# Pandas (seconda parte)

In questo notebook andremo ad utilizzare Scikit-learn (libreria di Machine Learning http://scikit-learn.org), solo per scaricare il dataset che importeremo come Pandas dataframe.

Andiamo a vedere altri operazioni utili su Pandas dataframe.

In [1]:
from sklearn.datasets import load_iris
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
data = load_iris()

In [3]:
iris = pd.DataFrame(data.data, columns=data.feature_names)
iris.head(10)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
5,5.4,3.9,1.7,0.4
6,4.6,3.4,1.4,0.3
7,5.0,3.4,1.5,0.2
8,4.4,2.9,1.4,0.2
9,4.9,3.1,1.5,0.1


In [4]:
iris.shape

(150, 4)

In [5]:
iris.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 4 columns):
sepal length (cm)    150 non-null float64
sepal width (cm)     150 non-null float64
petal length (cm)    150 non-null float64
petal width (cm)     150 non-null float64
dtypes: float64(4)
memory usage: 4.8 KB


In [6]:
iris.describe()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
count,150.0,150.0,150.0,150.0
mean,5.843333,3.054,3.758667,1.198667
std,0.828066,0.433594,1.76442,0.763161
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


## Reducing functions 

Possiamo usare funzioni statistiche direttamente sul nostro dataframe:  
* mean()
* std()
* sum()
* first(), last()
* max(), min())

In [7]:
iris.mean()

sepal length (cm)    5.843333
sepal width (cm)     3.054000
petal length (cm)    3.758667
petal width (cm)     1.198667
dtype: float64

In [8]:
iris.std()

sepal length (cm)    0.828066
sepal width (cm)     0.433594
petal length (cm)    1.764420
petal width (cm)     0.763161
dtype: float64

### Cambiare nomi a colonne

In [9]:
# La lista deve contenere l'esatto numero di colonne
iris.columns = ['sepal_l', 'sepal_wid', 'petal_len', 'petal_wid'] 

In [10]:
iris.head()

Unnamed: 0,sepal_l,sepal_wid,petal_len,petal_wid
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


#### Cambio nome ad una sola colonna
Associo ad un nuovo dataframe perché il cambiamento sia permanente

In [11]:
iris = iris.rename(columns = {'sepal_l':'sepal_len'})

In [12]:
iris.head()

Unnamed: 0,sepal_len,sepal_wid,petal_len,petal_wid
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


## Usare più condizioni con `.loc` (on `.iloc`)

Possiamo usare **&** (and) se entrambe le condizioni (poste fra parentesi) devono essere vere,  
oppure possiamo usare **|** (or) se basta che una delle condizioni debba essere vera.

In [13]:
new_iris = iris.loc[(iris['sepal_len'] >= 5) & (iris['sepal_wid'] >= 3.2)]

In [14]:
new_iris.head()

Unnamed: 0,sepal_len,sepal_wid,petal_len,petal_wid
0,5.1,3.5,1.4,0.2
4,5.0,3.6,1.4,0.2
5,5.4,3.9,1.7,0.4
7,5.0,3.4,1.5,0.2
10,5.4,3.7,1.5,0.2


In [15]:
new_iris.shape

(47, 4)

## Cancellare colonne o righe

In [16]:
iris2 = iris.drop(['sepal_len', 'sepal_wid'], axis=1)

In [17]:
iris2.head()

Unnamed: 0,petal_len,petal_wid
0,1.4,0.2
1,1.4,0.2
2,1.3,0.2
3,1.5,0.2
4,1.4,0.2


In [18]:
iris.head()

Unnamed: 0,sepal_len,sepal_wid,petal_len,petal_wid
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [19]:
iris3 = iris.drop([0, 1, 4, 8])

In [20]:
iris3.head(10)

Unnamed: 0,sepal_len,sepal_wid,petal_len,petal_wid
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
5,5.4,3.9,1.7,0.4
6,4.6,3.4,1.4,0.3
7,5.0,3.4,1.5,0.2
9,4.9,3.1,1.5,0.1
10,5.4,3.7,1.5,0.2
11,4.8,3.4,1.6,0.2
12,4.8,3.0,1.4,0.1
13,4.3,3.0,1.1,0.1


## Eliminiamo duplicati da una colonna

In [21]:
sepal_wid_unique = iris['sepal_wid'].drop_duplicates() 

In [22]:
sepal_wid_unique.head(10)

0     3.5
1     3.0
2     3.2
3     3.1
4     3.6
5     3.9
6     3.4
8     2.9
10    3.7
14    4.0
Name: sepal_wid, dtype: float64

In [23]:
iris.head(10)

Unnamed: 0,sepal_len,sepal_wid,petal_len,petal_wid
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
5,5.4,3.9,1.7,0.4
6,4.6,3.4,1.4,0.3
7,5.0,3.4,1.5,0.2
8,4.4,2.9,1.4,0.2
9,4.9,3.1,1.5,0.1


## Gestione dei valori sconosciuti 

Andiamo a vedere due modi per eliminare o manipolari valori sconosciuti e non riconosciuti durante l'importazione'

### Eliminiamo tutte le righe con valori sconosiuti

In [24]:
from numpy import nan

In [25]:
nan_df = pd.DataFrame([[1, 3, nan, 4], 
                   [nan, 9, 10, nan], 
                   [5,6,5,6,]], 
                   columns = ['come', 'quando', 'fuori', 'piove'])

In [26]:
nan_df

Unnamed: 0,come,quando,fuori,piove
0,1.0,3,,4.0
1,,9,10.0,
2,5.0,6,5.0,6.0


In [27]:
no_nan = nan_df.dropna()

In [28]:
no_nan

Unnamed: 0,come,quando,fuori,piove
2,5.0,6,5.0,6.0


### Convertiamo i valori sconosciuti in zeri

In [29]:
zero_nan = nan_df.fillna(0)

In [30]:
zero_nan

Unnamed: 0,come,quando,fuori,piove
0,1.0,3,0.0,4.0
1,0.0,9,10.0,0.0
2,5.0,6,5.0,6.0


## Raggruppiamo valori in funzione di una colonna e un operazione

Andiamo a riprenderci il dataset dell'esercitazione per fare alcuni esempi sulla potenza di `.groupby()`.  
L'operazione matematica a seguito raggruppamento `.groupby()` con sarà applicata su tutte le altre colonne numeriche.  

In [31]:
survey = pd.read_csv('data/WA_American-Time-Use-Survey-lite.csv')

In [32]:
survey.head()

Unnamed: 0,Education Level,Age,Age Range,Employment Status,Gender,Children,Weekly Earnings,Year,Weekly Hours Worked,Sleeping,...,Caring for Children,Playing with Children,Job Searching,Shopping,Eating and Drinking,Socializing & Relaxing,Television,Golfing,Running,Volunteering
0,High School,51,50-59,Unemployed,Female,0,0,2005,0,825,...,0,0,0,0,40,180,120,0,0,0
1,Bachelor,42,40-49,Employed,Female,2,1480,2005,40,500,...,365,20,0,120,40,15,15,0,0,0
2,Master,47,40-49,Employed,Male,0,904,2005,40,480,...,0,0,0,15,85,214,199,0,0,0
3,Some College,21,20-29,Employed,Female,0,320,2005,40,705,...,0,0,0,105,30,240,240,0,0,0
4,High School,49,40-49,Not in labor force,Female,0,0,2005,0,470,...,0,0,0,0,35,600,40,0,0,0


In [33]:
survey['Age'].min(), survey['Age'].max()

(15, 85)

In [34]:
age_twenties = survey.groupby('Age')

In [35]:
age_twenties

<pandas.core.groupby.DataFrameGroupBy object at 0x1a1425db70>

In [36]:
age_mean = survey.groupby('Age').mean()

In [37]:
age_mean.head()

Unnamed: 0_level_0,Children,Weekly Earnings,Year,Weekly Hours Worked,Sleeping,Grooming,Housework,Food & Drink Prep,Caring for Children,Playing with Children,Job Searching,Shopping,Eating and Drinking,Socializing & Relaxing,Television,Golfing,Running,Volunteering
Age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
15,2.049875,13.689526,2008.170823,1.816708,581.829177,46.074813,12.289277,7.158354,4.955112,0.900249,0.074813,19.438903,59.038653,289.301746,143.522444,1.041147,2.082294,10.457606
16,2.043558,32.954601,2008.354601,4.506135,579.134969,47.439264,13.7,8.253374,4.349693,1.448466,0.225767,17.003067,55.941718,290.050307,134.492638,0.665644,2.279141,8.996319
17,1.961795,53.817465,2008.199515,7.130382,575.237113,46.467556,15.244391,8.697999,4.493633,1.380837,0.473014,19.519709,55.080655,283.537902,131.081261,0.148575,2.298363,12.812007
18,1.095541,92.355778,2008.352138,10.801638,587.515924,46.00273,20.152866,9.785259,7.606005,2.415833,1.44222,21.311192,53.317561,301.792539,146.037307,1.977252,2.599636,7.543221
19,1.017878,149.948749,2008.412396,17.209774,578.221692,43.170441,17.332539,13.839094,13.812872,5.321812,2.27056,22.951132,54.856973,294.404052,142.587604,0.429082,1.476758,6.312277


In [38]:
age_twenties = age_mean.loc[20:29, :] # Ricordiamo che .loc prende i nomi, quindi anche 29 sarà compreso!

In [39]:
age_twenties

Unnamed: 0_level_0,Children,Weekly Earnings,Year,Weekly Hours Worked,Sleeping,Grooming,Housework,Food & Drink Prep,Caring for Children,Playing with Children,Job Searching,Shopping,Eating and Drinking,Socializing & Relaxing,Television,Golfing,Running,Volunteering
Age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
20,0.892771,186.771084,2008.483133,18.422892,567.961446,44.520482,24.006024,15.873494,22.86988,6.487952,1.859036,23.696386,56.425301,299.398795,151.06988,1.46988,0.913253,3.960241
21,0.762274,227.677003,2008.408269,21.322997,556.29199,41.45478,26.439276,19.529716,31.0,11.706718,3.528424,24.231266,57.976744,287.689922,151.919897,0.03876,0.569767,5.343669
22,0.739741,272.559395,2008.37041,22.673866,553.631749,40.647948,28.830454,22.987041,36.272138,11.667387,1.765659,25.786177,57.447084,274.62203,149.2473,0.87257,0.62635,4.218143
23,0.762238,342.386613,2008.536464,27.361638,546.757243,44.711289,33.992008,22.566434,40.882118,14.844156,1.58042,22.777223,60.705295,256.407592,135.123876,1.408591,0.919081,7.248751
24,0.777895,386.64,2008.352632,28.522105,543.169474,40.312632,32.585263,26.866316,50.482105,17.181053,3.708421,24.606316,64.86,255.626316,137.893684,0.447368,0.952632,6.174737
25,0.837268,426.701518,2008.455312,29.568297,539.879427,42.67032,34.390388,28.451096,52.780776,17.734401,1.956998,23.530354,60.607083,256.958685,140.268971,1.020236,1.218381,4.53457
26,0.960434,466.676493,2008.453064,30.192397,536.574088,39.211792,40.633049,29.364624,55.86346,18.342126,2.895268,23.852599,62.232739,251.687355,141.126455,1.158262,0.636152,5.796742
27,1.059892,491.025572,2008.48856,29.491252,532.653432,40.899058,37.573351,32.127187,60.400404,18.629206,1.940781,26.793405,63.595559,249.193809,139.254374,0.897039,0.956258,4.855989
28,1.196194,528.296588,2008.477034,30.315617,535.717192,39.17979,41.729003,31.133858,62.133858,21.305118,2.047244,24.252625,63.686352,246.807087,140.077428,1.079396,0.541339,5.167979
29,1.254693,554.275344,2008.546308,31.621402,531.459324,37.112015,40.994368,34.573217,66.187735,22.296621,1.769086,27.951189,67.957447,236.390488,136.723404,0.585106,0.805382,6.807885


## Let's play!
Appliachiamo quanto visto sul dataframe iris e dell'esercitazione