# Descriptive statistics
Possiamo dividere in due categorie la statistica descrittiva:
- Statistica descrittiva che studia i valori di una variabile
- Statistica descrittiva che studia la concentrazione o distribuzione di una variabile 

### Osservare i valori di una variabile
Per andare ad osservare i valori che assume una variabile queste sono le metriche che vengono utilizzate:
- somma
- mediana
- media
- massimo


### Utilizzi della statistica descrittiva
- individuare degli outliers
- individuare e selezionare dati per creare modelli con il machine learning

In [33]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import scipy
from scipy import stats

In [34]:
address = "./Data/mtcars.csv"
cars = pd.read_csv(address)
cars.rename(columns = {"Unnamed: 0": "cars_names"}, inplace= True)
cars

Unnamed: 0,cars_names,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
3,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
4,Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2
5,Valiant,18.1,6,225.0,105,2.76,3.46,20.22,1,0,3,1
6,Duster 360,14.3,8,360.0,245,3.21,3.57,15.84,0,0,3,4
7,Merc 240D,24.4,4,146.7,62,3.69,3.19,20.0,1,0,4,2
8,Merc 230,22.8,4,140.8,95,3.92,3.15,22.9,1,0,4,2
9,Merc 280,19.2,6,167.6,123,3.92,3.44,18.3,1,0,4,4


#### Somma

In [35]:
cars.sum() # effettua la somma delle rilevazioni per ogni feature (colonna)

cars_names    Mazda RX4Mazda RX4 WagDatsun 710Hornet 4 Drive...
mpg                                                       642.9
cyl                                                         198
disp                                                     7383.1
hp                                                         4694
drat                                                     115.09
wt                                                      102.952
qsec                                                     571.16
vs                                                           14
am                                                           13
gear                                                        118
carb                                                         90
dtype: object

In [36]:
cars.sum(axis=1) # effettua la somma delle features (colonne) per ogni record (riga)

0     328.980
1     329.795
2     259.580
3     426.135
4     590.310
5     385.540
6     656.920
7     270.980
8     299.570
9     350.460
10    349.660
11    510.740
12    511.500
13    509.850
14    728.560
15    726.644
16    725.695
17    213.850
18    195.165
19    206.955
20    273.775
21    519.650
22    506.085
23    646.280
24    631.175
25    208.215
26    272.570
27    273.683
28    670.690
29    379.590
30    694.710
31    288.890
dtype: float64

#### Mediana
La Mediana ritorna il valore della mediana di ogni feature (colonna). La mediana si ottiene ordinando i valori e prendendo il valore che sta nel mezzo: se abbiamo 2n+1 rilevazioni la mediana sara data dalla rilevazione n+1.

In [37]:
cars.median()

mpg      19.200
cyl       6.000
disp    196.300
hp      123.000
drat      3.695
wt        3.325
qsec     17.710
vs        0.000
am        0.000
gear      4.000
carb      2.000
dtype: float64

#### Media 
Si ottiene semplicmente sommando tutte le rilevazioni e di divide il risultato per il numero delle rilevazioni.

In [38]:
cars.mean()

mpg      20.090625
cyl       6.187500
disp    230.721875
hp      146.687500
drat      3.596563
wt        3.217250
qsec     17.848750
vs        0.437500
am        0.406250
gear      3.687500
carb      2.812500
dtype: float64

#### Massimo

In [39]:
cars.max()

cars_names    Volvo 142E
mpg                 33.9
cyl                    8
disp                 472
hp                   335
drat                4.93
wt                 5.424
qsec                22.9
vs                     1
am                     1
gear                   5
carb                   8
dtype: object

Per conoscere la riga (rilevazione) in cui abbiamo il massimo per un deternimato valore è sufficiente filtrare il dataframe per ottenere la serie della colonna che ci interessa ed utilizzare il metodo .idxmax().

In [40]:
print(f"Valore massimo mpg: {cars.mpg.max()}")
print(f"Colonna dove si trova il valore massimo di mpg: {cars.mpg.idxmax()}")


Valore massimo mpg: 33.9
Colonna dove si trova il valore massimo di mpg: 19


### Osservare la distribuzione di una variabile
Per andare ad osservare la distribuzione di una variabile le metriche utilizzate sono:
- deviazione standard (standar deviation)
- varianza
- counts
- quartili

#### Deviazione standard

In [29]:
cars.std() # deviazione standard

mpg       6.026948
cyl       1.785922
disp    123.938694
hp       68.562868
drat      0.534679
wt        0.978457
qsec      1.786943
vs        0.504016
am        0.498991
gear      0.737804
carb      1.615200
dtype: float64

### Varianza

In [41]:
cars.var()  # varianza

mpg        36.324103
cyl         3.189516
disp    15360.799829
hp       4700.866935
drat        0.285881
wt          0.957379
qsec        3.193166
vs          0.254032
am          0.248992
gear        0.544355
carb        2.608871
dtype: float64

### Conteggio dei valori unici

In [42]:
gear = cars.gear
gear.value_counts()

3    15
4    12
5     5
Name: gear, dtype: int64

La variabile Gear ha 3 valori unici (3,4,5) ed abbiamo 15 rilevazoini con il valore 3, 12 rilavazioni con il valore 12 e 5 con il valore 5.


### Describe
Fornisce tutte le rilevazioni sopra descritte in una unica tabella.

In [43]:
cars.describe()

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
count,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0
mean,20.090625,6.1875,230.721875,146.6875,3.596563,3.21725,17.84875,0.4375,0.40625,3.6875,2.8125
std,6.026948,1.785922,123.938694,68.562868,0.534679,0.978457,1.786943,0.504016,0.498991,0.737804,1.6152
min,10.4,4.0,71.1,52.0,2.76,1.513,14.5,0.0,0.0,3.0,1.0
25%,15.425,4.0,120.825,96.5,3.08,2.58125,16.8925,0.0,0.0,3.0,2.0
50%,19.2,6.0,196.3,123.0,3.695,3.325,17.71,0.0,0.0,4.0,2.0
75%,22.8,8.0,326.0,180.0,3.92,3.61,18.9,1.0,1.0,4.0,4.0
max,33.9,8.0,472.0,335.0,4.93,5.424,22.9,1.0,1.0,5.0,8.0


mpg      19.200
cyl       6.000
disp    196.300
hp      123.000
drat      3.695
wt        3.325
qsec     17.710
vs        0.000
am        0.000
gear      4.000
carb      2.000
dtype: float64

## Categorical data

In [54]:
cars.index = cars.cars_names
cars.head()

Unnamed: 0_level_0,cars_names,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
cars_names,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Mazda RX4,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
Hornet 4 Drive,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
Hornet Sportabout,Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2


In [55]:
carb = cars.carb 
carb.value_counts() # Verifichiamo i valori unici della colonna carb

4    10
2    10
1     7
3     3
8     1
6     1
Name: carb, dtype: int64

In [56]:
cars_cat = cars[["cyl","vs","am","gear","carb"]] # creiamo un nuovo dataframe
cars_cat.head()

Unnamed: 0_level_0,cyl,vs,am,gear,carb
cars_names,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Mazda RX4,6,0,1,4,4
Mazda RX4 Wag,6,0,1,4,4
Datsun 710,4,1,1,4,1
Hornet 4 Drive,6,1,0,3,1
Hornet Sportabout,8,0,0,3,2


In [60]:
gears_group = cars_cat.groupby("gear") # raggruppiamo per il valore "gear"
gears_group.describe()

Unnamed: 0_level_0,am,am,am,am,am,am,am,am,carb,carb,...,cyl,cyl,vs,vs,vs,vs,vs,vs,vs,vs
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
gear,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
3,15.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,15.0,2.666667,...,8.0,8.0,15.0,0.2,0.414039,0.0,0.0,0.0,0.0,1.0
4,12.0,0.666667,0.492366,0.0,0.0,1.0,1.0,1.0,12.0,2.333333,...,6.0,6.0,12.0,0.833333,0.389249,0.0,1.0,1.0,1.0,1.0
5,5.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,5.0,4.4,...,8.0,8.0,5.0,0.2,0.447214,0.0,0.0,0.0,0.0,1.0


### Trasformare una variabile in un categorical data type

In [63]:
cars["groups"] = pd.Series(cars.gear, dtype="category")

In [65]:
cars.groups.dtypes

CategoricalDtype(categories=[3, 4, 5], ordered=False)

In [66]:
cars.head()

Unnamed: 0_level_0,cars_names,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb,groups
cars_names,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Mazda RX4,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4,4
Mazda RX4 Wag,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4,4
Datsun 710,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1,4
Hornet 4 Drive,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1,3
Hornet Sportabout,Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2,3


In [67]:
cars.groups.value_counts()

3    15
4    12
5     5
Name: groups, dtype: int64

### Crosstab

In [69]:
pd.crosstab(cars["am"],cars["gear"])

gear,3,4,5
am,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,15,4,0
1,0,8,5
