# Intro to Pandas
Andiamo a vedere come caricare le librerie di Pandas e Numpy, come creare serie di dati grazie alle funzioni di Numpy per poi inserirli in un DataFrame Pandas. 

Andiamo a vedere come slezionare e filtrare i dati all'interno di un DataFrame e come effettuare senplici operazioni con gli scalari.

In [4]:
# La prassi più diffusa è quella di nominare numpy e panda rispettivamente: np e pd
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

## Pandas Series
Una Pandas Series è un array monodimensionale con delle label (possiamo quindi dire che è indicizzato con delle etichette).

In [5]:
# Creiamo una Pandas Series usando un numpy array.
series_obj = Series(np.arange(0,6), index =['row 1', 'row 2','row 3','row 4', 'row 5','row 6'])
series_obj

row 1    0
row 2    1
row 3    2
row 4    3
row 5    4
row 6    5
dtype: int32

In [6]:
# Possiamo accedere al valore di un elemento indicando il suo indice.
series_obj['row 6']

5

In [7]:
# Possiamo accedere ai dati indicando i valori.
series_obj[[0,5]]

row 1    0
row 6    5
dtype: int32

In [9]:
# Oppure possiamo farne lo 'slicing'.
series_obj[:3]

row 1    0
row 2    1
row 3    2
dtype: int32

## Creare un DataFrame
Un Pandas DataFrame è composto da una struttura di dati bi-dimensionale, possiamo consiserarlo come una tabella organizzata in righe e colonne dove:
- le **righe (row)** rappresentano le misurazioni;
- le **colonne (column)** rappresentano le varie *features*;  

In [10]:
# Possiamo creare un DataFrame vuoto da una Series vuota.
pd.Series()

Series([], dtype: float64)

In [15]:
# Possiamo utilizzare una Series creata con un numpy array.
data = np.array(['uno','due','tre','quattro','cinque'])
df = pd.Series(data)
print(df)

0        uno
1        due
2        tre
3    quattro
4     cinque
dtype: object


In [23]:
# Con np.random possiamo creare una serie di numeri casuali.
# Con reshape(row,columns) possiamo iraganizzare i dati in righe e colonne.

np.random.seed(25)
DF = DataFrame(np.random.rand(36).reshape(6,6), index=['row 1', 'row 2','row 3','row 4', 'row 5','row 6'], 
              columns=['column 1','column 2','column 3','column 4','column 5','column 6'])
DF

Unnamed: 0,column 1,column 2,column 3,column 4,column 5,column 6
row 1,0.870124,0.582277,0.278839,0.185911,0.4111,0.117376
row 2,0.684969,0.437611,0.556229,0.36708,0.402366,0.113041
row 3,0.447031,0.585445,0.161985,0.520719,0.326051,0.699186
row 4,0.366395,0.836375,0.481343,0.516502,0.383048,0.997541
row 5,0.514244,0.559053,0.03445,0.71993,0.421004,0.436935
row 6,0.281701,0.900274,0.669612,0.456069,0.289804,0.525819


## Select Column & Rows

In [24]:
# La fuznione loc() permette di andare a selezionare in modo accurato i dati.
DF.loc[['row 2','row 5'],['column 3','column 1']]

Unnamed: 0,column 3,column 1
row 2,0.556229,0.684969
row 5,0.03445,0.514244


## Slice Rows & Columns

In [25]:
# Possiamo fare lo slicing.
series_obj['row 2':'row 5']

row 2    1
row 3    2
row 4    3
row 5    4
dtype: int32

## Comparison with scalars

In [28]:
# Possiamo andare a verificare se i nostri dati soddisfano una certa condizione. 
# Verrà restituita una matrice dove avremo True se la condizione è soddisfatta oppure False nel caso contrario.
DF < .2

Unnamed: 0,column 1,column 2,column 3,column 4,column 5,column 6
row 1,False,False,False,True,False,True
row 2,False,False,False,False,False,True
row 3,False,False,True,False,False,False
row 4,False,False,False,False,False,False
row 5,False,False,True,False,False,False
row 6,False,False,False,False,False,False


## Filtering with scalars


In [31]:
# Possiamo andare a filtrare i dati di una Series comparandoli con uno scalare.
series_obj[series_obj<4]

row 1    0
row 2    1
row 3    2
row 4    3
dtype: int32

In [32]:
# Lo stesso possiamo farlo su un DataFrame, nel caso la condizione non venga soddisfatta ci verrà restituito NAN.
DF[DF>0.2]

Unnamed: 0,column 1,column 2,column 3,column 4,column 5,column 6
row 1,0.870124,0.582277,0.278839,,0.4111,
row 2,0.684969,0.437611,0.556229,0.36708,0.402366,
row 3,0.447031,0.585445,,0.520719,0.326051,0.699186
row 4,0.366395,0.836375,0.481343,0.516502,0.383048,0.997541
row 5,0.514244,0.559053,,0.71993,0.421004,0.436935
row 6,0.281701,0.900274,0.669612,0.456069,0.289804,0.525819


## Setting data with scalars

In [36]:
# Possiamo andare ad impostare dei valori anche successivamente la creazioned della nostra serie.
series_obj['row 1', 'row 5']= 1000
series_obj

KeyError: "['row 1' 'row 5'] not in index"

# Missing values
I valori mancanti sono rappresentati dal valore"NaN" (Not a Number).
Ci sono vari modi per gestire i valori mancanti, uno di questi e quello di sostituirli con la media dei valori a nostra disposizione.

In [37]:
missing = np.nan
series_obj = Series(['row 1', 'row 2', missing,missing, 'row 5', 'row 6'])
series_obj

0    row 1
1    row 2
2      NaN
3      NaN
4    row 5
5    row 6
dtype: object

In [38]:
# Per Verificare se dono presenti dei valori mancanti possiamo utilizzare la funzione isnull()
series_obj.isnull()

0    False
1    False
2     True
3     True
4    False
5    False
dtype: bool

In [39]:
series_obj[series_obj.isnull()]

2    NaN
3    NaN
dtype: object

## Filling missing values

In [40]:
np.random.seed(25)
DF = DataFrame(np.random.rand(36).reshape(6,6))
DF

Unnamed: 0,0,1,2,3,4,5
0,0.870124,0.582277,0.278839,0.185911,0.4111,0.117376
1,0.684969,0.437611,0.556229,0.36708,0.402366,0.113041
2,0.447031,0.585445,0.161985,0.520719,0.326051,0.699186
3,0.366395,0.836375,0.481343,0.516502,0.383048,0.997541
4,0.514244,0.559053,0.03445,0.71993,0.421004,0.436935
5,0.281701,0.900274,0.669612,0.456069,0.289804,0.525819


In [42]:
# Andiamo ad impostare dei valori mancanti.
DF.loc[3:5,0] = missing
DF.loc[1:4,5] = missing
DF

Unnamed: 0,0,1,2,3,4,5
0,0.870124,0.582277,0.278839,0.185911,0.4111,0.117376
1,0.684969,0.437611,0.556229,0.36708,0.402366,
2,0.447031,0.585445,0.161985,0.520719,0.326051,
3,,0.836375,0.481343,0.516502,0.383048,
4,,0.559053,0.03445,0.71993,0.421004,
5,,0.900274,0.669612,0.456069,0.289804,0.525819


In [44]:
# Con la funzione fillna() andiamo ad assegnare ai valori mancanti un valore arbitrario, zero nel nostro caso.
DF_filled = DF.fillna(0)
DF_filled

Unnamed: 0,0,1,2,3,4,5
0,0.870124,0.582277,0.278839,0.185911,0.4111,0.117376
1,0.684969,0.437611,0.556229,0.36708,0.402366,0.0
2,0.447031,0.585445,0.161985,0.520719,0.326051,0.0
3,0.0,0.836375,0.481343,0.516502,0.383048,0.0
4,0.0,0.559053,0.03445,0.71993,0.421004,0.0
5,0.0,0.900274,0.669612,0.456069,0.289804,0.525819


In [45]:
# Posssiamo gestire un modo granulare come andare a riempire in nostri valori mancanti.
# In questo caso andremo a riempire con il valore 0.1 gli elementi mancanti della colonna 0 e con 2 quelli della colonna 5.
DF_filled = DF.fillna({0:0.1, 5:2})
DF_filled

Unnamed: 0,0,1,2,3,4,5
0,0.870124,0.582277,0.278839,0.185911,0.4111,0.117376
1,0.684969,0.437611,0.556229,0.36708,0.402366,2.0
2,0.447031,0.585445,0.161985,0.520719,0.326051,2.0
3,0.1,0.836375,0.481343,0.516502,0.383048,2.0
4,0.1,0.559053,0.03445,0.71993,0.421004,2.0
5,0.1,0.900274,0.669612,0.456069,0.289804,0.525819


In [46]:
# Il metodo ffill 'forward fill' va ad inserire nei missing values l'ultimo valore non NAN.
DF_filled = DF.fillna(method='ffill')
DF_filled

Unnamed: 0,0,1,2,3,4,5
0,0.870124,0.582277,0.278839,0.185911,0.4111,0.117376
1,0.684969,0.437611,0.556229,0.36708,0.402366,0.117376
2,0.447031,0.585445,0.161985,0.520719,0.326051,0.117376
3,0.447031,0.836375,0.481343,0.516502,0.383048,0.117376
4,0.447031,0.559053,0.03445,0.71993,0.421004,0.117376
5,0.447031,0.900274,0.669612,0.456069,0.289804,0.525819


In [51]:
# Una delle soluzioni più adottate è quella di andare a riempire i valori con la media delle misurazioni del nostro dataset.
media_0 = DF[0].mean()
media_5 = DF[5].mean()
DF_filled = DF.fillna({0:media_0,5:media_5})
DF_filled

Unnamed: 0,0,1,2,3,4,5
0,0.870124,0.582277,0.278839,0.185911,0.4111,0.117376
1,0.684969,0.437611,0.556229,0.36708,0.402366,0.321597
2,0.447031,0.585445,0.161985,0.520719,0.326051,0.321597
3,0.667375,0.836375,0.481343,0.516502,0.383048,0.321597
4,0.667375,0.559053,0.03445,0.71993,0.421004,0.321597
5,0.667375,0.900274,0.669612,0.456069,0.289804,0.525819


## Counting Missing Values

In [19]:
DF

Unnamed: 0,0,1,2,3,4,5
0,0.870124,0.582277,0.278839,0.185911,0.4111,0.117376
1,0.684969,0.437611,0.556229,0.36708,0.402366,
2,0.447031,0.585445,0.161985,0.520719,0.326051,
3,,0.836375,0.481343,0.516502,0.383048,
4,,0.559053,0.03445,0.71993,0.421004,
5,,0.900274,0.669612,0.456069,0.289804,0.525819


In [47]:
# Con la funzione sum() possiamo andare a contare i volori mancanti di ogni colonna.
DF.isnull().sum() 

0    3
1    0
2    0
3    0
4    0
5    4
dtype: int64

## Filtering out missing values
Con la funzione dropna() andiamo a rimuovere dal DataFrame tutte righe che hanno almeno un valore mancante (NAN).

In [48]:
DF_no_nan = DF.dropna()
DF_no_nan


Unnamed: 0,0,1,2,3,4,5
0,0.870124,0.582277,0.278839,0.185911,0.4111,0.117376


In [49]:
# Se andiamo ad utilizzare il parametro axis=1 andremo a rimuovere le colonne invece delle righe.
DF_no_nan = DF.dropna(axis=1)
DF_no_nan


Unnamed: 0,1,2,3,4
0,0.582277,0.278839,0.185911,0.4111
1,0.437611,0.556229,0.36708,0.402366
2,0.585445,0.161985,0.520719,0.326051
3,0.836375,0.481343,0.516502,0.383048
4,0.559053,0.03445,0.71993,0.421004
5,0.900274,0.669612,0.456069,0.289804
