# Transformación de Datos con Scikit-Learn

Vamos a mostrar algunas funcionalidades de Transformación de Datos con Scikit-Learn con algunos ejemplos sobre el dataset de Titanic, de la misma manera que hicimos con Pandas.

Primero importamos las librerías y cargamos el dataset

Cargamos las librerías y los datos

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set()

In [2]:
### Carga de datos
df = pd.read_csv('DS_Clase_05_titanic.csv')
print(df.shape)
df.head(5)

(891, 12)


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### Imputación de valores faltantes con Scikit-Learn

Una cosa para tener en cuenta es que a Scikit-Learn no le gustan los valores faltantes, por lo que una de las primeras cosas que tendremos que hacer es imputarlos. En el módulo `sklearn.impute`, del cual recomendamos mirar su [documentación](https://scikit-learn.org/stable/modules/impute.html#impute), pueden encontrar algunas clases útiles para esta tarea.

El imputador más sencillo es el `SimpleImputer`, el cual nos servirá para rellenar valores faltantes en las columnas que elijamos. Mirar el siguiente ejemplo y explorar cuáles son los parámetros de ese objeto.

In [35]:
from sklearn.impute import SimpleImputer
imp = SimpleImputer(strategy='mean')

In [36]:
edades = df.Age.values
imp.fit(edades.reshape(-1,1))
print(imp.statistics_)

[29.69911765]


In [37]:
edades_imputed = imp.transform(edades.reshape(-1,1))
print(edades_imputed[:10])

[[22.        ]
 [38.        ]
 [26.        ]
 [35.        ]
 [35.        ]
 [29.69911765]
 [54.        ]
 [ 2.        ]
 [27.        ]
 [14.        ]]


Y, si queremos agregarlas al DataFrame,

In [38]:
df['Age_imputed'] = edades_imputed
df.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_imputed,rangos_etarios_scikit
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,22.0,1.0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,38.0,2.0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,26.0,1.0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,35.0,2.0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,35.0,2.0
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q,29.699118,1.0
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S,54.0,3.0
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S,2.0,0.0
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S,27.0,1.0
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C,14.0,0.0


### Discretización y binning con Scikit-Learn

La principal diferencia entre Scikit-Learn y Pandas es que Scikit-Learn decide los límites de los bines de acuerdo a una estrategia que le pasemos. La clase que vamos a usar se llama `KBinsDiscretizer`.

In [65]:
from sklearn.preprocessing import KBinsDiscretizer
est = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy = 'uniform')

Separamos los valores que queremos fitear.

In [66]:
edades = df.Age_imputed.values
print(edades.reshape(-1,1).shape)

(891, 1)


Y fiteamos el estimador

In [67]:
est.fit(edades.reshape(-1,1))

KBinsDiscretizer(encode='ordinal', n_bins=5, strategy='uniform')

Miramos los límites de cada bin

In [68]:
est.bin_edges_

array([array([ 0.42 , 16.336, 32.252, 48.168, 64.084, 80.   ])],
      dtype=object)

In [69]:
bines_asignados = est.transform(edades.reshape(-1,1))
print(bines_asignados)

[[1.]
 [2.]
 [1.]
 [2.]
 [2.]
 [1.]
 [3.]
 [0.]
 [1.]
 [0.]
 [0.]
 [3.]
 [1.]
 [2.]
 [0.]
 [3.]
 [0.]
 [1.]
 [1.]
 [1.]
 [2.]
 [2.]
 [0.]
 [1.]
 [0.]
 [2.]
 [1.]
 [1.]
 [1.]
 [1.]
 [2.]
 [1.]
 [1.]
 [4.]
 [1.]
 [2.]
 [1.]
 [1.]
 [1.]
 [0.]
 [2.]
 [1.]
 [1.]
 [0.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [0.]
 [1.]
 [3.]
 [1.]
 [4.]
 [1.]
 [1.]
 [1.]
 [0.]
 [0.]
 [1.]
 [2.]
 [2.]
 [0.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [0.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [0.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [2.]
 [0.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [2.]
 [1.]
 [3.]
 [1.]
 [4.]
 [1.]
 [2.]
 [2.]
 [1.]
 [1.]
 [1.]
 [2.]
 [2.]
 [1.]
 [1.]
 [1.]
 [2.]
 [1.]
 [2.]
 [0.]
 [1.]
 [1.]
 [1.]
 [1.]
 [4.]
 [1.]
 [1.]
 [0.]
 [1.]
 [1.]
 [2.]
 [2.]
 [3.]
 [0.]
 [1.]
 [1.]
 [1.]
 [2.]
 [2.]
 [1.]
 [2.]
 [1.]
 [1.]
 [1.]
 [1.]
 [2.]
 [0.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [1.]
 [0.]
 [2.]
 [2.]
 [3.]
 [1.]
 [3.]
 [2.]
 [1.]
 [3.]
 [0.]
 [1.]
 [1.]
 [1.]
 [2.]
 [2.]
 [1.]
 [1.]
 [0.]
 [0.]
 [1.

Y agregamos al dataframe

In [51]:
df['rangos_etarios_scikit'] = bines_asignados
df.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_imputed,rangos_etarios_scikit
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,22.0,1.0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,38.0,4.0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,26.0,1.0


Se puede hacer en una sola línea con `.fit_transform`

In [45]:
df['rangos_etarios_scikit'] = est.fit_transform(edades.reshape(-1,1))
df.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_imputed,rangos_etarios_scikit
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,22.0,1.0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,38.0,2.0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,26.0,1.0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,35.0,2.0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,35.0,2.0
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q,29.699118,1.0
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S,54.0,3.0
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S,2.0,0.0
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S,27.0,1.0
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C,14.0,0.0


¿Cuáles son las estrategias posibles del `KBinsDiscretizer`?¿Qué formas tiene de *encodear* la salida?

### `OneHotEncoder`

El caballito de batalla es el `OneHotEncoder`.

In [70]:
from sklearn.preprocessing import OneHotEncoder
onehot_encoder = OneHotEncoder(sparse = False)

In [72]:
generos = df.Sex.values.reshape(-1,1)
print(np.unique(generos))

['female' 'male']


In [73]:
onehot_encoder.fit(generos)

OneHotEncoder(categorical_features=None, categories=None, drop=None,
              dtype=<class 'numpy.float64'>, handle_unknown='error',
              n_values=None, sparse=False)

In [74]:
onehot_encoder.categories_

[array(['female', 'male'], dtype=object)]

In [75]:
generos_encoded = onehot_encoder.transform(generos)
print(generos_encoded)

[[0. 1.]
 [1. 0.]
 [1. 0.]
 ...
 [1. 0.]
 [0. 1.]
 [0. 1.]]


In [76]:
onehot_encoder.inverse_transform(generos_encoded[500].reshape(1,-1))

array([['male']], dtype=object)

In [77]:
df['female_encoded'] = generos_encoded[:,0]
df['male_encoded'] = generos_encoded[:,1]
df.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_imputed,rangos_etarios_scikit,female_encoded,male_encoded
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,22.0,1.0,0.0,1.0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,38.0,4.0,1.0,0.0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,26.0,1.0,1.0,0.0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,35.0,3.0,1.0,0.0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,35.0,3.0,0.0,1.0
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q,29.699118,3.0,0.0,1.0
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S,54.0,4.0,0.0,1.0
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S,2.0,0.0,0.0,1.0
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S,27.0,1.0,1.0,0.0
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C,14.0,0.0,1.0,0.0


### Ejercitación

Tomar el dataset 'DS_Clase_10_Heart.csv' y hacer la transformación de datos que hicieron con Pandas, pero ahora con Scikit-Learn. Transformar la columna `sex` con una `LabelEncoder` y la columna `thal` con un `OneHotEncoder`.