## Valores que faltan (*missing values*)

  * Los valores que faltan pueden estar codificados de muchas maneras: "-", "?", "-9999", "N/A", "NA", etc...
  
  * numpy tiene el valor especial `np.NaN`, pandas `pd.NA`
  
  * En general es parte del preproceso identificar qué representa un valor que falta y reemplazarlo por `pd.NA`

  * Existen múltiples técnicas para **imputar** valores que faltan, ya que los modelos no pueden tratar con ellos directamente

In [1]:
import seaborn as sns

titanic = sns.load_dataset('titanic')

In [2]:
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


Podemos ver el % de valores que faltan en cada columna

In [3]:
titanic.isna().mean() * 100

survived        0.000000
pclass          0.000000
sex             0.000000
age            19.865320
sibsp           0.000000
parch           0.000000
fare            0.000000
embarked        0.224467
class           0.000000
who             0.000000
adult_male      0.000000
deck           77.216611
embark_town     0.224467
alive           0.000000
alone           0.000000
dtype: float64

Podemos eliminar todas las filas que tienen algún NA con pandas:

In [4]:
titanic.shape

(891, 15)

In [5]:
titanic.dropna().shape

(182, 15)

Si una columna tiene un gran porcentaje de valores que faltan, podemos eliminarla y a continuación eliminar todas las filas que tengan algún NA en el resto:

In [6]:
X = titanic.drop(columns=['deck', 'embarked', 'alive', 'survived', 'class', 'who'])
y = titanic['survived']

In [7]:
isna = X.isna().any(axis=1)

In [8]:
X[~isna].shape

(712, 9)

### Imputar valores que faltan

Si tenemos pocos datos o no queremos perder observaciones, en ocasiones es útil completar los valores que faltan de una o más variables. scikit-learn implementa varias estrategias básicas en el módulo `impute`:

  * `impute.SimpleImputer`: puede imputar valores que faltan con la media, mediana, valor más frecuente o una constante
  * `impute.KNNImputer`: imputa usando la media de los $k$ vecinos próximos

In [10]:
y.value_counts()

0    549
1    342
Name: survived, dtype: int64

In [12]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)

In [13]:
X_train.shape

(668, 9)

In [14]:
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer

imputer = make_column_transformer(
    (SimpleImputer(strategy='mean', add_indicator=True), ['age']),
    (SimpleImputer(strategy='most_frequent', add_indicator=True), ['embark_town']),
    remainder='passthrough'
)

X_train_im = imputer.fit_transform(X_train)
X_test_im = imputer.transform(X_test)

In [12]:
X_train_im.shape

(668, 11)

In [18]:
X_train

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,adult_male,embark_town,alone
671,1,male,31.0,1,0,52.0000,True,Southampton,False
417,2,female,18.0,0,2,13.0000,False,Southampton,False
634,3,female,9.0,3,2,27.9000,False,Southampton,False
323,2,female,22.0,1,1,29.0000,False,Southampton,False
379,3,male,19.0,0,0,7.7750,True,Southampton,True
...,...,...,...,...,...,...,...,...,...
131,3,male,20.0,0,0,7.0500,True,Southampton,True
490,3,male,,1,0,19.9667,True,Southampton,False
528,3,male,39.0,0,0,7.9250,True,Southampton,True
48,3,male,,2,0,21.6792,True,Cherbourg,False


In [23]:
import pandas as pd

pd.DataFrame(X_train_im)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,31.0,0.0,Southampton,False,1,male,1,0,52.0,True,False
1,18.0,0.0,Southampton,False,2,female,0,2,13.0,False,False
2,9.0,0.0,Southampton,False,3,female,3,2,27.9,False,False
3,22.0,0.0,Southampton,False,2,female,1,1,29.0,False,False
4,19.0,0.0,Southampton,False,3,male,0,0,7.775,True,True
...,...,...,...,...,...,...,...,...,...,...,...
663,20.0,0.0,Southampton,False,3,male,0,0,7.05,True,True
664,29.796842,1.0,Southampton,False,3,male,1,0,19.9667,True,False
665,39.0,0.0,Southampton,False,3,male,0,0,7.925,True,True
666,29.796842,1.0,Cherbourg,False,3,male,2,0,21.6792,True,False


In [13]:
import pandas as pd
pd.DataFrame(X_train_im).isna().mean()

0     0.0
1     0.0
2     0.0
3     0.0
4     0.0
5     0.0
6     0.0
7     0.0
8     0.0
9     0.0
10    0.0
dtype: float64

#### Estrategias basadas en modelos

En la versión 0.23 de scikit-learn han añadido `impute.IterativeImputer` (todavía experimental). Funciona de la siguiente manera:

   * Ajusta un modelo donde la salida ($y$) es la variable a imputar y las características con el resto de columnas
   
   * Completa los valores que faltan usando las estimaciones del modelo
   
   * Esto se repite para cada columna donde falten valores

Para transformaciones básicas y/o no incluidas en scikit-learn, también podemos usar pandas. Por ejemplo, para reemplazar los valores que faltan con el más frecuente:

In [14]:
X.loc[X['embark_town'].isna(), 'embark_town'] = X['embark_town'].mode()

Puesto que ahora vamos a imputar valores que faltan usando un modelo, necesitamos que todas las variables sean numéricas:

In [15]:
X_num = pd.get_dummies(X, drop_first=True)

In [16]:
X_num.head()

Unnamed: 0,pclass,age,sibsp,parch,fare,adult_male,alone,sex_male,embark_town_Queenstown,embark_town_Southampton
0,3,22.0,1,0,7.25,True,False,1,0,1
1,1,38.0,1,0,71.2833,False,False,0,0,0
2,3,26.0,0,0,7.925,False,True,0,0,1
3,1,35.0,1,0,53.1,False,False,0,0,1
4,3,35.0,0,0,8.05,True,True,1,0,1


In [17]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_num, y, stratify=y, random_state=0)

In [18]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.compose import make_column_transformer

imputer = IterativeImputer()
X_train_ii = imputer.fit_transform(X_train)
X_test_ii = imputer.transform(X_test)

In [19]:
pd.DataFrame({'SimpleImputer': X_train_im[:, 0], 'IterativeImputer': X_train_ii[:, 1]})

Unnamed: 0,SimpleImputer,IterativeImputer
0,31.0,31.000000
1,18.0,18.000000
2,9.0,9.000000
3,22.0,22.000000
4,19.0,19.000000
...,...,...
663,20.0,20.000000
664,29.796842,25.086903
665,39.0,39.000000
666,29.796842,22.684674


In [29]:
from sklearn.datasets import load_iris

iris = load_iris(as_frame=True)

iris.data[["petal length (cm)", "petal width (cm)"]]

Unnamed: 0,petal length (cm),petal width (cm)
0,1.4,0.2
1,1.4,0.2
2,1.3,0.2
3,1.5,0.2
4,1.4,0.2
...,...,...
145,5.2,2.3
146,5.0,1.9
147,5.2,2.0
148,5.4,2.3


#### Comparación métodos de imputación

En la documentación se scikit-learn se pueden encontrar dos ejemplos comparando los distintos métodos:

  * [Imputing missing values with variants of IterativeImputer](https://scikit-learn.org/stable/auto_examples/impute/plot_iterative_imputer_variants_comparison.html#sphx-glr-auto-examples-impute-plot-iterative-imputer-variants-comparison-py)
  * [Imputing missing values before building an estimator](https://scikit-learn.org/stable/auto_examples/impute/plot_missing_values.html#sphx-glr-auto-examples-impute-plot-missing-values-py)
  
Otra comparación visual:

<img src=https://amueller.github.io/ml-workshop-1-of-4/slides/images/med_knn_rf_comparison.png with=500>

### Combinando preprocesos: *Pipelines*

Podemos combinar varios preprocesos para que se realicen sobre distintas columnas (en paralelo) con `ColumnTransformer`

Con la clase `Pipeline`, podemos combinar preprocesos para que se realicen de manera **secuencia**

Los pipelines también nos permiten combinar el preproceso con el ajuste del modelo, por ejemplo para buscar los parámetros óptimos con `GridSearchCV` de forma conjunta 

[Guia de usuario](https://scikit-learn.org/stable/modules/compose.html)

In [20]:
X.head()

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,adult_male,embark_town,alone
0,3,male,22.0,1,0,7.25,True,Southampton,False
1,1,female,38.0,1,0,71.2833,False,Cherbourg,False
2,3,female,26.0,0,0,7.925,False,Southampton,True
3,1,female,35.0,1,0,53.1,False,Southampton,False
4,3,male,35.0,0,0,8.05,True,Southampton,True


In [30]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder

numeric_features = ['age', 'fare']
numeric_transformer = Pipeline(
    steps=[
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler())
    ]
)

categorical_features = ['embark_town', 'sex', 'pclass']
categorical_transformer = Pipeline(
    steps=[
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('onehot', OneHotEncoder(handle_unknown='ignore'))
    ]
)

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ],
    remainder='passthrough'
)

In [31]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)

In [32]:
X_train_pre = preprocessor.fit_transform(X_train)
X_test_pre = preprocessor.transform(X_test)

In [33]:
X_train

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,adult_male,embark_town,alone
671,1,male,31.0,1,0,52.0000,True,Southampton,False
417,2,female,18.0,0,2,13.0000,False,Southampton,False
634,3,female,9.0,3,2,27.9000,False,Southampton,False
323,2,female,22.0,1,1,29.0000,False,Southampton,False
379,3,male,19.0,0,0,7.7750,True,Southampton,True
...,...,...,...,...,...,...,...,...,...
131,3,male,20.0,0,0,7.0500,True,Southampton,True
490,3,male,,1,0,19.9667,True,Southampton,False
528,3,male,39.0,0,0,7.9250,True,Southampton,True
48,3,male,,2,0,21.6792,True,Cherbourg,False


In [35]:
pd.DataFrame(X_train_pre)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,0.105744,0.350476,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1,0,True,False
1,-0.901055,-0.379422,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0,2,False,False
2,-1.598071,-0.100564,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,3,2,False,False
3,-0.591271,-0.079977,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1,1,False,False
4,-0.823609,-0.47721,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0,0,True,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
663,-0.746163,-0.490778,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0,0,True,True
664,-0.049148,-0.249038,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1,0,True,False
665,0.725313,-0.474402,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0,0,True,True
666,-0.049148,-0.216988,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,2,0,True,False


In [36]:
from sklearn.linear_model import LogisticRegression

clf = Pipeline(
    steps=[
        ('preprocessor', preprocessor), 
        ('classifier', LogisticRegression())
    ]
)
clf.fit(X_train, y_train)
clf.score(X_test, y_test)

0.8116591928251121

In [37]:
#from sklearn import set_config
#set_config(display='diagram')
clf

### Usando Pipelines en búsquedas de parámetros

Más información y ejemplos: [Pipelines](https://github.com/amueller/ml-workshop-3-of-4/blob/master/notebooks/03%20Pipelines.ipynb)

In [53]:
import seaborn as sns
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression, BayesianRidge
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer, SimpleImputer
from sklearn.ensemble import RandomForestRegressor

titanic = sns.load_dataset('titanic')

X = titanic.drop(columns=["class", "who", "adult_male", "embarked", "alone", "deck", "survived", "alive"])
y = titanic["survived"]

In [54]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)
X.head()

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,embark_town
0,3,male,22.0,1,0,7.25,Southampton
1,1,female,38.0,1,0,71.2833,Cherbourg
2,3,female,26.0,0,0,7.925,Southampton
3,1,female,35.0,1,0,53.1,Southampton
4,3,male,35.0,0,0,8.05,Southampton


In [55]:
numeric_features = ['age', 'fare']
numeric_transformer = Pipeline(
    steps=[
        ('imputer', IterativeImputer()),
        ('scaler', StandardScaler())
    ]
)

categorical_features = ['embark_town', 'sex', 'pclass']
categorical_transformer = Pipeline(
    steps=[
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('encoder', OneHotEncoder(handle_unknown='ignore'))
    ]
)

ordinal_features = ["sibsp", "parch"]
ordinal_transformer = SimpleImputer(strategy="most_frequent")

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features),
        ('ord', ordinal_transformer, ordinal_features)
    ],
    remainder='passthrough'
)

clf = Pipeline(
    steps=[
        ('pre', preprocessor), 
        ('clf', LogisticRegression())
    ]
)

clf

In [57]:
param_grid = {
    'pre__num__imputer__estimator': [BayesianRidge(), RandomForestRegressor()],
    'pre__ord__strategy': ["most_frequent", "median"],
    'clf__C': [0.1, 1.0, 10],
    'clf__solver': ["lbfgs", "liblinear"]
}

cv = GridSearchCV(clf, param_grid, cv=5)
cv.fit(X_train, y_train)
cv.score(X_test, y_test)





0.7892376681614349

In [59]:
import joblib

joblib.dump(cv, "modelo.pkl")

['modelo.pkl']

In [60]:
cv1 = joblib.load("modelo.pkl")

In [62]:
cv1.predict(X_test)

array([0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1,
       0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0,
       0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,
       0, 0, 0])

In [61]:
cv1.best_params_

{'clf__C': 10,
 'clf__solver': 'lbfgs',
 'pre__num__imputer__estimator': BayesianRidge(),
 'pre__ord__strategy': 'most_frequent'}

In [58]:
cv.best_params_

{'clf__C': 10,
 'clf__solver': 'lbfgs',
 'pre__num__imputer__estimator': BayesianRidge(),
 'pre__ord__strategy': 'most_frequent'}

### Ejercicios

#### Ejercicio 1

Con los datos del titanic, vamos a intentar predecir la supervivencia ('survived') a partir del resto de variables excepto:

   * `deck`: tiene muchos valores que faltan
    
   * `embarked`: es lo mismo que `embark_town`
    
   * `alive`: es lo mismo que `survived`
   
   * `who`: igual que `sex`
   
   * `class`: igual que `pclass`
   
   * `alone_male`: igual que `sex`

Para ello, primero vamos a preparar los datos:

   1. Completar los valores de la variable `embark_town` usando el valor más frecuente
    
   2. Convertir todas las variables a numéricas usando una codificación one-hot


#### Ejercicio 2

Con los datos del ejercicio 1, ahora vamos a ajustar un modelo de regresión logística:
   
   1. Eliminando las filas de los datos donde falta el valor de `age`
   2. Imputando la variable `age` con la media
   3. Imputando la variable `age` usando k vecinos próximos
   4. Imputando la variable `age` usando un modelo de *random forest* (ver parámetro `estimator` de `IterativeImputer`)

In [29]:
import seaborn as sns

titanic = sns.load_dataset('titanic')

X = titanic.drop(columns=['deck', 'embarked', 'alive', 'survived', 'class', 'who'])
y = titanic['survived']

# ojo, estamos completando tanto train como test, pero como faltan pocos valores en este caso no hay problema
X.loc[X['embark_town'].isna(), 'embark_town'] = X['embark_town'].mode()
X_num = pd.get_dummies(X, drop_first=True)

In [30]:
isna = X.isna().any(axis=1)
X_train, X_test, y_train, y_test = train_test_split(X_num[~isna], y[~isna], stratify=y[~isna], random_state=0)

In [31]:
X_train, X_test, y_train, y_test = train_test_split(X_num, y, stratify=y, random_state=0)

In [32]:
X_train.isna().sum()

pclass                       0
age                        136
sibsp                        0
parch                        0
fare                         0
adult_male                   0
alone                        0
sex_male                     0
embark_town_Queenstown       0
embark_town_Southampton      0
dtype: int64

In [33]:
X_train.dtypes

pclass                       int64
age                        float64
sibsp                        int64
parch                        int64
fare                       float64
adult_male                    bool
alone                         bool
sex_male                     uint8
embark_town_Queenstown       uint8
embark_town_Southampton      uint8
dtype: object

In [34]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LogisticRegression
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.pipeline import Pipeline, make_pipeline

clf = Pipeline(steps=[('imputer', IterativeImputer(estimator=RandomForestRegressor())), 
                      ('classifier', LogisticRegression(solver='liblinear'))])
clf.fit(X_train, y_train)
clf.score(X_test, y_test)



0.8026905829596412