## Ejemplo: Titanic

Construyamos un **modelo de clasificación** para predecir si un pasajero sobreviviría al desastre del Titanic utilizando los siguientes datos:

| Variable | Definition | Key |
|----------|-------------|-----|
| survival | Survival | 0 = No, 1 = Yes |
| pclass   | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd |
| sex      | Sex          |   |
| Age      | Age in years |   |
| sibsp    | # of siblings / spouses aboard the Titanic |   |
| parch    | # of parents / children aboard the Titanic  |   |
| ticket   | Ticket number |   |
| fare     | Passenger fare |   |
| cabin    | Cabin number   |   |
| embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |


In [4]:
import sklearn
import pandas as pd

titanic = pd.read_csv("data/data/titanic.csv")
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Reutilicemos (una versión simplificada) de la "receta ML" de la lección anterior:

In [5]:
# a veces no hay datos para algunos registros

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
#1) Preprocess data and feature engineering
# All machine learning models do not understand text, only numbers.
# As a first approach, let's focus on numeric data
titanic_nb = titanic.select_dtypes(include='number')
print(titanic_nb.tail())

# Also, what are those NaNs???? Let's drop them!
titanic_nb = titanic_nb.dropna()
print("=" * 50)
print(titanic_nb.tail())


# Don't forget to split titanic into x and y!
x = titanic_nb.drop(columns="Survived") # dataframe with all columns except "Survived"
y = titanic_nb["Survived"] # output variable is "Survived"
print("=" * 50)
print(x.tail())

     PassengerId  Survived  Pclass   Age  SibSp  Parch   Fare
886          887         0       2  27.0      0      0  13.00
887          888         1       1  19.0      0      0  30.00
888          889         0       3   NaN      1      2  23.45
889          890         1       1  26.0      0      0  30.00
890          891         0       3  32.0      0      0   7.75
     PassengerId  Survived  Pclass   Age  SibSp  Parch    Fare
885          886         0       3  39.0      0      5  29.125
886          887         0       2  27.0      0      0  13.000
887          888         1       1  19.0      0      0  30.000
889          890         1       1  26.0      0      0  30.000
890          891         0       3  32.0      0      0   7.750
     PassengerId  Pclass   Age  SibSp  Parch    Fare
885          886       3  39.0      0      5  29.125
886          887       2  27.0      0      0  13.000
887          888       1  19.0      0      0  30.000
889          890       1  26.0      0 

In [6]:
# 2) Choose a model. When choosing a model, the most important 
# thing is that the model should be aligned with the task at hand.
# That is, if solving a regression problem, choose a regression model!
# If solving a classification problem, choose a classification model!
# Most models have a "regression" flavour and a "classification" flavour
# For example: DecissionTreeRegressor Vs DecisionTreeClassifier.
# In the current example let's select the "classification" flavour of 
# the LinearRegression() model, which is called:
model = LogisticRegression() # This is a classification model


# 3) split data on training and test sets.
x_train, x_test, y_train, y_test = train_test_split(x, y) # we do a split so that registers in train set are not in test set
# 4) Train the model on the training set to try to maximize performance.
model.fit(x_train, y_train) # this is the training step with the training data

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [7]:
# 5) Measure the actual performance on the test set.
# Recall that the measure should be aligned with the task at hand 
# (regression Vs classification)
# For classification, a reasonable metric is ACCURACY
from sklearn.metrics import accuracy_score
y_test_pred = model.predict(x_test) 
acc = accuracy_score(y_test, y_test_pred)
print(f"Accuracy is {acc}")

Accuracy is 0.7486033519553073


**¿Lo podemos mejorar?**

# Ejercicio: Características categóricas y numéricas
¿Qué es una característica categórica? ¿Qué es una característica numérica? Clasifica cada característica del conjunto de datos del Titanic en una de estas categorías.

# Mejora: selección de características (feature selection) e ingeniería de características (feature engineering)
La ingeniería de características (feature engineering) es el proceso de crear, transformar o seleccionar variables de entrada (feature selection) para mejorar el rendimiento de los modelos de aprendizaje automático.

## Ejemplo: selección de características e ingeniería de características
Creemos una nueva variable llamada `FamSize` agregando `SibSp` y `Parch`. Este es un ejemplo simple de ingeniería de características.

Además, seleccionaremos las variables que creemos que tienen valor predictivo para nuestro modelo y descartaremos las demás. Este es un ejemplo simple de selección de características.

In [None]:
x = titanic.drop(columns="Survived")
y = titanic["Survived"]

# Feature engineering
x["FamSize"] = x["SibSp"] + x["Parch"]
# Feature selection: it is not a good idea to use all features!
# This is specially true on high dimensional data and it is call 
# "THE CURSE OF DIMENSIONALITY"
x = x[["Pclass", "Sex", "Age", "Fare", "FamSize", "Embarked"]] # dataframe with the columns we want

print(x)

# Split train-test data and preprocessing steps
x_train, x_test, y_train, y_test = train_test_split(x, y) # we shall change this later
# Let's save columns to restore the DataFrame later
cols = x.columns
# ...
x_train.head()

# Pclass es una var categorica aunque se represente con numeros (hay numero limitado de opciones)

     Pclass     Sex   Age     Fare  FamSize Embarked
0         3    male  22.0   7.2500        1        S
1         1  female  38.0  71.2833        1        C
2         3  female  26.0   7.9250        0        S
3         1  female  35.0  53.1000        1        S
4         3    male  35.0   8.0500        0        S
..      ...     ...   ...      ...      ...      ...
886       2    male  27.0  13.0000        0        S
887       1  female  19.0  30.0000        0        S
888       3  female   NaN  23.4500        3        S
889       1    male  26.0  30.0000        0        C
890       3    male  32.0   7.7500        0        Q

[891 rows x 6 columns]


Unnamed: 0,Pclass,Sex,Age,Fare,FamSize,Embarked
480,3,male,9.0,46.9,7,S
879,1,female,56.0,83.1583,1,C
768,3,male,,24.15,1,Q
232,2,male,59.0,13.5,0,S
179,3,male,36.0,0.0,0,S


## Ejercicio: ingeniería de características
La variable `Name` puede tener información interesante en los
saludos para cada pasajero (Miss, Mrs., Mr., Master, etc.). Crea una columna llamada `is_married` a partir de esta columna. Este es otro ejemplo de ingeniería de características. 

In [None]:
x = titanic.drop(columns="Survived")
y = titanic["Survived"]

#x["is_married"] = x["Name"].str.contains("Mrs") | x["Name"].str.contains("Mr")

def is_married(name):
    if 'Mrs.' in name or 'Mr.' in name:
        return True
    else:
        return False

x["is_married"] = x["Name"].apply(is_married)

x = x[["Pclass", "Sex", "Age", "Fare", "Embarked", "is_married"]]

print(x)

     Pclass     Sex   Age     Fare Embarked  is_married
0         3    male  22.0   7.2500        S        True
1         1  female  38.0  71.2833        C        True
2         3  female  26.0   7.9250        S       False
3         1  female  35.0  53.1000        S        True
4         3    male  35.0   8.0500        S        True
..      ...     ...   ...      ...      ...         ...
886       2    male  27.0  13.0000        S       False
887       1  female  19.0  30.0000        S       False
888       3  female   NaN  23.4500        S       False
889       1    male  26.0  30.0000        C        True
890       3    male  32.0   7.7500        Q        True

[891 rows x 6 columns]


# Mejora: Mejor preprocesamiento de datos

En esta sección, haremos lo siguiente:
* Eliminaremos los datos faltantes (NaN) mediante **imputación**. Por ejemplo, aplicaremos la imputación "media" a los datos numéricos y la imputación "más_frecuente" a los datos categóricos.
* Convertiremos las columnas categóricas en formato numérico mediante **codificación one-hot**. Específicamente, transformaremos las columnas "Pclass" (1, 2 o 3), "Sex" ("male" o "female") y "Embarked" (con los posibles valores "C", "S" o "Q").
* Escalaremos los datos numéricos para mejorar la convergencia mediante **escalamiento estándar** (también conocido como **puntuación z**).

Es importante tener en cuenta que se aplican diferentes transformaciones a diferentes columnas. Por ejemplo, ¿cómo podemos asegurarnos de que la codificación one-hot solo se aplique a "Sex" y "Embarked"? Para lograr esto, utilizaremos *ColumnTransformer*.

Además, algunas columnas requieren múltiples transformaciones. Por ejemplo, primero se imputarán los datos numéricos y luego se estandarizarán. Para encadenar estas operaciones de manera eficiente, utilizaremos el ***Pipeline***.

In [None]:
x_train.head()

Unnamed: 0,Pclass,Sex,Age,Fare,FamSize,Embarked
586,2,male,47.0,15.0,0,S
73,3,male,26.0,14.4542,1,C
130,3,male,33.0,7.8958,0,C
57,3,male,28.5,7.2292,0,C
117,2,male,29.0,21.0,1,S


In [None]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer 
from sklearn.pipeline import Pipeline

print(x_train.head())

scaler = StandardScaler()
numeric_imputer = SimpleImputer(strategy="mean") # one with mean 
categorical_imputer = SimpleImputer(strategy="most_frequent") # other with median
oh_encoder = OneHotEncoder(sparse_output=False) # one hot encoder (3rd col del)

# first pipeline to recollec in a single place the transformations for numeric and categorical data
numeric_transformer = Pipeline(steps=[
    ("imputer", numeric_imputer), # imputer of numeric data
    ("scaler", scaler) # scaler
])
categorical_transformer = Pipeline(steps=[
    ("imputer", categorical_imputer), # imputer of categorical data name so that its clear
    ("encoder", oh_encoder) # one hot encoder
])

transformer = ColumnTransformer( # single transformer 
    transformers=[
        ('numeric_imp', numeric_transformer, ['Age', 'FamSize', 'Fare']),  # apply numeric_transformer to these columns
        ('categorial_imp', categorical_transformer, ['Pclass', 'Sex', 'Embarked']) # apply categorical_transformer to these columns
    ]
)
# set_output is only used for better visualization. If not set to "pandas",
# the transformations returns an np.array
transformer = transformer.set_output(transform="pandas") # if we dont put this the transformation will return a nunpy, and thats not bad but we want it readable to print it

# Note! classifiers/regressors use fit and predict (and fit_predict), 
# preprocessors use fit and transform (and fit_transform)
# Compare fit_transform on train Vs transform on test
x_train = transformer.fit_transform(x_train) # fit and transform train data
x_test  = transformer.transform(x_test) # only 'transform' test data. 

print("=" * 80)
print(x_train.head())

     Pclass   Sex   Age     Fare  FamSize Embarked
586       2  male  47.0  15.0000        0        S
73        3  male  26.0  14.4542        1        C
130       3  male  33.0   7.8958        0        C
57        3  male  28.5   7.2292        0        C
117       2  male  29.0  21.0000        1        S
     numeric_imp__Age  numeric_imp__FamSize  numeric_imp__Fare  \
586          1.341923             -0.572326          -0.356205   
73          -0.272477              0.075663          -0.367286   
130          0.265656             -0.572326          -0.500444   
57          -0.080287             -0.572326          -0.513978   
117         -0.041848              0.075663          -0.234385   

     categorial_imp__Pclass_1  categorial_imp__Pclass_2  \
586                       0.0                       1.0   
73                        0.0                       0.0   
130                       0.0                       0.0   
57                        0.0                       0.0   
11

## Ejercicio: Características categóricas
WTF? **¿Por qué `Pclass` es una característica categórica?** Explica

Pclass es una var categorica aunque se represente con numeros (hay numero limitado de opciones)

## Ejercicio: Pipelining
Los `Pipelines` son geniales. ¡Crea un `Pipeline` que contenga tanto el transformador completo como el clasificador! Llámala
`preproc_and_model`.

# Mejora: Mejor uso de los datos con validación cruzada

## Ejemplo: validación cruzada de 5 pasos
Aunque existe un `KFold` en scikit-learn, recomiendo
`StratifiedKFold`.

In [None]:
# Final code!!!!
from sklearn.model_selection import StratifiedKFold
import numpy as np # for mean and std

x = titanic.drop(columns="Survived")
y = titanic["Survived"]

# Feature engineering
x["FamSize"] = x["SibSp"] + x["Parch"]
# Feature selection: it is not a good idea to use all features!
# This is specially true on high dimensional data and it is call 
# "THE CURSE OF DIMENSIONALITY"
x = x[["Pclass", "Sex", "Age", "Fare", "FamSize", "Embarked"]]

def create_preproc_and_model():
    scaler = StandardScaler()
    numeric_imputer = SimpleImputer(strategy="mean")
    categorical_imputer = SimpleImputer(strategy="most_frequent")
    oh_encoder = OneHotEncoder(sparse_output=False)

    numeric_transformer = Pipeline(steps=[
        ("imputer", numeric_imputer),
        ("scaler", scaler)
    ])
    categorical_transformer = Pipeline(steps=[
        ("imputer", categorical_imputer),
        ("encoder", oh_encoder)
    ])

    transformer = ColumnTransformer(
        transformers=[
            ('numeric_imp', numeric_transformer, ['Age', 'FamSize', 'Fare']),  
            ('categorial_imp', categorical_transformer, ['Pclass', 'Sex', 'Embarked']) 
        ]
    )
    return Pipeline(steps=[
        ("transformer", transformer),
        ("log_reg", LogisticRegression())
    ])

# instead of 
# x_train, x_test, y_train, y_test = train_test_split(x, y)
kfold = StratifiedKFold(n_splits=5) # define 5 folds

accuracies = [] # res for each fold
fold = 1
for train_indices, test_indices in kfold.split(x, y): # which indexes are for train and test
    x_train = x.iloc[train_indices] # train
    y_train = y.iloc[train_indices]
    x_test = x.iloc[test_indices] # test
    y_test = y.iloc[test_indices]

    preproc_and_model = create_preproc_and_model()

    preproc_and_model.fit(x_train, y_train) # fit over train data
    y_test_pred = preproc_and_model.predict(x_test) # predict over test data
    acc = accuracy_score(y_test, y_test_pred)
    print(f"Accuracy in fold {fold} is {acc}")
    fold += 1
    accuracies.append(acc) # save accuracy for this fold

print("********************")
print(f"Avg accuracy: {np.mean(accuracies)} ± {np.std(accuracies)}") # mean and std of accuracies of all folds

# media ha mejorado mucho

Accuracy in fold 1 is 0.770949720670391
Accuracy in fold 2 is 0.8202247191011236
Accuracy in fold 3 is 0.7865168539325843
Accuracy in fold 4 is 0.7808988764044944
Accuracy in fold 5 is 0.8258426966292135
********************
Avg accuracy: 0.7968865733475614 ± 0.02199538125972221


## Ejercicio: `cross_val_score`

Escribir el código anterior desde cero es una excelente manera de probar tu comprensión de la validación cruzada. Sin embargo, se puede simplificar en gran medida aprovechando la función `cross_val_score` de sklearn. Reescribe el código para usar esta función.

# Selección de modelos y validación cruzada anidada

## Ejemplo: selección de modelo con validación cruzada anidada

Utilicemos la validación cruzada anidada para obtener la precisión esperada en el conjunto de prueba si utilizamos la selección de modelo. Para simplificar, comparemos tres clasificadores de regresión logística con
tres penalizaciones diferentes: `penalties = [None, "l2", "l1"]`.

La mejor manera de evitar confusiones con el CV anidado es escribir una función que reciba datos de entrenamiento y devuelva un modelo ajustado: `fit_model`. Si deseas seleccionar el mejor modelo posible entre los candidatos (¡es decir, la selección de modelos!), este `fit_model` debe incluir un bucle de CV (el llamado CV interno).

Para estimar el rendimiento del modelo producido por `fit_model`, utilizaremos otro bucle de CV que llamará varias veces a esta función (el llamado CV externo).

In [None]:
x = titanic.drop(columns="Survived")
y = titanic["Survived"]
x["FamSize"] = x["SibSp"] + x["Parch"]
x = x[["Pclass", "Sex", "Age", "Fare", "FamSize", "Embarked"]]

# !!!!! add penalty param   
def create_preproc_and_model(penalty):
    scaler = StandardScaler()
    numeric_imputer = SimpleImputer(strategy="mean")
    categorical_imputer = SimpleImputer(strategy="most_frequent")
    oh_encoder = OneHotEncoder(sparse_output=False)

    numeric_transformer = Pipeline(steps=[
        ("imputer", numeric_imputer),
        ("scaler", scaler)
    ])
    categorical_transformer = Pipeline(steps=[
        ("imputer", categorical_imputer),
        ("encoder", oh_encoder)
    ])

    transformer = ColumnTransformer(
        transformers=[
            ('numeric_imp', numeric_transformer, ['Age', 'FamSize', 'Fare']),  
            ('categorial_imp', categorical_transformer, ['Pclass', 'Sex', 'Embarked']) 
        ]
    )
    # !!!!!! add penalty param and select a solver compatible with all penalties
    return Pipeline(steps=[
        ("transformer", transformer),
        ("log_reg", LogisticRegression(penalty=penalty, solver="saga", max_iter=1000))
    ])

def fit_model(x_train_val, y_train_val):
    inner_kfold = StratifiedKFold(n_splits=5)
    # !!!!!! Inner loop devoted to hyperparameter tuning
    param_accuracies = {}
    for penalty in [None, "l2", "l1"]:
        inner_accuracies = []
        for train_indices, val_indices in inner_kfold.split(x_train_val, y_train_val):
            x_train = x_train_val.iloc[train_indices]
            y_train = y_train_val.iloc[train_indices]
            x_val = x_train_val.iloc[val_indices]
            y_val = y_train_val.iloc[val_indices]

            preproc_and_model = create_preproc_and_model(penalty)
            preproc_and_model.fit(x_train, y_train) 
            
            # !!! Inner accuracies are calculated on the validation set
            inner_accuracies.append(
                accuracy_score(y_val, preproc_and_model.predict(x_val))
            )
        param_accuracies[penalty] = np.mean(inner_accuracies)
    
    best_config = max(param_accuracies, key=param_accuracies.get)
    print(f"Best model is {best_config}")

    # Now that the best model has been selected, we can train it on the whole training set
    preproc_and_model = create_preproc_and_model(best_config)
    preproc_and_model.fit(x_train_val, y_train_val)
    return preproc_and_model



# !!! Create outer and inner kfold
outer_kfold = StratifiedKFold(n_splits=5)

accuracies = []
fold = 1
for train_val_indices, test_indices in outer_kfold.split(x, y):
    x_train_val = x.iloc[train_val_indices]
    y_train_val = y.iloc[train_val_indices]
    x_test = x.iloc[test_indices]
    y_test = y.iloc[test_indices]

    print("=" * 50)
    print("Outer fold", fold)
    preproc_and_model = fit_model(x_train_val, y_train_val)
    # The final accuracy is calculated on the test set
    acc = (
        accuracy_score(y_test, preproc_and_model.predict(x_test))
    )
    print(f"Accuracy in Outer fold {fold} is {acc}")
    fold += 1
    accuracies.append(acc)

print("*"*50)
print(f"Avg accuracy: {np.mean(accuracies)} ± {np.std(accuracies)}")

Outer fold 1
Best model is l2
Accuracy in Outer fold 1 is 0.770949720670391
Outer fold 2
Best model is l1
Accuracy in Outer fold 2 is 0.8202247191011236
Outer fold 3
Best model is l2
Accuracy in Outer fold 3 is 0.7865168539325843
Outer fold 4
Best model is l2
Accuracy in Outer fold 4 is 0.7808988764044944
Outer fold 5
Best model is None
Accuracy in Outer fold 5 is 0.8258426966292135
**************************************************
Avg accuracy: 0.7968865733475614 ± 0.02199538125972221


## Ejercicio: Validación cruzada anidada con sklearn

El bloque de código anterior es bastante complejo y escribirlo desde cero es una excelente manera de probar tu comprensión de la validación cruzada anidada. Sin embargo, se puede simplificar aprovechando las funciones de sklearn `GridSearchCV` y `cross_val_score`. Reescribe el código utilizando estas funciones.