# Complex machine-learning pipeline

First, let's fetch the "titanic" dataset directly from OpenML.

In [6]:
import pandas as pd

In this dataset, the missing values are stored with the following character `"?"`. We will notify it to Pandas when reading the CSV file.

In [7]:
df = pd.read_csv(
    "https://www.openml.org/data/get_csv/16826755/phpMYEkMl.csv",
    na_values='?' # Les valeurs manquantes sont des ''?''
)
df.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


The classification task is to predict whether or not a person will survive the Titanic disaster.

In [8]:
X_df = df.drop(columns='survived')
y = df['survived']

We will split the data into a training and a testing set.

In [9]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X_df, y, random_state=42
)

<div class="alert alert-success">
    <p><b>QUESTIONS</b>:</p>
    <ul>
        <li>What would happen if you try to fit a <tt>RandomForestClassifier</tt>?</li>
    </ul>
</div>

In [31]:
from sklearn.ensemble import RandomForestClassifier

clf=RandomForestClassifier()
# clf.fit(X_train,y_train) # Ne marche pas car il y a des valeurs non numériques

# Working only with numerical data

We already saw in the previous lecture that we can easily train scikit-learn model on numerical data. Therefore, we will start to select only the numerical columns and train a model.

## Pandas preprocessing

Before to use scikit-learn, we will try make some simple preprocessing using pandas. First, let's select only the numerical columns

In [32]:
num_cols = ['age', 'pclass', 'parch', 'fare'] # Colonnes numériques 'classe de voyage' 'nb d'enfants'

X_train_num = X_train[num_cols]

<div class="alert alert-success">
    <p><b>QUESTIONS</b>:</p>
    <ul>
        <li>What would happen if you try to fit a <tt>RandomForestClassifier</tt>?</li>
    </ul>
</div>

In [33]:
model = RandomForestClassifier(n_estimators=100)
# model.fit(X_train_num, y_train) # Ne marche pas car missing values

We might want to look into a summary of the data that we try to fit.

In [14]:
X_train_num.info()
# Un record possède un 'fare' manquant
# remplacer 'l'age par la moyenne

<class 'pandas.core.frame.DataFrame'>
Int64Index: 981 entries, 1139 to 1126
Data columns (total 4 columns):
age       784 non-null float64
pclass    981 non-null int64
parch     981 non-null int64
fare      980 non-null float64
dtypes: float64(2), int64(2)
memory usage: 38.3 KB


Since there are some missing data, we can replace them with a mean.

In [15]:
X_train_num_imputed = X_train_num.fillna(X_train_num.mean())
X_train_num_imputed.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 981 entries, 1139 to 1126
Data columns (total 4 columns):
age       981 non-null float64
pclass    981 non-null int64
parch     981 non-null int64
fare      981 non-null float64
dtypes: float64(2), int64(2)
memory usage: 38.3 KB


In [16]:
X_train_num.mean() # pour vérifier avec quoi on va remplacer : est-ce abhérant ?
# pour 'parch' nombre d'enfants est abhérant, mais mieux que de mettre un 0 qui ajoute un bruit.

age       29.347683
pclass     2.298675
parch      0.391437
fare      33.686466
dtype: float64

In [17]:
model.fit(X_train_num_imputed, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

<div class="alert alert-success">
    <p><b>EXERCISE</b>:</p>
    <ul>
    <li>What should we do if there are also missing values in the test set?</li>
    <li>Process the test set so as to be able to compute the test score of the model.</li>
    </ul>
</div>

In [34]:
# TODO
X_test_num = X_test[num_cols]  # selection des colonnes numériques

# ATTENTION : il faut remplacer par la même moyenne que ce qu'on a utilisé pour le X_train
X_test_num_imputed = X_test_num.fillna(X_train_num.mean()) 

model.fit(X_train_num_imputed, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [35]:
model.score(X_test_num_imputed, y_test)

0.6737804878048781

## Make it less error prone using scikit-learn

We saw earlier that we should be careful when preprocessing data to avoid any "data leak" (i.e. reusing some knowledge from the training when testing our model). Scikit-learn provides the `Pipeline` class to make successive transformations. In addition, it will ensure that the right operations will be applied at the right time.

In [29]:
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer

model = make_pipeline(SimpleImputer(strategy='mean',add_indicator=True),
                      RandomForestClassifier(n_estimators=100))
model.fit(X_train_num, y_train)

Pipeline(memory=None,
         steps=[('simpleimputer',
                 SimpleImputer(add_indicator=True, copy=True, fill_value=None,
                               missing_values=nan, strategy='mean',
                               verbose=0)),
                ('randomforestclassifier',
                 RandomForestClassifier(bootstrap=True, class_weight=None,
                                        criterion='gini', max_depth=None,
                                        max_features='auto',
                                        max_leaf_nodes=None,
                                        min_impurity_decrease=0.0,
                                        min_impurity_split=None,
                                        min_samples_leaf=1, min_samples_split=2,
                                        min_weight_fraction_leaf=0.0,
                                        n_estimators=100, n_jobs=None,
                                        oob_score=False, random_state=None,
         

In [30]:
model.score(X_test_num, y_test)

0.676829268292683

If we want to directly fit the model on `X_train`, we can select the numerical columns using  a `ColumnTransformer` object:

In [22]:
from sklearn.compose import make_column_transformer

# 
numerical_preprocessing = make_column_transformer(
    # tuples of transformers and column selections
    (SimpleImputer(strategy='mean'), num_cols)
)
model = make_pipeline(
    numerical_preprocessing,
    RandomForestClassifier(n_estimators=100),
)
model.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('columntransformer',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('simpleimputer',
                                                  SimpleImputer(add_indicator=False,
                                                                copy=True,
                                                                fill_value=None,
                                                                missing_values=nan,
                                                                strategy='mean',
                                                                verbose=0),
                                                  ['age', 'pclass', 'parch',
                                                   'fare'])],
                                   verbose=False)),
                

In [23]:
model.score(X_test, y_test)

0.6737804878048781

# Working only with categorical data

Categorical columns (even more string data types) are not supported natively by machine-learning algorithms and required some preprocessing step usually called encoding. The most meaningful encoding with tree-based algorithms is the `OrdinalEncoder`.

In [24]:
X_train.head()

Unnamed: 0,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
1139,3,"Rekic, Mr. Tido",male,38.0,0,0,349249,7.8958,,S,,,
678,3,"Boulos, Master. Akar",male,6.0,1,1,2678,15.2458,,C,,,"Syria Kent, ON"
290,1,"Taussig, Mr. Emil",male,52.0,1,1,110413,79.65,E67,S,,,"New York, NY"
285,1,"Straus, Mr. Isidor",male,67.0,1,0,PC 17483,221.7792,C55 C57,S,,96.0,"New York, NY"
1157,3,"Rosblom, Mr. Viktor Richard",male,18.0,1,1,370129,20.2125,,S,,,


In [25]:
cat_col = ['sex', 'embarked', 'pclass']

In [26]:
X_train_cat = X_train[cat_col]

In [27]:
X_train_cat.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 981 entries, 1139 to 1126
Data columns (total 3 columns):
sex         981 non-null object
embarked    980 non-null object
pclass      981 non-null int64
dtypes: int64(1), object(2)
memory usage: 30.7+ KB


In [36]:
from sklearn.preprocessing import OrdinalEncoder

cat_pipeline = make_pipeline(
    SimpleImputer(strategy='constant', fill_value='missing'),
    OrdinalEncoder(),
)

categorical_preprocessing = make_column_transformer(
    (cat_pipeline, cat_col)
)
model = make_pipeline(
    categorical_preprocessing,
    RandomForestClassifier(n_estimators=100)
)
model.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('columntransformer',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('pipeline',
                                                  Pipeline(memory=None,
                                                           steps=[('simpleimputer',
                                                                   SimpleImputer(add_indicator=False,
                                                                                 copy=True,
                                                                                 fill_value='missing',
                                                                                 missing_values=nan,
                                                                                 strategy='constant',
                                      

In [37]:
model.score(X_test, y_test)

0.7713414634146342

# Combining both categorical and numerical data in the pipeline

<div class="alert alert-success">
    <p><b>EXERCISE</b>:</p>
    <ul>
    <li>Try to combine the numerical and categorical pipelines into a single <tt>ColumnTransformer</tt></li>
        <li>Fit a <tt>RandomForestClassifier</tt> on the output of this feature engineering. How does the test score evolve?</li>
    </ul>
</div>

In [None]:

from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OrdinalEncoder

cat_col = ['sex', 'embarked', 'pclass']
num_cols = ['age','parch', 'fare'] 

cat_pipeline = make_pipeline(
    SimpleImputer(strategy='constant', fill_value='missing'),
    OrdinalEncoder(),
)
num_pipeline=SimpleImputer(strategy='mean')

preprocessing = make_column_transformer(
    (cat_pipeline, cat_col),
    (num_pipeline,num_cols)
)
    
model = make_pipeline(
    preprocessing,
    RandomForestClassifier(n_estimators=100)
)
model.fit(X_train, y_train).score(X_test,y_test)


In [13]:
# Si on veut juste executer le preprocessing :
preprocessing.transform(X_train)

NameError: name 'preprocessing' is not defined