
<br>
===================================<br>
Column Transformer with Mixed Types<br>
===================================<br>
This example illustrates how to apply different preprocessing and<br>
feature extraction pipelines to different subsets of features,<br>
using :class:`sklearn.compose.ColumnTransformer`.<br>
This is particularly handy for the case of datasets that contain<br>
heterogeneous data types, since we may want to scale the<br>
numeric features and one-hot encode the categorical ones.<br>
In this example, the numeric data is standard-scaled after<br>
mean-imputation, while the categorical data is one-hot<br>
encoded after imputing missing values with a new category<br>
(``'missing'``).<br>
Finally, the preprocessing pipeline is integrated in a<br>
full prediction pipeline using :class:`sklearn.pipeline.Pipeline`,<br>
together with a simple classification model.<br>


Author: Pedro Morales <part.morales@gmail.com><br>
<br>
License: BSD 3 clause

In [None]:
import numpy as np

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV

In [None]:
np.random.seed(0)

Load data from https://www.openml.org/d/40945

In [None]:
X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)

Alternatively X and y can be obtained directly from the frame attribute:<br>
X = titanic.frame.drop('survived', axis=1)<br>
y = titanic.frame['survived']

We will train our classifier with the following features:<br>
Numeric Features:<br>
- age: float.<br>
- fare: float.<br>
Categorical Features:<br>
- embarked: categories encoded as strings {'C', 'S', 'Q'}.<br>
- sex: categories encoded as strings {'female', 'male'}.<br>
- pclass: ordinal integers {1, 2, 3}.

We create the preprocessing pipelines for both numeric and categorical data.

In [None]:
numeric_features = ['age', 'fare']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

In [None]:
categorical_features = ['embarked', 'sex', 'pclass']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

In [None]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

Append classifier to preprocessing pipeline.<br>
Now we have a full prediction pipeline.

In [None]:
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', LogisticRegression())])

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [None]:
clf.fit(X_train, y_train)
print("model score: %.3f" % clf.score(X_test, y_test))

#############################################################################<br>
Using the prediction pipeline in a grid search<br>
#############################################################################<br>
Grid search can also be performed on the different preprocessing steps<br>
defined in the ``ColumnTransformer`` object, together with the classifier's<br>
hyperparameters as part of the ``Pipeline``.<br>
We will search for both the imputer strategy of the numeric preprocessing<br>
and the regularization parameter of the logistic regression using<br>
:class:`sklearn.model_selection.GridSearchCV`.

In [None]:
param_grid = {
    'preprocessor__num__imputer__strategy': ['mean', 'median'],
    'classifier__C': [0.1, 1.0, 10, 100],
}

In [None]:
grid_search = GridSearchCV(clf, param_grid, cv=10)
grid_search.fit(X_train, y_train)

In [None]:
print(("best logistic regression from grid search: %.3f"
       % grid_search.score(X_test, y_test)))