# Column Transformer with Mixed Types

This example illustrates how to apply different preprocessing and feature extraction pipelines to different subsets of features, using ColumnTransformer. This is particularly handy for the case of datasets that contain heterogeneous data types, since we may want to scale the numeric features and one-hot encode the categorical ones.

In this example, the numeric data is standard-scaled after mean-imputation. The categorical data is one-hot encoded via OneHotEncoder, which creates a new category for missing values. We further reduce the dimensionality by selecting categories using a chi-squared test.

Finally, the preprocessing pipeline is integrated in a full prediction pipeline using Pipeline, together with a simple classification model.

# 1. Load necessary libraries

In [16]:
import numpy as np
import pandas as pd

from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.feature_selection import SelectPercentile, chi2
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

np.random.seed(0)

# 2. Load the data and do initial analysis

In [2]:
X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)


In [4]:
X.head()

Unnamed: 0,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


In [6]:
y.value_counts()

Unnamed: 0_level_0,count
survived,Unnamed: 1_level_1
0,809
1,500


In [7]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 13 columns):
 #   Column     Non-Null Count  Dtype   
---  ------     --------------  -----   
 0   pclass     1309 non-null   int64   
 1   name       1309 non-null   object  
 2   sex        1309 non-null   category
 3   age        1046 non-null   float64 
 4   sibsp      1309 non-null   int64   
 5   parch      1309 non-null   int64   
 6   ticket     1309 non-null   object  
 7   fare       1308 non-null   float64 
 8   cabin      295 non-null    object  
 9   embarked   1307 non-null   category
 10  boat       486 non-null    object  
 11  body       121 non-null    float64 
 12  home.dest  745 non-null    object  
dtypes: category(2), float64(3), int64(3), object(5)
memory usage: 115.4+ KB


# 3. Create preprocessing pipeline

We will train our classifier with the following features:

Numeric Features:

1. age: float

2. fare: float

Categorical Features:

1. embarked: categories encoded as strings {'C', 'S', 'Q'}

2. sex: categories encoded as strings {'female', 'male'}

3. pclass: ordinal integers {1, 2, 3}.

We create the preprocessing pipelines for both numeric and categorical data. Note that pclass could either be treated as a categorical or numeric feature.

In [9]:
X.age.isnull().sum() , X.fare.isnull().sum()

(263, 1)

In [10]:
numeric_features = ["age", "fare"]
numeric_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())]
)

categorical_features = ["embarked", "sex", "pclass"]
categorical_transformer = Pipeline(
    steps=[
        ("encoder", OneHotEncoder(handle_unknown="ignore")),
        ("selector", SelectPercentile(chi2, percentile=50)),
    ]
)
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)

# 4. Develop baseline model for Logistic regression , Random Forest & Gradient Boosting Algorithm

In [11]:
clf = Pipeline(
    steps=[("preprocessor", preprocessor), ("classifier", LogisticRegression())]
)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

clf.fit(X_train, y_train)
print("model score: %.3f" % clf.score(X_test, y_test))

model score: 0.798


In [15]:
rf = Pipeline(
    steps=[("preprocessor", preprocessor), ("classifier", RandomForestClassifier())]
)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

rf.fit(X_train, y_train)
print("model score: %.3f" % rf.score(X_test, y_test))

model score: 0.809


In [17]:
gbm = Pipeline(
    steps=[("preprocessor", preprocessor), ("classifier", GradientBoostingClassifier())]
)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

gbm.fit(X_train, y_train)
print("model score: %.3f" % gbm.score(X_test, y_test))

model score: 0.824


In [18]:
clf

In [19]:
rf

In [20]:
gbm

When dealing with a cleaned dataset, the preprocessing can be automatic by using the data types of the column to decide whether to treat a column as a numerical or categorical feature. sklearn.compose.make_column_selector gives this possibility. First, let’s only select a subset of columns to simplify our example.

In [21]:
subset_feature = ["embarked", "sex", "pclass", "age", "fare"]
X_train, X_test = X_train[subset_feature], X_test[subset_feature]

In [22]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1047 entries, 1118 to 684
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   embarked  1045 non-null   category
 1   sex       1047 non-null   category
 2   pclass    1047 non-null   int64   
 3   age       841 non-null    float64 
 4   fare      1046 non-null   float64 
dtypes: category(2), float64(2), int64(1)
memory usage: 35.0 KB


In [24]:
from sklearn.compose import make_column_selector as selector

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, selector(dtype_exclude="category")),
        ("cat", categorical_transformer, selector(dtype_include="category")),
    ]
)
clf = Pipeline(
    steps=[("preprocessor", preprocessor), ("classifier", LogisticRegression())]
)

rf = Pipeline(
    steps=[("preprocessor", preprocessor), ("classifier", RandomForestClassifier())]
)

gbm = Pipeline(
    steps=[("preprocessor", preprocessor), ("classifier", GradientBoostingClassifier())]
)

clf.fit(X_train, y_train)
print("Logistic regression model score: %.3f" % clf.score(X_test, y_test))

rf.fit(X_train, y_train)
print("Random forest model score: %.3f" % rf.score(X_test, y_test))

gbm.fit(X_train, y_train)
print("Gradient Boosting model score: %.3f" % gbm.score(X_test, y_test))


Logistic regression model score: 0.798
Random forest model score: 0.805
Gradient Boosting model score: 0.817


The resulting score is not exactly the same as the one from the previous pipeline because the dtype-based selector treats the pclass column as a numeric feature instead of a categorical feature as previously

In [26]:
clf

In [27]:
rf

In [28]:
gbm

In [29]:
selector(dtype_exclude="category")(X_train)

['pclass', 'age', 'fare']

In [30]:
selector(dtype_include="category")(X_train)

['embarked', 'sex']

# 5. Using the prediction pipeline in a grid search

Grid search can also be performed on the different preprocessing steps defined in the ColumnTransformer object, together with the classifier’s hyperparameters as part of the Pipeline. We will search for both the imputer strategy of the numeric preprocessing and the regularization parameter of the logistic regression using RandomizedSearchCV. This hyperparameter search randomly selects a fixed number of parameter settings configured by n_iter. Alternatively, one can use GridSearchCV but the cartesian product of the parameter space will be evaluated.

In [31]:
param_grid = {
    "preprocessor__num__imputer__strategy": ["mean", "median"],
    "preprocessor__cat__selector__percentile": [10, 30, 50, 70],
    "classifier__C": [0.1, 1.0, 10, 100],
}


In [32]:
search_cv = RandomizedSearchCV(clf, param_grid, n_iter=10, random_state=0)
search_cv

In [33]:
search_cv.fit(X_train, y_train)

print("Best params:")
print(search_cv.best_params_)

Best params:
{'preprocessor__num__imputer__strategy': 'mean', 'preprocessor__cat__selector__percentile': 30, 'classifier__C': 100}


In [34]:
print(f"Internal CV score: {search_cv.best_score_:.3f}")

Internal CV score: 0.786


# 6.  Visualize and introspect the top grid search results as a pandas dataframe

In [35]:
import pandas as pd

cv_results = pd.DataFrame(search_cv.cv_results_)
cv_results = cv_results.sort_values("mean_test_score", ascending=False)
cv_results[
    [
        "mean_test_score",
        "std_test_score",
        "param_preprocessor__num__imputer__strategy",
        "param_preprocessor__cat__selector__percentile",
        "param_classifier__C",
    ]
].head(5)

Unnamed: 0,mean_test_score,std_test_score,param_preprocessor__num__imputer__strategy,param_preprocessor__cat__selector__percentile,param_classifier__C
7,0.786015,0.03102,mean,30,100.0
0,0.785063,0.030498,median,30,1.0
2,0.785063,0.030498,mean,30,1.0
4,0.785063,0.030498,mean,10,10.0
3,0.783149,0.030462,mean,30,0.1


The best hyper-parameters have be used to re-fit a final model on the full training set. We can evaluate that final model on held out test data that was not used for hyperparameter tuning.

In [36]:
print(
    "accuracy of the best model from randomized search: "
    f"{search_cv.score(X_test, y_test):.3f}"
)

accuracy of the best model from randomized search: 0.798
