# What is sklearn.ColumnTransformer?
Real-world data often contains heterogeneous data types. When processing the data before applying the final prediction model, we typically want to use different preprocessing steps and transformations for those different types of columns.

A simple example: we may want to scale the numerical features and one-hot encode the categorical features.

In [13]:
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer, make_column_transformer

In [8]:
titanic = pd.read_csv("https://raw.githubusercontent.com/amueller/scipy-2017-sklearn/master/notebooks/datasets/titanic3.csv")
# there is still a small problem with using the OneHotEncoder and missing values,
# so for now I am going to assume there are no missing values by dropping them
titanic2 = titanic.dropna(subset=['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked'])

In [14]:
target = titanic2.survived.values
features = titanic2[['pclass', 'sex', 'age', 'fare', 'embarked']] # select arbitrary subset

In [15]:
features.head()

Unnamed: 0,pclass,sex,age,fare,embarked
0,1,female,29.0,211.3375,S
1,1,male,0.9167,151.55,S
2,1,female,2.0,151.55,S
3,1,male,30.0,151.55,S
4,1,female,25.0,151.55,S


This dataset contains some categorical variables ("pclass", "sex" and "embarked"), and some numerical variables ("age" and "fare"). Note that the "pclass", although categorical, is already encoded as integers in the dataset. So let's use the ColumnTransformer to combine transformers for those two types of features

In [18]:
preprocess = make_column_transformer( # create preprocessing pipeline
    (StandardScaler(), ['age', 'fare']), # scale numerical features
    # one-hot encode categorical features
    (OneHotEncoder(), ['pclass', 'sex', 'embarked'] )
)

In [20]:
preprocess.fit_transform(features)[:5]

array([[-0.05663194,  3.13554913,  1.        ,  0.        ,  0.        ,
         1.        ,  0.        ,  0.        ,  0.        ,  1.        ],
       [-2.01237899,  2.06268333,  1.        ,  0.        ,  0.        ,
         0.        ,  1.        ,  0.        ,  0.        ,  1.        ],
       [-1.93693697,  2.06268333,  1.        ,  0.        ,  0.        ,
         1.        ,  0.        ,  0.        ,  0.        ,  1.        ],
       [ 0.01300899,  2.06268333,  1.        ,  0.        ,  0.        ,
         0.        ,  1.        ,  0.        ,  0.        ,  1.        ],
       [-0.33519565,  2.06268333,  1.        ,  0.        ,  0.        ,
         1.        ,  0.        ,  0.        ,  0.        ,  1.        ]])

Tadaaa! Everything in floats, ready to be categorized!