In [1]:
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer

If you are getting an error on importing ColumnTransformer, it might be because you might be using an old version of `scikit-learn`. <br>
To update `scikit-learn` to the latest version, type the following into the terminal or command prompt. <br>
```
pip install -U scikit-learn
``` 
Then hit enter. 

Here, I am constructing a simple data frame for this tutorial. 

In [2]:
d = {'col_1': np.linspace(0, 1.5, 5),
    'col_2': ['A', 'B', 'C', 'D', 'E'],
    'col_3': ['f', 'g', 'h', 'i', 'j'],
    'col_4': [5, 6, 7, 8, 9]}

In [3]:
pd.DataFrame(d)

Unnamed: 0,col_1,col_2,col_3,col_4
0,0.0,A,f,5
1,0.375,B,g,6
2,0.75,C,h,7
3,1.125,D,i,8
4,1.5,E,j,9


In [4]:
data = pd.DataFrame(d)

Let us have a look at our data frame. 

In [5]:
data

Unnamed: 0,col_1,col_2,col_3,col_4
0,0.0,A,f,5
1,0.375,B,g,6
2,0.75,C,h,7
3,1.125,D,i,8
4,1.5,E,j,9


`col_2` and `col_3` are categorical features. <br> 
It is recommended to preprocess categorical features with One hot encoding, before passing it to the machine learning model. <br><br>
Let us just preprocess one feature first. <br>

**`ColumnTransformer`** takes in a list of tuples; Each tuple will have (name,  transformer, column_name(s)) <br>
For this example, I **named** the tuple as ‘categorical_features’; you can call it anything; it does not affect the process. <br>
**Transformer** here is OneHotEncoder from scikit-learn; Note that the transformer must support fit and transform methods, for example, we cannot use get_dummies from Pandas here. <br>
**Column name** here is ‘col_2’, we will use multiple columns in the next example


In [6]:
from sklearn.preprocessing import OneHotEncoder
ColumnTransformer([('categorical_features', OneHotEncoder(sparse=False), ['col_2'])])

ColumnTransformer(n_jobs=None, remainder='drop', sparse_threshold=0.3,
                  transformer_weights=None,
                  transformers=[('categorical_features',
                                 OneHotEncoder(categorical_features=None,
                                               categories=None, drop=None,
                                               dtype=<class 'numpy.float64'>,
                                               handle_unknown='error',
                                               n_values=None, sparse=False),
                                 ['col_2'])],
                  verbose=False)

Saving it in a variable ‘ct’. 

In [7]:
ct = ColumnTransformer([('categorical_features', OneHotEncoder(sparse=False, drop = 'first'), ['col_2'])])

Here the first column indicates ‘B’ of ‘col_2’, second to ‘C’ and so forth. 

In [8]:
ct.fit_transform(data)

array([[0., 0., 0., 0.],
       [1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.]])

In [9]:
data 

Unnamed: 0,col_1,col_2,col_3,col_4
0,0.0,A,f,5
1,0.375,B,g,6
2,0.75,C,h,7
3,1.125,D,i,8
4,1.5,E,j,9


Let us take a look at multiple categorical columns.

In [10]:
ct_2 = ColumnTransformer([('categorical_features', OneHotEncoder(sparse=False, drop = 'first'), ['col_2', 'col_3'])])

Similarly, you can all the names of the columns with categorical features. 

In [11]:
ct_2.fit_transform(data)

array([[0., 0., 0., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0., 1., 0., 0.],
       [0., 0., 1., 0., 0., 0., 1., 0.],
       [0., 0., 0., 1., 0., 0., 0., 1.]])

In [19]:
data

Unnamed: 0,col_1,col_2,col_3,col_4
0,0.0,A,f,5
1,0.375,B,g,6
2,0.75,C,h,7
3,1.125,D,i,8
4,1.5,E,j,9


Now let us take into consideration numerical features in this example, since preprocessing for numerical and categorical features will be different. <br>
For this example, we will use `MinMaxScaler` for numerical features, and I named it ‘numerical_features’. 


In [12]:
from sklearn.preprocessing import MinMaxScaler
ct_3 = ColumnTransformer([('categorical_features', OneHotEncoder(sparse=False, drop = 'first'), ['col_2', 'col_3']), 
                         ('numerical_features', MinMaxScaler(), ['col_1', 'col_4'])])

In [13]:
ct_3.fit_transform(data)

array([[0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ],
       [1.  , 0.  , 0.  , 0.  , 1.  , 0.  , 0.  , 0.  , 0.25, 0.25],
       [0.  , 1.  , 0.  , 0.  , 0.  , 1.  , 0.  , 0.  , 0.5 , 0.5 ],
       [0.  , 0.  , 1.  , 0.  , 0.  , 0.  , 1.  , 0.  , 0.75, 0.75],
       [0.  , 0.  , 0.  , 1.  , 0.  , 0.  , 0.  , 1.  , 1.  , 1.  ]])

In [14]:
new_data = ct_3.fit_transform(data)

I am inserting the transformed data into a data frame. 

In [15]:
pd.DataFrame(new_data)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.25,0.25
2,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.5,0.5
3,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.75,0.75
4,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0


If you want to specific columns untouched, in this example the numerical features, you can use ‘passthrough’ keyword. <br>
You can also use 'drop' keyword to remove the columns.

In [17]:
ct_4 = ColumnTransformer([('categorical_features', OneHotEncoder(sparse=False, drop = 'first'), ['col_2', 'col_3']), 
                         ('numerical_features', 'passthrough', ['col_1', 'col_4'])])

In [18]:
pd.DataFrame(ct_4.fit_transform(data))

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0
1,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.375,6.0
2,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.75,7.0
3,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.125,8.0
4,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,1.5,9.0
