# Specific processing for column

Sometimes different columns need to be transformed in different ways. The most obvious example is the different processing of categorical and numerical columns:

- For numeric columns, you need to apply normalisation techniques;
- For categorical columns, you need to apply encoding (a hot, mean, etc.).

It's easy to build such a transformation yourself, but it's convenient that `sklearn` has an out-of-the-box solution that can be easily integrated into sklearn type pipelines - `sklearn.compose.ColumnTransformer`.

Learn more <a href="https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html">here</a>.

In [8]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import (
    OneHotEncoder,
    StandardScaler,
    FunctionTransformer
)
from sklearn.compose import ColumnTransformer
from IPython.display import HTML

## Basic example

So in the next cell a random data frame is generated, with some categorical and some numerical columns. Let's show how can be builded component of the pipeline that process categorical columns in one way and numeric in other.

In [105]:
sample_size = 500
np.random.seed(10)

generate_word = lambda: "".join([
    chr(val) for val in 
    np.random.randint(ord("a"), ord("z") + 1, 10)
])
get_cat_var = lambda: np.random.choice(
    [
        generate_word() for i in 
        range(np.random.randint(2,7))
    ], 
    sample_size
)
get_num_var = lambda: np.random.normal(
    np.random.uniform(-1,1), 
    np.random.uniform(1,10),
    sample_size
)

variables_generator = [get_cat_var, get_num_var]

data_frame = pd.concat(
    {
        f"var {i}" : \
        pd.Series(np.random.choice(variables_generator)())
        for i in range(20)
    },
    axis = 1
)

data_frame.head()

Unnamed: 0,var 0,var 1,var 2,var 3,var 4,var 5,var 6,var 7,var 8,var 9,var 10,var 11,var 12,var 13,var 14,var 15,var 16,var 17,var 18,var 19
0,6.352738,ghfmmekjzz,ewvspmvrkg,-3.916784,0.579251,3.078876,jljighbmio,iieafcivri,-3.503851,hwadgiwzth,zderdinjyy,-0.851043,-7.137998,-0.990391,4.128471,lduutwjjin,-6.858011,-3.455499,kdzpmsglss,fjogwgrkig
1,-1.562264,ghfmmekjzz,dlfjbofnbr,-1.45895,0.755219,-0.498048,phrxnjsbae,iieafcivri,-5.212578,yxickhmgkp,kpqepphruh,-6.878257,-1.712574,-7.783903,-3.623413,lduutwjjin,3.198987,-7.290196,eywzqkuzza,fjogwgrkig
2,-2.453819,booaisyeuj,dlfjbofnbr,-0.124566,4.070167,-2.27191,lzsssmsaim,vhfoucvgil,-3.504522,pdzajvgbzz,ynhwdgvtke,-0.838181,1.89863,-6.63206,-1.394765,zghwqxiakd,-14.830121,10.490557,irrdfszbwf,voumadgklp
3,-0.042513,booaisyeuj,kkagxtgiko,-6.897858,-0.065287,-3.459478,phrxnjsbae,yfmijifvmo,0.742066,wectjxhbio,kpqepphruh,-0.087694,-1.808818,0.053985,0.494845,lduutwjjin,-0.341344,4.539596,eywzqkuzza,dzlpowvufa
4,-5.946806,xmtwmxfxpz,dlfjbofnbr,7.45373,-3.450039,0.091773,jljighbmio,vhfoucvgil,1.830545,hwadgiwzth,ynhwdgvtke,-0.218426,0.492733,-2.954776,-2.614179,zghwqxiakd,-2.672298,6.436154,kdzpmsglss,dzlpowvufa


To prepare a transformer that handles different columns in different ways, you need to pass a list of your transformers to the `transformers` parameter of the `sklearn.compose.ColumnTransformer` constructor.

Each element of the transformers list should be of the form `(<transformer name>, <transformer class>, <columns that will use this transformer>)`.

So in the following cell we have created such an object, showing how it will look in the Jupyter output and possible results of this transformation for the data frame described above.

In [141]:
numeric_columns = list(data_frame.select_dtypes("number").columns)
categorical_columns = list(set(data_frame.columns) - set(numeric_columns))

my_transformer = ColumnTransformer(
    transformers = [
        ("one_hot_encoder", OneHotEncoder(), categorical_columns),
        ("standart_scaler", StandardScaler(), numeric_columns)
    ]
)

display(HTML("<p style=\"font-size:20px\">Class display in jupyter</p>"))
display(my_transformer)
display(HTML("<p style=\"font-size:20px\">Fit and transfrom result</p>"))
display(
    pd.DataFrame(
        my_transformer.fit_transform(data_frame)
    ).head()
)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,37,38,39,40,41,42,43,44,45,46
0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,...,-1.169878,0.128906,1.319012,-0.962918,-0.316245,-1.378868,-0.00842,1.003985,-0.811118,-0.795706
1,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,...,-0.391908,0.160042,-0.407646,-1.467765,-2.575318,-0.360064,-1.728754,-1.025626,0.436845,-1.53117
2,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.03046,0.746601,-1.263927,-0.963117,-0.311424,0.31806,-1.43707,-0.442118,-1.80037,1.879037
3,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,...,-2.113468,0.014859,-1.837191,0.291547,-0.030132,-0.378137,0.256049,0.052623,-0.002472,0.73769
4,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,...,2.429194,-0.584051,-0.122927,0.61314,-0.079132,0.054056,-0.505865,-0.761387,-0.291717,1.101435


## No transformations

If you need to build a transformer that doesn't change some of the columns in any way, you can use `FunctionTransformer(lambda x: x)` for columns you need to keep value in- is a transformer that returns the orginal value of the input column. So you'll get the original value of the column.

In the following example we use a dummy transformer and a standard scaler for each input column. As you can see, this type of transformation leaves the columns untransformed.

In [85]:
np.random.seed(10)
sample_size = 10

df = pd.DataFrame({
    "col1" : np.random.uniform(5, 10, sample_size),
    "col2" : np.random.normal(5, 10, sample_size)
})

display(HTML("<h3>Input frame</h3>"))
display(df)

df = ColumnTransformer(
    transformers = [
        ("dummy", FunctionTransformer(lambda x: x), ["col1", "col2"]),
        ("standart_scaler", StandardScaler(), ["col1", "col2"])
    ]
).fit_transform(df)
display(HTML("<h3>Transformation result</h3>"))
pd.DataFrame(df)

Unnamed: 0,col1,col2
0,8.856603,7.655116
1,5.10376,6.085485
2,8.168241,5.042914
3,8.744019,3.253998
4,7.492535,9.330262
5,6.123983,17.030374
6,5.990314,-4.650657
7,8.802654,15.282741
8,5.845554,7.286301
9,5.441699,9.451376


Unnamed: 0,0,1,2,3
0,8.856603,7.655116,1.25827,0.013588
1,5.10376,6.085485,-1.365599,-0.258714
2,8.168241,5.042914,0.776989,-0.43958
3,8.744019,3.253998,1.179555,-0.749924
4,7.492535,9.330262,0.304557,0.304194
5,6.123983,17.030374,-0.652291,1.64002
6,5.990314,-4.650657,-0.745748,-2.121234
7,8.802654,15.282741,1.22055,1.336838
8,5.845554,7.286301,-0.84696,-0.050395
9,5.441699,9.451376,-1.129323,0.325205
