# 5.5) Column Transformer

- In the dataset, we usually have multiple types of features like numerical and categorical. Some might need scaling, others need encoding, etc.
- A ColumnTransformer lets us apply different preprocessing techniques to different columns in one single step.
- Example: Assume we have a customer dataset with the following columns:
  - age: Suppose 20 values are missing then we can apply SimpleImputer here. We'll get a numpy array after transformation.
  - city and gender columns: One hot encoding can be applied and we will get a numpy array after transformation.
  - review: Ordinal encoding can be applied, which will give us a numpy array after transformation.
  - Finally, we will need to combine all these numpy arrays to get our final numpy array. This is not efficient and too much of a trouble. The solution of this problem is ColumnTransformer.

### Code:

In [4]:
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
df = pd.read_csv('covid_toy.csv') # this dataset is created by tutor
df.head()

Unnamed: 0,age,gender,fever,cough,city,has_covid
0,60,Male,103.0,Mild,Kolkata,No
1,27,Male,100.0,Mild,Delhi,Yes
2,42,Male,101.0,Mild,Delhi,No
3,31,Female,98.0,Mild,Kolkata,No
4,65,Female,101.0,Mild,Mumbai,No


- This dataset has 100 peoples data.
- gender is nominal dataset.
- fever: 10 values are missing
- cough: ordinal and has values Mild or Strong
- city: nominal and has values Kolkata, Bangalore, Delhi, Mumbai

In [13]:
df['fever'].isnull().sum()

np.int64(10)

In [11]:
df['cough'].value_counts()

cough
Mild      62
Strong    38
Name: count, dtype: int64

In [12]:
df['city'].value_counts()

city
Kolkata      32
Bangalore    30
Delhi        22
Mumbai       16
Name: count, dtype: int64

In [14]:
# train_test_split
from sklearn.model_selection import train_test_split
X_train, x_test, y_train, y_test = train_test_split(df.drop(columns=['has_covid']), df['has_covid'], test_size=0.2)

- **Case(i): Without Using COlumnTransformer:**

In [16]:
# adding SimpleImputer to the fever column:
si = SimpleImputer()
X_train_fever = si.fit_transform(X_train[['fever']])  # missing value will be replaced by mean
X_test_fever = si.transform(x_test[['fever']])
X_train_fever.shape   # will give (80,1) -> numpy array

(80, 1)

In [17]:
# OrdinalEncoding on the cough column:
oe = OrdinalEncoder(categories = [['Mild', 'Strong']])
X_train_cough = oe.fit_transform(X_train[['cough']])
X_test_cough = oe.transform(x_test[['cough']])
X_train_cough.shape

(80, 1)

In [18]:
X_train[['cough']]   # Returns a single column, but still as a DataFrame (2D). Shape will be like: (rows, 1)

Unnamed: 0,cough
33,Mild
51,Strong
46,Mild
84,Strong
26,Mild
...,...
86,Mild
42,Mild
7,Strong
45,Mild


In [19]:
X_train['cough']    # Returns a single column as a Series (1D). Shape will be like: (rows,)

33      Mild
51    Strong
46      Mild
84    Strong
26      Mild
       ...  
86      Mild
42      Mild
7     Strong
45      Mild
66      Mild
Name: cough, Length: 80, dtype: object

In [21]:
# OneHotEncoding on the gender and city columns:
ohe = OneHotEncoder(drop='first', sparse_output=False)   
# Note: In newer versions of scikit-learn (>=1.2), the argument sparse was replaced with sparse_output.
X_train_gender_city = ohe.fit_transform(X_train[['gender','city']])
X_test_gender_city = ohe.transform(x_test[['gender', 'city']])
X_train_gender_city.shape

(80, 4)

In [24]:
# Extracting the age column from the original dataset:
X_train_age = X_train.drop(columns=['gender','fever','cough','city']).values
X_test_age = x_test.drop(columns=['gender','fever','cough','city']).values
X_train_age.shape

(80, 1)

In [25]:
# Combining the age column with the transformed columns:
X_train_transformed = np.concatenate((X_train_age, X_train_fever, X_train_gender_city, X_train_cough), axis = 1)
X_test_transformed = np.concatenate((X_test_age, X_test_fever, X_test_gender_city, X_test_cough), axis = 1)

In [26]:
X_train_transformed.shape

(80, 7)

- **Case(ii): Using ColumnTransformer:**

In [27]:
from sklearn.compose import ColumnTransformer

In [31]:
transformer = ColumnTransformer(transformers = [
    ('tnf1', SimpleImputer(), ['fever']),  # Here, tnf1 is any name given to the transformer
    ('tnf2', OrdinalEncoder(categories=[['Mild', 'Strong']]), ['cough']),  
    ('tnf3', OneHotEncoder(sparse_output=False, drop='first'), ['gender','city']),  
], remainder='passthrough')
# In the transformers parameter, we pass a list of tuples of transformers that we are going to apply.
# remainder parameter tells what to do with the rest of the columns - Either passthrough (i.e., do nothing) or drop

In [33]:
transformer.fit_transform(X_train).shape

(80, 7)

In [35]:
transformer.fit_transform(x_test).shape

(20, 7)

In [36]:
transformer.fit_transform(X_train)

array([[ 98.        ,   0.        ,   0.        ,   0.        ,
          1.        ,   0.        ,  26.        ],
       [100.        ,   1.        ,   0.        ,   0.        ,
          1.        ,   0.        ,  11.        ],
       [101.        ,   0.        ,   0.        ,   0.        ,
          0.        ,   1.        ,  19.        ],
       [ 98.        ,   1.        ,   0.        ,   0.        ,
          0.        ,   1.        ,  69.        ],
       [100.        ,   0.        ,   0.        ,   0.        ,
          1.        ,   0.        ,  19.        ],
       [100.72222222,   0.        ,   1.        ,   1.        ,
          0.        ,   0.        ,  38.        ],
       [100.        ,   1.        ,   0.        ,   0.        ,
          1.        ,   0.        ,  13.        ],
       [100.72222222,   1.        ,   1.        ,   0.        ,
          1.        ,   0.        ,  71.        ],
       [ 98.        ,   0.        ,   0.        ,   0.        ,
          1.    

- How easy it has become with ColumnTransformer.

==========================================================================================================