# Column Transformer
Column Transformer is a scikit-learn class used to create and apply seperate transformers for numerical and categorical data.

To create transformers we need to specify the transformer object and pass the list of transformations inside a tuple along with the column on which you want to apply the transformation.

We saw in the One Hot Encoding that we were doing the through the pandas library, in which we do transform the encoding from each and one individual column, and then merge to the dataFrame. To avoid this, we will use this class or this transformer known as the `Column Transformer`.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [2]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder

In [3]:
df = pd.read_csv('/content/covid_toy.csv')

In [4]:
df.sample(5)

Unnamed: 0,age,gender,fever,cough,city,has_covid
72,83,Female,101.0,Mild,Kolkata,No
38,49,Female,101.0,Mild,Delhi,Yes
32,34,Female,101.0,Strong,Delhi,Yes
1,27,Male,100.0,Mild,Delhi,Yes
31,83,Male,103.0,Mild,Kolkata,No


In [5]:
df['city'].value_counts()

Unnamed: 0_level_0,count
city,Unnamed: 1_level_1
Kolkata,32
Bangalore,30
Delhi,22
Mumbai,16


In [6]:
df['cough'].value_counts()

Unnamed: 0_level_0,count
cough,Unnamed: 1_level_1
Mild,62
Strong,38


Use One-Hot Encoding when the categories are just different labels with no order or ranking (like types of fruits or colors).

Use Ordinal Encoding when the categories have a clear order or hierarchy (like "Low", "Medium", "High" or rating scales).

We will apply the:
- **One Hot Encoding :** `gender`, `city`
- **Ordinal Encoding :** `cough`

In [8]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.drop(columns=['has_covid']), df['has_covid'], test_size=0.2)

In [9]:
X_train

Unnamed: 0,age,gender,fever,cough,city
70,68,Female,101.0,Strong,Delhi
74,34,Female,104.0,Strong,Delhi
76,80,Male,100.0,Mild,Bangalore
65,69,Female,102.0,Mild,Bangalore
55,81,Female,101.0,Mild,Mumbai
...,...,...,...,...,...
61,81,Female,98.0,Strong,Mumbai
37,55,Male,100.0,Mild,Kolkata
8,19,Female,100.0,Strong,Bangalore
66,51,Male,104.0,Mild,Kolkata


# Without Column Transformer

First we have to impute or handle the missing values of the fever columns.

In [10]:
# Adding simple imputer to fever column
si = SimpleImputer()
X_train_fever = si.fit_transform(X_train[['fever']])

# Also the test data
X_test_fever = si.fit_transform(X_test[['fever']])

print("X_train Shape: ", X_train_fever.shape)
print("X_test Shape: ", X_test_fever.shape)

X_train Shape:  (80, 1)
X_test Shape:  (20, 1)


Now, we are doing the Ordinal Encoder of the cough column, as there are the two types: `Mild` and `Strong`. These two types are in the order like: `Strong > Mild`. So as it is in the ordered types so we have to apply the ordinal encoder.

In [13]:
# Ordinal Encoding -> Cough
oe = OrdinalEncoder(categories=[['Mild', 'Strong']])
X_train_cough = oe.fit_transform(X_train[['cough']])

# Also the test data
X_test_cough = oe.fit_transform(X_test[['cough']])

print("X_train Shape: ", X_train_cough.shape)
print("X_test Shape: ", X_test_cough.shape)

X_train Shape:  (80, 1)
X_test Shape:  (20, 1)


Now we are doing the Nominal Encoding that is the One hot encoding on the columns of the Gender and city. As these two columns are not in the order they are just the categorical values which can be transformed to the `0`,`1`, etc.

**Example (Gender):** `Male: 0`, `Female: 1`

**Example (City):** `Bangalore: 0`, `Delhi: 1`, `Kolkata: 2`, `Mumbai: 3`


In [15]:
# One Hot Encoding -> Gender, City
ohe = OneHotEncoder(drop='first', sparse_output=False)
X_train_gender_city = ohe.fit_transform(X_train[['gender', 'city']])

# Also the test data
X_test_gender_city = ohe.fit_transform(X_test[['gender', 'city']])

print("X_train Shape: ", X_train_gender_city.shape)
print("X_test Shape: ", X_test_gender_city.shape)

X_train Shape:  (80, 4)
X_test Shape:  (20, 4)


Concatenating all the columns, first we will drop the age.

In [16]:
# Extracting Age
X_train_age = X_train.drop(columns = ['gender', 'fever', 'cough', 'city']).values

# Also the test data
X_test_age = X_test.drop(columns = ['gender', 'fever', 'cough', 'city']).values

print("X_train Shape: ", X_train_age.shape)
print("X_test Shape: ", X_test_age.shape)

X_train Shape:  (80, 1)
X_test Shape:  (20, 1)


After dropping the age now we are concatenating all the columns.

In [17]:
X_train_transformed = np.concatenate((X_train_age, X_train_fever, X_train_gender_city, X_train_cough), axis =1 )
X_test_transformed = np.concatenate((X_test_age, X_test_fever, X_test_gender_city, X_test_cough), axis =1 )
print(" Transformed X_train Shape: ", X_train_transformed.shape)
print(" Transformed X_test Shape: ", X_test_transformed.shape)

 Transformed X_train Shape:  (80, 7)
 Transformed X_test Shape:  (20, 7)


# With Column Transformer

In [18]:
from sklearn.compose import ColumnTransformer

In [25]:
transformer = ColumnTransformer(transformers=[
    ('transformer-1', SimpleImputer(), ['fever']),
    ('transformer-2', OrdinalEncoder(categories = [['Mild', 'Strong']]),['cough']),
    ('transformer-3', OneHotEncoder(sparse_output = False, drop = 'first'), ['gender', 'city'])
], remainder="passthrough")

In [27]:
transformer.fit_transform(X_train).shape

(80, 7)

In [28]:
transformer.transform(X_test).shape

(20, 7)