Column transformation is a technique used in machine learning to preprocess data before feeding it into a model. It involves transforming the values in a column to make them more suitable for the learning algorithm. By applying various transformations, we can improve the performance of our machine learning models.

Example: In machine learning, column transfer is like choosing and preparing the right school supplies for a class. Imagine your data is a school bag filled with different things like books, pencils, and lunch boxes — each item is like a column in your data. But not all items are useful for every subject. So, for a math class, you might only take your math book and pencil, and leave out the rest. Similarly, in machine learning, column transfer means selecting only the useful columns (like age or city) and sometimes changing them so the computer can understand them better. For example, if a column has words like city names, we turn them into numbers. This helps the computer learn more effectively from the data, just like how the right school supplies help you do better in class.










In [1]:
import numpy as np
import pandas as pd

In [2]:

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder

In [4]:

df = pd.read_csv('covid_toy.csv')

In [5]:

df.head()

Unnamed: 0,age,gender,fever,cough,city,has_covid
0,60,Male,103.0,Mild,Kolkata,No
1,27,Male,100.0,Mild,Delhi,Yes
2,42,Male,101.0,Mild,Delhi,No
3,31,Female,98.0,Mild,Kolkata,No
4,65,Female,101.0,Mild,Mumbai,No


In [7]:
df.shape

(100, 6)

In [8]:
df['cough'].value_counts()

Unnamed: 0_level_0,count
cough,Unnamed: 1_level_1
Mild,62
Strong,38


In [9]:
df['city'].value_counts()

Unnamed: 0_level_0,count
city,Unnamed: 1_level_1
Kolkata,32
Bangalore,30
Delhi,22
Mumbai,16


In [10]:
df.isnull().sum()

Unnamed: 0,0
age,0
gender,0
fever,10
cough,0
city,0
has_covid,0


In [11]:

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(df.drop(columns=['has_covid']),df['has_covid'],
                                                test_size=0.2)

In [12]:
X_train

Unnamed: 0,age,gender,fever,cough,city
1,27,Male,100.0,Mild,Delhi
19,42,Female,,Strong,Bangalore
39,50,Female,103.0,Mild,Kolkata
13,64,Male,102.0,Mild,Bangalore
46,19,Female,101.0,Mild,Mumbai
...,...,...,...,...,...
5,84,Female,,Mild,Bangalore
67,65,Male,99.0,Mild,Bangalore
58,23,Male,98.0,Strong,Mumbai
61,81,Female,98.0,Strong,Mumbai


# 1. Without useing column trasfer

In [13]:

# adding simple imputer to fever col
si = SimpleImputer()
X_train_fever = si.fit_transform(X_train[['fever']])

# also the test data
X_test_fever = si.fit_transform(X_test[['fever']])

X_train_fever.shape

(80, 1)

In [14]:

# Ordinalencoding -> cough
oe = OrdinalEncoder(categories=[['Mild','Strong']])
X_train_cough = oe.fit_transform(X_train[['cough']])

# also the test data
X_test_cough = oe.fit_transform(X_test[['cough']])

X_train_cough.shape

(80, 1)

In [16]:
# OneHotEncoding -> gender, city
ohe = OneHotEncoder(drop='first', sparse_output=False)
X_train_gender_city = ohe.fit_transform(X_train[['gender', 'city']])

# also the test data
X_test_gender_city = ohe.transform(X_test[['gender', 'city']])

X_train_gender_city.shape

(80, 4)

In [17]:
# Extracting Age
X_train_age = X_train.drop(columns=['gender','fever','cough','city']).values

# also the test data
X_test_age = X_test.drop(columns=['gender','fever','cough','city']).values

X_train_age.shape

(80, 1)

In [18]:

X_train_transformed = np.concatenate((X_train_age,X_train_fever,X_train_gender_city,X_train_cough),axis=1)
# also the test data
X_test_transformed = np.concatenate((X_test_age,X_test_fever,X_test_gender_city,X_test_cough),axis=1)

X_train_transformed.shape

(80, 7)

# Using column transformer

In [19]:
from sklearn.compose import ColumnTransformer

In [21]:
transformer = ColumnTransformer(transformers=[
    ('tnf1', SimpleImputer(), ['fever']),
    ('tnf2', OrdinalEncoder(categories=[['Mild', 'Strong']]), ['cough']),
    ('tnf3', OneHotEncoder(sparse_output=False, drop='first'), ['gender', 'city'])
], remainder='passthrough')

In [22]:
transformer.fit_transform(X_train).shape

(80, 7)

In [23]:
transformer.transform(X_test).shape

(20, 7)