## ColumnTransformer

### Definition

`ColumnTransformer` is a class from `sklearn.compose` that allows you to apply different preprocessing techniques to different columns of a dataset at the same time.

It is mainly used to transform specific columns while keeping the remaining columns unchanged.

---

### Why ColumnTransformer is Used

In real-world datasets, different columns require different preprocessing methods.

Example:

- Categorical columns → OneHotEncoder
- Ordinal columns → OrdinalEncoder
- Numerical columns → StandardScaler

`ColumnTransformer` helps apply all these transformations in one step.

---

### Problem Without ColumnTransformer

Without ColumnTransformer, you must:

- Select columns manually
- Apply encoding manually
- Combine transformed columns manually
- Handle column order manually

This increases complexity and risk of errors.

---

### Solution Using ColumnTransformer

ColumnTransformer automates the process by:

- Applying transformations to selected columns
- Keeping remaining columns unchanged
- Combining everything into a single output

---



## without ColumnTransformer

In [33]:
import pandas as pd 
import numpy as np

In [34]:
dataset = pd.read_csv("Job_Candidate_Selection.csv")

In [35]:
dataset

Unnamed: 0,Candidate_ID,Age,Gender,Education,Experience_Level,City,Expected_Salary,Interview_Score,Selected
0,C001,25.0,Male,Bachelor,Beginner,Hyderabad,400000.0,75.0,Yes
1,C002,30.0,Female,Master,Intermediate,Chennai,600000.0,82.0,Yes
2,C003,28.0,Male,Bachelor,Beginner,Bangalore,450000.0,68.0,No
3,C004,,Female,PhD,Advanced,Hyderabad,900000.0,91.0,Yes
4,C005,35.0,Male,Master,Intermediate,Pune,700000.0,85.0,Yes
5,C006,27.0,,Bachelor,Beginner,Chennai,420000.0,72.0,No
6,C007,32.0,Female,PhD,Advanced,Bangalore,950000.0,95.0,Yes
7,C008,,Male,Master,Intermediate,Hyderabad,,78.0,Yes
8,C009,26.0,Female,Bachelor,Beginner,Pune,390000.0,65.0,No
9,C010,31.0,Male,PhD,Advanced,Chennai,880000.0,,Yes


In [36]:
dataset.shape

(10, 9)

### perform simple imputer

In [37]:
from sklearn.impute import SimpleImputer 
# for numerical columns
num_imputer = SimpleImputer(missing_values=np.nan,strategy="mean")

dataset[["Age", "Expected_Salary", "Interview_Score"]] = num_imputer.fit_transform(
    dataset[["Age", "Expected_Salary", "Interview_Score"]]
)

# for categorical columns 

cat_imputer = SimpleImputer(missing_values=np.nan,strategy="most_frequent")

dataset[["Gender"]] = cat_imputer.fit_transform(
    dataset[["Gender"]]
)

In [38]:
dataset

Unnamed: 0,Candidate_ID,Age,Gender,Education,Experience_Level,City,Expected_Salary,Interview_Score,Selected
0,C001,25.0,Male,Bachelor,Beginner,Hyderabad,400000.0,75.0,Yes
1,C002,30.0,Female,Master,Intermediate,Chennai,600000.0,82.0,Yes
2,C003,28.0,Male,Bachelor,Beginner,Bangalore,450000.0,68.0,No
3,C004,29.25,Female,PhD,Advanced,Hyderabad,900000.0,91.0,Yes
4,C005,35.0,Male,Master,Intermediate,Pune,700000.0,85.0,Yes
5,C006,27.0,Male,Bachelor,Beginner,Chennai,420000.0,72.0,No
6,C007,32.0,Female,PhD,Advanced,Bangalore,950000.0,95.0,Yes
7,C008,29.25,Male,Master,Intermediate,Hyderabad,632222.222222,78.0,Yes
8,C009,26.0,Female,Bachelor,Beginner,Pune,390000.0,65.0,No
9,C010,31.0,Male,PhD,Advanced,Chennai,880000.0,79.0,Yes


In [39]:
dataset.isnull().sum()

Candidate_ID        0
Age                 0
Gender              0
Education           0
Experience_Level    0
City                0
Expected_Salary     0
Interview_Score     0
Selected            0
dtype: int64

### perform oridinal encoder on Education and Experience_Level

In [41]:
dataset["Education"].value_counts()

Education
Bachelor    4
Master      3
PhD         3
Name: count, dtype: int64

In [42]:
dataset["Experience_Level"].value_counts()

Experience_Level
Beginner        4
Intermediate    3
Advanced        3
Name: count, dtype: int64

In [43]:
from sklearn.preprocessing import OrdinalEncoder 

oe = OrdinalEncoder(categories=[["Bachelor","Master","PhD"],["Beginner","Intermediate","Advanced"]],dtype=np.int16)

dataset[["Education","Experience_Level"]] = oe.fit_transform(dataset[["Education","Experience_Level"]])

In [44]:
dataset

Unnamed: 0,Candidate_ID,Age,Gender,Education,Experience_Level,City,Expected_Salary,Interview_Score,Selected
0,C001,25.0,Male,0,0,Hyderabad,400000.0,75.0,Yes
1,C002,30.0,Female,1,1,Chennai,600000.0,82.0,Yes
2,C003,28.0,Male,0,0,Bangalore,450000.0,68.0,No
3,C004,29.25,Female,2,2,Hyderabad,900000.0,91.0,Yes
4,C005,35.0,Male,1,1,Pune,700000.0,85.0,Yes
5,C006,27.0,Male,0,0,Chennai,420000.0,72.0,No
6,C007,32.0,Female,2,2,Bangalore,950000.0,95.0,Yes
7,C008,29.25,Male,1,1,Hyderabad,632222.222222,78.0,Yes
8,C009,26.0,Female,0,0,Pune,390000.0,65.0,No
9,C010,31.0,Male,2,2,Chennai,880000.0,79.0,Yes


### perform OneHotEncoder on Gender, City

In [47]:
from sklearn.preprocessing import OneHotEncoder 
ohe = OneHotEncoder(categories=[["Male","Female"],["Hyderabad","Chennai","Bangalore","Pune"]],  drop="first",sparse_output=False,dtype=np.int16) 

new_data = ohe.fit_transform(dataset[["Gender","City"]])

In [48]:
new_data

array([[0, 0, 0, 0],
       [1, 1, 0, 0],
       [0, 0, 1, 0],
       [1, 0, 0, 0],
       [0, 0, 0, 1],
       [0, 1, 0, 0],
       [1, 0, 1, 0],
       [0, 0, 0, 0],
       [1, 0, 0, 1],
       [0, 1, 0, 0]], dtype=int16)

### performing labelencoding on Selected 

In [51]:
from sklearn.preprocessing import LabelEncoder 

le = LabelEncoder() 

dataset["Selected"] = le.fit_transform(dataset["Selected"])

In [52]:
dataset

Unnamed: 0,Candidate_ID,Age,Gender,Education,Experience_Level,City,Expected_Salary,Interview_Score,Selected
0,C001,25.0,Male,0,0,Hyderabad,400000.0,75.0,1
1,C002,30.0,Female,1,1,Chennai,600000.0,82.0,1
2,C003,28.0,Male,0,0,Bangalore,450000.0,68.0,0
3,C004,29.25,Female,2,2,Hyderabad,900000.0,91.0,1
4,C005,35.0,Male,1,1,Pune,700000.0,85.0,1
5,C006,27.0,Male,0,0,Chennai,420000.0,72.0,0
6,C007,32.0,Female,2,2,Bangalore,950000.0,95.0,1
7,C008,29.25,Male,1,1,Hyderabad,632222.222222,78.0,1
8,C009,26.0,Female,0,0,Pune,390000.0,65.0,0
9,C010,31.0,Male,2,2,Chennai,880000.0,79.0,1


In [None]:
# the above process is very lengthy and time taken and requires manual operations 

In [None]:
# so must use the ColumnTransformer

## perform Data Preprocessing using ColumnTransformer 

In [55]:
df = pd.read_csv("Job_Candidate_Selection.csv")

In [56]:
df

Unnamed: 0,Candidate_ID,Age,Gender,Education,Experience_Level,City,Expected_Salary,Interview_Score,Selected
0,C001,25.0,Male,Bachelor,Beginner,Hyderabad,400000.0,75.0,Yes
1,C002,30.0,Female,Master,Intermediate,Chennai,600000.0,82.0,Yes
2,C003,28.0,Male,Bachelor,Beginner,Bangalore,450000.0,68.0,No
3,C004,,Female,PhD,Advanced,Hyderabad,900000.0,91.0,Yes
4,C005,35.0,Male,Master,Intermediate,Pune,700000.0,85.0,Yes
5,C006,27.0,,Bachelor,Beginner,Chennai,420000.0,72.0,No
6,C007,32.0,Female,PhD,Advanced,Bangalore,950000.0,95.0,Yes
7,C008,,Male,Master,Intermediate,Hyderabad,,78.0,Yes
8,C009,26.0,Female,Bachelor,Beginner,Pune,390000.0,65.0,No
9,C010,31.0,Male,PhD,Advanced,Chennai,880000.0,,Yes


In [57]:
df.isnull().sum()

Candidate_ID        0
Age                 2
Gender              1
Education           0
Experience_Level    0
City                0
Expected_Salary     1
Interview_Score     1
Selected            0
dtype: int64

In [67]:
from sklearn.compose import ColumnTransformer 

ct = ColumnTransformer(transformers=[
    ("tnf1",SimpleImputer(missing_values=np.nan,strategy="mean"),["Age","Expected_Salary","Interview_Score"]),
    ("tnf2",SimpleImputer(missing_values=np.nan,strategy="most_frequent"),["Gender"]), 
    ("tnf3",OrdinalEncoder(categories=[["Bachelor","Master","PhD"],["Beginner","Intermediate","Advanced"]],dtype=np.int16),["Education","Experience_Level"]),
    ("tnf5",OneHotEncoder(drop="first",sparse_output=False,dtype=np.int16),["Gender","City"])
],remainder="passthrough")

In [68]:
encoded_df = ct.fit_transform(df)

In [76]:
new_df = pd.DataFrame(encoded_df)

In [77]:
new_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,25.0,400000.0,75.0,Male,0,0,1,0,0,1,0,C001,Yes
1,30.0,600000.0,82.0,Female,1,1,0,0,1,0,0,C002,Yes
2,28.0,450000.0,68.0,Male,0,0,1,0,0,0,0,C003,No
3,29.25,900000.0,91.0,Female,2,2,0,0,0,1,0,C004,Yes
4,35.0,700000.0,85.0,Male,1,1,1,0,0,0,1,C005,Yes
5,27.0,420000.0,72.0,Male,0,0,0,1,1,0,0,C006,No
6,32.0,950000.0,95.0,Female,2,2,0,0,0,0,0,C007,Yes
7,29.25,632222.222222,78.0,Male,1,1,1,0,0,1,0,C008,Yes
8,26.0,390000.0,65.0,Female,0,0,0,0,0,0,1,C009,No
9,31.0,880000.0,79.0,Male,2,2,1,0,1,0,0,C010,Yes
