# Column Transformer:

**What is ColumnTransformer?**

The ColumnTransformer is a class in the scikit-learn Python machine learning library that allows you to selectively apply data preparation transforms to different columns in your dataset. This is particularly useful when you have a mix of numerical and categorical data that require different preprocessing steps.

**Why Use ColumnTransformer?**

Using ColumnTransformer offers several advantages:

- Selective Transformation: Apply specific transformations to subsets of columns.
- Pipeline Integration: Easily integrate with scikit-learn's Pipeline for streamlined workflows.
- Code Organization: Encapsulate preprocessing logic in a single, maintainable object.
- Use when we do multiple things like fill missing values,Scaling,Encoding Etc,So we do these things seperataly one by one and then merge again those columns,so when we use column tranformer you can do these thing in some lines of code and without merging issue.

## Import libraries:

In [1]:
import pandas as pd
import numpy as np

## Load Dataset:

In [2]:
df=pd.read_csv('../Data/Salary_Data.csv')
df.head()

Unnamed: 0,Age,Gender,Education Level,Job Title,Years of Experience,Salary
0,32.0,Male,Bachelor's,Software Engineer,5.0,90000.0
1,28.0,Female,Master's,Data Analyst,3.0,65000.0
2,45.0,Male,PhD,Senior Manager,15.0,150000.0
3,36.0,Female,Bachelor's,Sales Associate,7.0,60000.0
4,52.0,Male,Master's,Director,20.0,200000.0


In [3]:
df.info()
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6704 entries, 0 to 6703
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Age                  6702 non-null   float64
 1   Gender               6702 non-null   object 
 2   Education Level      6701 non-null   object 
 3   Job Title            6702 non-null   object 
 4   Years of Experience  6701 non-null   float64
 5   Salary               6699 non-null   float64
dtypes: float64(3), object(3)
memory usage: 314.4+ KB


Unnamed: 0,Age,Years of Experience,Salary
count,6702.0,6701.0,6699.0
mean,33.620859,8.094687,115326.964771
std,7.614633,6.059003,52786.183911
min,21.0,0.0,350.0
25%,28.0,3.0,70000.0
50%,32.0,7.0,115000.0
75%,38.0,12.0,160000.0
max,62.0,34.0,250000.0


In [4]:
df.isnull().sum()

Age                    2
Gender                 2
Education Level        3
Job Title              2
Years of Experience    3
Salary                 5
dtype: int64

In [5]:
df = df.replace({'Bachelor\'s': 'Bachelor\'s Degree','Master\'s': 'Master\'s Degree', 'phD': 'PhD'})

In [6]:
df[df['Education Level'].isnull()]

Unnamed: 0,Age,Gender,Education Level,Job Title,Years of Experience,Salary
172,,,,,,
260,,,,,,
2011,27.0,Male,,Developer,7.0,100000.0


-----------

**Filling missing values**
- Every column has missing values we have fill these.
- For numerical columns we use Simple imputer.
- For categorical columns we use Simple imputer.
  
**Ordinal encoding**
- It use on ordinal categorical data.
- Education level that column.
  
**One Hot encoing**
- It use on nominal categorical data.
- job title and gender are that columns.

**Scaling**
- Use on Age and Salary columns.
  
------------

## Column Transformer:

In [7]:
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

In [8]:
X_train,X_test,y_train,y_test=train_test_split(df,df['Salary'],test_size=0.2,random_state=42)

In [9]:
X_train

Unnamed: 0,Age,Gender,Education Level,Job Title,Years of Experience,Salary
2922,52.0,Female,PhD,Software Engineer Manager,29.0,194778.0
4941,26.0,Male,Bachelor's Degree,Senior Product Marketing Manager,5.0,85000.0
135,39.0,Female,Bachelor's Degree,Administrative Assistant,10.0,55000.0
2306,23.0,Female,Bachelor's Degree,Software Engineer,1.0,50000.0
3433,38.0,Male,PhD,Senior Human Resources Manager,11.0,115000.0
...,...,...,...,...,...,...
3772,41.0,Male,PhD,Data Scientist,12.0,130000.0
5191,26.0,Female,Bachelor's Degree,Social Media Manager,3.0,55000.0
5226,27.0,Male,Bachelor's Degree,Product Designer,3.0,60000.0
5390,31.0,Female,Bachelor's Degree,Marketing Coordinator,4.0,65000.0


In [10]:
ct=transformer = ColumnTransformer([

    ('num_imputer',SimpleImputer(strategy='mean'),['Age','Years of Experience','Salary']),
    ('cat_imputer',SimpleImputer(strategy='most_frequent'),['Gender','Education Level','Job Title']),
    ('ohe' , OneHotEncoder(sparse_output=False,drop='first',handle_unknown='ignore'),['Gender','Job Title']),
    ('ord_enco' , OrdinalEncoder(handle_unknown='use_encoded_value',unknown_value=-1,categories=[["High School", 'Bachelor\'s Degree', "Master's Degree", 'PhD']]),['Education Level']),
    ('scaling' , StandardScaler() , ['Age','Years of Experience','Salary']),

],remainder='passthrough')

In [11]:
ct.fit(X_train)

In [12]:
ct.transform(X_train)

array([[52.0, 29.0, 194778.0, ..., 2.406234352525286, 3.4320973428784556,
        1.4880204204586012],
       [26.0, 5.0, 85000.0, ..., -1.0059079232089017,
        -0.5177127801342539, -0.582251871840238],
       [39.0, 10.0, 55000.0, ..., 0.7001632146581922,
        0.30516432882672734, -1.1480133975463465],
       ...,
       [27.0, 3.0, 60000.0, ..., -0.8746716818345097,
        -0.8468636237186464, -1.0537198099286615],
       [31.0, 4.0, 65000.0, ..., -0.3497267163369424,
        -0.6822882019264501, -0.9594262223109769],
       [24.0, 1.0, 90000.0, ..., -1.2683804059576853,
        -1.1760144673030388, -0.48795828422255316]], dtype=object)

In [13]:
ct.transform(X_test)



array([[43.0, 19.0, 156486.0, ..., 1.2251081801557595,
        1.7863431249564936, 0.7658824090473244],
       [34.0, 8.0, 140000.0, ..., 0.04398200778623306,
        -0.023986514757665174, 0.4549775919542942],
       [27.0, 3.0, 80000.0, ..., -0.8746716818345097,
        -0.8468636237186464, -0.6765454594579227],
       ...,
       [34.0, 9.0, 150000.0, ..., 0.04398200778623306,
        0.14058890703453109, 0.6435647671896637],
       [46.0, 15.0, 180000.0, ..., 1.618816904278935, 1.1280414377877086,
        1.2093262928957722],
       [45.0, 22.0, 171468.0, ..., 1.4875806629045432,
        2.2800693903330824, 1.048423714984955]], dtype=object)