<a href="https://colab.research.google.com/github/geonextgis/Mastering-Machine-Learning-and-GEE-for-Earth-Science/blob/main/03_Feature_Engineering/04_Column_Transformer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Column Transformer**
The ColumnTransformer is a feature in scikit-learn, a popular Python machine learning library, that allows you to apply different preprocessing steps to different subsets of the columns (features) in your dataset. It is particularly useful when you have a dataset with a mix of numerical and categorical features, and you want to apply different transformations to these feature types.

Here's an overview of how the ColumnTransformer works:

1. **Specify Transformers:**<br> First, you define a list of transformers, where each transformer specifies a particular preprocessing step to be applied to a subset of the columns. For example, you might have one transformer for numerical columns (e.g., scaling), another for categorical columns (e.g., one-hot encoding), and maybe even other transformers for specific subsets of columns.

2. **Specify Columns:**<br> For each transformer, you also specify which columns it should be applied to. This is done using the columns parameter, where you can specify either column indices or column names.

3. **Combine Transformers:**<br> You create a ColumnTransformer object and pass in the list of transformers. You can also specify what to do with the remaining columns that are not specified in any of the transformers, using the remainder parameter. Options include dropping them or passing them through without any transformation.

4. **Fit and Transform:**<br> You can then fit the ColumnTransformer on your dataset using the fit method, and subsequently transform your dataset using the transform method. The ColumnTransformer applies the specified transformations to the designated columns and returns a transformed dataset.

## **Import Required Libraries**

In [43]:
from google.colab import drive
drive.mount("/content/drive")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [44]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

## **Read the Data**

In [45]:
df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/GitHub Repo/Mastering-Machine-Learning-and-GEE-for-Earth-Science/Datasets/covid_toy.csv")
df

Unnamed: 0,age,gender,fever,cough,city,has_covid
0,60,Male,103.0,Mild,Kolkata,No
1,27,Male,100.0,Mild,Delhi,Yes
2,42,Male,101.0,Mild,Delhi,No
3,31,Female,98.0,Mild,Kolkata,No
4,65,Female,101.0,Mild,Mumbai,No
...,...,...,...,...,...,...
95,12,Female,104.0,Mild,Bangalore,No
96,51,Female,101.0,Strong,Kolkata,Yes
97,20,Female,101.0,Mild,Bangalore,No
98,5,Female,98.0,Strong,Mumbai,No


In [46]:
# Check the information of the columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   age        100 non-null    int64  
 1   gender     100 non-null    object 
 2   fever      90 non-null     float64
 3   cough      100 non-null    object 
 4   city       100 non-null    object 
 5   has_covid  100 non-null    object 
dtypes: float64(1), int64(1), object(4)
memory usage: 4.8+ KB


In [47]:
# Check the number of null values in each column
df.isnull().sum()

age           0
gender        0
fever        10
cough         0
city          0
has_covid     0
dtype: int64

In [48]:
# Check all the unique values of the categorical columns
for column in df.select_dtypes(include="object").columns:
    unique_values = df[column].unique()
    print(f"{column}: {unique_values}")

gender: ['Male' 'Female']
cough: ['Mild' 'Strong']
city: ['Kolkata' 'Delhi' 'Mumbai' 'Bangalore']
has_covid: ['No' 'Yes']


## **Preprocessing without `ColumnTransformer`**

### **Train Test Split**

In [49]:
X_train, X_test, y_train, y_test = train_test_split(df.drop("has_covid", axis=1),
                                                    df["has_covid"],
                                                    test_size=0.3,
                                                    random_state=0)
X_train.shape, X_test.shape

((70, 5), (30, 5))

In [50]:
X_train.head(10)

Unnamed: 0,age,gender,fever,cough,city
60,24,Female,102.0,Strong,Bangalore
80,14,Female,99.0,Mild,Mumbai
90,59,Female,99.0,Strong,Delhi
68,54,Female,104.0,Strong,Kolkata
51,11,Female,100.0,Strong,Kolkata
27,33,Female,102.0,Strong,Delhi
18,64,Female,98.0,Mild,Bangalore
56,71,Male,,Strong,Kolkata
63,10,Male,100.0,Mild,Bangalore
74,34,Female,104.0,Strong,Delhi


### **Fill the Null Values of `fever` Column using `SimpleImputer`**

- **SimpleImputer:**<br>
 It is an univariate imputer for filling missing values with simple strategies. It replaces missing values using a descriptive statistic (e.g. mean, median, or most frequent) along each column, or using a constant value.

In [51]:
# Create a SimpleImputer object
simple_imputer = SimpleImputer(strategy='mean')

# Fit the 'fever' column of the training data
simple_imputer.fit(X_train[["fever"]])

# Transform the 'fever' column of the training and testing data
X_train_fever = simple_imputer.transform(X_train[["fever"]])
X_test_fever = simple_imputer.transform(X_test[["fever"]])

In [52]:
# Print the first ten values of the x_train_fever
X_train_fever[:10]

array([[102.        ],
       [ 99.        ],
       [ 99.        ],
       [104.        ],
       [100.        ],
       [102.        ],
       [ 98.        ],
       [101.06557377],
       [100.        ],
       [104.        ]])

### **Apply `OrdinalEncdoer` to `cough` Column**



In [53]:
# Create an object of the OrdinalEncoder
ordinal_encoder = OrdinalEncoder(categories=[["Mild", "Strong"]], dtype=int)

# Fit the 'cough' column of the training data
ordinal_encoder.fit(X_train[["cough"]])

# Transform the 'cough' column of the training and testing data
X_train_cough = ordinal_encoder.transform(X_train[["cough"]])
X_test_cough = ordinal_encoder.transform(X_test[["cough"]])

In [54]:
# Print the first ten values of the x_train_cough
X_train_cough[:10]

array([[1],
       [0],
       [1],
       [1],
       [1],
       [1],
       [0],
       [1],
       [0],
       [1]])

### **Apply `OneHotEncdoer` to `gender` and `city` Columns**

In [55]:
# Create an object of the OneHotencoder
one_hot_encoder = OneHotEncoder(drop="first", sparse_output=False, dtype=int)

# Fit the 'genedr' and 'city' columns of the training data
one_hot_encoder.fit(X_train[["gender", "city"]])

# Transform the 'genedr' and 'city' columns of the training and testing data
X_train_gender_city = one_hot_encoder.transform(X_train[["gender", "city"]])
X_test_gender_city = one_hot_encoder.transform(X_test[["gender", "city"]])

In [56]:
# Check the new column names after applying One Hot Encoding
one_hot_encoder.get_feature_names_out()

array(['gender_Male', 'city_Delhi', 'city_Kolkata', 'city_Mumbai'],
      dtype=object)

In [57]:
# Print the first ten values of the x_train_gender_city
X_train_gender_city[:10]

array([[0, 0, 0, 0],
       [0, 0, 0, 1],
       [0, 1, 0, 0],
       [0, 0, 1, 0],
       [0, 0, 1, 0],
       [0, 1, 0, 0],
       [0, 0, 0, 0],
       [1, 0, 1, 0],
       [1, 0, 0, 0],
       [0, 1, 0, 0]])

In [58]:
X_train_cough.shape

(70, 1)

In [59]:
# Convert the 'age' column into numpy array
X_train_age = np.array(X_train["age"]).reshape((70, 1))
X_test_age = np.array(X_test["age"]).reshape((30, 1))

In [60]:
# Print the first ten values of the x_train_age
X_train_age[:10]

array([[24],
       [14],
       [59],
       [54],
       [11],
       [33],
       [64],
       [71],
       [10],
       [34]])

### **Concatenating all the Arrays for the Training and Testing Data**

In [61]:
# Concatenating all the columns of the training data
X_train_transformed = np.concatenate((X_train_age, X_train_fever, X_train_cough, X_train_gender_city), axis=1)

# Concatenating all the columns of the training data
X_test_transformed = np.concatenate((X_test_age, X_test_fever, X_test_cough, X_test_gender_city), axis=1)

In [62]:
# Defining the column names of the transformed dataframe
column_names = np.concatenate((np.array(["age", "fever", "cough"]), one_hot_encoder.get_feature_names_out()))
column_names

array(['age', 'fever', 'cough', 'gender_Male', 'city_Delhi',
       'city_Kolkata', 'city_Mumbai'], dtype=object)

In [63]:
# Convert transformed data into pandas dataframe
X_train_transformed = pd.DataFrame(X_train_transformed, columns=column_names)
X_test_transformed = pd.DataFrame(X_test_transformed, columns=column_names)

In [64]:
# Print the transformed data
X_train_transformed

Unnamed: 0,age,fever,cough,gender_Male,city_Delhi,city_Kolkata,city_Mumbai
0,24.0,102.0,1.0,0.0,0.0,0.0,0.0
1,14.0,99.0,0.0,0.0,0.0,0.0,1.0
2,59.0,99.0,1.0,0.0,1.0,0.0,0.0
3,54.0,104.0,1.0,0.0,0.0,1.0,0.0
4,11.0,100.0,1.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...
65,51.0,101.0,1.0,0.0,0.0,1.0,0.0
66,65.0,99.0,0.0,1.0,0.0,0.0,0.0
67,42.0,104.0,0.0,1.0,0.0,0.0,1.0
68,18.0,104.0,0.0,0.0,0.0,0.0,0.0


In [65]:
# Print the information of the transformed training data
X_train_transformed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70 entries, 0 to 69
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   age           70 non-null     float64
 1   fever         70 non-null     float64
 2   cough         70 non-null     float64
 3   gender_Male   70 non-null     float64
 4   city_Delhi    70 non-null     float64
 5   city_Kolkata  70 non-null     float64
 6   city_Mumbai   70 non-null     float64
dtypes: float64(7)
memory usage: 4.0 KB


In [66]:
X_test_transformed.head(10)

Unnamed: 0,age,fever,cough,gender_Male,city_Delhi,city_Kolkata,city_Mumbai
0,19.0,100.0,0.0,0.0,0.0,1.0,0.0
1,25.0,104.0,0.0,1.0,0.0,0.0,0.0
2,42.0,101.0,0.0,1.0,1.0,0.0,0.0
3,81.0,101.0,0.0,0.0,0.0,0.0,1.0
4,5.0,102.0,0.0,1.0,0.0,1.0,0.0
5,27.0,100.0,0.0,1.0,0.0,1.0,0.0
6,69.0,103.0,0.0,0.0,0.0,1.0,0.0
7,34.0,98.0,1.0,1.0,0.0,1.0,0.0
8,60.0,99.0,0.0,0.0,0.0,0.0,1.0
9,12.0,104.0,0.0,0.0,0.0,0.0,0.0


In [67]:
# Print the information of the transformed testing data
X_test_transformed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   age           30 non-null     float64
 1   fever         30 non-null     float64
 2   cough         30 non-null     float64
 3   gender_Male   30 non-null     float64
 4   city_Delhi    30 non-null     float64
 5   city_Kolkata  30 non-null     float64
 6   city_Mumbai   30 non-null     float64
dtypes: float64(7)
memory usage: 1.8 KB


## **Preprocessing with `ColumnTransformer`**

In [68]:
# Create an object of the ColumnTransformer
transformer = ColumnTransformer(transformers=[
    ("tranformer_1", SimpleImputer(strategy='mean'), ["fever"]),
    ("transformer_2", OrdinalEncoder(categories=[["Mild", "Strong"]]), ["cough"]),
    ("transformer_3", OneHotEncoder(drop="first", sparse_output=False), ["gender", "city"])
], remainder="passthrough")

In [69]:
# Fit and transform the training data
X_train_transformed = transformer.fit_transform(X_train)
X_train_transformed.shape

(70, 7)

In [70]:
# Transform the testing data
X_test_transformed = transformer.transform(X_test)
X_test_transformed.shape

(30, 7)

In [71]:
# Checking the new column names of the transformed data
transformer.get_feature_names_out()

array(['tranformer_1__fever', 'transformer_2__cough',
       'transformer_3__gender_Male', 'transformer_3__city_Delhi',
       'transformer_3__city_Kolkata', 'transformer_3__city_Mumbai',
       'remainder__age'], dtype=object)

In [72]:
# Convert the transformed array into pandas dataframe
X_train_transformed = pd.DataFrame(X_train_transformed, columns=transformer.get_feature_names_out())
X_test_transformed = pd.DataFrame(X_test_transformed, columns=transformer.get_feature_names_out())

In [73]:
X_train_transformed

Unnamed: 0,tranformer_1__fever,transformer_2__cough,transformer_3__gender_Male,transformer_3__city_Delhi,transformer_3__city_Kolkata,transformer_3__city_Mumbai,remainder__age
0,102.0,1.0,0.0,0.0,0.0,0.0,24.0
1,99.0,0.0,0.0,0.0,0.0,1.0,14.0
2,99.0,1.0,0.0,1.0,0.0,0.0,59.0
3,104.0,1.0,0.0,0.0,1.0,0.0,54.0
4,100.0,1.0,0.0,0.0,1.0,0.0,11.0
...,...,...,...,...,...,...,...
65,101.0,1.0,0.0,0.0,1.0,0.0,51.0
66,99.0,0.0,1.0,0.0,0.0,0.0,65.0
67,104.0,0.0,1.0,0.0,0.0,1.0,42.0
68,104.0,0.0,0.0,0.0,0.0,0.0,18.0


In [74]:
X_test_transformed.head(10)

Unnamed: 0,tranformer_1__fever,transformer_2__cough,transformer_3__gender_Male,transformer_3__city_Delhi,transformer_3__city_Kolkata,transformer_3__city_Mumbai,remainder__age
0,100.0,0.0,0.0,0.0,1.0,0.0,19.0
1,104.0,0.0,1.0,0.0,0.0,0.0,25.0
2,101.0,0.0,1.0,1.0,0.0,0.0,42.0
3,101.0,0.0,0.0,0.0,0.0,1.0,81.0
4,102.0,0.0,1.0,0.0,1.0,0.0,5.0
5,100.0,0.0,1.0,0.0,1.0,0.0,27.0
6,103.0,0.0,0.0,0.0,1.0,0.0,69.0
7,98.0,1.0,1.0,0.0,1.0,0.0,34.0
8,99.0,0.0,0.0,0.0,0.0,1.0,60.0
9,104.0,0.0,0.0,0.0,0.0,0.0,12.0
