# Ensemble Techniques And Its Types-5

**Q1. You are working on a machine learning project where you have a dataset contaning numerical and
categorical features. You have identified that some of the features are highly correlated and there are
missing values in some of the columns. You want to build a pipeline that automates the feature
engineering process and handles the missing values**

**Design a pipeline that includes the following steps:**  

- Use an automated feature selection method to identify the important features in the dataset
- Create a numerical pipeline that includes the following steps
    - Impute the missing values in the numerical columns using the mean of the column values
    - Scale the numerical columns using standardizationn
- Create a categorical pipeline that includes the following steps
    - Impute the missing values in the categorical columns using the most frequent value of the column
    - One-hot encode the categorical columns
- Combine the numerical and categorical pipelines using a ColumnTransformer
- Use a Random Forest Classifier to build the final model
- Evaluate the accuracy of the model on the test dataset

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder,LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score
from seaborn import load_dataset

import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

In [2]:
df = load_dataset("penguins")

In [3]:
df.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female


In [4]:
df.shape

(344, 7)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object 
dtypes: float64(4), object(3)
memory usage: 18.9+ KB


In [6]:
df.isnull().sum()

species               0
island                0
bill_length_mm        2
bill_depth_mm         2
flipper_length_mm     2
body_mass_g           2
sex                  11
dtype: int64

In [7]:
## Encoding the target feature i.e., species
label_encoder = LabelEncoder()
df["species"] = label_encoder.fit_transform(df["species"])

In [8]:
df.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,0,Torgersen,39.1,18.7,181.0,3750.0,Male
1,0,Torgersen,39.5,17.4,186.0,3800.0,Female
2,0,Torgersen,40.3,18.0,195.0,3250.0,Female
3,0,Torgersen,,,,,
4,0,Torgersen,36.7,19.3,193.0,3450.0,Female


In [9]:
df.species.value_counts()

species
0    152
2    124
1     68
Name: count, dtype: int64

**Problem Understanding: It's a multiclass classification problem.**

In [10]:
## Splitting the data into dependent and independent features

X = df.drop("species",axis=1)
y = df["species"]

In [11]:
##Splitting the data into train and test sets
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=42)

In [12]:
print(f"shape of X_train:{X_train.shape}")
print(f"shape of y_train:{y_train.shape}")
print(f"shape of X_test:{X_test.shape}")
print(f"shape of y_test:{y_test.shape}")

shape of X_train:(240, 6)
shape of y_train:(240,)
shape of X_test:(104, 6)
shape of y_test:(104,)


In [13]:
## Getting the categorical and numerical feature names
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   island             344 non-null    object 
 1   bill_length_mm     342 non-null    float64
 2   bill_depth_mm      342 non-null    float64
 3   flipper_length_mm  342 non-null    float64
 4   body_mass_g        342 non-null    float64
 5   sex                333 non-null    object 
dtypes: float64(4), object(2)
memory usage: 16.2+ KB


In [14]:
numerical_cols = list(X.select_dtypes(include='float64').columns)

In [15]:
categorical_cols = list(X.select_dtypes(include='object').columns)

In [16]:
print(f"numerical_cols: {numerical_cols}")
print(f"categorical_cols: {categorical_cols}")

numerical_cols: ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']
categorical_cols: ['island', 'sex']


Creating the Pipelines

In [17]:
## Numerical Pipeline
num_pipeline = Pipeline(
     steps=[
         ('imputer',SimpleImputer(strategy='mean')),
         ('scaler',StandardScaler())
     ]
 )

In [18]:
## Categorical Pipeline

cat_pipeline = Pipeline(
    steps=[
        ('imputer',SimpleImputer(strategy='most_frequent')),
        ('oh_encoder',OneHotEncoder())
    ]
)

Combining both the pipelines using ColumnTransformer

In [19]:
preprocessor = ColumnTransformer([
    ('num_pipeline',num_pipeline,numerical_cols),
    ('cat_pipeline',cat_pipeline,categorical_cols)
])

In [20]:
X_train = preprocessor.fit_transform(X_train)
X_test = preprocessor.transform(X_test)

**Building a Random Forest Classifier Model on the transformed data**

In [21]:
rf_clf = RandomForestClassifier()

In [22]:
rf_clf.fit(X_train,y_train)

In [23]:
y_pred = rf_clf.predict(X_test)

**Evaluating the performace of the model on the test data**

In [24]:
acc_score = accuracy_score(y_test,y_pred)
precision =precision_score(y_test,y_pred,average = 'weighted')
recall =recall_score(y_test,y_pred,average = 'weighted')
f1 =f1_score(y_test,y_pred,average = 'weighted')

In [25]:
print(f"The accuracy of the model: {acc_score}")
print(f"The weighted precision of the model: {precision}")
print(f"The weighted recall of the model: {recall}")
print(f"The weighted f1 score of the model: {f1}")

The accuracy of the model: 0.9903846153846154
The weighted precision of the model: 0.9905731523378581
The weighted recall of the model: 0.9903846153846154
The weighted f1 score of the model: 0.9903089421368172


**The model works way to well we might need to perform hyperparameter tuning that it generalizes well.**

**Q2. Build a pipeline that includes a random forest classifier and a logistic regression classifier, and then
use a voting classifier to combine their predictions. Train the pipeline on the iris dataset and evaluate its
accuracy.**

In [26]:
## importing the required libraies
from sklearn.ensemble import RandomForestClassifier,VotingClassifier
from sklearn.linear_model import LogisticRegression

In [27]:
df = load_dataset("iris")

In [28]:
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [29]:
df.shape

(150, 5)

In [30]:
df["species"].value_counts()

species
setosa        50
versicolor    50
virginica     50
Name: count, dtype: int64

In [31]:
label_encoder = LabelEncoder()
df["species"] = label_encoder.fit_transform(df["species"])

In [32]:
X = df.drop("species",axis=1)
y= df["species"]

In [33]:
##splitting the dataset into training and testing datasets
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.28,random_state=42)

In [34]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
dtypes: float64(4)
memory usage: 4.8 KB


**there are only numerical features**

In [35]:
main_pipeline = Pipeline(
    steps=[
        ('imputer',SimpleImputer(strategy='mean')),
        ('scaler',StandardScaler())

    ])

In [36]:
preprocessor = ColumnTransformer([
    ('main_pipeline',main_pipeline,list(X.columns))
])

In [37]:
X_train = preprocessor.fit_transform(X_train)
X_test = preprocessor.transform(X_test)

In [38]:
# Create Random Forest classifier
rf_clf = RandomForestClassifier()

# Create Logistic Regression classifier
lr_clf = LogisticRegression()

In [39]:
voting_clf = VotingClassifier(estimators=[
    ("random_forest_classifier",rf_clf),
    ("logistic_regression_classifier",lr_clf)
])

In [40]:
voting_clf.fit(X_train,y_train)

In [41]:
y_pred = voting_clf.predict(X_test)

In [42]:
acc_score = accuracy_score(y_test,y_pred)
precision =precision_score(y_test,y_pred,average = 'weighted')
recall =recall_score(y_test,y_pred,average = 'weighted')
f1 =f1_score(y_test,y_pred,average = 'weighted')

In [43]:
print(f"The accuracy of the model: {acc_score}")
print(f"The weighted precision of the model: {precision}")
print(f"The weighted recall of the model: {recall}")
print(f"The weighted f1 score of the model: {f1}")

The accuracy of the model: 1.0
The weighted precision of the model: 1.0
The weighted recall of the model: 1.0
The weighted f1 score of the model: 1.0
