# Ensemble Techniques-5

Q1. You are working on a machine learning project where you have a dataset containing numerical and
categorical features. You have identified that some of the features are highly correlated and there are
missing values in some of the columns. You want to build a pipeline that automates the feature
engineering process and handles the missing values.

Design a pipeline that includes the following steps:

- Use an automated feature selection method to identify the important features in the datasets
- Create a numerical pipeline that includes the following steps
    - Impute the missing values in the numerical columns using the mean of the column values
    - Scale the numerical columns using standardization
- Create a categorical pipeline that includes the following steps
    - Impute the missing values in the categorical columns using the most frequent value of the column
    - One-hot encode the categorical columns
- Combine the numerical and categorical pipelines using a ColumnTransformer
- Use a Random Forest Classifier to build the final model
- Evaluate the accuracy of the model on the test dataset

Note: Your solution should include code snippets for each step of the pipeline, and a brief explanation of
each step. You should also provide an interpretation of the results and suggest possible improvements for
the pipeline


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [11]:
df=pd.read_csv('diabetes.csv')
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


We will predict the Outcome

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


Since we don't have a categorical column, we only need a numerical pipeline.

In [22]:
X=df.iloc[:,:-1]
y=df['Outcome']

X.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0


Minimum of Glucose, BloodPressure, SkinThickness, Insulin, BMI cannot be zero.

In [47]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.20, random_state=42)

In [44]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

In [42]:
col_to_impute=df.columns.to_list()
col_to_impute.remove('Outcome')
col_to_impute.remove('Pregnancies')
col_to_impute

['Glucose',
 'BloodPressure',
 'SkinThickness',
 'Insulin',
 'BMI',
 'DiabetesPedigreeFunction',
 'Age']

In [46]:
num_pipeline=Pipeline(
    steps=[
        ('imputer', SimpleImputer(missing_values=0, strategy='median')),
        ('scaler', StandardScaler())
    ]
)


In [49]:
preprocessor=ColumnTransformer([
    ('num_pipeline', num_pipeline, col_to_impute)
]) 

In [50]:
X_train=preprocessor.fit_transform(X_train)
X_test=preprocessor.transform(X_test)

In [57]:
from sklearn.ensemble import RandomForestClassifier

classifier=RandomForestClassifier()

In [58]:
from sklearn.model_selection import RandomizedSearchCV
import warnings
warnings.filterwarnings('ignore')

params={'max_depth':[3,5,10, None],
       'n_estimators':[100,200,300],
       'criterion':['gini','entropy']
       }

clf=RandomizedSearchCV(classifier, param_distributions=params, scoring='accuracy', cv=5)
clf.fit(X_train, y_train)

In [61]:
best_params=clf.best_params_
best_params

{'n_estimators': 200, 'max_depth': 10, 'criterion': 'gini'}

In [60]:
clf.best_score_

0.7720378515260562

In [62]:
new_classifer=RandomForestClassifier(**best_params)
new_classifer.fit(X_train,y_train)
y_pred=new_classifer.predict(X_test)

In [66]:
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
print(confusion_matrix(y_pred=y_pred,y_true=y_test))
print(f'\nAccuracy:{accuracy_score(y_pred,y_test)}\n')
print(classification_report(y_pred,y_test))

[[77 22]
 [18 37]]

Accuracy:0.7402597402597403

              precision    recall  f1-score   support

           0       0.78      0.81      0.79        95
           1       0.67      0.63      0.65        59

    accuracy                           0.74       154
   macro avg       0.73      0.72      0.72       154
weighted avg       0.74      0.74      0.74       154



Q2. Build a pipeline that includes a random forest classifier and a logistic regression classifier, and then
use a voting classifier to combine their predictions. Train the pipeline on the iris dataset and evaluate its
accuracy.

Ans. Solution using python is as follows:

In [67]:
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create individual classifiers (Random Forest and Logistic Regression)
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
lr_classifier = LogisticRegression(random_state=42)

# Create a Voting Classifier that combines the two classifiers
voting_classifier = VotingClassifier(
    estimators=[('rf', rf_classifier), ('lr', lr_classifier)],
    voting='hard'  # Use majority class labels for voting
)

# Create a pipeline that includes feature scaling, then the Voting Classifier
pipeline = Pipeline([
    ('scaler', StandardScaler()),  # Standardize features
    ('voting', voting_classifier)  # Combine predictions with Voting Classifier
])

# Train the pipeline on the training data
pipeline.fit(X_train, y_train)

In [68]:
y_pred=pipeline.predict(X_test)

print(confusion_matrix(y_pred=y_pred,y_true=y_test))
print(f'\nAccuracy:{accuracy_score(y_pred,y_test)}\n')
print(classification_report(y_pred,y_test))

[[19  0  0]
 [ 0 13  0]
 [ 0  0 13]]

Accuracy:1.0

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        19
           1       1.00      1.00      1.00        13
           2       1.00      1.00      1.00        13

    accuracy                           1.00        45
   macro avg       1.00      1.00      1.00        45
weighted avg       1.00      1.00      1.00        45

