## Ensembling : Using multiple ML models to get better accuracy as our output

### Ensembling Techniques:

1. Bagging - also called as Bootstrap Aggregation. It is a ML technique that involves training multiple ML models on different subsets of the training data to impore accuracy and stability of the model in order to reduce overfitting.

2. Boosting - Applying multiple ML model to convert a weak learner into a strong learner.
3. Voting - Using multiple ML models to fit on the data and select the one which gives the highest accuracy

## Loading the libraries

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

## Load data

In [32]:
#from sklearn.datasets import load_iris
data = pd.read_csv('iris.csv')
data.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


In [7]:
data.drop('Id', axis = 1, inplace = True)
data.head()

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


## Enocde Target(Species)

In [8]:
dic = {'Iris-setosa' : 0, 'Iris-virginica' : 1, 'Iris-versicolor' : 2}
data['Species'] = data['Species'].replace(dic)

In [9]:
data.head()

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


## Seperate X and y

In [11]:
X = data.drop('Species', axis = 1)
y = data['Species']

## Feature Scaling

In [10]:
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()

In [14]:
X = ss.fit_transform(X)
X

array([[-9.00681170e-01,  1.03205722e+00, -1.34127240e+00,
        -1.31297673e+00],
       [-1.14301691e+00, -1.24957601e-01, -1.34127240e+00,
        -1.31297673e+00],
       [-1.38535265e+00,  3.37848329e-01, -1.39813811e+00,
        -1.31297673e+00],
       [-1.50652052e+00,  1.06445364e-01, -1.28440670e+00,
        -1.31297673e+00],
       [-1.02184904e+00,  1.26346019e+00, -1.34127240e+00,
        -1.31297673e+00],
       [-5.37177559e-01,  1.95766909e+00, -1.17067529e+00,
        -1.05003079e+00],
       [-1.50652052e+00,  8.00654259e-01, -1.34127240e+00,
        -1.18150376e+00],
       [-1.02184904e+00,  8.00654259e-01, -1.28440670e+00,
        -1.31297673e+00],
       [-1.74885626e+00, -3.56360566e-01, -1.34127240e+00,
        -1.31297673e+00],
       [-1.14301691e+00,  1.06445364e-01, -1.28440670e+00,
        -1.44444970e+00],
       [-5.37177559e-01,  1.49486315e+00, -1.28440670e+00,
        -1.31297673e+00],
       [-1.26418478e+00,  8.00654259e-01, -1.22754100e+00,
      

## Divide the data into train_test_split

In [15]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

### Applying Decision Tree Classifier as a standalone ML model

In [19]:
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier()

In [21]:
dtc.fit(X_train, y_train)

In [23]:
y_pred = dtc.predict(X_test)
y_pred

array([1, 2, 0, 1, 0, 1, 0, 2, 2, 2, 1, 2, 2, 2, 2, 0, 2, 2, 0, 0, 1, 2,
       0, 0, 1, 0, 0, 2, 2, 0, 1, 2, 0, 1, 1, 2, 0, 1], dtype=int64)

In [24]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
print(accuracy_score(y_pred, y_test))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

0.9736842105263158
[[13  0  0]
 [ 0  9  0]
 [ 0  1 15]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        13
           1       0.90      1.00      0.95         9
           2       1.00      0.94      0.97        16

    accuracy                           0.97        38
   macro avg       0.97      0.98      0.97        38
weighted avg       0.98      0.97      0.97        38



#### Note: In scikit-learn, the bagging technique can be implemented using the 'BaggingClassifier" or "Bagging Regressor"  classes,

- BaggingClassifier - Classification problems
- BaggingRegressor - REgression problems

In [17]:
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier()

In [25]:
from sklearn.ensemble import BaggingClassifier
bc = BaggingClassifier(base_estimator = dtc, max_samples = 100, bootstrap = True, random_state = 42)

In [26]:
bc.fit(X_train, y_train)

In [29]:
y_pred_bc = bc.predict(X_test)
y_pred_bc

array([1, 2, 0, 1, 0, 1, 0, 2, 2, 2, 1, 2, 2, 2, 2, 0, 2, 2, 0, 0, 1, 2,
       0, 0, 1, 0, 0, 2, 2, 0, 1, 2, 0, 1, 1, 2, 0, 1], dtype=int64)

In [30]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
print(accuracy_score(y_pred_bc, y_test))
print(confusion_matrix(y_test, y_pred_bc))
print(classification_report(y_test, y_pred_bc))

0.9736842105263158
[[13  0  0]
 [ 0  9  0]
 [ 0  1 15]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        13
           1       0.90      1.00      0.95         9
           2       1.00      0.94      0.97        16

    accuracy                           0.97        38
   macro avg       0.97      0.98      0.97        38
weighted avg       0.98      0.97      0.97        38



In [33]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()

In [34]:
lr.fit(X_train, y_train)

In [35]:
y_pred_lr = lr.predict(X_test)

In [37]:
accuracy_score(y_pred_lr, y_test)

0.9736842105263158