# Bagging

# what is bagging
* Bagging, also known as bootstrap aggregation, is the ensemble learning method that is commonly used to reduce variance within a noisy dataset. In bagging, a random sample of data in a training set is selected with replacement—meaning that the individual data points can be chosen more than once


* Bagging is used when the goal is to reduce the variance of a decision tree classifier. Here the objective is to create several subsets of data from training sample chosen randomly with replacement. Each collection of subset data is used to train

![](bag.png)


**Output side called as  Aggregation**

**For regression task it will take average**



**For classification it will count the output** 

## How bagging works

#### Bootstrapping:
*  Bagging leverages a bootstrapping sampling technique to create diverse samples. This resampling method generates different subsets of the training dataset by selecting data points at random and with replacement. This means that each time you select a data point from the training dataset, you are able to select the same instance multiple times. As a result, a value/instance repeated twice (or more) in a sample.

#### Parallel training:
* These bootstrap samples are then trained independently and in parallel with each other using weak or base learners.

#### Aggregation:
* Finally, depending on the task (i.e. regression or classification), an average or a majority of the predictions are taken to compute a more accurate estimate. In the case of regression, an average is taken of all the outputs predicted by the individual classifiers; this is known as soft voting. For classification problems, the class with the highest majority of votes is accepted; this is known as hard voting or majority voting.

## Benefits :

* The biggest advantage of bagging is that multiple weak learners can work better than a single strong learner.

#### Ease of implementation: 
* Python libraries such as scikit-learn (also known as sklearn) make it easy to combine the predictions of base learners or estimators to improve model performance.


#### Reduction of variance:
* Bagging can reduce the variance within a learning algorithm. This is particularly helpful with high-dimensional data, where missing values can lead to higher variance, making it more prone to overfitting and preventing accurate generalization to new datasets.


## challenges of bagging:


#### Computationally expensive:
* Bagging slows down and grows more intensive as the number of iterations increase. Clustered systems or a large number of processing cores are ideal for quickly creating bagged ensembles on large test sets.



## First we will see Using single model

## Business case: Predicting  quality of wine with given feature.

In [None]:
#importing package and loading data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

In [None]:
# Laod the dataset
data=pd.read_csv('wine.csv')

# Basic check

In [None]:
data.head()#first five rows

In [None]:
data.wine.value_counts()

In [None]:
data.tail()#last 5 rows

In [None]:
data.info()# to know datatype and null count

In [None]:
data.describe()#used to view some basic statistical details like percentile, mean, std etc. 

In [None]:
data.shape#rows and columns

# EDA

### Renaming the columns

In [None]:
data.rename(columns={'Alcalinity of ash':'AOA','Total phenols':'total_phe',
                     'Nonflavanoid phenols':'NOP','Color intensity':'color_intensity','Hue':'hu',
                     'OD280/OD315 of diluted wines':'DW','Malic acid':'M_acid'},inplace=True)
#renaming all columns

In [None]:
data.head()#first 5 rows

## checking distribution

In [None]:
#creation of data frame which continuous variable
box = data[['Alcohol','M_acid','Ash',
          'AOA','Magnesium','total_phe',
          'Flavanoids','NOP','Proanthocyanins',
          'color_intensity','hu','DW','Proline']]

In [None]:
plt.figure(figsize=(25,25),facecolor='white')#canvas size

plotnum=1 #counter

for c in box:#columns form dataFrame
    if(plotnum<9):#checking whether counter less than 9
        a=plt.subplot(4,2,plotnum)#plotting 8 graph
        sns.distplot(box[c])#to know distribution
    plotnum+=1#increment counter
plt.tight_layout()    

In [None]:
sns.distplot(data.Proanthocyanins)

In [None]:
sns.distplot(data.color_intensity)

# Data preprocessing

# 1.checking null values 

In [None]:
data.isnull().sum()

## 2.checking constant columns:

In [None]:
data.describe()

In [None]:
## the standard deviation of all features are not 0 ,so there are no constant features in the dataset.

## 3.checking outlier

In [None]:
#creation of data frame which continuous variable
box=data[['Alcohol','M_acid','Ash',
          'AOA','Magnesium','total_phe',
          'Flavanoids','NOP','Proanthocyanins',
          'color_intensity','hu','DW','Proline']]

In [None]:
plt.figure(figsize=(25,25),facecolor='white')
plotnum=1
for c in box:
    if(plotnum<14):
        ax=plt.subplot(4,4,plotnum)
        sns.boxplot(box[c])
    plotnum+=1
plt.tight_layout()    

# Outlier Handling

# M_acid

In [None]:
#outlier handling for M_acid and its not a normal so we use IQR range
from scipy import stats


In [None]:
IQR = stats.iqr(data.M_acid, interpolation = 'midpoint') #calculating Inter quantile range

IQR

In [None]:
Q1=data.M_acid.quantile(0.25)#defining 25% of data
Q3=data.M_acid.quantile(0.75)##defining 75% of data

min_limit=Q1 - 1.5*IQR #setting minimum limit

max_limit=Q3 + 1.5*IQR #setting maximum limit


In [None]:
min_limit

In [None]:
max_limit

In [None]:
data.loc[data['M_acid']<min_limit] #checking values which are less than minimum limit

In [None]:
data.loc[data['M_acid']>max_limit]#checking values which are greater than maximum limit


In [None]:
#imputing outlier with with median
data.loc[data['M_acid']>max_limit,'M_acid']=np.median(data.M_acid)

In [None]:
data.loc[data['M_acid']>max_limit] #checking whether outlier remove or not

# Ash

### Using 3 sigma rule

In [None]:
# for ash we will use 3 sigma rule 
lower_limit=data.Ash.mean() - 3*data.Ash.std() #calculating lower limit
print(lower_limit)

upper_limit=data.Ash.mean() + 3*data.Ash.std() #calculating upper limit
upper_limit

In [None]:
data.loc[data['Ash']<lower_limit] #checking values which are less than minimum limit

In [None]:
data.loc[data['Ash']<lower_limit,'Ash']=np.mean(data.Ash)

In [None]:
data.loc[data['Ash']<lower_limit]

In [None]:
data.loc[data['Ash']>upper_limit]#checking values which are greater than maximum limit

In [None]:
data.loc[data['Ash']>upper_limit,'Ash']=np.mean(data.Ash)#imputing value with mean

In [None]:
data.loc[data['Ash']>upper_limit]#recheck

# AOA this is a normally distributed using 3 sigma rule

In [None]:
data.sort_values('AOA')

In [None]:
lower_limit=data.AOA.mean() - 3*data.AOA.std()##calculating lower limit
print(lower_limit)

upper_limit=data.AOA.mean() + 3*data.AOA.std()#calculating upper limit
upper_limit

In [None]:
data.loc[data['AOA']<lower_limit]#checking values which are less than minimum limit

In [None]:
data.loc[data['AOA']<lower_limit,'AOA']=np.median(data.AOA)#imputing value with median

In [None]:
data.loc[data['AOA']<lower_limit]

In [None]:
data.loc[data['AOA']>upper_limit]#checking values which are greater than maximum limit

In [None]:
data.loc[data['AOA']>upper_limit,'AOA']=np.mean(data.AOA)#imputing value with mean

In [None]:
data.loc[data['AOA']>upper_limit]

# Magnesium is normal curve so using 3 sigma rule

In [None]:
lower_limit=data.Magnesium.mean() - 3*data.Magnesium.std()#calculating lower limit
print(lower_limit)

upper_limit=data.Magnesium.mean() + 3*data.Magnesium.std()#calculating upper limit
upper_limit

In [None]:
data.loc[data['Magnesium']<lower_limit]#checking values which are less than minimum limit

In [None]:
data.loc[data['Magnesium']>upper_limit]#checking values which are  greater than maximum limit

In [None]:
data.loc[data['Magnesium']>upper_limit,'Magnesium']=np.mean(data.Magnesium)#imputing values using mean

In [None]:
data.loc[data['Magnesium']>upper_limit]

### Feature Selection

In [None]:
## Checking correlation

plt.figure(figsize=(30, 30))#canvas size
sns.heatmap(data.corr(), annot=True, cmap="RdYlGn", annot_kws={"size":15})
#plotting heat map to check correlation

# Model building


In [None]:
data.head()

In [None]:
## Creating independent and dependent variable
X=data.iloc[:,1:] #from alcohol column
y=data.wine

In [None]:
### creating train and test data
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=42)

In [None]:
## checking the train and test shape
X_test.shape

In [None]:
X_train.shape

In [None]:
from sklearn.neighbors import KNeighborsClassifier#USING KNN
KNN1=KNeighborsClassifier() ## model object creation
KNN1.fit(X_train,y_train)  ## fitting the model
y_hat_knn=KNN1.predict(X_test) ## getting the predict from created model

In [None]:
from sklearn.metrics import f1_score 
f1_knn=f1_score(y_test,y_hat_knn,average='weighted')#checking model performance 
f1_knn

## Using bagging


In [None]:
from sklearn.ensemble import BaggingClassifier #import bagging 

## model object creation
model_bagg1=BaggingClassifier(base_estimator=KNN1,n_estimators=100) 

# base_estimator---> algorithm which you want to pass
# n_estimators-----> number of base learners


## fitting the model
model_bagg1.fit(X_train,y_train) 


## getting the prediction
y_hat_bagg=model_bagg1.predict(X_test) 

In [None]:
f1_bagg=f1_score(y_test,y_hat_bagg,average='weighted') ## The weighted-averaged F1 score is calculated by taking 
                                             ## the mean of all per-class F1 scores while considering each class’s support.

In [None]:
f1_bagg #score after bagging