# Model Building Part:

## Bagging and Boosting techniques apply on the bank dataset:
- Use all the base algo for bothe the techniques and compare with technique with which base algo perform the best

## Objective: 
- Here we will apply all the base classifiers using the bagging technique and boosting technique both . Then will compare which model has performed the best with which base algorithm

- Problem statement: Predict if a customer will opt for the term deposit or not (Binary Classification problem)
- taget variable = y
- y = 'yes' means customer opted for the term deposit
- y = 'no' means customer din't opt for the term deposit

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler       #scaling needed for KNN, Linear etc only(Not for Tree based algo)
from sklearn.model_selection import train_test_split   

#bagging and boosting techniques:
from sklearn.ensemble import BaggingClassifier        
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier


#All the base algo:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier   
from sklearn.linear_model import LogisticRegression


#accuracy:
from sklearn.metrics import accuracy_score      

In [2]:
df = pd.read_csv(r'C:\users\91842\Downloads\bank-dataset.csv')
df.head()

Unnamed: 0,age,job,marital,education,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,telephone,may,mon,261,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,no,no,telephone,may,mon,149,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,yes,no,telephone,may,mon,226,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,telephone,may,mon,151,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,yes,telephone,may,mon,307,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


In [5]:
df.dtypes

age                 int64
job                object
marital            object
education          object
housing            object
loan               object
contact            object
month              object
day_of_week        object
duration            int64
campaign            int64
pdays               int64
previous            int64
poutcome           object
emp.var.rate      float64
cons.price.idx    float64
cons.conf.idx     float64
euribor3m         float64
nr.employed       float64
y                  object
dtype: object

In [3]:
df.isnull().sum()

age               0
job               0
marital           0
education         0
default           0
housing           0
loan              0
contact           0
month             0
day_of_week       0
duration          0
campaign          0
pdays             0
previous          0
poutcome          0
emp.var.rate      0
cons.price.idx    0
cons.conf.idx     0
euribor3m         0
nr.employed       0
y                 0
dtype: int64

In [4]:
df.shape

(41188, 20)

Check if the dataset is imbalanced or not:

In [4]:
#The proportion looks like
(df['y'].value_counts()/df.shape[0])*100

no     88.734583
yes    11.265417
Name: y, dtype: float64

- So, we dont' have an imbalance data here.

- If the calsses are like 98% and 2%, then we can say that is imbalance
- here it is almost 89% and 11%, so not fairly imbalance we can say

- Data preprocessing: Before passing the data to the Model process the data. Here need to apply encoding (1 of the data preprocessing steps) because there are many categorical columns are present.

In [5]:
x_features = list(df.columns)
x_features.remove('y')
x_features

['age',
 'job',
 'marital',
 'education',
 'default',
 'housing',
 'loan',
 'contact',
 'month',
 'day_of_week',
 'duration',
 'campaign',
 'pdays',
 'previous',
 'poutcome',
 'emp.var.rate',
 'cons.price.idx',
 'cons.conf.idx',
 'euribor3m',
 'nr.employed']

In [6]:
#Dropping 'month' and 'day_of_week' from the x_features
x_features.remove('month')
x_features.remove('day_of_week')

In [7]:
#encoding of all the categorical independent variables:
encoded_data = pd.get_dummies(df[x_features], drop_first=True)
encoded_data.head(3)

Unnamed: 0,age,duration,campaign,pdays,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,...,education_high.school,education_illiterate,education_professional.course,education_university.degree,default_yes,housing_yes,loan_yes,contact_telephone,poutcome_nonexistent,poutcome_success
0,56.0,261.0,1.0,999.0,0.0,1.1,93.994,-36.4,4.857,5191.0,...,0,0,0,0,0,0,0,1,1,0
1,57.0,149.0,1.0,999.0,0.0,1.1,93.994,-36.4,4.857,5191.0,...,1,0,0,0,0,0,0,1,1,0
2,37.0,226.0,1.0,999.0,0.0,1.1,93.994,-36.4,4.857,5191.0,...,1,0,0,0,0,1,0,1,1,0


In [8]:
#encoding of the categorical dependent variables:
df['y'] = df['y'].map(lambda x: 1 if x=='yes' else 0)
df['y'].value_counts()

0    36548
1     4640
Name: y, dtype: int64

Prepare x and y for model building:

In [9]:
#prepare x and y:
x = encoded_data
y = df['y']

Scale the data before applying the data on KNN algo:

In [15]:
x_scalar = StandardScaler()
x_scaled = x_scalar.fit_transform(x)

Model building:We have performed all the data preprocessings that is required for this data (encoding)
- Now split the data and then build the model

In [30]:
x_train, x_test, y_train, y_test = train_test_split(x_scaled,
                                                   y,
                                                   test_size=0.2,
                                                   random_state=42)

### KNN Without Bagging:

- First apply KNN without Bagging and notice the accuracy
- Then apply KNN algo with Bagging and notice the accuracy
- Finally compare both the accuracy

In [17]:
#Apply KNN algo without Bagging and notice the accuracy: 

#create an instance
knn = KNeighborsClassifier()

#fit the train data to the model
knn.fit(x_train, y_train)

#predict using test set
y_pred_knn = knn.predict(x_test)

#check the accuracy of the model
accuracy_score(y_test, y_pred_knn)

0.8988832240835154

So, without Bagging technique the KNN model is able to accurately classify almost 89.88% of the data.

- Now let's use bagging over our KNN classifier and see if our score improves or not:

### KNN With Bagging:

In [18]:
bagging_knn = BaggingClassifier(base_estimator=KNeighborsClassifier(),
                            bootstrap=True,
                            oob_score=True)

#fit the train data to the bagging model
bagging_knn.fit(x_train, y_train)

#predict using test set
y_pred_bagging_knn = bagging_knn.predict(x_test)

#check the accuracy of the model
accuracy_score(y_test, y_pred_bagging_knn)

  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])


0.8976693372177713

In [44]:
print('KNN without Bagging accuracy:', 89.76)
print('KNN wit Bagging accuracy:', 89.88)

KNN without Bagging accuracy: 89.76
KNN wit Bagging accuracy: 89.88


- So, accuracy from KNN = 89.76%
- And accuracy from Bagging with KNN = 89.88%
- It seems that the KNN algo overfitted the data slighylt.

### DT Without Bagging:

- First apply DT without Bagging and notice the accuracy
- Then apply DT algo with Bagging and notice the accuracy

In [20]:
#create an instance
dt_clf = DecisionTreeClassifier()

#fit the train data to the model
dt_clf.fit(x_train, y_train)

#predict using test set
y_pred_dt = dt_clf.predict(x_test)

#check the accuracy of the model
accuracy_score(y_test, y_pred_dt)

0.8855304685603301

### DT With Bagging:

In [21]:
bagging_dt = BaggingClassifier(base_estimator=DecisionTreeClassifier(),
                            bootstrap=True,
                            oob_score=True)

#fit the train data to the bagging model
bagging_dt.fit(x_train, y_train)

#predict using test set
y_pred_bagging_dt = bagging_dt.predict(x_test)

#check the accuracy of the model
accuracy_score(y_test, y_pred_bagging_dt)

  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])


0.9118718135469774

In [46]:
print('DT without Bagging accuracy:', 88.55)
print('DT wit Bagging accuracy:', 91.18)

DT without Bagging accuracy: 88.55
DT wit Bagging accuracy: 91.18


- So, accuracy from a single DT = 88.55%
- And accuracy from Bagging with DT = 91.18%
- So bagging technique gives more accuracy than using a single DT.

### Bagging with Random forest (Bydefault base algo is Decision Tree):

In [22]:
bagging_rf = BaggingClassifier(base_estimator=RandomForestClassifier(),
                            bootstrap=True,
                            oob_score=True)

#fit the train data to the bagging model
bagging_rf.fit(x_train, y_train)

#predict using test set
y_pred_bagging_rf = bagging_rf.predict(x_test)

#check the accuracy of the model
accuracy_score(y_test, y_pred_bagging_rf)

  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])


0.9151493080844866

In [47]:
print('RF with Bagging accuracy:', 91.52)

RF with Bagging accuracy: 91.52


So, Bagging with RF is giving accuracy = 91.52%

### Logistic Regression without Bagging Technique:

In [31]:
#create an instance
LR_clf = LogisticRegression()

#fit the train data to the model
LR_clf.fit(x_train, y_train)

#predict using test set
y_pred_LR = LR_clf.predict(x_test)

#check the accuracy of the model
accuracy_score(y_test, y_pred_LR)

0.9093226511289147

### Logistic Regression with Bagging Technique:

In [32]:
bagging_LR = BaggingClassifier(base_estimator=LogisticRegression(),
                            bootstrap=True,
                            oob_score=True)

#fit the train data to the bagging model
bagging_LR.fit(x_train, y_train)

#predict using test set
y_pred_bagging_LR = bagging_LR.predict(x_test)

#check the accuracy of the model
accuracy_score(y_test, y_pred_bagging_LR)

  warn("Some inputs do not have OOB scores. "
  predictions.sum(axis=1)[:, np.newaxis])


0.9094440398154892

- Logistic regression accuray = 90.93%
- Logistic with Bagging accuracy = 90.94%
- So almost same 

### Conclusion : 
- Using Bagging technique with Random Forest has given the highest accuracy for this particular data = 91.52%

## BOOSTING TECHNIQUES:

Decsion Tree with Boosting Technique:

In [39]:
DT_clf = AdaBoostClassifier()
DT_clf.fit(x_train, y_train)
y_pred = DT_clf.predict(x_test)
accuracy_score(y_test,y_pred)

0.9090798737557659

Boosting using DT is giving accuracy = 91%

Random Forest with Boosting Technique:

In [38]:
rf_clf = AdaBoostClassifier(base_estimator=RandomForestClassifier())
rf_clf.fit(x_train, y_train)
y_pred = rf_clf.predict(x_test)
accuracy_score(y_test,y_pred)

0.9135712551590192

So, RF with Boosting accuracy = 91.4%

In [40]:
LR_clf = AdaBoostClassifier(base_estimator=LogisticRegression())
LR_clf.fit(x_train, y_train)
y_pred = LR_clf.predict(x_test)
accuracy_score(y_test,y_pred)

0.9048312697256615

So, LR with Boosting accuracy = 90.5%

### Conclusion : Using Boosting technique with Random Forest has given the highest accuracy for this particular data = 91.4%

Conclusion based on evaluation of different models:
- Random forest has outperformed other model for this data
- Bagging and Boosting accuarcy for RF is almost similar = 91%