# Ensemble


Ensemble learning helps improve machine learning results by combining several models. This approach allows the production of better predictive performance compared to a single model.

Most ensemble methods use a single base learning algorithm i.e. learners of the same type, leading to homogeneous ensembles.

There are also some methods that use heterogeneous learners, i.e. learners of different types, leading to heterogeneous ensembles. In order for ensemble methods to be more accurate than any of its individual members, the base learners have to be as accurate as possible and as diverse as possible.



### Bagging

Bagging stands for bootstrap aggregation. One way to reduce the variance of an estimate is to average together multiple estimates. For example, we can train M different trees on different subsets of the data (chosen randomly with replacement).

Bagging uses bootstrap sampling to obtain the data subsets for training the base learners. For aggregating the outputs of base learners, bagging uses voting for classification and averaging for regression.

### Boosting

Boosting is a general ensemble method that creates a strong classifier from a number of weak classifiers.

This is done by building a model from the training data, then creating a second model that attempts to correct the errors from the first model. Models are added until the training set is predicted perfectly or a maximum number of models are added.

AdaBoost is one of the most successful boosting algorithms developed for binary classification.

### Libraries useful in Ensemble are listed below

### Import all the libraries required

In [1]:
import pandas as pd
from sklearn import preprocessing
from sklearn.tree import DecisionTreeClassifier 
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.metrics import accuracy_score

### Load the "letter-recognition" data

In [2]:
# import dataset
df = pd.read_csv("letter-recognition.data.txt", header = None)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
0,T,2,8,3,5,1,8,13,0,6,6,10,8,0,8,0,8
1,I,5,12,3,7,2,10,5,5,4,13,3,9,2,8,4,10
2,D,4,11,6,8,6,10,6,2,6,10,3,7,3,7,3,9
3,N,7,11,6,6,3,5,9,4,6,4,4,10,6,10,2,8
4,G,2,1,3,1,1,8,6,6,6,6,5,9,1,7,5,10


### Split the dataset into training and testing parts (70-30 ratio with a random state value 30)

In [3]:
# Select the independent variables and the target attribute
X = df[df.columns[1:]] # Selecting the independent variables
Y = df[df.columns[0]] # selecting only the target lableled column
X.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
0,2,8,3,5,1,8,13,0,6,6,10,8,0,8,0,8
1,5,12,3,7,2,10,5,5,4,13,3,9,2,8,4,10
2,4,11,6,8,6,10,6,2,6,10,3,7,3,7,3,9
3,7,11,6,6,3,5,9,4,6,4,4,10,6,10,2,8
4,2,1,3,1,1,8,6,6,6,6,5,9,1,7,5,10


In [4]:
# Divide the dataset into training and testing partition
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30, random_state = 30)

### Q1. Ensemble Method by manipulation of Dataset (Bagged Decision Trees)

Bagging performs best with algorithms that have high variance. A popular example are decision trees, often constructed without pruning.

We will create decision tree classifiers with and without bagging ensemble method and compare their performance.

In [5]:
# Implement the decision tree classifier using entropy and random state value as 30
dtree_entropy = DecisionTreeClassifier(criterion='entropy', random_state = 30) 

In [6]:
# Use k-fold cross validation with k=5
dtree_entropy = dtree_entropy.fit(X_train,Y_train)
scores = cross_val_score(dtree_entropy, X_train, Y_train, cv=5, scoring='accuracy')
print('scores: ', scores)
print('mean score: ', scores.mean())

scores:  [0.865      0.86964286 0.85214286 0.86       0.87      ]
mean score:  0.8633571428571427


### Prediction and Evaluation

In [7]:
# Predict results on the testing part
predictions_entropy = dtree_entropy.predict(X_test)

In [8]:
# Calculate and print confusion matrix and other performance measures 
print(classification_report(Y_test,predictions_entropy))
print("Confusion Matrix")
print(confusion_matrix(Y_test,predictions_entropy))
print("\n Accuracy")
print(accuracy_score(Y_test,predictions_entropy))

              precision    recall  f1-score   support

           A       0.93      0.91      0.92       229
           B       0.82      0.83      0.83       228
           C       0.91      0.88      0.89       220
           D       0.78      0.86      0.82       219
           E       0.84      0.87      0.85       232
           F       0.83      0.77      0.80       225
           G       0.87      0.80      0.83       234
           H       0.74      0.79      0.76       206
           I       0.88      0.92      0.90       236
           J       0.90      0.89      0.90       209
           K       0.83      0.84      0.84       213
           L       0.92      0.92      0.92       239
           M       0.91      0.91      0.91       240
           N       0.89      0.87      0.88       239
           O       0.90      0.81      0.85       243
           P       0.85      0.92      0.88       243
           Q       0.87      0.82      0.84       228
           R       0.81    

### Comparison with Bagged Decision Tree

In [10]:
# Create a model using bagging using 5 decision tree classifiers
from sklearn.ensemble import BaggingClassifier

seed = 30
dtree = DecisionTreeClassifier(criterion='entropy', random_state = 30) 
num_trees = 5
model = BaggingClassifier(base_estimator=dtree, n_estimators=num_trees, random_state=seed)

In [11]:
# Use k-fold cross validation with k=5
model = model.fit(X_train,Y_train)
scores = cross_val_score(model, X_train, Y_train, cv=5, scoring='accuracy')
print('scores: ', scores)
print('mean score: ', scores.mean())

scores:  [0.88714286 0.89035714 0.89       0.88285714 0.88714286]
mean score:  0.8875


### Prediction and Evaluation

In [13]:
# Predict results on the testing part
predictions_model = model.predict(X_test)

In [14]:
# Calculate and print confusion matrix and other performance measures 
print(classification_report(Y_test,predictions_model))
print("Confusion Matrix")
print(confusion_matrix(Y_test,predictions_model))
print("\n Accuracy")
print(accuracy_score(Y_test,predictions_model))

              precision    recall  f1-score   support

           A       0.93      0.98      0.95       229
           B       0.78      0.92      0.84       228
           C       0.88      0.89      0.89       220
           D       0.78      0.90      0.84       219
           E       0.84      0.91      0.87       232
           F       0.87      0.81      0.84       225
           G       0.85      0.82      0.84       234
           H       0.82      0.86      0.84       206
           I       0.90      0.93      0.91       236
           J       0.93      0.89      0.91       209
           K       0.87      0.92      0.89       213
           L       0.94      0.92      0.93       239
           M       0.93      0.93      0.93       240
           N       0.96      0.90      0.92       239
           O       0.87      0.81      0.84       243
           P       0.89      0.93      0.91       243
           Q       0.89      0.89      0.89       228
           R       0.89    

### Q2. Ensemble Method by manipulation of Classifiers (using Voting Classifier)

The VotingClassifier takes in a list of different estimators as arguments and a voting method. The **hard** voting method uses the predicted labels and a majority rules system, while the **soft** voting method predicts a label based on the argmax/largest predicted value of the sum of the predicted probabilities.

After we provide the desired classifiers, we need to fit the resulting ensemble classifier object. We can then get predictions and use accuracy metrics.

In [16]:
#Import required library
from sklearn.ensemble import VotingClassifier

In [17]:
# Implement the different classifiers
cf1 = DecisionTreeClassifier(criterion='entropy', random_state = 30)
cf2 =KNeighborsClassifier(n_neighbors=5, metric='euclidean')
cf3 =GaussianNB()


In [18]:
# Build Voting Classifier using above estimators and hard voting method
voter= VotingClassifier([('dtree',cf1),('knn',cf2),('gnb',cf3)],voting='hard')
# Function to be used: VotingClassifier(estimators,voting)
# Estimators represent the base classifiers used taken as ('base classifier name', variable_name)


In [19]:
# Fit the voting classifier model and print scores using k-fold cross validation with k=5
voter = voter.fit(X_train,Y_train)
scores = cross_val_score(voter, X_train, Y_train, cv=5, scoring='accuracy')
print('scores: ', scores)
print('mean score: ', scores.mean())

scores:  [0.89428571 0.90714286 0.895      0.89857143 0.90142857]
mean score:  0.8992857142857142


### Prediction and Evaluation

In [21]:
# Predict results on the testing part
predictions_voter = voter.predict(X_test)

In [22]:
# Calculate and print confusion matrix and other performance measures 
print(classification_report(Y_test,predictions_voter))
print("Confusion Matrix")
print(confusion_matrix(Y_test,predictions_voter))
print("\n Accuracy")
print(accuracy_score(Y_test,predictions_voter))

              precision    recall  f1-score   support

           A       0.91      1.00      0.95       229
           B       0.68      0.95      0.79       228
           C       0.91      0.95      0.93       220
           D       0.70      0.95      0.81       219
           E       0.85      0.94      0.90       232
           F       0.86      0.88      0.87       225
           G       0.90      0.87      0.89       234
           H       0.82      0.83      0.82       206
           I       0.91      0.94      0.92       236
           J       0.95      0.88      0.92       209
           K       0.89      0.82      0.86       213
           L       0.97      0.93      0.95       239
           M       0.87      0.96      0.91       240
           N       0.97      0.90      0.94       239
           O       0.92      0.89      0.91       243
           P       0.95      0.93      0.94       243
           Q       0.96      0.87      0.91       228
           R       0.93    

### Q3. Manipulating the features

In [101]:
# Generate five random vectors
print(list(X.sample(10,axis=1,random_state=1).columns))
print(list(X.sample(10,axis=1,random_state=2).columns))
print(list(X.sample(10,axis=1,random_state=3).columns))
print(list(X.sample(10,axis=1,random_state=4).columns))
print(list(X.sample(10,axis=1,random_state=5).columns))


[4, 14, 8, 3, 7, 11, 5, 2, 15, 1]
[13, 5, 6, 1, 10, 4, 2, 11, 8, 15]
[8, 14, 5, 2, 7, 3, 13, 16, 6, 1]
[13, 1, 7, 4, 5, 10, 12, 3, 16, 14]
[6, 2, 8, 3, 11, 16, 12, 5, 9, 10]


In [54]:
# Model 1
# Select the independent variables 
X1_train=X_train.sample(10,axis=1,random_state=1)
# select only the target lableled column
# Train the model
dtree_entropy = DecisionTreeClassifier(criterion='entropy', random_state = 30)
dtree_entropy_1 = dtree_entropy.fit(X1_train,Y_train)

In [55]:
# Model 2
# Select the independent variables 
X2_train=X_train.sample(10,axis=1,random_state=2)
# select only the target lableled column
# Train the model
dtree_entropy = DecisionTreeClassifier(criterion='entropy', random_state = 30)
dtree_entropy_2 = dtree_entropy.fit(X2_train,Y_train)

In [56]:
# Model 3
# Select the independent variables 
X3_train=X_train.sample(10,axis=1,random_state=3)
# select only the target lableled column
# Train the model
dtree_entropy = DecisionTreeClassifier(criterion='entropy', random_state = 30)
dtree_entropy_3 = dtree_entropy.fit(X3_train,Y_train)

In [57]:
# Model 4
# Select the independent variables 
X4_train=X_train.sample(10,axis=1,random_state=4)
# select only the target lableled column
# Train the model
dtree_entropy = DecisionTreeClassifier(criterion='entropy', random_state = 30)
dtree_entropy_4 = dtree_entropy.fit(X4_train,Y_train)

In [58]:
# Model 5
# Select the independent variables 
X5_train=X_train.sample(10,axis=1,random_state=5)
# select only the target lableled column
# Train the model
dtree_entropy = DecisionTreeClassifier(criterion='entropy', random_state = 30)
dtree_entropy_5 = dtree_entropy.fit(X5_train,Y_train)

In [59]:
# Apply Voting Classifier
voter= VotingClassifier([('dtree1',dtree_entropy_1),('dtree2',dtree_entropy_2),('dtree3',dtree_entropy_3),('dtree4',dtree_entropy_4),('dtree5',dtree_entropy_5)],voting='hard')
voter = voter.fit(X_train,Y_train)

In [60]:
# Calculate and print confusion matrix and other performance measures 
predictions_attribute = voter.predict(X_test)
print(classification_report(Y_test,predictions_attribute))
print("Confusion Matrix")
print(confusion_matrix(Y_test,predictions_attribute))
print("\n Accuracy")
print(accuracy_score(Y_test,predictions_attribute))

              precision    recall  f1-score   support

           A       0.93      0.91      0.92       229
           B       0.82      0.83      0.83       228
           C       0.91      0.88      0.89       220
           D       0.78      0.86      0.82       219
           E       0.84      0.87      0.85       232
           F       0.83      0.77      0.80       225
           G       0.87      0.80      0.83       234
           H       0.74      0.79      0.76       206
           I       0.88      0.92      0.90       236
           J       0.90      0.89      0.90       209
           K       0.83      0.84      0.84       213
           L       0.92      0.92      0.92       239
           M       0.91      0.91      0.91       240
           N       0.89      0.87      0.88       239
           O       0.90      0.81      0.85       243
           P       0.85      0.92      0.88       243
           Q       0.87      0.82      0.84       228
           R       0.81    

### Q4. Manipulating the classes

In [78]:
# Generate 5 sets of two class representation
import string
import random
random.seed=30
alphabet_list=(list(string.ascii_uppercase))
S1=random.sample(alphabet_list,13)
S2=random.sample(alphabet_list,13)
S3=random.sample(alphabet_list,13)
S4=random.sample(alphabet_list,13)
S5=random.sample(alphabet_list,13)
print(S1,S2,S3,S4,S5)
#print(alphabet_list)


['X', 'N', 'A', 'L', 'Q', 'C', 'U', 'Y', 'M', 'V', 'D', 'P', 'F'] ['S', 'V', 'T', 'K', 'Z', 'D', 'P', 'G', 'X', 'M', 'I', 'W', 'L'] ['D', 'U', 'H', 'A', 'M', 'I', 'Q', 'P', 'G', 'B', 'Y', 'E', 'K'] ['N', 'M', 'E', 'W', 'R', 'C', 'Q', 'F', 'T', 'S', 'Y', 'U', 'H'] ['Z', 'D', 'C', 'I', 'Y', 'T', 'G', 'J', 'H', 'R', 'K', 'E', 'B']


In [93]:
# Model 1
# Select the independent variables 
# select only the target lableled column
Y1_train=Y_train.copy(deep=True)
for i in range(len(Y_train.index)):
    Y1_train.iloc[i]=int(Y1_train.iloc[i] in S1)
# Train the model
Y1_train=Y1_train.astype(int)
dtree_entropy = DecisionTreeClassifier(criterion='entropy', random_state = 30)
dtree_entropy_1 = dtree_entropy.fit(X_train,Y1_train)

In [95]:
# Model 2
# Select the independent variables 
# select only the target lableled column
Y2_train=Y_train.copy(deep=True)
for i in range(len(Y_train.index)):
    Y2_train.iloc[i]=int(Y2_train.iloc[i] in S2)
# Train the model
Y2_train=Y2_train.astype(int)
dtree_entropy = DecisionTreeClassifier(criterion='entropy', random_state = 30)
dtree_entropy_2 = dtree_entropy.fit(X_train,Y2_train)

In [96]:
# Model 3
# Select the independent variables 
# select only the target lableled column
Y3_train=Y_train.copy(deep=True)
for i in range(len(Y_train.index)):
    Y3_train.iloc[i]=int(Y3_train.iloc[i] in S3)
# Train the model
Y3_train=Y3_train.astype(int)
dtree_entropy = DecisionTreeClassifier(criterion='entropy', random_state = 30)
dtree_entropy_3 = dtree_entropy.fit(X_train,Y3_train)

In [97]:
# Model 4
# Select the independent variables 
# select only the target lableled column
Y4_train=Y_train.copy(deep=True)
for i in range(len(Y_train.index)):
    Y4_train.iloc[i]=int(Y4_train.iloc[i] in S4)
# Train the model
Y4_train=Y4_train.astype(int)
dtree_entropy = DecisionTreeClassifier(criterion='entropy', random_state = 30)
dtree_entropy_4 = dtree_entropy.fit(X_train,Y4_train)

In [98]:
# Model 5
# Select the independent variables 
# select only the target lableled column
Y5_train=Y_train.copy(deep=True)
for i in range(len(Y_train.index)):
    Y5_train.iloc[i]=int(Y5_train.iloc[i] in S5)
# Train the model
Y5_train=Y5_train.astype(int)
dtree_entropy = DecisionTreeClassifier(criterion='entropy', random_state = 30)
dtree_entropy_5 = dtree_entropy.fit(X_train,Y5_train)

In [99]:
# Apply Voting Classifier
voter= VotingClassifier([('dtree1',dtree_entropy_1),('dtree2',dtree_entropy_2),('dtree3',dtree_entropy_3),('dtree4',dtree_entropy_4),('dtree5',dtree_entropy_5)],voting='hard')
voter = voter.fit(X_train,Y_train)

In [100]:
# Calculate and print confusion matrix and other performance measures 
predictions_attribute = voter.predict(X_test)
print(classification_report(Y_test,predictions_attribute))
print("Confusion Matrix")
print(confusion_matrix(Y_test,predictions_attribute))
print("\n Accuracy")
print(accuracy_score(Y_test,predictions_attribute))

              precision    recall  f1-score   support

           A       0.93      0.91      0.92       229
           B       0.82      0.83      0.83       228
           C       0.91      0.88      0.89       220
           D       0.78      0.86      0.82       219
           E       0.84      0.87      0.85       232
           F       0.83      0.77      0.80       225
           G       0.87      0.80      0.83       234
           H       0.74      0.79      0.76       206
           I       0.88      0.92      0.90       236
           J       0.90      0.89      0.90       209
           K       0.83      0.84      0.84       213
           L       0.92      0.92      0.92       239
           M       0.91      0.91      0.91       240
           N       0.89      0.87      0.88       239
           O       0.90      0.81      0.85       243
           P       0.85      0.92      0.88       243
           Q       0.87      0.82      0.84       228
           R       0.81    

### Q5. Which method performs the best

In [None]:
Ensemble method worked best