# Ensemble


Ensemble learning helps improve machine learning results by combining several models. This approach allows the production of better predictive performance compared to a single model.

Most ensemble methods use a single base learning algorithm i.e. learners of the same type, leading to homogeneous ensembles.

There are also some methods that use heterogeneous learners, i.e. learners of different types, leading to heterogeneous ensembles. In order for ensemble methods to be more accurate than any of its individual members, the base learners have to be as accurate as possible and as diverse as possible.



### Bagging

Bagging stands for bootstrap aggregation. One way to reduce the variance of an estimate is to average together multiple estimates. For example, we can train M different trees on different subsets of the data (chosen randomly with replacement).

Bagging uses bootstrap sampling to obtain the data subsets for training the base learners. For aggregating the outputs of base learners, bagging uses voting for classification and averaging for regression.

### Boosting

Boosting is a general ensemble method that creates a strong classifier from a number of weak classifiers.

This is done by building a model from the training data, then creating a second model that attempts to correct the errors from the first model. Models are added until the training set is predicted perfectly or a maximum number of models are added.

AdaBoost is one of the most successful boosting algorithms developed for binary classification.

### Libraries useful in Ensemble are listed below

### Import all the libraries required

In [17]:
import pandas as pd
from sklearn import preprocessing
from sklearn.tree import DecisionTreeClassifier 
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.metrics import accuracy_score

### Load the "letter-recognition" data

In [18]:
# import dataset
df = pd.read_csv("letter-recognition.data.txt", header = None)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
0,T,2,8,3,5,1,8,13,0,6,6,10,8,0,8,0,8
1,I,5,12,3,7,2,10,5,5,4,13,3,9,2,8,4,10
2,D,4,11,6,8,6,10,6,2,6,10,3,7,3,7,3,9
3,N,7,11,6,6,3,5,9,4,6,4,4,10,6,10,2,8
4,G,2,1,3,1,1,8,6,6,6,6,5,9,1,7,5,10


### Split the dataset into training and testing parts (70-30 ratio with a random state value 30)

In [19]:
# Select the independent variables and the target attribute
X = df[df.columns[1:]] # Selecting the independent variables
Y = df[df.columns[0]] # selecting only the target lableled column
X.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
0,2,8,3,5,1,8,13,0,6,6,10,8,0,8,0,8
1,5,12,3,7,2,10,5,5,4,13,3,9,2,8,4,10
2,4,11,6,8,6,10,6,2,6,10,3,7,3,7,3,9
3,7,11,6,6,3,5,9,4,6,4,4,10,6,10,2,8
4,2,1,3,1,1,8,6,6,6,6,5,9,1,7,5,10


In [20]:
# Divide the dataset into training and testing partition
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30, random_state = 30)

### Q1. Ensemble Method by manipulation of Dataset (Bagged Decision Trees)

Bagging performs best with algorithms that have high variance. A popular example are decision trees, often constructed without pruning.

We will create decision tree classifiers with and without bagging ensemble method and compare their performance.

In [21]:
# Implement the decision tree classifier using entropy and random state value as 30
dtree_entropy = DecisionTreeClassifier(criterion='entropy', random_state = 30) 

In [22]:
# Use k-fold cross validation with k=5
dtree_entropy = dtree_entropy.fit(X_train,Y_train)
scores = cross_val_score(dtree_entropy, X_train, Y_train, cv=5, scoring='accuracy')
print('scores: ', scores)
print('mean score: ', scores.mean())

scores:  [0.865      0.86964286 0.85214286 0.86       0.87      ]
mean score:  0.8633571428571427


### Prediction and Evaluation

In [23]:
# Predict results on the testing part
predictions = dtree_entropy.predict(X_test)

In [24]:
# Calculate and print confusion matrix and other performance measures 
print(classification_report(Y_test,predictions))
print("Confusion Matrix")
print(confusion_matrix(Y_test,predictions))
print("\n Accuracy")
print(accuracy_score(Y_test,predictions))

              precision    recall  f1-score   support

           A       0.93      0.91      0.92       229
           B       0.82      0.83      0.83       228
           C       0.91      0.88      0.89       220
           D       0.78      0.86      0.82       219
           E       0.84      0.87      0.85       232
           F       0.83      0.77      0.80       225
           G       0.87      0.80      0.83       234
           H       0.74      0.79      0.76       206
           I       0.88      0.92      0.90       236
           J       0.90      0.89      0.90       209
           K       0.83      0.84      0.84       213
           L       0.92      0.92      0.92       239
           M       0.91      0.91      0.91       240
           N       0.89      0.87      0.88       239
           O       0.90      0.81      0.85       243
           P       0.85      0.92      0.88       243
           Q       0.87      0.82      0.84       228
           R       0.81    

### Comparison with Bagged Decision Tree

In [25]:
# Create a model using bagging using 5 decision tree classifiers
from sklearn.ensemble import BaggingClassifier

seed = 30
dtree = DecisionTreeClassifier(criterion='entropy', random_state = 30) 
num_trees = 5
model = BaggingClassifier(base_estimator=dtree, n_estimators=num_trees, random_state=seed)

In [29]:
# Use k-fold cross validation with k=5
from sklearn.model_selection import cross_val_score
knn = model.fit(X_train, Y_train)
scores = cross_val_score(knn, X_train, Y_train, cv=5, scoring='accuracy')
print('scores: ', scores)
print('mean score: ', scores.mean())

scores:  [0.88714286 0.89035714 0.89       0.88285714 0.88714286]
mean score:  0.8875


### Prediction and Evaluation

In [30]:
# Predict results on the testing part
predictions = knn.predict(X_test)

In [31]:
# Calculate and print confusion matrix and other performance measures 
print(classification_report(Y_test,predictions))
print("Confusion Matrix")
print(confusion_matrix(Y_test,predictions))
print("\n Accuracy")
print(accuracy_score(Y_test,predictions))

              precision    recall  f1-score   support

           A       0.93      0.98      0.95       229
           B       0.78      0.92      0.84       228
           C       0.88      0.89      0.89       220
           D       0.78      0.90      0.84       219
           E       0.84      0.91      0.87       232
           F       0.87      0.81      0.84       225
           G       0.85      0.82      0.84       234
           H       0.82      0.86      0.84       206
           I       0.90      0.93      0.91       236
           J       0.93      0.89      0.91       209
           K       0.87      0.92      0.89       213
           L       0.94      0.92      0.93       239
           M       0.93      0.93      0.93       240
           N       0.96      0.90      0.92       239
           O       0.87      0.81      0.84       243
           P       0.89      0.93      0.91       243
           Q       0.89      0.89      0.89       228
           R       0.89    

### Q2. Ensemble Method by manipulation of Classifiers (using Voting Classifier)

The VotingClassifier takes in a list of different estimators as arguments and a voting method. The **hard** voting method uses the predicted labels and a majority rules system, while the **soft** voting method predicts a label based on the argmax/largest predicted value of the sum of the predicted probabilities.

After we provide the desired classifiers, we need to fit the resulting ensemble classifier object. We can then get predictions and use accuracy metrics.

In [32]:
#Import required library
from sklearn.ensemble import VotingClassifier

In [35]:
# Implement the different classifiers
dtree_gini=DecisionTreeClassifier(criterion='gini',random_state=30)
euclidean_3nn=KNeighborsClassifier(n_neighbors=3,metric='euclidean')
euclidean_5nn=KNeighborsClassifier(n_neighbors=5,metric='euclidean')
manhattan_5nn=KNeighborsClassifier(n_neighbors=5,metric='manhattan')
gnb=GaussianNB()

In [36]:
# Build Voting Classifier using above estimators and hard voting method
# Function to be used: VotingClassifier(estimators,voting)
# Estimators represent the base classifiers used taken as ('base classifier name', variable_name)
voting_classifier=VotingClassifier(estimators=[('dt', dtree_gini),('e3nn', euclidean_3nn),('e5nn', euclidean_5nn),('m5nn', manhattan_5nn)], voting='hard')

In [37]:
# Fit the voting classifier model and print scores using k-fold cross validation with k=5
voting_classifier.fit(X_train, Y_train)
scores = cross_val_score(voting_classifier, X_train, Y_train, cv=5, scoring='accuracy')
print('scores: ', scores)
print('mean score: ', scores.mean())

scores:  [0.94321429 0.94285714 0.94357143 0.94714286 0.94607143]
mean score:  0.9445714285714285


### Prediction and Evaluation

In [38]:
# Predict results on the testing part
predictions = voting_classifier.predict(X_test)

In [39]:
# Calculate and print confusion matrix and other performance measures 
print(classification_report(Y_test,predictions))
print("Confusion Matrix")
print(confusion_matrix(Y_test,predictions))
print("\n Accuracy")
print(accuracy_score(Y_test,predictions))

              precision    recall  f1-score   support

           A       0.99      1.00      0.99       229
           B       0.84      0.97      0.90       228
           C       0.96      0.95      0.96       220
           D       0.89      0.98      0.93       219
           E       0.94      0.93      0.93       232
           F       0.93      0.93      0.93       225
           G       0.95      0.92      0.93       234
           H       0.88      0.92      0.90       206
           I       0.94      0.97      0.96       236
           J       0.97      0.93      0.95       209
           K       0.93      0.89      0.91       213
           L       0.99      0.96      0.97       239
           M       0.97      0.98      0.97       240
           N       0.98      0.94      0.96       239
           O       0.92      0.94      0.93       243
           P       0.97      0.93      0.95       243
           Q       0.96      0.96      0.96       228
           R       0.94    

### Q3. Manipulating the features

In [42]:
# Generate five random vectors
import numpy as np
X1 = np.random.choice(np.arange(1, 16), 10, replace=False)
X2 = np.random.choice(np.arange(1, 16), 10, replace=False)
X3 = np.random.choice(np.arange(1, 16), 10, replace=False)
X4 = np.random.choice(np.arange(1, 16), 10, replace=False)
X5 = np.random.choice(np.arange(1, 16), 10, replace=False)


In [45]:
# Model 1
# Select the independent variables 
# select only the target lableled column
# Train the model
X=df[df.columns[X1]]
Y=df[0]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30, random_state = 30)
model1=DecisionTreeClassifier(criterion='entropy', random_state=30)
model1=model1.fit(X_train,Y_train)

In [47]:
# Model 2
# Select the independent variables 
# select only the target lableled column
# Train the model
X=df[df.columns[X2]]
Y=df[0]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30, random_state = 30)
model2=DecisionTreeClassifier(criterion='entropy', random_state=30)
model2=model2.fit(X_train,Y_train)

In [48]:
# Model 3
# Select the independent variables 
# select only the target lableled column
# Train the model
X=df[df.columns[X3]]
Y=df[0]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30, random_state = 30)
model3=DecisionTreeClassifier(criterion='entropy', random_state=30)
model3=model1.fit(X_train,Y_train)

In [49]:
# Model 4
# Select the independent variables 
# select only the target lableled column
# Train the model
X=df[df.columns[X4]]
Y=df[0]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30, random_state = 30)
model4=DecisionTreeClassifier(criterion='entropy', random_state=30)
model4=model1.fit(X_train,Y_train)

In [50]:
# Model 5
# Select the independent variables 
# select only the target lableled column
# Train the model
X=df[df.columns[X5]]
Y=df[0]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30, random_state = 30)
model5=DecisionTreeClassifier(criterion='entropy', random_state=30)
model5=model1.fit(X_train,Y_train)

In [51]:
# Apply Voting Classifier
voting_classifier=VotingClassifier(estimators=[('model1', model1),('model2', model2),('model3', model3),('model4', model4),('model5', model5)], voting='hard')

In [52]:
# Calculate and print confusion matrix and other performance measures 
voting_classifier.fit(X_train, Y_train)
scores = cross_val_score(voting_classifier, X_train, Y_train, cv=5, scoring='accuracy')
print('scores: ', scores)
print('mean score: ', scores.mean())

scores:  [0.76892857 0.76285714 0.76464286 0.7675     0.76642857]
mean score:  0.7660714285714286


### Q4. Manipulating the classes

In [53]:
# Generate 5 sets of two class representation
import numpy
array = np.zeros((5, 13))
for i in range(5):
  array[i]=np.random.sample(range(1, 26), 13)

print(array)
df1=df.copy(deep = True)
df2=df.copy(deep = True)
df3=df.copy(deep = True)
df4=df.copy(deep = True)
df5=df.copy(deep = True)

for j in range(len(df)):
    col=ord(df.iloc[j,0])-64
    
    if col not in array[0]:
        df1.iloc[j,0] = 1
    else:
        df1.iloc[j,0] = 0
    if col not in array[1]:
        df2.iloc[j,0] = 1
    else:
        df2.iloc[j,0] = 0
    if col not in array[2]:
        df3.iloc[j,0] = 1
    else:
        df3.iloc[j,0] = 0
    if col not in array[3]:
        df4.iloc[j,0] = 1
    else:
        df4.iloc[j,0] = 0
    if col not in array[4]:
        df5.iloc[j,0] = 1
    else:
        df5.iloc[j,0] = 0
        
        
        

TypeError: random_sample() takes at most 1 positional argument (2 given)

In [None]:
# Model 1
# Select the independent variables 
# select only the target lableled column
# Train the model
X = df1[df1.columns[1:]]
Y = df1[df1.columns[0]]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30, random_state = 30)
dtr1 = DecisionTreeClassifier(criterion='entropy', random_state = 30)
dtr1 = dtr.fit(X_train,Y_train)


In [None]:
# Model 2
# Select the independent variables 
# select only the target lableled column
# Train the model
X = df2[df2.columns[1:]]
Y = df2[df2.columns[0]]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30, random_state = 30)
dtr2 = DecisionTreeClassifier(criterion='entropy', random_state = 30)
dtr2 = dtr.fit(X_train,Y_train)

In [None]:
# Model 3
# Select the independent variables 
# select only the target lableled column
# Train the model
X = df3[df2.columns[1:]]
Y = df2[df2.columns[0]]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30, random_state = 30)
dtr2 = DecisionTreeClassifier(criterion='entropy', random_state = 30)
dtr2 = dtr.fit(X_train,Y_train)

In [None]:
# Model 4
# Select the independent variables 
# select only the target lableled column
# Train the model
X = df2[df2.columns[1:]]
Y = df2[df2.columns[0]]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30, random_state = 30)
dtr2 = DecisionTreeClassifier(criterion='entropy', random_state = 30)
dtr2 = dtr.fit(X_train,Y_train)

In [None]:
# Model 5
# Select the independent variables 
# select only the target lableled column
# Train the model
X = df2[df2.columns[1:]]
Y = df2[df2.columns[0]]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30, random_state = 30)
dtr2 = DecisionTreeClassifier(criterion='entropy', random_state = 30)
dtr2 = dtr.fit(X_train,Y_train)

In [None]:
# Apply Voting Classifier

In [None]:
# Calculate and print confusion matrix and other performance measures 

### Q5. Which method performs the best