# Ensembles Method

In this topics, we will learn how to combine (or simply ensemble) the models we have tried in a way that makes combination of these models make better at predicting than the individual models.

Commonly the "weak" learners we use are decision trees. In fact the default for most ensemble methods is a decision tree in sklearn. However, we can change this value to any of the models we have seen so far.

## Why do we need to ensemble learner?

There are two competing variables in finding a well fitting machine learning model: **Bias** and **Variance**.

**Bias**: When a model has high bias, this means that means it doesn't do a good job of bending to the data. An example of an algorithm that usually has high bias is linear regression. Even with completely different datasets, we end up with the same line fit to the data. When models have high bias, this is bad.

**Variance**: When a model has high variance, this means that it changes drastically to meet the needs of every point in our dataset. Linear models like the one above is low variance, but high bias. An example of an algorithm that tends to have a high variance and low bias is a decision tree (especially decision trees with no early stopping parameters). A decision tree, as a high variance algorithm, will attempt to split every point into it's own branch if possible. This is a trait of high variance, low bias algorithms - they are extremely flexible to fit exactly whatever data they see.

## Bootstrapping 
Take parts of data and generate multiple models. Then investigate each prediction
and subsequently, take a consensus-based approach, either by averaging or
by max voting.

## Boosting
As bootstrapping, but each model builds from the previous one with an additional term. 
Final prediction is based on the last model.

## Ensembles Method in Scikit-Learn

In [89]:
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.feature_extraction.text import CountVectorizer
from time import time

In [66]:
data = pd.read_csv("SMSSpamCollection.tsv", names=['label', 'message'], sep='\t')
display (data.head())
display (data.info)
data['label'] = data.label.map({'ham': 0, 'spam': 1})

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


<bound method DataFrame.info of      label                                            message
0      ham  Go until jurong point, crazy.. Available only ...
1      ham                      Ok lar... Joking wif u oni...
2     spam  Free entry in 2 a wkly comp to win FA Cup fina...
3      ham  U dun say so early hor... U c already then say...
4      ham  Nah I don't think he goes to usf, he lives aro...
...    ...                                                ...
5567  spam  This is the 2nd time we have tried 2 contact u...
5568   ham               Will ü b going to esplanade fr home?
5569   ham  Pity, * was in mood for that. So...any other s...
5570   ham  The guy did some bitching but I acted like i'd...
5571   ham                         Rofl. Its true to its name

[5572 rows x 2 columns]>

In [67]:
X = data[['message']].values
y = data[['label']].values

In [68]:
vectorizer = CountVectorizer(stop_words='english')
spam_vector = vectorizer.fit_transform(data["message"])
spam_features = vectorizer.get_feature_names()

In [69]:
df_spam = pd.DataFrame(spam_vector.toarray(), columns=spam_features)

In [70]:
df_spam.head()

Unnamed: 0,00,000,000pes,008704050406,0089,0121,01223585236,01223585334,0125698789,02,...,zhong,zindgi,zoe,zogtorius,zoom,zouk,zyada,èn,ú1,〨ud
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [71]:
X_train, X_test, y_train, y_test = train_test_split(spam_vector.toarray(), data['label'],
                                                   test_size=0.2, random_state=111)
X_train.shape, X_test.shape

((4457, 8444), (1115, 8444))

In [72]:
RFC = RandomForestClassifier(criterion='entropy', n_estimators=10)#, n_jobs=3)

In [73]:
RFC.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

Train Acc 0.997083239847431
Test Acc 0.9713004484304932


In [80]:
DTC = DecisionTreeClassifier(criterion='entropy', random_state=1111)
DTC.fit(X_train, y_train)
train_pred = DTC.predict(X_train)
test_pred = DTC.predict(X_test)
print ("Train Acc", metrics.accuracy_score(y_train, train_pred))
print ("Train Recall", metrics.recall_score(y_train, train_pred))
print ("Train Precision", metrics.precision_score(y_train, train_pred))
print ("Train F1", metrics.f1_score(y_train, train_pred))
print ()
print ("Test Acc", metrics.accuracy_score(y_test, test_pred))
print ("Test Recall", metrics.recall_score(y_test, test_pred))
print ("Test Precision", metrics.precision_score(y_test, test_pred))
print ("Test F1", metrics.f1_score(y_test, test_pred))

Train Acc 1.0
Train Recall 1.0
Train Precision 1.0
Train F1 1.0

Test Acc 0.968609865470852
Test Recall 0.847682119205298
Test Precision 0.9142857142857143
Test F1 0.8797250859106529


In [78]:
train_pred = RFC.predict(X_train)
test_pred = RFC.predict(X_test)
print ("Train Acc", metrics.accuracy_score(y_train, train_pred))
print ("Train Recall", metrics.recall_score(y_train, train_pred))
print ("Train Precision", metrics.precision_score(y_train, train_pred))
print ("Train F1", metrics.f1_score(y_train, train_pred))
print ()
print ("Test Acc", metrics.accuracy_score(y_test, test_pred))
print ("Test Recall", metrics.recall_score(y_test, test_pred))
print ("Test Precision", metrics.precision_score(y_test, test_pred))
print ("Test F1", metrics.f1_score(y_test, test_pred))

Train Acc 0.997083239847431
Train Recall 0.9781879194630873
Train Precision 1.0
Train F1 0.9889737065309585

Test Acc 0.9713004484304932
Test Recall 0.8013245033112583
Test Precision 0.983739837398374
Test F1 0.8832116788321168


In [85]:
BC = BaggingClassifier(n_estimators=10, random_state=1111, verbose=True, n_jobs=3)
BC.fit(X_train, y_train)
train_pred = BC.predict(X_train)
test_pred = BC.predict(X_test)
print ("Train Acc", metrics.accuracy_score(y_train, train_pred))
print ("Train Recall", metrics.recall_score(y_train, train_pred))
print ("Train Precision", metrics.precision_score(y_train, train_pred))
print ("Train F1", metrics.f1_score(y_train, train_pred))
print ()
print ("Test Acc", metrics.accuracy_score(y_test, test_pred))
print ("Test Recall", metrics.recall_score(y_test, test_pred))
print ("Test Precision", metrics.precision_score(y_test, test_pred))
print ("Test F1", metrics.f1_score(y_test, test_pred))

[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed:   35.8s finished
[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed:   14.7s finished
[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.


Train Acc 0.9977563383441777
Train Recall 0.9848993288590604
Train Precision 0.9982993197278912
Train F1 0.9915540540540542

Test Acc 0.9668161434977578
Test Recall 0.8410596026490066
Test Precision 0.9071428571428571
Test F1 0.872852233676976


[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed:    3.3s finished


In [93]:
for LR in [0.1, 0.5, 1, 10]:
    start = time()
    print ('Learning Rate', LR)
    ABC = AdaBoostClassifier(n_estimators=10, learning_rate=LR, random_state=1111)
    ABC.fit(X_train, y_train)
    print (f"{time()-start:.3f}s fitting time")
    train_pred = ABC.predict(X_train)
    test_pred = ABC.predict(X_test)
    print ("Train Acc", metrics.accuracy_score(y_train, train_pred))
    print ("Train Recall", metrics.recall_score(y_train, train_pred))
    print ("Train Precision", metrics.precision_score(y_train, train_pred))
    print ("Train F1", metrics.f1_score(y_train, train_pred))
    print ()
    print ("Test Acc", metrics.accuracy_score(y_test, test_pred))
    print ("Test Recall", metrics.recall_score(y_test, test_pred))
    print ("Test Precision", metrics.precision_score(y_test, test_pred))
    print ("Test F1", metrics.f1_score(y_test, test_pred))
    print (f"{time()-start:.3f}s")
    print ('\n\n')

Learning Rate 0.1
15.283s fitting time
Train Acc 0.901503253309401
Train Recall 0.26677852348993286
Train Precision 0.9875776397515528
Train F1 0.4200792602377807

Test Acc 0.8977578475336323
Test Recall 0.24503311258278146
Test Precision 1.0
Test F1 0.39361702127659576
16.932s



Learning Rate 0.5
15.364s fitting time
Train Acc 0.9275297285169396
Train Recall 0.46308724832214765
Train Precision 0.989247311827957
Train F1 0.6308571428571429

Test Acc 0.9210762331838565
Test Recall 0.41721854304635764
Test Precision 1.0
Test F1 0.5887850467289719
17.562s



Learning Rate 1
16.733s fitting time
Train Acc 0.9481714157505048
Train Recall 0.6644295302013423
Train Precision 0.927400468384075
Train F1 0.7741935483870969

Test Acc 0.9417040358744395
Test Recall 0.6158940397350994
Test Precision 0.93
Test F1 0.7410358565737053
18.485s



Learning Rate 10
6.668s fitting time
Train Acc 0.1337222346870092
Train Recall 1.0
Train Precision 0.1337222346870092
Train F1 0.23589946566396197

Test Acc 0.

In [94]:
for LR in [0.5]:
    start = time()
    print ('Learning Rate', LR)
    ABC = AdaBoostClassifier(n_estimators=50, learning_rate=LR, random_state=1111)
    ABC.fit(X_train, y_train)
    print (f"{time()-start:.3f}s fitting time")
    train_pred = ABC.predict(X_train)
    test_pred = ABC.predict(X_test)
    print ("Train Acc", metrics.accuracy_score(y_train, train_pred))
    print ("Train Recall", metrics.recall_score(y_train, train_pred))
    print ("Train Precision", metrics.precision_score(y_train, train_pred))
    print ("Train F1", metrics.f1_score(y_train, train_pred))
    print ()
    print ("Test Acc", metrics.accuracy_score(y_test, test_pred))
    print ("Test Recall", metrics.recall_score(y_test, test_pred))
    print ("Test Precision", metrics.precision_score(y_test, test_pred))
    print ("Test F1", metrics.f1_score(y_test, test_pred))
    print (f"{time()-start:.3f}s")
    print ('\n\n')

Learning Rate 0.5
78.317s fitting time
Train Acc 0.9694862014808167
Train Recall 0.790268456375839
Train Precision 0.9771784232365145
Train F1 0.8738404452690167

Test Acc 0.9623318385650225
Test Recall 0.7350993377483444
Test Precision 0.9823008849557522
Test F1 0.8409090909090909
87.516s



