## Chapter 7 - Ensemble Learning and Random Forests

### Voting Classifiers

In a situation where there are multiple classifiers, we can create a better classifier by aggregating the predictions of each classifier and predict that class that gets the most votes. This is called a Hard Voting classifier.

If each classifier is a weak learner (does slightly better than random guessing), the ensemble can be a strong learner (improves on accuracy) as long as there are sufficient weak learners and they are sufficiently diverse. Specifically, they are perfectly independent and make uncorrelated errors. 

In [1]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

from sklearn.datasets import make_moons

In [2]:
X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [3]:
log_clf, for_clf, svm_clf = LogisticRegression(), RandomForestClassifier(), SVC()

voting_clf = VotingClassifier(
    estimators = [('lr', log_clf), ('rf', for_clf), ('svc', svm_clf)], voting='hard')

voting_clf.fit(X_train, y_train)

VotingClassifier(estimators=[('lr',
                              LogisticRegression(C=1.0, class_weight=None,
                                                 dual=False, fit_intercept=True,
                                                 intercept_scaling=1,
                                                 l1_ratio=None, max_iter=100,
                                                 multi_class='auto',
                                                 n_jobs=None, penalty='l2',
                                                 random_state=None,
                                                 solver='lbfgs', tol=0.0001,
                                                 verbose=0, warm_start=False)),
                             ('rf',
                              RandomForestClassifier(bootstrap=True,
                                                     ccp_alpha=0.0,
                                                     class_weight=None,
                                             

In [4]:
nms = ('LogisticRegression', 'RandomForestClassifier', 'SVC', 'VotingClassifier')
clfs = (log_clf, for_clf, svm_clf, voting_clf)
for c, n in zip(clfs, nms):
    c.fit(X_train, y_train)
    y_pred = c.predict(X_test)
    print(n, accuracy_score(y_test, y_pred))

LogisticRegression 0.864
RandomForestClassifier 0.896
SVC 0.896
VotingClassifier 0.912


The VotingClassifier has outperformed the individual classifiers.

If all classifiers output class probabilities, then SKLearn can predict the class with the highest class probability, averaged over all classifiers. This is called soft voting. It gives higher performance than hard voting because it gives more weight to highly confident votes. 

In [5]:
voting_clf2 = VotingClassifier(
    estimators = [('lr', log_clf), ('rf', for_clf),], voting='soft')

voting_clf2.fit(X_train, y_train)

VotingClassifier(estimators=[('lr',
                              LogisticRegression(C=1.0, class_weight=None,
                                                 dual=False, fit_intercept=True,
                                                 intercept_scaling=1,
                                                 l1_ratio=None, max_iter=100,
                                                 multi_class='auto',
                                                 n_jobs=None, penalty='l2',
                                                 random_state=None,
                                                 solver='lbfgs', tol=0.0001,
                                                 verbose=0, warm_start=False)),
                             ('rf',
                              RandomForestClassifier(bootstrap=True,
                                                     ccp_alpha=0.0,
                                                     class_weight=None,
                                             

In [6]:
nms = ('LogisticRegression', 'RandomForestClassifier', 'SVC', 'VotingClassifier(hard)', 'VotingClassifier(soft)')
clfs = (log_clf, for_clf, svm_clf, voting_clf, voting_clf2)
for c, n in zip(clfs, nms):
    c.fit(X_train, y_train)
    y_pred = c.predict(X_test)
    print(n, accuracy_score(y_test, y_pred))

LogisticRegression 0.864
RandomForestClassifier 0.912
SVC 0.896
VotingClassifier(hard) 0.912
VotingClassifier(soft) 0.904


The voting classifier improves by an additional 80 basis points (0.8%).