# ensuble learning

In [52]:
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.ensemble import VotingClassifier, RandomForestClassifier, BaggingClassifier, ExtraTreesClassifier, AdaBoostClassifier
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

**ensuble learning** ensumble learning is a method in machine learnig that relies on the 'wisdown of the crowd'. It involves many classifiers that predicts something, the label is chosen according to the majority vote.

An example of this is training a group of decision tree classifiers, each on a different random subset. You can obtain the result from all the decision trees and get the label using majority vote. Such decision tree is called random forest.

Usually you will build an ensumble classifier at the end of your project once you have built a few good classifiers.

Ensumble learners can achieve high accuracy given that the classifiers it is using is diverse enough. If all the ensumble learners are trained on the same data, then they are likley to make the same mistake. So to improve its overall accuracy, train them on different data sets.

This increases the chance that they will make different types of errors, improving the ensumble's accuracy.

In [2]:
X, y = make_moons(n_samples=500, noise=0.3, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [3]:
voting_clf = VotingClassifier(
    estimators=[
        ('lr', LogisticRegression(random_state=42)),
        ('rf', RandomForestClassifier(random_state=42)),
        ('svc', SVC(random_state=42))
    ]
)

In [4]:
voting_clf.fit(X_train, y_train)

When we fit it to the model, it clones the estimators and fits the data into them.

The original estimators are avaliable through.

this returns the non trained estimators

In [5]:
voting_clf.named_estimators

{'lr': LogisticRegression(random_state=42),
 'rf': RandomForestClassifier(random_state=42),
 'svc': SVC(random_state=42)}

this returns the trained estimators

In [6]:
voting_clf.named_estimators_

{'lr': LogisticRegression(random_state=42),
 'rf': RandomForestClassifier(random_state=42),
 'svc': SVC(random_state=42)}

## how ensumble learning works

getting the overall accuracy of each classifier on the test data

In [7]:
for name, clf in voting_clf.named_estimators_.items():
    print(name , "=, ", clf.score(X_test, y_test))


lr =,  0.864
rf =,  0.896
svc =,  0.896


how the voting works

In [8]:
X_test[:1]

array([[0.50169252, 0.21717211]])

In [9]:
voting_clf.predict(X_test[:1])

array([1])

In [10]:
[clf.predict(X_test[:1]) for clf in voting_clf.estimators_]

[array([1]), array([1]), array([0])]

how the voting process improves the classification

There it is, the ensumble classifier has outperformed all the other classifiers!

In [11]:
voting_clf.score(X_test, y_test)

0.912

## soft voting

In [12]:
voting_clf.voting = "soft"

In [13]:
voting_clf.named_estimators['svc'].probability = True

In [14]:
voting_clf.fit(X_train, y_train)

In [15]:
voting_clf.score(X_test, y_test)

0.92

In [16]:
voting_clf.predict_proba(X_test)[:3]

array([[0.51626368, 0.48373632],
       [0.75887079, 0.24112921],
       [0.68581925, 0.31418075]])

## bagging and parsing

Another method of ensumble trains the same estimator on different datasets. 

There is two types:

- **bagging** : with replacement
- **pasting** : without replacement (same data cannot be used to train twice)

Each individual estimator has a higher bias than if we trained it in the original dataset. However though votting processes, the ensumble model reduces the overall bias and variance.

Each estimator can be trained on a seperate server and can scale well.

In [17]:
bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500, max_samples=100, n_jobs=-1, random_state=42)

bag_clf.fit(X_train, y_train)

the bagged learner clearly makes better predictions

In [18]:
bag_clf.score(X_test, y_test)

0.904

only most of the training examples are sampled using bagging and some are not. These instances are known as out-of-bag (OOB) instances. A bagging ensumble can be validated using OOB.

In [19]:
bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=500, n_jobs=-1, random_state=42, oob_score=True)

In [20]:
bag_clf.fit(X_train, y_train)


according to evaluation score or oob score the bag_clf is likely to achieve 89% accuracy.

In [21]:
bag_clf.oob_score_

0.896

In [22]:
bag_clf.score(X_test, y_test)

0.912

In [23]:
bag_clf.oob_decision_function_[:10]

array([[0.32352941, 0.67647059],
       [0.3375    , 0.6625    ],
       [1.        , 0.        ],
       [0.        , 1.        ],
       [0.        , 1.        ],
       [0.06145251, 0.93854749],
       [0.35465116, 0.64534884],
       [0.01142857, 0.98857143],
       [0.98930481, 0.01069519],
       [0.97927461, 0.02072539]])

## Random patches and random subspaces

## Random forest

The random forest classifier with optimised for decision tree with bagging.

with few exceptions RandomForestClassifier has all the hyperparameters of DecisionTreeClassifier and BaggingClassifier.

 Instead of splitting the tree among the best features. It samples sqrt(n) features randomly and selects the best features among those, resulting in further randomness.

 Which results in high bias but low variance. Overall yeilding to a better result.

In [24]:
rndf_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, random_state=42, n_jobs=-1)
rndf_clf.fit(X_train, y_train)

In [25]:
rndf_clf.score(X_test, y_test)

0.912

the above code is equivalent to

In [26]:
bag_clf = BaggingClassifier(DecisionTreeClassifier(max_features='sqrt', max_leaf_nodes=16), random_state=42, n_jobs=-1, n_estimators=500)

bag_clf.fit(X_train, y_train)

bag_clf.score(X_test, y_test)

0.912

**extremely randomised trees** You can speed up the tree construction by setting the splitter to random. This will use a random threshold for splitting nodes, instead of searchign for the best possible thresholds (like it normally does)

we get the same accuracy even though we speed up the training time

In [27]:
bag_clf = BaggingClassifier(DecisionTreeClassifier(splitter='random', max_features='sqrt', max_leaf_nodes=16), random_state=42, n_jobs=-1, n_estimators=500)

bag_clf.fit(X_train, y_train)

bag_clf.score(X_test, y_test)

0.912

**ExtraTreeClassifer** is identical to the random forest classifier but its bootstrap is set to False.

**bootstrap** randomly sampling the data with replacement to create multiple trainign sets, which are then used to train multiple classifiers.

if bootstrap is set to false it will the the entire dataset to train every model.

In [28]:
extr_clf = ExtraTreesClassifier(n_estimators=500, n_jobs=-1, random_state=42, max_leaf_nodes=16)
extr_clf.fit(X_train, y_train)

extr_clf.score(X_test, y_test)

0.912

## feature importance

Feature importance works by looking at how much tree nodes that use feature reduce impurity on average, across all trees in the forest.

In [34]:
iris = load_iris(as_frame=True)

In [49]:
rndf_clf = RandomForestClassifier(n_estimators=500, random_state=42)

rndf_clf.fit(iris.data, iris.target)

In [50]:
for feature, importance in zip(iris.data.columns, rndf_clf.feature_importances_):
    print(feature, round(importance, 2))

sepal length (cm) 0.11
sepal width (cm) 0.02
petal length (cm) 0.44
petal width (cm) 0.42


In [51]:
rndf_clf.score(iris.data, iris.target)

1.0

## Boosting

Refers to ensumble method that combines many weak learners into a strong learner. The general idea is that it learns sequentially. Each learner trying to correct the output of its predecessor.

Two popular methods are:

1. **Adaboost** short for adaptive boosting
2. **gradient boosting**

### Adaboost

One way to pay a bit more attention to pay a bit more attention to the training instances that its predecessor underfits.

For example it will train on a base classifier first such as a decision tree and use it to make predictions on the training set. **The algorithm will increase the relative weights of miss classified training instances** It will train a second classifier using the updated weight and so on.

It is kind of like gradient descent.

However one draw back is that it is sequential learning therefore it cannot be parallelised. It cannot scale well like bagging and pasting.

In [78]:
ada_clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1), n_estimators=30, learning_rate=0.5, random_state=42)

In [79]:
ada_clf.fit(X_train, y_train)

In [80]:
ada_clf.score(X_test, y_test)

0.904

## gradient boosting