# CHAPTER7: ENSEMBLE LEARNING AND RANDOM FORESTS.

## Notes

- A group of predictors is called an ensemble; this, this technique is called ***Ensemble Learning***

- As an example of an Ensemble method you can train a group of decision tree classifiers each on a different random subset of the training set. To make predictions you obtain the predictions of all the individuyal trees then predict the class that gets the most votes. Such an ensemble of decision trees is called a random forest and despite its simplicity this is one of the most powerful mlk algorithms available today.

### Voting Classifiers

- This method is somewhat a straight forward one, you train the same subset of the training data with the different Predictors and than you aggregate their results

- If its a classifing problem you take the mode of the predictions, if it is a regression problem you take the mean of the different regressors in ensemble learning.

- Ensemble methods work best when the predictors are as independent from one another as possible. One way to get diverse classifiers is to train them using very different algorithms. This increases the chance that they will make very different types of errors, improving the ensemble's accuracy.

#### Example Voting classifier


In [1]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons



In [2]:
# Create the training set.

X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [3]:
log_clf = LogisticRegression(penalty=None)
rnd_clf = RandomForestClassifier()
svm_clf = SVC()

voting_clf = VotingClassifier(estimators=[("lr", log_clf),
                                          ("rand_forest", rnd_clf),
                                          ("svm", svm_clf)],
                              voting="hard")

voting_clf.fit(X_train, y_train)

In [4]:
# Voting classifier outperforms all of the classifiers.
voting_clf.score(X_test, y_test)

0.904

- If all classifiers are able to estimate class probabilities (they have a preict_proba method) then you can tell scikit learn to predict the class with the highest class probablity averaged over all the individual classifiers. This is called soft voting it often achieves higher performance than hard voting because it gives more weight to highly confident votes. All you need to do is replace voting="hard"  with the voting="soft" and ensure that all classifiers can estimate class probabilities.

### Bagging and Pasting.

- Bagging and pasting approach is to use the same training algorithm for every predictor and train them on different random subsets of the training set. When sampling is performed with replacement this method is called bagging. When sampling is performed without replacement it is called pasting.

- Once all predictors are trained, the enseble can make a prediction for a new instance by simply aggregating the predictions of all predictors. The aggregation function is typically the statistical mode, for classification. or the average for regression. Generallt the net result is that the ensemble has a similar bias but a lower variance than a single predictor trained on the original training set.

#### Example bagging

In [10]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=100,
                            bootstrap=True, random_state=1337)
bag_clf.fit(X_train, y_train)
bag_clf.score(X_test, y_test)

0.904

In [11]:
from sklearn.model_selection import cross_val_score
## using bagging model

scores = cross_val_score(bag_clf, X_train, y_train, cv=5, verbose=2)
scores

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s


[CV] END .................................................... total time=   0.1s
[CV] END .................................................... total time=   0.1s
[CV] END .................................................... total time=   0.1s
[CV] END .................................................... total time=   0.0s
[CV] END .................................................... total time=   0.1s


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.7s finished


array([0.86666667, 0.92      , 0.88      , 0.88      , 0.88      ])

In [16]:
past_clf = BaggingClassifier(DecisionTreeClassifier(), n_estimators=100,
                            max_samples=1.0, bootstrap=False, random_state=1337)

scores = cross_val_score(past_clf, X_train, y_train,
                         cv=5)
scores

array([0.84      , 0.89333333, 0.86666667, 0.85333333, 0.86666667])

- On average about 63 % is sampled in the bagging method for each independent predictor. But dont forget that size of the sample is generally the same with the dataset that means there are duplicates of rows in the sample for each predictor

- Also that means that 37 % of the instances from dataset a predictor does not see that means, we can evaluate our predictor with that 37% you can set oob(out of bag)_on = True and you can see it through oob_score_ attribute.

#### Example OOB score

In [20]:
bag_clf = BaggingClassifier(DecisionTreeClassifier(), 
                            n_estimators=100,
                            bootstrap=True, 
                            oob_score=True,
                            random_state=1337)


bag_clf.fit(X_train, y_train)
bag_clf.oob_score_, bag_clf.score(X_test, y_test), bag_clf.score(X_train, y_train)
# This results in average of the oob evalutations of each predictor.

(0.896, 0.904, 1.0)

In [19]:
bag_clf.oob_decision_function_

array([[0.38095238, 0.61904762],
       [0.45      , 0.55      ],
       [1.        , 0.        ],
       [0.        , 1.        ],
       [0.        , 1.        ],
       [0.05128205, 0.94871795],
       [0.31428571, 0.68571429],
       [0.        , 1.        ],
       [1.        , 0.        ],
       [0.89473684, 0.10526316],
       [0.79487179, 0.20512821],
       [0.        , 1.        ],
       [0.80952381, 0.19047619],
       [0.91176471, 0.08823529],
       [0.97560976, 0.02439024],
       [0.02857143, 0.97142857],
       [0.02777778, 0.97222222],
       [1.        , 0.        ],
       [1.        , 0.        ],
       [1.        , 0.        ],
       [0.02941176, 0.97058824],
       [0.28571429, 0.71428571],
       [0.975     , 0.025     ],
       [1.        , 0.        ],
       [0.925     , 0.075     ],
       [0.        , 1.        ],
       [1.        , 0.        ],
       [1.        , 0.        ],
       [0.        , 1.        ],
       [0.62857143, 0.37142857],
       [0.

- The oob decision function for each training instance is also available through the oob_decusuib_function variable. In this case the decision functions returns the class probabilities for each training instance.

- Itreturns the avarege oob scores for each training instance.

- Max samples and bootstrap are connected just like it, max_features and bootstrap features are connected

- Max_features controls how many features are going to be selected, and bootstrap_features controls if replacement after sampling the features are going to happen. It is set the false default so there will not be duplicates of the features.

- Sampling both training instances and features is called the Random patches method. keeping all training instances but sampling features is called random subspaces.

#### Random Forests

- Instead of building a baggingclassifier and poassing it a decisiontreeclassifier you can istead use the randomforestclassifier class which is more convenient and optimized for decision trees.

- The Random Forest algorithm introduces extra randomness when growing trees instead of searching the very best feature when splitting a node it searches for the best feature among a random subset of features. The algorithm results in greater tree diversity, which trades a higher bias for a lower variance, generally yielding an overall better model.

- Another great quality of the random forests is they make it easy to measure the relative importance of each feature.

- YOu can axxes the result using the feature_importances_ variable.

- Random Forests are very handy to get a quick understanding of what features actually matter in particular if you need to perform feature selection.

#### Boosting

- The general idea of most boosting methods is to train predictors sequentially, each trying to correct ist predeccor.

- One of the most popular is adaboosting, aka gradient boosting.

- One way for a new predictor to correct its predecessor is to pay a bit more attention to the training instances that the predecessor underfitted this results in new predictors focusing more and more on the hard cases. This is the technique used by AdaBoost.

- Once all predictors are trained, the ensemble makes predictions very much like bagging or pasting, except that predictors have different weights depending on their overall accuracy on the weighted training set.

- In gradient boosting method tries to fit the new predictor to the residual errors made by the previous predictor.



In [None]:
from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1), n_estimators=200,
    learning_rate=0.5
)