A group of predictors is called an *ensemble*, which is why when you aggregate the predictions of a group of predictors, we call the technique ***Ensemble Learning***.

# Voting Classifiers

Imagine training four classifiers, Logistic regression, SVM Classifiter, and Random Forest Classifier. 

Now we aggregate the results and whatever the majority predicts is our final prediciton. 

So, if two of the three predict **True**, we predict **True**.

In [1]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [2]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

log_clf = LogisticRegression(solver="lbfgs", random_state=42)
rnd_clf = RandomForestClassifier(n_estimators=100, random_state=42)
svm_clf = SVC(gamma="scale", random_state=42)

voting_clf = VotingClassifier(
    estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
    voting='hard')

In [3]:
voting_clf.fit(X_train, y_train)

VotingClassifier(estimators=[('lr',
                              LogisticRegression(C=1.0, class_weight=None,
                                                 dual=False, fit_intercept=True,
                                                 intercept_scaling=1,
                                                 l1_ratio=None, max_iter=100,
                                                 multi_class='warn',
                                                 n_jobs=None, penalty='l2',
                                                 random_state=42,
                                                 solver='lbfgs', tol=0.0001,
                                                 verbose=0, warm_start=False)),
                             ('rf',
                              RandomForestClassifier(bootstrap=True,
                                                     class_weight=None,
                                                     criterion='gini',
                                            

In [4]:
from sklearn.metrics import accuracy_score

for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.864
RandomForestClassifier 0.896
SVC 0.896
VotingClassifier 0.912


### Soft voting:

In [5]:
log_clf = LogisticRegression(solver="lbfgs", random_state=42)
rnd_clf = RandomForestClassifier(n_estimators=100, random_state=42)
svm_clf_soft = SVC(gamma='scale', probability=True, random_state=42)

voting_clf = VotingClassifier(
    estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf_soft)],
    voting='soft')

voting_clf.fit(X_train, y_train)

VotingClassifier(estimators=[('lr',
                              LogisticRegression(C=1.0, class_weight=None,
                                                 dual=False, fit_intercept=True,
                                                 intercept_scaling=1,
                                                 l1_ratio=None, max_iter=100,
                                                 multi_class='warn',
                                                 n_jobs=None, penalty='l2',
                                                 random_state=42,
                                                 solver='lbfgs', tol=0.0001,
                                                 verbose=0, warm_start=False)),
                             ('rf',
                              RandomForestClassifier(bootstrap=True,
                                                     class_weight=None,
                                                     criterion='gini',
                                            

In [6]:
for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

LogisticRegression 0.864
RandomForestClassifier 0.896
SVC 0.896
VotingClassifier 0.92


# Bagging and Pasting

When we use the same trianing algorithm for every predictor and train them on different random subsets of the training set ***WITH*** replacement, it is called *bagging* which is short for **bootstrap aggregating**.  

When the technique described above is done ***WITHOUT*** replacement, its called **pasting**.  

Both methods allow trianing instances to be sampled several times across multiple predictors; however, bagging allows training instances to be sample several times for the ****SAME**** predictor.

Keypoint 1: aggregation reduces ****BOTH**** bias and variance. 

Keypoint 2: Bagging often results in better models. 

## Bagging and Pasting in Scikit-Learn

In [7]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(
    DecisionTreeClassifier(), 
    n_estimators=500,
    max_samples=100,
    bootstrap=True,
    n_jobs=-1)

bag_clf.fit(X_train, y_train)
ypred = bag_clf.predict(X_test)

## Out-of-Bag Evaluation

63% of the training instances are sampled on average for each predictor. The remaining 37% is called *out-of-bag* (oob) instances. 

Given that the oob instances are never seen by the predictors, we can use the oob as an evaluation set by averaging out the oob evaluation of each predictor on the ensemble itself by setting ```oob_score``` to ```True```.

In [8]:
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(), 
    n_estimators=500, 
    bootstrap=True, 
    n_jobs=-1, 
    oob_score=True)

bag_clf.fit(X_train, y_train)
bag_clf.oob_score_

0.896

In [9]:
y_pred = bag_clf.predict(X_test)

## Random Patches and Random Subspaces

If we have a high-dimensional dataset, we may want to only sample some of the features. 

The **Random Patches** method is When we sample some of the features as well as some of the training instances.

The **Random Subspaces** method is when we sample some of the features but all of the training instances. 

# Random Forests  
Random Forest is an ensembe of Decision Trees generally trained via the baggin method and typically with ```max_samples``` as ```True```.


In [10]:
from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=500, 
                                max_leaf_nodes=16,
                                n_jobs=-1,
                                random_state=42)

rnd_clf.fit(X_train, y_train)

y_pred_rf = rnd_clf.predict(X_test)

In [11]:
accuracy_score(y_test, y_pred_rf)

0.912

## Extra-Trees

We can make a Random Forest even more random by using random thresholds for each feature instead of searching for the best possible thresholds. This algorithm is called an Extremely Randomized Trees ensemble or Extra-Trees for short. 

In [12]:
from sklearn.ensemble import ExtraTreesClassifier

extra_trees_clf = ExtraTreesClassifier(n_estimators=500, 
                                   max_leaf_nodes=16, 
                                   n_jobs=-1,
                                   random_state=42 )
extra_trees_clf.fit(X_train, y_train)

y_pred_extra = extra_trees_clf.predict(X_test)

In [13]:
accuracy_score(y_test, y_pred_extra)

0.912

## Feature Importance

After training, we can inspect the importance of each feature.

In [14]:
from sklearn.datasets import load_iris
iris =  load_iris()

In [15]:
rnd_clf = RandomForestClassifier(n_estimators=500, n_jobs=-1, random_state=42)
rnd_clf.fit(iris["data"], iris['target'])
for name, score in zip(iris["feature_names"], rnd_clf.feature_importances_):
    print(name, score)

sepal length (cm) 0.11249225099876375
sepal width (cm) 0.02311928828251033
petal length (cm) 0.4410304643639577
petal width (cm) 0.4233579963547682


Consequently, if we want to perform feature selection, we now have a much better idea of which feature(s) to drop.

# Boosting

Boosting refers to any Ensemble method that can combine several weak learners into a strong learner. The general idea is that most boosting methods train predictors sequentially. The most popular are AdaBoost and Gradient Boosting. 

## AdaBoost
Attempts to correct the predictors which underfitted.

The big drawback to any sequential learning technique is that it cannot be parallelized so it does not scale as well as bagging or pasting. 

In [16]:
from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(max_depth=1), n_estimators=200,
    algorithm='SAMME.R', learning_rate=0.5)

ada_clf.fit(X_train, y_train)

y_pred_extra = ada_clf.predict(X_test)

accuracy_score(y_test, y_pred_extra)

0.896

### Gradient Boosting
What's the difference between gradident and ada boosting? 
* Ada Boosting tweaks the instance weights at every iteration
* Gradient Boosting tries to fit the new predictor to the *residual errors* made by the previous predictor

In [17]:
import numpy as np 

np.random.seed(42)
X = np.random.rand(100, 1) - 0.5
y = 3*X[:, 0]**2 + 0.05 * np.random.randn(100)

In [18]:
from sklearn.tree import DecisionTreeRegressor

tree_reg1 = DecisionTreeRegressor(max_depth=2)
tree_reg1.fit(X,y)

DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=None, splitter='best')

In [19]:
y2 = y - tree_reg1.predict(X)
tree_reg2 = DecisionTreeRegressor(max_depth=2)
tree_reg2.fit(X, y2)

DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=None, splitter='best')

In [20]:
y3 = y2 - tree_reg2.predict(X)
tree_reg3 = DecisionTreeRegressor(max_depth=2)
tree_reg3.fit(X, y3)

DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=None, splitter='best')

In [21]:
X_new = np.array([[0.8]])

In [22]:
y_pred = sum(tree.predict(X_new) for tree in (tree_reg1, tree_reg2, tree_reg3))
y_pred

array([0.75026781])

Or, in a few less lines. 

In [23]:
from sklearn.ensemble import GradientBoostingRegressor

gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=3, learning_rate=1.0)
gbrt.fit(X,y)

GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
                          learning_rate=1.0, loss='ls', max_depth=2,
                          max_features=None, max_leaf_nodes=None,
                          min_impurity_decrease=0.0, min_impurity_split=None,
                          min_samples_leaf=1, min_samples_split=2,
                          min_weight_fraction_leaf=0.0, n_estimators=3,
                          n_iter_no_change=None, presort='auto',
                          random_state=None, subsample=1.0, tol=0.0001,
                          validation_fraction=0.1, verbose=0, warm_start=False)

Finally, a key thing to remember is that you can spend an inordinate amount of time attempting to optimize the hyperparameters, or you can simply use XGBoost and tune those hyperparameters using ```GridSearch()``` or ```RandomSearch()```.

In [24]:
import xgboost 
from sklearn.metrics import mean_squared_error

X_train, X_val, y_train, y_val = train_test_split(X, y)

xgb_reg = xgboost.XGBRegressor(random_state=42)
xgb_reg.fit(X_train, y_train, 
            eval_set=[(X_val, y_val)], early_stopping_rounds=2)

[0]	validation_0-rmse:0.22055
Will train until validation_0-rmse hasn't improved in 2 rounds.
[1]	validation_0-rmse:0.16547
[2]	validation_0-rmse:0.12243
[3]	validation_0-rmse:0.10044
[4]	validation_0-rmse:0.08467
[5]	validation_0-rmse:0.07344
[6]	validation_0-rmse:0.06728
[7]	validation_0-rmse:0.06383
[8]	validation_0-rmse:0.06125
[9]	validation_0-rmse:0.05959
[10]	validation_0-rmse:0.05902
[11]	validation_0-rmse:0.05852
[12]	validation_0-rmse:0.05844
[13]	validation_0-rmse:0.05801
[14]	validation_0-rmse:0.05747
[15]	validation_0-rmse:0.05772
[16]	validation_0-rmse:0.05778
Stopping. Best iteration:
[14]	validation_0-rmse:0.05747



XGBRegressor(base_score=0.5, booster=None, colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints=None,
             learning_rate=0.300000012, max_delta_step=0, max_depth=6,
             min_child_weight=1, missing=nan, monotone_constraints=None,
             n_estimators=100, n_jobs=0, num_parallel_tree=1,
             objective='reg:squarederror', random_state=42, reg_alpha=0,
             reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method=None,
             validate_parameters=False, verbosity=None)

In [25]:
y_pred = xgb_reg.predict(X_val)
mean_squared_error(y_val, y_pred)

0.0033024080171411836

## Stacking

Basically, we use machine learning to generate values we will use in our final prediction. 