# Programming for Data Science and Artificial Intelligence

## 9.5 Classification - Random Forests and Ensembles Scratch

### Readings: 
- [GERON] Ch3, 5, 6
- [VANDER] Ch5
- [HASTIE] Ch4
- https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html

### Random Forests

Random forests are an example of an *ensemble* method, meaning that it relies on aggregating the results of an ensemble of simpler estimators.
The somewhat surprising result with such ensemble methods is that the sum can be greater than the parts: that is, a majority vote among a number of estimators can end up being better than any of the individual estimators doing the voting!

Random forests are an example of an *ensemble learner* built on decision trees.
For this reason we'll start by discussing decision trees themselves.

Decision trees are extremely intuitive ways to classify or label objects: you simply ask a series of questions designed to zero-in on the classification.

#### Ensembles

This notion—that multiple overfitting estimators can be combined to reduce the effect of this overfitting—is what underlies an ensemble method called *bagging*.
Bagging makes use of an ensemble (a grab bag, perhaps) of parallel estimators, each of which over-fits the data, and averages the results to find a better classification.
An ensemble of randomized decision trees is known as a *random forest*.

This type of bagging classification can be done manually using Scikit-Learn's ``BaggingClassifier`` meta-estimator, as shown here:

In [1]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier

tree = DecisionTreeClassifier()

#let each tree train a random subset of 80% of data (called bootstraping)
#randomness will help in make sure the final vote is fairly distributed
bag = BaggingClassifier(tree, n_estimators=100, max_samples=0.8,
                        random_state=1)

bag.fit(X, y)
plot_tree(bag, X, y)

NameError: name 'X' is not defined

In [None]:
#this is the same as RandomForest
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

param_grid = {"n_estimators": [10, 50, 100], 
              "criterion": ["gini", "entropy"],
              "max_depth": np.arange(1, 10)}
model = RandomForestClassifier()

grid = GridSearchCV(model, param_grid)
grid.fit(X, y)

print(grid.best_params_)

model = grid.best_estimator_
model.fit(X, y)

plot_tree(model, X, y)

In [None]:
#### Apply random forests to our news data

cv = StratifiedShuffleSplit(n_splits=5, test_size=0.2, random_state=42)
grid = GridSearchCV(model, param_grid=param_grid, cv=cv)
grid.fit(X_train, train.target)

print(f"The best parameters are {grid.best_params_} with" +
          f" a score of {grid.best_score_:.2f}")

model = grid.best_estimator_

model.fit(X_train, train.target)
y_score = model.predict_proba(X_test)

#plot precision recall curve
plot_pr_curve(y_test, y_score)

#### When to use Random Forests
Random forests has several advantages:
- Voting helps overcome overfitting
- Sklearn implements <code>feature_importances_</code> in <code>RandomForestClassifier</code> which helps you understand which feature is useful for classification in Random Forest
- Just like other ensemble, it works well with structured/tabular data.  Indeed, XGBoost (another ensemble method) is among the best classifier for structured/tabular data and often used for Kaggle competition But if we are working with lots of "confusing" features, we will go for deep learning, such as image, sound, brain signal analysis. 
- Unlike Decision Trees, multiple trees give out probability

Two biggest disadvantages are:
- Difficult to interpret what is going on
- Does not work well with rare outcome since boostraping will not hit the data

In conclusion, if you are working with structured/tabular data, and would like high accuracy but does not care much about interpretability (just like most Kaggle competition does), you may want to use ensemble methods (including Random Forests and the like)