<a href="https://colab.research.google.com/github/dgunning/financial-ml/blob/master/Chapter_6_Ensemble_Methods.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Ensemble Methods

## Bootstrap Aggregation

In [5]:
from scipy.special import comb
N,p,k = 100, 1./3,3. 
p_=0
for i in range(0, int(N/k)+1):
  p_+=comb(N,i)*p**i*(1-p)**(N-i)
p, 1-p_

(0.3333333333333333, 0.4811966952738904)

## Random Forest
Decision forests are prone to overfitting. Random forest produces ensemble forecasts with lower variance.

### Reducing overfitting in Random Forests
If a large number of samples are redundant, overfitting will still take place

1. Set a paramater *max_features* to a lower value, as a way of forcing discrepancy between the trees
2. Early stopping: Set the regularizzation parameter min_weight_fraction_leaf to a sufficiantly large value (e.g.5%) such that the out-of-bag accuracy converges to out-of-sample (k-fold) accuracy
3. Use *BaggingClassifier* on *DecisionTreeClassifier* where max_samples is set to the average uniqueness between samples.
(a) clf = DecisionTreeClassifier(criterion='entropy', max_features='auto', class_weight='balanced')
(b) bc = BaggingClassifier(base_estimator=clf, n-estimators=1000, max_samples=avgU, max_features=1.)
4. Using BaggingClassifier on RandomForestClassifier where max_samples is set to the average uniqueness (avgU) between the samples
(a) cls = RandomForestClassifier(n_estimators=1, criterion='entropy', bootstrap=False, class_weight='balanced_subsample')
(b) bc = BaggingClassifier(base_estimator=clf, n_estimators=1000, max_samples=avgU, max_features=1.)
5. Modify the RF class to replace standard bootstraping with sequential bootstraping

### Three ways of setting up an RF

In [0]:
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
avgU = 6

clf0 = RandomForestClassifier(n_estimators=1, criterion='entropy', bootstrap=False, class_weight='balanced_subsample')

clf1 =DecisionTreeClassifier(criterion='entropy', max_features='auto', class_weight='balanaced')
clf1 = BaggingClassifier(base_estimator=clf1, n_estimators=1000, max_samples=avgU)

clf2 = RandomForestClassifier(n_estimators=1, criterion='entropy', bootstrap=False, class_weight='balanced_subsample')
clf2 = BaggingClassifier(base_estimator=clf2, n_estimators=1000, max_samples=avgU, max_features=1.)


When fitting decision trees, a rotation of the features in a direction that aligns with the exes typically reduces the number of levels needed by the tree. For this reason, fit RF on a PCA of the fetaures, as that may speed up calculations and reduce some overfitting