# Random Forests
#### Math 3480 - Machine Learning - Dr. Michael E. Olson

## Reading
* Geron, Chapter 7

Decision Trees are efficient, but often have poor results. To improve the results, we aggregate the information, or look at groups of predictors. This group of predictors is called an __ensemble__.

Applying this to Decision Trees,
* Take a random subset of the data (Bagging)
* Run it through a Decision Tree
* Run it through a different Decision Tree
* Continue until you have run the data subset through all the trees
* Use this aggregated information to predict the class that gets the most votes

This process, since we are looking through multiple trees, is called a __Random Forest__, and is one of the most powerful Machine Learning algorithms available today.



### Voting
One way we can get the best results is to run the data through the different methods (Logistic Regression, SVM, Random Forests, KNN,...) and take the model that performs the best
* Run the data through each ML algorithm
* Aggregate (or collect) the data together
* Take a majority vote on which method has the most correct classifications, and aggregate the results

The majority-vote classifier is called *hard voting*.

*Soft voting* would be predicting the class with the highest class probability averaged over all the individual classifiers.

### Bagging (Bootstrap AGGregatING)
To bootstrap (or create a subset of your dataset):
* Take a random element from your data, and include it in your dataset
  * with replacement (it's not removed from the original dataset, so it could be selected again later)
* Continue to take a determined number of elements from your dataset, with replacement
  * If you repeat some elements, that is just fine - it is useful in the algorithm
  * How many elements in each bag?
    * If $n$ is the number of elements in the original dataset, and $n'$ is the number in the bag, we want $n'<n$
    * A good round figure would be $n' \approx 60\% \cdot n$. Not a hard number - can vary based on need

We will then use each bag to be trained into models. We then apply our data to each model and take the average for our result. This is the __ensemble__ process. 

#### How to do Bagging in Python

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (16, 6)
from sklearn import datasets

iris_data = datasets.load_iris()
iris = pd.DataFrame(iris_data['data'], columns=iris_data['feature_names'])
species = pd.DataFrame(iris_data['target'], columns=['species_num'])

def test_species(x):
    if x==0: return "setosa"
    if x==1: return "versicolor"
    if x==2: return "virginica"

iris['species'] = species['species_num'].apply(lambda x: test_species(x))

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(iris.drop(['species'],axis=1), iris['species'], test_size=0.3, random_state=32)

In [None]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(
    DecisionTreeClassifier(), n_estimators=500, max_samples=100, bootstrap=True, n_jobs=-1)
    
bag_clf.fit(X_train, y_train)
y_predict = bag_clf.predict(X_test)

from sklearn.metrics import confusion_matrix, classification_report
print(confusion_matrix(y_train,y_predict))

### Random Forests
* Bootstrap your data
* Using the bootstrapped data, select only a few variables to test
* Determine which variable creates the best split in the data - this is your first node
* Select only a few of the remaining variables to test
* The variable that creates the best split becomes the second node
* ...etc...