# Lab 17

Today, we will begin exploring the idea of _ensemble methods_ or methods that are an aggregate of several machine learning methods. Today, we will: 

0. Explain aggregation in terms of ensemble methods 
1. Define the random forest

## Ensemble Methods

**Ensemble methods** are methods that use an "ensemble" (or group) of methods to perform a task. 

For example, we might want to do a classification task assigning Smith students to houses (training on data from a previous class of students). We might try 1-NN, 3-NN, and a 5-NN to assign the house for a student, and get Tyler House twice and Chapin once. What house would you assign this student to? _Why?_

Just to push your thinking a bit, consider that the results were:
* **Situation 1** - Tyler (result from 1-NN), Tyler (result from 3-NN), and Chapin (result from 5-NN)
* **Situation 2** - Tyler (result from 1-NN), Chapin (result from 3-NN), and Tyler (result from 5-NN)
* **Situation 3** - Chapin (result from 1-NN), Tyler (result from 3-NN), and Tyler (result from 5-NN)

Typically, we use **majority vote** to decide the ultimate classifier assigned by an ensemble method, but we can also develop (or tune) other kinds of voting systems, including weighted votes. (You can also "tune" this weighting as part of your supervised learning, but that is beyond the scope of this lab.)

In the above example, under majority vote, all the situations would assign the student to Tyler house. But if one used a weighted voting scheme that more heavily weighted the 5-NN, then it is possible that the student would be assigned to Chapin, instead. 

In addition to varying the _k_ in kNN, we could also build several 3-NN classifiers with a different set of training examples. 

### Grove of trees

In the above example, we used several versions of kNN and then performed some kind of voting system. We can do a similar thing with decision trees by creating many **pruned** trees.

Returning to the idea of the bias-variance trade off; if we have a decision tree with many nodes, then they may overfit to our data and/or exhibit high _variance_ meaning that they are hard to generalize. They also might simply contain too many rules or decisions for assigning a label (or class) to a datapoint. In a sense, we want nice compact trees where each node is contributing more to the classification than it the effort of adding that node to the tree. 

So after building a decision tree, we might _prune_ it by cutting off branches that are simply not contributing much. It is natural to ask why we are not just making short trees (ie. only using a maximal number of nodes) instead of building a whole tree and then pruning? This is simply that we don't know what is useful until we have the _whole_ tree. 

Borrowing from the example above, we decide to build 5 pruned decision trees using five different sets of previous students as our training sets. We then take the majority vote for house assignment. 

### (Possibly) Motivating example

Consider that you are given the task of assigning each US senator to a policy committee. In effect, we are trying to accurate assign Senators a label of a committee name. For some Senators, we have their current committee membership as they were in the Senate last session (this will serve as our training set). We would like to generate a rule-based method (like a decision tree) that _explains_ why each Senator is placed on a committee. For each senator from the previous session, we have all of their votes as well as their party affiliation. 

As a first, you follow the process that you did for your house assignment algorithm: building 5 pruned decision trees using five different sets of previous Senators as the training set. You notice something weird: the top node is always the Senator's party affiliation. 

The US has an incredibly partisan 2-party system. So when trying to use a decision tree, the first node splits senators by their party as this is the most overwhelming predictor of a senator's voting behavior. 

When you have situation like this where one predictor is always at the top of a decision tree regardless of the training data, we say that we have _bully_ predictor. In this case, we need to do something different...

## Random Forests

Random forests are creating a grove of decision trees where we restrict the number of available variables that we can make a decision with. Returning to our weather example, this means that we restrict the trees to using a subset of all the input variables:
* cloudy
* humid
* temp	
* rained_yesterday

This restriction helps to circumvent the bully predictor and give us actually different trees in our grove. Today we will jump straight to the implementation of this is `sklearn`.

### Imports for Today

Again today, we will use the weather data from last time. 

In [2]:
## Import block
%matplotlib inline

import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

import numpy as np
import pandas as pd

from sklearn.ensemble import RandomForestClassifier

In [None]:
# For function testing 



In [8]:
## Import Data

weather_data = np.genfromtxt("../Lab16-DecisionTrees/lab16data.csv", delimiter=',', skip_header=1)
weather_pd = pd.read_csv("../Lab16-DecisionTrees/lab16data.csv", sep = ",")


In [9]:
# Split into the input variables and the target classes
in_weather = weather_data[:,:4]
out_class = weather_data[:,4]

# Get the variable names 
var_names = list(weather_pd.columns)[:4]

## Random Forests in `sklearn`

It should come as no surprise that we will first specify our random forest model, and then fit the random forest model to our data. 

Here we have a few parameters to unpack: 
* `n_estimators` means the number of trees in your grove
* `max_features` means the number of features that the tree in your grove can use
* `max_depth` means how many layers your trees have

In [10]:
# Specify our model
grove = RandomForestClassifier(n_estimators=10, max_features = 3, max_depth=2, random_state=0)

In [12]:
# Fit our model to the data
grove.fit(in_weather, out_class)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=2, max_features=3, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=0, verbose=0,
                       warm_start=False)

### Inferences from our random forests

We can use our random forest to tell us which features are the most important: 

In [14]:
print(grove.feature_importances_)

[0.40506577 0.17887533 0.2896692  0.1263897 ]


We can also make predictions, like for a clear, not humid, hot day, following a rainy day:

In [18]:
grove.predict([[0, 0, 1, 1]])

array([0.])

We can also see how well we did on our test set:

In [19]:
grove.score(in_weather, out_class)

0.6764705882352942

### Final Thoughts

To finish up this lab, answer the question: **Do you think that random forests are an improvement over the decision tree from last lab?** Why or why not? Share your thoughts in a post on **#lab_submission** channel on slack with your answer. Your post must start with **Lab17** to get credit.  

If your have questions from this lab, post them to #lab_questions with the same preamble (i.e. starting with **Lab17**). If you have the same question, please use one of the emoji's to upvote the question. If you would like to answer someone's question, please use the thread function. This will tie your answer to their question. 

### Next Time

We will start with 15 minutes on Labs 16 and 17. Bring questions!

#### Resources consulted 

0. [ISLR](http://faculty.marshall.usc.edu/gareth-james/ISL/)
1. [SDS 293 Notes by R. Jordan Crouser](http://www.science.smith.edu/~jcrouser/SDS293/)
2. [Random Forest in sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)