# Supervised Learning

![Random Forest Comic](https://s-media-cache-ak0.pinimg.com/736x/e9/c9/df/e9c9df5e60831e134d0dfa367bf286ce.jpg)

In [None]:
# %load utils/imports.py
%matplotlib inline

import numpy as np
import pandas as pd

from utils import *
from utils.plotting import *

from utils.demo import *
from utils.styles import *

# Random Forest

Random forests or random decision forests are an [ensemble learning](https://en.wikipedia.org/wiki/Ensemble_learning) method for classification, regression, and other tasks. Ensemble methods are essentially machine learning methods that use multiple learning algorithms to obtain better predictive performance than could be obtained by using a single algorithm.

They work by constructing many decision trees at training time and outputting the class that is the mode of the classes (for classification) or mean prediction (for regression) of the individual trees.

Essentially, a random forest uses many small decision trees as weak learners in order to create a higher-level strong learner. The strong learner essentially polls its weak learners to 'vote' on which class (`y`) should be selected for a given input (`x`). The class with the most votes is selected!

![Example Random Forest](http://image.slidesharecdn.com/mlsquare-140801092353-phpapp01/95/squares-machine-learning-infrastructure-and-applications-rong-yan-17-638.jpg?cb=1406885118)

## Strengths and Weaknesses

### Strengths

- Generally quite fast to build and train.
- Does not expect linear features or even features that interact linearly.
- Handle categorical features and continuous features very well.
- Handle high dimensional spaces very well.
- Scale well to a large number of instances since each individual tree is relatively small.
- Often highly accurate in contrast to other models.

### Weaknesses

- Can be a bit slow to query, but it is generally fast enough in practice.
- Can overfit on particularly noisy data.
- Results of learning are hard to comprehend in contrast to decision trees.

Scikit-learn provides the [`RandomForestClassifier`](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) and here is a quick example of how to use random forest on the motor trend dataset.

In [None]:
fn = 'mtcars.csv'
download_data(fn)
df = pd.read_csv('data/' + fn)

In [None]:
from sklearn.cross_validation import cross_val_score
from sklearn.ensemble import RandomForestClassifier

features = df.drop(['Car', 'gear'], axis=1)
target = df.gear

rf = RandomForestClassifier()
scores = cross_val_score(rf, features, target)
print("Accuracy: {:.2f} +/- {:.2f}".format(scores.mean(), scores.std() * 2))

### Overfitting
Random Forest is a fairly robust machine learning algorithm, but it is still prone to overfitting if given bad parameters. In particular, watch out when tweaking the following parameters to the `RandomForestClassifier`:
- `n_estimators`: in general the more trees the less likely the algorithm is to overfit. The default is 10, but if you have a large amount of data you may want to try increasing this. The lower this number, the closer the model is to a decision tree, with a restricted feature set.
- `max_features`: by default the algorithm uses the square root of the number of features. If you have a very large number of features, you may want to try adjusting this number (try 30-50% of the number of features). This determines how many features each tree is randomly assigned. The smaller, the less likely to overfit, but too small and it will start to introduce under fitting.
- `max_depth`: Experiment with this. This will reduce the complexity of the learned models, lowering overfitting risk. Try starting small, say 5-10, and increasing you get the best result.
- `min_samples_leaf`: Try setting this to values greater than one. This has a similar effect to the `max_depth parameter`. It means the branch will stop splitting once the leaves have that number of samples each.