#### What is a Random Forest?

When I started learning about Random Forests, I stumbled upon [Edwin Chen's blog post](http://blog.echen.me/2011/03/14/laymans-introduction-to-random-forests/) he wrote in 2011, in which he provided an extremely nice way to fully describe what a random forest is. I'm reproducing it below (in case you're too lazy to click on the link and read it)

Suppose you’re very indecisive, so whenever you want to watch a movie, you ask your friend Willow if she thinks you’ll like it. In order to answer, Willow first needs to figure out what movies you like, so you give her a bunch of movies and tell her whether you liked each one or not (i.e., you give her a labeled training set). Then, when you ask her if she thinks you’ll like movie X or not, she plays a 20 questions-like game with IMDB, asking questions like “Is X a romantic movie?”, “Does Johnny Depp star in X?”, and so on. She asks more informative questions first (i.e., she maximizes the information gain of each question), and gives you a yes/no answer at the end. **Willow is thus a decision tree for your movie preferences.**

But Willow is only human, so she doesn’t always generalize your preferences very well (i.e., she overfits). In order to get more accurate recommendations, you’d like to ask a bunch of your friends, and watch movie X if most of them say they think you’ll like it. That is, instead of asking only Willow, you want to ask Woody, Apple, and Cartman as well, and they vote on whether you’ll like a movie. **All these friends make up what's called an ensemble classifier**, aka a forest in this case.

Now you don’t want each of your friends to do the same thing and give you the same answer, so you first give each of them slightly different data. After all, you’re not absolutely sure of your preferences yourself – you told Willow you loved Titanic, but maybe you were just happy that day because it was your birthday, so maybe some of your friends shouldn’t use the fact that you liked Titanic in making their recommendations. Or maybe you told her you loved Cinderella, but actually you really really loved it, so some of your friends should give Cinderella more weight. So instead of giving your friends the same data you gave Willow, you give them slightly perturbed versions. You don’t change your love/hate decisions, you just say you love/hate some movies a little more or less (formally, you give each of your friends a *bootstrapped* version of your original training data). For example, whereas you told Willow that you liked Black Swan and Harry Potter and disliked Avatar, you tell Woody that you liked Black Swan so much you watched it twice, you disliked Avatar, and don’t mention Harry Potter at all. By using this modified ensemble, you hope that while each of your friends gives somewhat idiosyncratic recommendations (Willow thinks you like vampire movies more than you do, Woody thinks you like Pixar movies, and Cartman thinks you just hate everything), the errors get canceled out in the majority. **Your friends now form a bagged (bootstrap aggregated) forest of your movie preferences.**

There’s still one problem with your data, however. While you loved both Titanic and Inception, it wasn’t because you like movies that star Leonardio DiCaprio. Maybe you liked both movies for other reasons. Thus, you don’t want your friends to all base their recommendations on whether Leo is in a movie or not. So when each friend asks IMDB a question, only a random subset of the possible questions is allowed (i.e., when you’re building a decision tree, at each node you use some randomness in selecting the attribute to split on, say by randomly selecting an attribute or by selecting an attribute from a random subset). This means your friends aren’t allowed to ask whether Leonardo DiCaprio is in the movie whenever they want. So whereas previously you injected randomness at the data level, by perturbing your movie preferences slightly, now you’re injecting randomness at the model level, by making your friends ask different questions at different times.

**And so your friends now form a random forest.**

Random Forest is a highly versatile machine learning technique, and can perform both classification and regression. It's thus a solid choice for nearly any type of prediction models, including the non-linear ones. It can handle a large number of features, and it can also be used to estimate which of your variables are important in the underlying data being modeled.

All these benefits are okay, but what makes Random Forests especially appealing? It's the fact that you can throw pretty much anything at it and it'll do a serviceable job. It does a particularly good job of estimating inferred transformations, and, as a result, doesn't require much tuning like, say, a Support Vector Machine model. If you have a tight deadline coming up, this is a handy tool to have in your arsenal.

#### Random Forest in action

What better dataset than the good ol' [Iris dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set) to show how Random Forest works? This dataset is available in Python datasets module and you can import it as follows (but make sure you have the scikit-learn library installed):

In [1]:
from sklearn.datasets import load_iris

We'll import a few more libraries and modules that you'll need to make all this work, such as pandas, numpy, and ofcourse, the RandomForestClassifier (which will make the magic happen).

In [2]:
import pandas as pd
import numpy as np
from sklearn import metrics, cross_validation
from sklearn.ensemble import RandomForestClassifier



In [9]:
# loading the dataset
iris = load_iris()

# storing the features in a dataset called features
features = pd.DataFrame(iris.data, columns=iris.feature_names)

# storing the features in a dataset called target
target = pd.DataFrame(iris.target)

# renaming the columns of these two datasets
features.columns = ['Sepal_length','Sepal_width','Petal_length','Petal_width']
target.columns = ['class']

# taking a peek at the features dataset
features.head() 

Unnamed: 0,Sepal_length,Sepal_width,Petal_length,Petal_width
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


And now the next logical step: separating them into training and testing sets, so that we can train our model on the training dataset and see how well it is performing on data it's not seen (the testing dataset)

In [10]:
# Putting two-thirds of the data in the training set
X_train, X_test, y_train, y_test = cross_validation.train_test_split(features, target, test_size=0.333, random_state=0)

The Random Forest will accept only numpy arrays, and so we can convert our pandas dataframes onto numpy arrays easily as follows:

In [11]:
trainArr = X_train.values #training array
trainRes = y_train.values # training results

We'll now initialize our Random Forest. I have provided a descriptio of the various attributes in the RandomForestClassifier function for reference, and most of the values have been set to their defaults.

In [12]:
rf = RandomForestClassifier(n_estimators=100, # the numer of trees in the forest
                            criterion="gini", # The function to measure the quality of a split. Supported criteria are
                                              # "gini" for the Gini impurity and "entropy" for the information gain.
                            max_features="auto", # The number of features to consider when looking for the best split, default
                                                 # is "auto" which means sqrt(number of features)
                            min_samples_split=2, # The minimum number of samples required to split an internal node, default is 2
                            max_depth=None, # The maximum depth of the tree. The default is None, which means the nodes are 
                                            # expanded until all leaves are pure or until all leaves contain less than
                                            # 'min_samples_split' samples
                            min_samples_leaf=1, # The minimum number of samples required to be at a leaf node, default is 1
                            min_weight_fraction_leaf=0., # The minimum weighted fraction of the sum total of weights (of all
                                                         # the input samples) required to be at a leaf node. Default is 0.0. 
                                                         # Samples have equal weight when sample_weight is not provided.
                            max_leaf_nodes=None, # Grow trees with ``max_leaf_nodes`` in best-first fashion. Best nodes are 
                                                 # defined as relative reduction in impurity. Default is None, which means an
                                                 # unlimited number of leaf nodes.
                            min_impurity_split=1e-7, # Threshold for early stopping in tree growth. A node will split if its 
                                                     # impurity is above the threshold, otherwise it is a leaf.
                            bootstrap=True, # Whether bootstrap samples are used when building trees.
                            oob_score=False, # Whether to use out-of-bag samples to estimate the generalization accuracy, 
                                             # default is False
                            n_jobs=1, # The number of jobs to run in parallel for both fiting and predicting stage. 
                                      # Default is 1. If -1, then the number of jobs is set to the number of cores.
                            random_state=None,
                            verbose=0, # controls the verbosity of the tree buillding process, default is 0
                            warm_start=False) # If true, reuse the solution of the previous call to fit and add more 
                                              # estimators to the ensemble, otherwise, just fit a whole new forest.

Fitting the model to our data:

In [13]:
rf.fit(trainArr, trainRes)

  if __name__ == '__main__':


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=100, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

Now our classifier is ready, and all we need to do it predict on our test data set (for which we have to convert it into a numpy array)

In [19]:
# array of testing data
testArr = X_test.values

# array of testing targets
testRes = y_test.values

# getting predictions and storing them in results
results = rf.predict(testArr)

To know how well the Random Forest did on the testing dataset, we can print the accuracy score and the confusuion matrix

In [23]:
print (metrics.confusion_matrix(testRes, results))

[[16  0  0]
 [ 0 18  1]
 [ 0  1 14]]


In [24]:
print (100*metrics.accuracy_score(testRes, results))

96.0


**Only 2 our of 50 observations were misclassified, and thus the accuracy score was 96%.** This is good, even not considering the fact that it took almost no time for the model to be trained. If you are interested in the individual class probabilities, here's how you can get them:

In [36]:
probs = rf.predict_proba(testArr)
print(probs[:15,:]) # class probabilities for the first 15 test observations

[[ 0.    0.    1.  ]
 [ 0.    1.    0.  ]
 [ 1.    0.    0.  ]
 [ 0.    0.    1.  ]
 [ 1.    0.    0.  ]
 [ 0.    0.    1.  ]
 [ 1.    0.    0.  ]
 [ 0.    0.99  0.01]
 [ 0.    0.9   0.1 ]
 [ 0.    1.    0.  ]
 [ 0.    0.29  0.71]
 [ 0.    1.    0.  ]
 [ 0.    1.    0.  ]
 [ 0.    0.98  0.02]
 [ 0.    1.    0.  ]]
