# Introduction to Random Forests

Decision trees are a powerful and popular machine learning technique. The basic concept is very similar to trees that we commonly use to aid decision-making.

Machine Learning techniques enables us to automatically construct a decision tree that tells us what outcomes we should predict in certain situations.

The decision tree algorithm is a supervised learning algorithm -- we first construct the tree with historical data, and then use it to predict an outcome. 
- One of the major advantages of decision trees is that they can pick up nonlinear interactions between variables in the data that linear regression can't.

The following work is a walk through of the building blocks of making a decision tree automatically.

We'll be looking at individual income in the United States. 
The data is from the 1994 census, and contains information on an individual's 
- marital status
- age 
- type of work, and more. 

The target column, or what we want to predict, is whether individuals make less than or equal to 50k a year, or more than 50k a year. The dataset can be accessed from the [University of California, Irvine's website](http://archive.ics.uci.edu/ml/datasets/Adult).


Following projects shows the implementation and application of Decision Trees on the same data set -

- [Introduction to Decision Trees](https://github.com/ajdatahub/ProjectDS/blob/master/Decision%20Trees/Introduction%20to%20Decision%20Trees.ipynb)

- [Applying Decision Trees](https://github.com/ajdatahub/ProjectDS/blob/master/Decision%20Trees/Applying%20Decision%20Trees.ipynb)

In these projects, I have learned about how decision trees are constructed and how to use them most effectively.


The most powerful tool for reducing decision tree overfitting is called the random forest algorithm. In the current project, we'll learn how to construct and apply Random Forests.

In [18]:
import pandas as pd

# First column is age of the person. Set index_col to False to avoid pandas thinking that the first column is row indexes
income = pd.read_csv("income.csv", index_col=False)
print(income.head(5))

   age          workclass  fnlwgt   education  education_num  \
0   39          State-gov   77516   Bachelors             13   
1   50   Self-emp-not-inc   83311   Bachelors             13   
2   38            Private  215646     HS-grad              9   
3   53            Private  234721        11th              7   
4   28            Private  338409   Bachelors             13   

        marital_status          occupation    relationship    race      sex  \
0        Never-married        Adm-clerical   Not-in-family   White     Male   
1   Married-civ-spouse     Exec-managerial         Husband   White     Male   
2             Divorced   Handlers-cleaners   Not-in-family   White     Male   
3   Married-civ-spouse   Handlers-cleaners         Husband   Black     Male   
4   Married-civ-spouse      Prof-specialty            Wife   Black   Female   

   capital_gain  capital_loss  hours_per_week  native_country high_income  
0          2174             0              40   United-States   

We have categorical variables such as workclass that have string values. 
- Multiple individuals can share the same string value. The types of work include State-gov, Self-emp-not-inc, Private, and so on. Each of these strings is a label for a category. 
- Another example of a column of categories is sex, where the options are Male and Female.

Before we get started with decision trees, we need to convert the categorical variables in our data set to numeric variables.

In [19]:
workclass_col = pd.Categorical(income['workclass'])
income['workclass'] = workclass_col.codes
income['workclass'].head()

0    7
1    6
2    4
3    4
4    4
Name: workclass, dtype: int8

Similarly, we will convert all the other categorical variables to numeric variables.

In [20]:
def convert_categorical(column,data):
#     for each in columns:
    column_cat = pd.Categorical(data[column])
    codes = column_cat.codes
    return codes
    
income['education'] = convert_categorical('education',income)
income['marital_status'] = convert_categorical('marital_status',income)
income['occupation'] = convert_categorical('occupation',income)
income['relationship'] = convert_categorical('relationship',income)
income['race'] = convert_categorical('race',income)
income['sex'] = convert_categorical('sex',income)
income['native_country'] = convert_categorical('native_country',income)
income['high_income'] = convert_categorical('high_income',income)

A decision tree is made up of a series of nodes and branches. A node is where we split the data based on a variable, and a branch is one side of the split. 

The nodes at the bottom of the tree, where we decide to stop splitting, are called terminal nodes, or leaves.
- When we are splitting the data, we are not doing it randomly. We have a motive in mind- an objective. Out goal is to make predictions on future data. 
    - In order to do this, all rows in each leaf must have only one value for our target column.
    
Each leaf can only have rows with the same values for our target column. If this isn't the case, we won't be able to make effective predictions.

After constructing a tree using the data, we'll want to make predictions. In order to do this, we'll take a new row and feed it through our decision tree.

In [21]:
income.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,high_income
0,39,7,77516,9,13,4,1,1,4,1,2174,0,40,39,0
1,50,6,83311,9,13,2,4,0,4,1,0,0,13,39,0
2,38,4,215646,11,9,0,6,1,4,1,0,0,40,39,0
3,53,4,234721,1,7,2,6,0,2,1,0,0,40,39,0
4,28,4,338409,9,13,2,10,5,2,0,0,0,40,5,0


### Random Forests

A random forest is a kind of ensemble model. Ensembles combine the predictions of multiple models to create a more accurate final prediction. We'll make a simple ensemble to see how they work.

To learn about this, let us create two decision trees with slightly different parameters:

- One with min_samples_leaf set to 2
- One with max_depth set to 5

We will check their accuracies separately. Lter, we'll combine their predictions and compare the combined accuracy with the individual accuracies of both trees.

In [22]:
# Splitting the data into train and test

import numpy as np
import math

# Set a random seed so the shuffle is the same every time
np.random.seed(1)

income = income.reindex(np.random.permutation(income.index))

split_val = math.floor(income.shape[0] * .8)

train = income.iloc[:split_val]
test = income.iloc[split_val:]

In [23]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score

columns = ["age", "workclass", "education_num", "marital_status", "occupation", "relationship", "race", "sex", "hours_per_week", "native_country"]

cl = DecisionTreeClassifier(random_state=1, min_samples_leaf=2)
cl.fit(train[columns], train["high_income"])

cl1 = DecisionTreeClassifier(random_state=1, max_depth=5)
cl1.fit(train[columns], train["high_income"])


DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=1,
            splitter='best')

We will use AUC_ROC( Area Under the Receiver Operating Characteristic curve) as an error metric. AUC ranges from 0 to 1, so it's ideal for binary classification. The higher the AUC, the more accurate our predictions.

We can compute AUC with the roc_auc_score function from sklearn.metrics. This function takes in two parameters:
- y_true: true labels
- y_score: predicted labels

It then calculates and returns the AUC value.

In [24]:
# min_samples_leaf=2 predictions
min_samples_predictions = cl.predict(test[columns])

min_samples_auc = roc_auc_score(test['high_income'],min_samples_predictions)

# max_depth=5 predictions
max_depth_predictions = cl1.predict(test[columns])

max_depth_auc = roc_auc_score(test['high_income'], max_depth_predictions)

print('Min Samples AUC : ',min_samples_auc)
print('Max Depth AUC : ',max_depth_auc)

Min Samples AUC :  0.6878964226062301
Max Depth AUC :  0.6759853906508785


### Combining Predictions

When we have multiple classifiers making predictions, we can treat each set of predictions as a column in a matrix. Here's an example where we have Decision Tree 1 (DT1), Decision Tree 2 (DT2), and Decision Tree 3 (DT3):

DT1     DT2    DT3 <br>
 0       1      0  <br>
 1       1      1 <br>
 0       0      1 <br>
 1       0      0 <br>

But, we want a single vector containing one prediction per row in the training data set. 
To accomplish this, we'll need to create rules to convert each row of our matrix of predictions into a single number.

We want to create a Final Prediction vector that looks like this:

DT1 &nbsp;            DT2 &nbsp;           DT3 &nbsp;           Final Prediction<br>
0 &nbsp;              1 &nbsp;            0 &nbsp;            0<br>
1       1      1      1<br>
0       0      1      0<br>
1       0      0      0<br>

There are many ways to get from the output of multiple models to a final vector of predictions. 
- One method is majority voting, in which each classifier gets a "vote," and the most commonly voted value for each row "wins." 
- This only works if there are more than two classifiers (and ideally an odd number, so we don't have to write a rule to break ties). Majority voting is what we applied in the example above.

Before, we just used 2 classifiers (with max_depth and min_samples_leaf). As we have two classifiers, we'll take the mean of all of the items in a row. 

- Instead of using the predict() method, which returns either 0 or 1 we can use the RandomForestClassifier.predict_proba() method instead, which will predict a probability from 0 to 1 that a given class is the right one for a row. 
- predict_proba() will return the following output- 

 0     1<br>
.7    .3<br>
.2    .8<br>
.1    .9<br>


- Each row will correspond to a prediction. The first column is the probability that the prediction is a 0, and the second column is the probability that the prediction is a 1. Each row adds up to 1.


- If we just take the second column, we get the average value that the classifier would predict for that row. If there's a .9 probability that the correct classification is 1, we can use the .9 as the value the classifier is predicting. This will give us a continuous output in a single vector, instead of just 0 or 1.


- Then we can add together all of the vectors we get through this method, and divide the sum by the total number of vectors to get the mean prediction made across the entire ensemble for a particular row. Finally, we round off to get a 0 or 1 prediction for the row.

In [25]:
import numpy as np
'''
If we use the predict_proba() method on both classifiers to generate probabilities, 
take the mean for each row, and then round the results, we'll get ensemble predictions.
'''

predictions1 = cl.predict_proba(test[columns])[:,1]
predictions2 = cl1.predict_proba(test[columns])[:,1]

mean_predictions = np.round((predictions1 + predictions2)/2)

combined_auc_score = roc_auc_score(test['high_income'],mean_predictions)

print('Combined AUC: ',combined_auc_score)

Combined AUC:  0.7150846804038882


We see that the AUC value for the combined predictions of the two trees is higher than that of either tree on its own.

The individual models are approaching the same problem in slightly different ways, and building different trees because we used different parameters for each one. 
- Each tree makes different predictions in different areas. 
- Even though both trees have about the same accuracy, when we combine them, the result is stronger because it leverages the strengths of both approaches.


- The more "diverse" or dissimilar the models we use to construct an ensemble are, the stronger their combined predictions will be (assuming that all of the models have about the same accuracy). 


- Ensembling a decision tree and a logistic regression model, for example, will result in stronger predictions than ensembling two decision trees with similar parameters. That's because those two models use very different approaches to arrive at their answers.

### Variation with Bagging

Random Forests is an ensemble of Decision Trees. 


- If we don't make any modifications to the trees, each tree will be exactly the same, so we'll get no boost when we ensemble them. In order to make ensembling effective, we have to introduce variation into each individual decision tree model.

If we introduce variation, each tree will be be constructed slightly differently, and will therefore make different predictions.

- There are two main ways to introduce variation in a random forest -bagging and random feature subsets. 


__Bagging- __

In a random forest, we don't train each tree on the entire data set. We train it on a random sample of the data, or a "bag," instead. We perform this sampling with replacement, which means that after we select a row from the data we're sampling, we put the row back in the data so it can be picked again. Some rows from the original data may appear in the "bag" multiple times.



In [26]:
tree_count = 10 # Number of trees we want to build

# List to hold predicitons for each tree
predictions_list = [] 

# Each "bag" will have 60% of the number of original rows
bag_proportion = .6

for i in range(tree_count):
    
    bag = train.sample(frac=bag_proportion, replace=True, random_state=i)
    
    # Fit a decision tree model to the "bag"
    cl = DecisionTreeClassifier(random_state=1, min_samples_leaf=2)
    cl.fit(bag[columns], bag["high_income"])
    
    # Using the model, make predictions on the test data
    predictions_list.append(cl.predict_proba(test[columns])[:,1])
    
mean_predictions = [np.round(sum(i)/tree_count) for i in zip(*predictions_list)]

combined_auc_score = roc_auc_score(test['high_income'],mean_predictions)

print('Combined AUC: ',combined_auc_score)

Combined AUC:  0.7329963297474371


Using the bagging, we gained some accuracy over a single decision tree. 
- We achieved an AUC score of around .733 with bagging, which is an improvement over the AUC score of .688 we got without bagging.

### Selecting Random Features

To build a tree, we split the data based on a certain feature. The feature which is used to split the data is absed in the information gain that particular feature provides.  
- We compute the information gain for each feature in our random sample, and pick the one with the highest information gain to split on.
    - Entropy is the measure of "disorder" in the data set. 
    - If a dataset has all the same labels, they'll have low entropy. If all the labels are different, they'll have high entropy. 
    - Splits that give us more information about the data, will ideally minimize entropy. 
    - The tree we are building will split the labels into distinct groups with minimum disorder (mix of values). This'll allow the splits to give our tree more predictive power.


While training our model in every iteration, rather than using all the features to find the maximum information gain, we'll only evaluate a __constrained set of features that we select randomly.__
- This introduces variation into the trees, and makes for more powerful ensembles.

We perform random subset selection process in scikit-learn- 
- Set the splitter parameter on DecisionTreeClassifier to "random", and the max_features parameter to "auto". 
- If we have N columns, this will pick a subset of features of size square root of N, compute the Gini coefficient for each (this is similar to information gain), and split the node on the best column in the subset.

In [27]:
tree_count = 10 # Number of trees we want to build

# List to hold predicitons for each tree
predictions_list = [] 

# Each "bag" will have 60% of the number of original rows
bag_proportion = .6

for i in range(tree_count):
    
    bag = train.sample(frac=bag_proportion, replace=True, random_state=i)
    
    # Fit a decision tree model to the "bag"
    cl = DecisionTreeClassifier(random_state=1, min_samples_leaf=2, splitter = 'random', max_features = 'auto')
    cl.fit(bag[columns], bag["high_income"])
    
    # Using the model, make predictions on the test data
    predictions_list.append(cl.predict_proba(test[columns])[:,1])
    
mean_predictions = [np.round(sum(i)/tree_count) for i in zip(*predictions_list)]

combined_auc_score = roc_auc_score(test['high_income'],mean_predictions)

print('Combined AUC: ',combined_auc_score)

Combined AUC:  0.7345958637997538


As discussed, Random Forests have the two building blocks __bagging__ and __random feature subset__. 

Scikit-Learn has a __RandomForestClassifier__ class and a __RandomForestRegressor__ class that enable us to train and test random forest models quickly.

When we instantiate a RandomForestClassifier, we pass in an n_estimators parameter that indicates how many trees to build. 
- While adding more trees usually improves accuracy, it also increases the overall time the model takes to train.

In [30]:
from sklearn.ensemble import RandomForestClassifier

clr = RandomForestClassifier(n_estimators = 5, random_state = 1, min_samples_leaf = 2)

# Fit the model
clr.fit(train[columns],train[['high_income']])

# Make predictions
predictions = clr.predict(test[columns])

auc_score = roc_auc_score(test['high_income'], predictions)

print('RF AUC Score', auc_score)

RF AUC Score 0.7347461391939776


  


Similar to decision trees, we can tweak some of the parameters for random forests, including:

- min_samples_leaf
- min_samples_split
- max_depth
- max_leaf_nodes

There are also parameters specific to the random forest that alter the overall construction of the tree- 

- n_estimators
- bootstrap - "Bootstrap aggregation" is another name for bagging; this parameter indicates whether to turn it on (Defaults to True)

Tweaking parameters can increase the accuracy of the forest. 
- The easiest tweak is to increase the number of estimators we use. 
- This approach yields diminishing returns - going from 10 trees to 100 will make a bigger difference than going from 100 to 500, which will make a bigger difference than going from 500 to 1000. 
- The accuracy increase function is logarithmic, so increasing the number of trees beyond a certain number (usually 200) won't help much at all.

In [32]:
# Increasing n_estimators to 150.
from sklearn.ensemble import RandomForestClassifier

clr = RandomForestClassifier(n_estimators = 150, random_state = 1, min_samples_leaf = 2)

# Fit the model
clr.fit(train[columns],train[['high_income']])

# Make predictions
predictions = clr.predict(test[columns])

auc_score = roc_auc_score(test['high_income'], predictions)

print('RF AUC Score', auc_score)

  import sys


RF AUC Score 0.7379403213124711


We could increase the AUC score from 0.734 to 0.737 but the model took longer time to run. 
- Working with much larger data sets will affect the extra time and could amount to hours or days.

One of the major advantages of random forests over single decision trees is that they tend to overfit less. 
- Although each individual decision tree in a random forest varies widely, the average of their predictions is less sensitive to the input data than a single tree is. 
- This is because while one tree can construct an incorrect and overfit model, the average of 100 or more trees will be more likely to refine the signal and ignore the noise. 
- The signal will be the same across all of the trees, whereas each tree will refine in on the noise differently. This means that the average will discard the noise and keep the signal.

### Where Random Forests are Useful

Random forest algorithm is incredibly powerful but it isn't applicable to all tasks. 

- The main strengths of a random forest are- 
    - __Very accurate predictions__ - Random forests achieve near very accurate performance on many machine learning tasks. Along with neural networks and gradient-boosted trees, they're typically one of the top-performing algorithms.

    - __Resistance to overfitting__ - Due to their construction, random forests are fairly resistant to overfitting. 
    

- The main weaknesses of using a random forest are:

    - __Difficult to interpret__ - Because we've averaging the results of many trees, it can be hard to figure out why a random forest is making predictions the way it is.

    - __Takes longer to create__ - It takes longer time to build and implement using the RandomForests algorithms. Fortunately, we can exploit multicore processors to parallelize tree construction. Scikit allows us to do this through the n_jobs parameter on RandomForestClassifier. 


Given these trade-offs, it makes sense to use random forests in situations where __accuracy is of the utmost importance__; being able to interpret or explain the decisions the model is making isn't key. 
- In cases where time is of the essence or interpretability is important, a single decision tree may be a better choice.

