# Random Forests

A _random forest_ is an ensemble machine learning technique — a random forest contains many decision trees that all work together to classify new points. When a random forest is asked to classify a new point, the random forest gives that point to each of the decision trees. Each of those trees reports their classification and the random forest returns the most popular classification. It’s like every tree gets a vote, and the most popular classification wins.

Some of the trees in the random forest may be overfit, but by making the prediction based on a large number of trees, overfitting will have less of an impact.

## Bagging

Random forests create different trees using a process known as __bagging__. Every time a decision tree is made, it is created using a different subset of the points in the training set. For example, if our training set had `1000` rows in it, we could make a decision tree by picking `100` of those rows at random to build the tree. This way, every tree is different, but all trees will still be created from a portion of the training data.

One thing to note is that when we’re randomly selecting these `100` rows, we’re doing so with replacement. Picture putting all 100 rows in a bag and reaching in and grabbing one row at random. After writing down what row we picked, we put that row back in our bag.

This means that when we’re picking our `100` random rows, we could pick the same row more than once. In fact, it’s very unlikely, but all `100` randomly picked rows could all be the same row!

Because we’re picking these rows with replacement, there’s no need to shrink our bagged training set from `1000` rows to `100`. We can pick `1000` rows at random, and because we can get the same row more than once, we’ll still end up with a unique data set.

In [1]:
from tree import build_tree, print_tree, car_data, car_labels
import random
random.seed(4)

tree = build_tree(car_data, car_labels)
#print_tree(tree)

# implement bagging
indices = []
for i in range(0, 1000):
  indices.append(random.randint(0, 999))

data_subset = [car_data[index] for index in indices]
labels_subset = [car_labels[index] for index in indices]

subset_tree = build_tree(data_subset, labels_subset)
print_tree(subset_tree)

Person Capacity
--> Branch 2:
  Predict Counter({'unacc': 354})
--> Branch 4:
  Estimated Saftey
  --> Branch high:
    Buying Price
    --> Branch high:
      Price of maintenance
      --> Branch high:
        Predict Counter({'acc': 3})
      --> Branch low:
        Predict Counter({'acc': 6})
      --> Branch med:
        Predict Counter({'acc': 6})
      --> Branch vhigh:
        Predict Counter({'unacc': 6})
    --> Branch low:
      Price of maintenance
      --> Branch high:
        Size of luggage boot
        --> Branch big:
          Predict Counter({'vgood': 6})
        --> Branch med:
          Number of doors
          --> Branch 2:
            Predict Counter({'acc': 1})
          --> Branch 5more:
            Predict Counter({'vgood': 2})
        --> Branch small:
          Predict Counter({'acc': 1})
      --> Branch low:
        Size of luggage boot
        --> Branch big:
          Predict Counter({'vgood': 2})
        --> Branch med:
          Predict Counter({'vgoo

# Bagging Features

We’re now making trees based on different random subsets of our initial dataset. But we can continue to add variety to the ways our trees are created by changing the features that we use.

Recall that for our car data set, the original features were the following:

* The price of the car
* The cost of maintenance
* The number of doors
* The number of people the car can hold
* The size of the trunk
* The safety rating

Right now when we create a decision tree, we look at every one of those features and choose to split the data based on the feature that produces the most information gain. We could change how the tree is created by only allowing a subset of those features to be considered at each split.

For example, when finding which feature to split the data on the first time, we might randomly choose to only consider the price of the car, the number of doors, and the safety rating.

After splitting the data on the best feature from that subset, we’ll likely want to split again. For this next split, we’ll randomly select three features again to consider. This time those features might be the cost of maintenance, the number of doors, and the size of the trunk. We’ll continue this process until the tree is complete.

One question to consider is how to choose the number of features to randomly select. Why did we choose `3` in this example? A good rule of thumb is to randomly select the square root of the total number of features. Our car dataset doesn’t have a lot of features, so in this example, it’s difficult to follow this rule. But if we had a dataset with `25` features, we’d want to randomly select `5` features to consider at every split point.

In [2]:
from tree2 import *
import random
import numpy as np

np.random.seed(1)
random.seed(4)

def find_best_split(dataset, labels):
    best_gain = 0
    best_feature = 0
    features = np.random.choice(len(dataset[0]), 3, replace = False)
    for feature in features:
        data_subsets, label_subsets = split(dataset, labels, feature)
        gain = information_gain(labels, label_subsets)
        if gain > best_gain:
            best_gain, best_feature = gain, feature
    return best_gain, best_feature

indices = [random.randint(0, 999) for i in range(1000)]

data_subset = [car_data[index] for index in indices]
labels_subset = [car_labels[index] for index in indices]

# let’s see what the best feature to split the dataset is
print(find_best_split(data_subset, labels_subset))

(0.010225712539814483, 4)


# Classify

Now that we can make different decision trees, it’s time to plant a whole forest! Let’s say we make different `8` trees using bagging and feature bagging. We can now take a new unlabeled point, give that point to each tree in the forest, and count the number of times different labels are predicted.

The trees give us their votes and the label that is predicted most often will be our final classification! For example, if we gave our random forest of 8 trees a new data point, we might get the following results:

`["vgood", "vgood", "good", "vgood", "acc", "vgood", "good", "vgood"]`

Since the most commonly predicted classification was `"vgood"`, this would be the random forest’s final classification.

Let’s write some code that can classify an unlabeled point.

In [3]:
from tree2 import build_tree, print_tree, car_data, car_labels, classify
import random
random.seed(4)

# The features are the price of the car, the cost of maintenance, the number of doors, 
# the number of people the car can hold, the size of the trunk, and the safety rating
unlabeled_point = ['high', 'vhigh', '3', 'more', 'med', 'med']

# create 20 trees and record the prediction of each one
predictions = []
for x in range(0, 20):
  indices = [random.randint(0, 999) for i in range(1000)]
  data_subset = [car_data[index] for index in indices]
  labels_subset = [car_labels[index] for index in indices]
  #create a tree using bagging and feature bagging
  subset_tree = build_tree(data_subset, labels_subset)
  predictions.append(classify(unlabeled_point, subset_tree))

print(predictions)

# find the most common predction
final_prediction = max(predictions, key=predictions.count)
print(final_prediction)

['acc', 'unacc', 'acc', 'unacc', None, 'acc', 'acc', 'unacc', 'unacc', None, 'acc', 'unacc', 'acc', 'acc', 'acc', 'acc', 'acc', 'unacc', None, 'acc']
acc


# Test Set

We’re now able to create a random forest, but how accurate is it compared to a single decision tree? To answer this question we’ve split our data into a training set and test set. By building our models using the training set and testing on every data point in the test set, we can calculate the accuracy of both a single decision tree and a random forest.

We’ve given you code that calculates the accuracy of a single tree. This tree was made without using any of the bagging techniques we just learned. We created the tree by using every row from the training set once and considered every feature when splitting the data rather than a random subset.

Let’s also calculate the accuracy of a random forest and see how it compares. 

In [4]:
from tree3 import training_data, training_labels, testing_data, testing_labels, make_random_forest, make_single_tree, classify
import numpy as np
import random
np.random.seed(1)
random.seed(1)

# create a single tree
tree = make_single_tree(training_data, training_labels)

# create a random forest that takes three parameters:
# the number of trees in the forest, the training data, and the training labels, 
# the tree should return a list of trees
forest = make_random_forest(40, training_data, training_labels)

# create a variable to keep track of how many points in the test set the random forest classifies correctly
forest_correct = 0

single_tree_correct = 0

# loop through every point in the test set, and count the number of points the tree classified correctly 
for i in range(len(testing_data)):
  prediction = classify(testing_data[i], tree)
  if prediction == testing_labels[i]:
    single_tree_correct += 1
  predictions = []
  for forest_tree in forest:
    predictions.append(classify(testing_data[i], forest_tree))
  # find the most common prediction and compare it to the true label
  forest_prediction = max(predictions,key=predictions.count)
  if forest_prediction == testing_labels[i]:
    forest_correct += 1

# report the accuracy of a single tree
print(single_tree_correct/len(testing_data))

# report the percentage of correctly classified points—accuracy of the random forest model
print(forest_correct / len(testing_data))

0.8815028901734104
0.9219653179190751


# Random Forest in Scikit-learn

You now have the ability to make a random forest using your own decision trees. However, `scikit-learn` has a `RandomForestClassifier` class that will do all of this work for you! `RandomForestClassifier` is in the `sklearn.ensemble` module.

`RandomForestClassifier` works almost identically to `DecisionTreeClassifier` — the `.fit()`, `.predict()`, and `.score()` methods work in the exact same way.

When creating a `RandomForestClassifier`, you can choose how many trees to include in the random forest by using the `n_estimators parameter` like this:

`classifier = RandomForestClassifier(n_estimators = 100)`

We now have a very powerful machine learning model that is fairly resistant to overfitting!

In [5]:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
from cars import training_points, training_labels, testing_points, testing_labels
import warnings
from sklearn.ensemble import RandomForestClassifier

# create a RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=2000, random_state=0)

# train the forest
classifier.fit(training_points, training_labels)

# test the random forest
classifier.predict(testing_points)

# print the accuracy of the model
print(classifier.score(testing_points, testing_labels))

0.9826589595375722


# Review

Nice work! Here are some of the major takeaways about random forests:

* A _random forest_ is an ensemble machine learning model. It makes a classification by aggregating the classifications of many decision trees.
* Random forests are used to avoid overfitting. By aggregating the classification of multiple trees, having overfitted trees in a random forest is less impactful.
* Every decision tree in a random forest is created by using a different subset of data points from the training set. Those data points are chosen at random with replacement, which means a single data point can be chosen more than once. This process is known as _bagging_.
* When creating a tree in a random forest, a randomly selected subset of features are considered as candidates for the best splitting feature. If your dataset has `n` features, it is common practice to randomly select the square root of `n` features.