# Random Forest
Random forest is a popular ensemble machine learning technique. Essentially it uses a batch of decision tree and bootstrap aggregation (*bagging*) to reduce variance. A single decision tree leads to high bias. A forest of decision tree will lead to high variance. The bagging technique will address the variance problem.

We can build a decision tree easily using `sklearn` and achieve >80% accuracy on MNIST dataset using all pixel values as features.

In [1]:
import numpy as np
from sklearn import tree
from keras.datasets import mnist
from keras.utils import to_categorical

(x_train, y_train), (x_test, y_test) = mnist.load_data()

N, H, W = x_train.shape
x = x_train.reshape((N,H*W)).astype('float') / 255
y = to_categorical(y_train, num_classes=10)

model = tree.DecisionTreeClassifier()
model.fit(x, y)

N, H, W = x_test.shape
x = x_test.reshape((N,H*W)).astype('float') / 255
y = to_categorical(y_test, num_classes=10)

model.score(x, y)

Using TensorFlow backend.


0.87539999999999996

Random forest outperforms decision tree by having 100 tree. Obviously, it will take a longer time to train.

In [4]:
from sklearn import ensemble

N, H, W = x_train.shape
x = x_train.reshape((N,H*W)).astype('float') / 255
y = to_categorical(y_train, num_classes=10)

model = ensemble.RandomForestClassifier(n_estimators=100)
model.fit(x, y)

N, H, W = x_test.shape
x = x_test.reshape((N,H*W)).astype('float') / 255
y = to_categorical(y_test, num_classes=10)

model.score(x, y)

0.90229999999999999

## Decision Tree Construction
### Tree Node
There are two types of tree node in a decision tree. **Prediction node(s)** are leaf nodes of the tree. **Decidsion node(s)** are the parent nodes of the tree. Most decision tree implementations use binary tree, i.e. every decision node will split into two branches at most. Now the question is, how do we split the data at every decision node? The essence of splitting comes down to reduce impurity. We try to split until the impurity is zero, i.e. all data are homogeneous, unless there is a maximum depth restriction.

In [17]:
# Create couple test data as if they are from a CSV File
header = ['color', 'diameter', 'label']

training_data = [
    ['Green', 3, 'Apple'],
    ['Yellow', 3, 'Apple'],
    ['Red', 1, 'Grape'],
    ['Red', 1, 'Grape'],
    ['Yellow', 3, 'Lemon']
]

testing_data = [
    ['Green', 3, 'Apple'],
    ['Yellow', 4, 'Apple'],
    ['Red', 2, 'Grape'],
    ['Red', 1, 'Grape'],
    ['Yellow', 3, 'Lemon']
]

def class_counts(rows):
    """Counts the number of each type of example in a dataset.
    """
    counts = dict()
    for row in rows:
        label = row[-1]
        if label not in counts:
            counts[label] = 0
        
        counts[label] += 1
    
    return counts

### CART Gini Index
The definition of **pure** is that if we select two items from a pure population, the probaility of them being of same class has to be one. Gini index is a measurement of how impure a population is, ranging from 0 being pure to 1 being impure. 

$$
G = 1 - \Sigma_{i}^{C} P(i)^{2}
$$

The procedure to benchmark a branching decision is.

1. Calculate Gini index for left and right sub-node. 
2. Use weighted average on the two indices to decide what is the impurity.

In [47]:
def gini(data_rows):
    counts = class_counts(data_rows)
    impurity = 1
    for label in counts:
        prob = counts[label] / float(len(data_rows))
        impurity -= prob**2
    
    return impurity

# Impurity should be 0
left_branch_gini = gini([['Apple'], ['Apple'], ['Apple']])

# Impurity should be high
right_branch_gini = gini([['Apple'], ['Orange'], ['Banana'], ['Apple'], ['Orange']])

# Using weighted average to compute total purity score
print 'Subpar split: %f' % (left_branch_gini * (3.0/8) + right_branch_gini * (5.0/8))

# Lower the impurity, the better split is
left_branch_gini = gini([['Apple'], ['Apple'], ['Apple'], ['Apple'], ['Apple']])
right_branch_gini = gini([['Orange'], ['Banana'], ['Orange']])
print 'Optimal split: %f' % (left_branch_gini * (5.0/8) + right_branch_gini * (3.0/8))

Subpar split: 0.400000
Optimal split: 0.166667


### ID3
The core algorithm for building decision trees is called ID3. This algorithm eploys a top-down, greedy search through the space of possible branches with no backtracking. ID3 uses *entropy* and *information gain* to construct a decision tree.

Entropy is defined as follows:

$$
E = \Sigma_{i = 1}^{C} -P(i) * log_{2} P(i)
$$

The procedure to benchmark a branching decision is.

1. Calculate the entropy before the split happens.
2. Calculate the entropy for left and right sub-branch. 
3. Using the prior entropy and weighted sum of the sub-entropies, we can come up with information gain.

$$
IG = E_{0} - \Sigma_{i}^{2} P(i)E_{i}
$$

In [48]:
def entropy(data_rows):
    counts = class_counts(data_rows)
    entropy = 0
    for label in counts:
        prob = counts[label] / float(len(data_rows))
        entropy += -1 * prob * np.log(prob)
    
    return entropy


def info_gain(partitions):
    combined = []
    for part in partitions:
        combined += part
    
    gain = entropy(combined)
    for part in partitions:
        prob = float(len(part)) / len(combined)
        gain -= prob * entropy(part)
        
    return gain
    

left_part = [['Apple'], ['Apple'], ['Apple']]
right_part = [['Apple'], ['Orange'], ['Banana'], ['Apple'], ['Orange']]
print 'Subpar split: %f' % info_gain([left_part, right_part])

# The more information gain, the better split is.
left_part = [['Apple'], ['Apple'], ['Apple'], ['Apple'], ['Apple']]
right_part = [['Orange'], ['Banana'], ['Orange']]
print 'Optimal split: %f' % info_gain([left_part, right_part])

Subpar split: 0.240931
Optimal split: 0.661563
