# Random Forest Classifier from Scratch

## Introduction

In this iPython Notebook, we will implement the random forest classification algorithm from scratch, with helps from the C4.5 decision tree classifier that we built previously.  Random forest is an ensemble method of machine learning, wich combines output of various decision tree classifier, and vote for their majority.

In [5]:
import pandas as pd
import numpy as np
import c45dtree as dt


## Bagging Samples

Random forest would build a set of decision trees.  For each tree, its input data was drawn from the bootstrap samples (random with replacement) of the original data, as shown belows:

|Original Data|1|2|3|4|5|6|7|8|9|10|
|--|--|--|--|--|--|--|--|--|--|--|
|Bagging (1)|7|8|10|8|2|5|10|10|5|9|
|Bagging (2)|1|4|9|1|2|3|2|7|3|2|
|Bagging (3)|1|8|5|10|5|5|9|6|3|7|

For those samples not being drwan at each iteration, they are called out-of-bag samples.


In [6]:
def bagSamples(m):
    b = np.random.choice(m, size=m, replace=True)
    oob = [i for i in xrange(m) if i not in b]
    return b, oob


## Random Features Selection

The theory is that the combining of a set of weak classifiers would deliver a strong classifier.  Adding to drawing bootstap sample inputs, We generate a set of weaker decision tree classifiying by only selecting $\sqrt { n }$ of random features to build trees.

In [7]:
import sys

def buildForest(X, y, forestSize=100, maxDepth=50, minLeafSize=1):
    forest = []
    oobs = []
    
    for k in xrange(forestSize):
        sys.stdout.write("\rIteration #%d" % k)
        sys.stdout.flush()
        
        b, oob = bagSamples(len(y))
        X_k = X.iloc[b]
        y_k = y.iloc[b]
        tree_k = dt.buildTree(X_k, y_k, maxDepth=maxDepth, minLeafSize=minLeafSize, randomFeatrue=True)
        forest.append(tree_k)
        oobs.append(oob)
        
    return forest, oobs


## Vote Classifier Majority

In [8]:
def predictForest(row, forest):
    p = []
    for k in xrange(len(forest)):
        p.append(dt.predictTree(row, forest[k]))

    return dt.voteMajority(p)


## Out-of-bag Estimation

In [9]:
def oobError(X, y, oobs, forest):
    pred = []
    hist = []
    for k in xrange(len(forest)):
        pred.append(X.iloc[oobs[k]].apply(lambda row: dt.predictTree(row, forest[k]), axis=1))
        hist.append(pd.DataFrame(pred).apply(lambda col: dt.voteMajority(col)) )
        
    return pd.DataFrame(hist).apply(lambda row: np.mean((row-y[row.index])**2), axis=1)


## Submit Kaggle Titanic Competition

In [10]:
def submitTitanic():
    df = dt.prepareTitanic("train.csv")
    X = df[["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked", "Title", "HasCabin", "FamilySize"]]
    y = df["Survived"]
    forest, oobs = buildForest(X, y, forestSize=500, maxDepth=10, minLeafSize=1)

    df_test = dt.prepareTitanic("test.csv")
    X_test = df_test[["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked", "Title", "HasCabin", "FamilySize"]]
    p_test = X_test.apply(lambda row: predictForest(row, forest), axis=1)
    submission = pd.DataFrame({ "PassengerId": df_test["PassengerId"],
                                "Survived": p_test})
    submission.to_csv("forest.csv", index=False)
    return forest, oobs

#uncomment to run
forest, oobs = submitTitanic()
#500, 8, 1, 0.80861


Iteration #499