# Random Forest Tree Implementaion Using Scikit-Learn Library

## Decision Tree
Decision Trees are commonly used in data mining with the objective of creating a model that predicts the value of a target (or dependent variable) based on the values of several input (or independent variables). 

## Random Forests
The general method of random decision forests was first proposed by Ho in 1995.

### Classification Tree
Where the target variable is categorical and the tree is used to identify the "class" within which a target variable would likely fall into.


### Random Forest Representation
The representation for the Random Forest model is collection of many Classification tree, that is why it is called forest.

Random Forests grows many classification trees. To classify a new object from an input vector, put the input vector down each of the trees in the forest. Each tree gives a classification, and we say the tree "votes" for that class. The forest chooses the classification having the most votes (over all the trees in the forest).

#### Features
* It is unexcelled in accuracy among current algorithms.
* It runs efficiently on large data bases.
* It can handle thousands of input variables without variable deletion.
* It gives estimates of what variables are important in the classification.
* It generates an internal unbiased estimate of the generalization error as the forest building progresses.
* It has an effective method for estimating missing data and maintains accuracy when a large proportion of the data are missing.
* It has methods for balancing error in class population unbalanced data sets.
* Generated forests can be saved for future use on other data.
* Prototypes are computed that give information about the relation between the variables and the classification.
* It computes proximities between pairs of cases that can be used in clustering, locating outliers, or (by scaling) give interesting views of the data.
* The capabilities of the above can be extended to unlabeled data, leading to unsupervised clustering, data views and outlier detection.
* It offers an experimental method for detecting variable interactions.

### The pseudocode for random forest algorithm can split into two stages.

* Random forest creation pseudocode.
* Pseudocode to perform prediction from the created random forest classifier.

First, let’s begin with random forest creation pseudocode

##### Random Forest pseudocode :
1. Randomly select **“k”** features from total **“m”** features.
    * Where **k << m**
    
2. Among the **“k”** features, calculate the node **“d”** using the best split point.

3. Split the node into daughter nodes using the best split.

4. Repeat 1 to 3 steps until **“l”** number of nodes has been reached.

5. Build forest by repeating steps 1 to 4 for **“n”** number times to create **“n”** number of trees.

Finally, we repeat 1 to 4 stages to create **“n”** randomly created trees. This randomly created trees forms the random forest.

#### Random forest prediction pseudocode:

To perform prediction using the trained random forest algorithm uses the below pseudocode.

1. Takes the test features and use the rules of each randomly created decision tree to predict the oucome and stores the predicted outcome (target)

2. Calculate the votes for each predicted target.

3. Consider the high voted predicted target as the final prediction from the random forest algorithm.

### Loading the modules required, from ScikitLearn Library:

In [1]:
from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

from sklearn.externals.joblib import Memory
from sklearn.datasets import load_svmlight_file



### Loading Other modules

In [2]:
import csv
import pandas as pd 
import numpy as np
import time

### Enabling Inline Plotting

In [3]:
%matplotlib inline

### Loading Dataset

In [4]:
# Load a CSV file
def load_data(filename, filetype):
    if filetype == 'csv':
        dataset = pd.read_csv(filename)
        nrow, ncol = dataset.shape
        X = dataset.iloc[:, :ncol-1]
        Y = dataset.iloc[:, ncol-1:ncol]
        return X, Y
    
    elif filetype == 'libsvm':
        mem = Memory("./mycache")
        @mem.cache
        def get_data():
            data = load_svmlight_file(filename)
            return data[0], data[1]
        X, Y = get_data()
        return X, Y
    
    else:
        print('File Type Not Supported !')

### Prepare data 

#### Glass Identification Data Set 
From USA Forensic Science Service; 6 types of glass; defined in terms of their oxide content (i.e. Na, Fe, K, etc).
Number of Instances : 214
Number of Attributes : 10
Attribute Information:

1. Id number: 1 to 214 
2. RI: refractive index 
3. Na: Sodium (unit measurement: weight percent in corresponding oxide, as are attributes 4-10) 
4. Mg: Magnesium 
5. Al: Aluminum 
6. Si: Silicon 
7. K: Potassium 
8. Ca: Calcium 
9. Ba: Barium 
10. Fe: Iron 
11. Type of glass: (class attribute) 
    1. building_windows_float_processed 
    2. building_windows_non_float_processed 
    3. vehicle_windows_float_processed 
    4. vehicle_windows_non_float_processed (none in this database) 
    5. containers 
    6. tableware 
    7. headlamps



In [5]:
# Dataset Splitting For Training and Testing
X, Y = load_data('glass', 'libsvm')
data = X
target = Y

data_train, data_test, target_train, target_test = train_test_split(data, target, test_size=0.33, random_state=42)

________________________________________________________________________________
[Memory] Calling __main__-E%3A-DM_Git-Classifier-__ipython-input__.get_data...
get_data()
_________________________________________________________get_data - 0.0s, 0.0min


In [6]:
# fit a Random Forest model to the data
cfTree = RandomForestClassifier(criterion='gini', max_depth=10, max_features='auto', max_leaf_nodes=10)

startTime = time.time()
cfTree.fit(data_train, target_train)
endTime = time.time()
timeDiff = endTime - startTime
print(timeDiff)

0.1873645782470703


In [7]:
# make predictions
expectedClass = target_test
predictedClass = cfTree.predict(data_test)

In [8]:
# summarize the fit of the model
#print(metrics.classification_report(expectedClass, predictedClass))

In [9]:
#print(metrics.confusion_matrix(expectedClass, predictedClass))

In [10]:
# Calculate accuracy percentage
def accuracy_metric(actual, predicted):
    correct = 0
    for i in range(len(actual)):
        if actual[i] == predicted[i]:
            correct += 1
    return correct / float(len(actual)) * 100.0

In [11]:
accuracy = accuracy_metric(expectedClass, predictedClass)
print(accuracy)

74.64788732394366


## Pruning
<table>
    <tr>
        <th>Split Criterion</th>
        <th>Max Depth</th>
        <th>Max Leaf Node</th>
        <th>Run time(in s)</th>
        <th>Accuracy(%)</th>
    </tr>
    <tr>
        <td>Gini</td>
        <td>2</td>
        <td>Default</td>
        <td>0.022</td>
        <td>60.56</td>
    </tr>
    <tr>
        <td>Gini</td>
        <td>10</td>
        <td>5</td>
        <td>0.025</td>
        <td>70.42</td>
    </tr>  
    <tr>
        <td>Gini</td>
        <td>5</td>
        <td>10</td>
        <td>0.036</td>
        <td>73.23</td>
    </tr>
    <tr>
        <td>Gini</td>
        <td>5</td>
        <td>5</td>
        <td>0.039</td>
        <td>70.42</td>
    </tr> 
    <tr>
        <td>Entropy</td>
        <td>5</td>
        <td>10</td>
        <td>0.039</td>
        <td>69.01</td>
    </tr> 
    <tr>
        <td>Entropy</td>
        <td>10</td>
        <td>10</td>
        <td>0.043</td>
        <td>69.01</td>
    </tr>
    <tr>
        <td>Entropy</td>
        <td>5</td>
        <td>5</td>
        <td>0.022</td>
        <td>61.97</td>
    </tr>
    <tr>
        <td>Entropy</td>
        <td>10</td>
        <td>5</td>
        <td>0.039</td>
        <td>60.56</td>
    </tr>
</table>

### Conclusion
* From the above we can deduce that Gini split is giving better accuracy than Entropy, So Gini Index can be the better option for the split criterion. And the runtime for Gini Split is also very less as compared to entropy.
* In all the cases if we are increasing the number of leaf nodes, the accuracy is increasing.
* Random forest classifier handles the missing values.
* When we have more trees in the forest, random forest classifier won’t overfit the model.
* Can model the random forest classifier for categorical values also.