# Random Forest

I'm going to use faomus dataset UCI's abalone for predicting the age of abalone from physical measurements.  The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope -- a boring and time-consuming task.  Other measurements, which are easier to obtain, are used to predict the age.  Further information, such as weather patterns and location (hence food availability) may be required to solve the problem.

Variable description is:

    Name            Data Type   Meas.   Description
    ----            ---------   -----   -----------
    Sex                nominal          M, F, and I (infant)
    Length          continuous     mm   Longest shell measurement
    Diameter        continuous     mm   perpendicular to length
    Height          continuous     mm   with meat in shell
    Whole weight    continuous   grams  whole abalone
    Shucked weight  continuous   grams  weight of meat
    Viscera weight  continuous   grams  gut weight (after bleeding)
    Shell weight    continuous   grams  after being dried
    Rings              integer          +1.5 gives the age in years

In [54]:
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier as rforest

df=pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data', sep=',',header=None)


# Sex variable set to bool encoding
#df[0] = pd.factorize(df[0])[0]

for label in "MFI":
    df[label] = df[0] == label
del df[0]

df.head(10)

Unnamed: 0,1,2,3,4,5,6,7,8,M,F,I
0,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15,True,False,False
1,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7,True,False,False
2,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9,False,True,False
3,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10,True,False,False
4,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7,False,False,True
5,0.425,0.3,0.095,0.3515,0.141,0.0775,0.12,8,False,False,True
6,0.53,0.415,0.15,0.7775,0.237,0.1415,0.33,20,False,True,False
7,0.545,0.425,0.125,0.768,0.294,0.1495,0.26,16,False,True,False
8,0.475,0.37,0.125,0.5095,0.2165,0.1125,0.165,9,True,False,False
9,0.55,0.44,0.15,0.8945,0.3145,0.151,0.32,19,False,True,False


In [55]:
Y = np.asarray(df[df.columns[7]])
del df[8]
X = np.asarray(df)
df.head(10)

Unnamed: 0,1,2,3,4,5,6,7,M,F,I
0,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,True,False,False
1,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,True,False,False
2,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,False,True,False
3,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,True,False,False
4,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,False,False,True
5,0.425,0.3,0.095,0.3515,0.141,0.0775,0.12,False,False,True
6,0.53,0.415,0.15,0.7775,0.237,0.1415,0.33,False,True,False
7,0.545,0.425,0.125,0.768,0.294,0.1495,0.26,False,True,False
8,0.475,0.37,0.125,0.5095,0.2165,0.1125,0.165,True,False,False
9,0.55,0.44,0.15,0.8945,0.3145,0.151,0.32,False,True,False


In [58]:
X[0]

array([0.455, 0.365, 0.095, 0.514, 0.2245, 0.10099999999999999, 0.15, True,
       False, False], dtype=object)

# Decision tree

In [63]:
import random as rnd
from sklearn import tree

# Train-Test split
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, train_size=0.75)

clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train, Y_train)
pred = clf.predict(X_test)
print(classification_report(Y_test, pred))
print(confusion_matrix(Y_test,pred))

             precision    recall  f1-score   support

          1       0.00      0.00      0.00         1
          2       0.00      0.00      0.00         1
          3       0.00      0.00      0.00         2
          4       0.56      0.53      0.54        19
          5       0.24      0.27      0.26        22
          6       0.25      0.24      0.25        59
          7       0.30      0.33      0.31        90
          8       0.26      0.26      0.26       132
          9       0.24      0.24      0.24       163
         10       0.31      0.28      0.29       185
         11       0.16      0.18      0.17       118
         12       0.18      0.20      0.19        66
         13       0.08      0.07      0.07        58
         14       0.04      0.03      0.03        37
         15       0.11      0.17      0.14        23
         16       0.12      0.13      0.13        23
         17       0.00      0.00      0.00        10
         18       0.00      0.00      0.00   

# Random forest

In [70]:
rfor = rforest(n_estimators=250)
rfor.fit(X_train, Y_train)
pred = rfor.predict(X_test)
print(classification_report(Y_test, pred))

             precision    recall  f1-score   support

          1       0.00      0.00      0.00         1
          2       0.00      0.00      0.00         1
          3       0.25      0.50      0.33         2
          4       0.62      0.42      0.50        19
          5       0.27      0.41      0.33        22
          6       0.33      0.27      0.30        59
          7       0.32      0.39      0.35        90
          8       0.31      0.36      0.33       132
          9       0.26      0.34      0.29       163
         10       0.29      0.26      0.28       185
         11       0.23      0.33      0.27       118
         12       0.21      0.15      0.18        66
         13       0.15      0.10      0.12        58
         14       0.33      0.08      0.13        37
         15       0.15      0.09      0.11        23
         16       0.18      0.09      0.12        23
         17       0.20      0.20      0.20        10
         18       0.00      0.00      0.00   

In [71]:
print(confusion_matrix(Y_test,pred))

[[ 0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0]
 [ 0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0]
 [ 0  0  1  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0]
 [ 0  0  1  8  8  1  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0]
 [ 0  0  1  1  9  6  4  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0]
 [ 0  0  0  0 11 16 23  5  1  3  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0]
 [ 0  0  0  2  4 13 35 17 13  4  2  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0]
 [ 0  0  0  0  1  5 24 47 33 13  7  1  1  0  0  0  0  0  0  0  0  0  0  0
   0  0]
 [ 0  0  0  0  0  4  9 37 55 33 17  4  3  0  1  0  0  0  0  0  0  0  0  0
   0  0]
 [ 0  0  0  0  0  1  3 21 56 49 39  5  5  0  1  3  2  0  0  0  0  0  0  0
   0  0]
 [ 0  0  0  0  0  2  4  9 24 24 39  8  4  1  2  1  0  0  0  0  0  0  0  0
   0  0]
 [ 0  0  0  0  0  0  1  6  8 10 25 10  3  2  0  0  0  0  0  1  0  0  0  0
   0  0]
 [ 0

# Cross Validated version

In [72]:
from sklearn.cross_validation import cross_val_score

# 10-Fold Cross validation
print np.mean(cross_val_score(rfor, X_train, Y_train, cv=10))

0.239109165069


Random forest barely outperformed decision tree, but looking at confussion matrix, the average prediction does not vary much from what we would want to have as a predictor and could still be usefull for biological study means.