# Challenge: If a tree falls in the forest...

Now that you've learned about random forests and decision trees let's do an exercise in accuracy. You know that random forests are basically a collection of decision trees. But how do the accuracies of the two models compare?

So here's what you should do. Pick a dataset. It could be one you've worked with before or it could be a new one. Then build the best decision tree you can.

Now try to match that with the simplest random forest you can. For our purposes measure simplicity with runtime. Compare that to the runtime of the decision tree. This is imperfect but just go with it.

## Data Content

This dataset contains information about fees assessed against properties in the state of New York, and is updated daily.  My decision tree will predict location of fee issued, based on these codes:

* 1 = Manhattan
* 2 = Bronx
* 3 = Brooklyn
* 4 = Queens
* 5 = Staten Island

---

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
# Office file location(file1) and personal file location(file2)
file1 = 'C:/Users/ccarlsjh/Desktop/Important Files/Untitled Folder/Class/3 Deeper into supervised learning/Challenges/3 Dataset.csv'
file2 = 'C:/Users/Carter Carlson/Documents/Thinkful/Classwork/3 Deeper into supervised learning/Challenges/3 Dataset.csv'

try:
    df = pd.read_csv(file1)
except FileNotFoundError:
    df = pd.read_csv(file2)

In [3]:
df = df.dropna()
df.head()

Unnamed: 0,FeeID,BuildingID,BoroID,Boro,HouseNumber,StreetName,Zip,Block,Lot,LifeCycle,FeeTypeID,FeeType,FeeSourceTypeID,FeeSourceType,FeeSourceID,FeeIssuedDate,FeeAmount,DoFAccountType,DoFTransferDate
0,441,183598,3,BROOKLYN,227,ALBANY AVENUE,11213.0,1370,14,Building,1,Initial Re-inspection Fee,51,PROJECT BLDG,30305,2008-05-23T00:00:00,2000.0,236,2008-06-20T00:00:00
1,442,381566,3,BROOKLYN,232,TOMPKINS AVENUE,11216.0,1785,39,Building,1,Initial Re-inspection Fee,51,PROJECT BLDG,30305,2008-05-23T00:00:00,1500.0,236,2008-06-20T00:00:00
2,443,330335,3,BROOKLYN,786,MACON STREET,11233.0,1497,18,Building,1,Initial Re-inspection Fee,51,PROJECT BLDG,30305,2008-05-23T00:00:00,1500.0,236,2008-06-20T00:00:00
3,444,357697,3,BROOKLYN,1109,PUTNAM AVENUE,11221.0,3366,54,Building,1,Initial Re-inspection Fee,51,PROJECT BLDG,30305,2008-05-23T00:00:00,1500.0,236,2008-06-20T00:00:00
4,445,381865,3,BROOKLYN,237,TROUTMAN STREET,11237.0,3174,42,Building,1,Initial Re-inspection Fee,51,PROJECT BLDG,30305,2008-05-23T00:00:00,3000.0,236,2008-06-20T00:00:00


In [4]:
categorical = df.select_dtypes(include=['object'])
for i in categorical:
    column = categorical[i]
    print(i)
    print(column.nunique())

Boro
5
HouseNumber
3039
StreetName
1720
LifeCycle
4
FeeType
7
FeeSourceType
4
FeeIssuedDate
2927
DoFTransferDate
125


In [5]:
# We'll remove any features that have over 200 unique values,
# and the Boro column that lists location by name instead of number
df = df.drop(['HouseNumber','StreetName','FeeIssuedDate','Boro'], axis=1)

# Also remove Zipcode and block number, as it will overfit the decision tree
df = df.drop(['Zip','Block'], axis=1)

In [6]:
from sklearn.model_selection import cross_val_score
def get_accuracy(classifier):
    accuracy = cross_val_score(classifier, X, Y, cv=10).mean()
    return accuracy

In [7]:
X = df.drop('BoroID', axis=1)
Y = df['BoroID']
X = pd.get_dummies(X)

In [8]:
import time

from sklearn import tree
d_tree = tree.DecisionTreeClassifier()
start_time = time.time()
print('Decision tree accuracy: %0.3f' % get_accuracy(d_tree))
print('Decision tree time to run: %1.2f seconds' % (time.time() - start_time))

from sklearn import ensemble
rfc = ensemble.RandomForestClassifier()
start_time = time.time()
print('Random Decision tree accuracy: %0.3f' % get_accuracy(rfc))
print('Random Decision tree time to run: %1.2f seconds' % (time.time() - start_time))

Decision tree accuracy: 0.990
Decision tree time to run: 2.33 seconds
Random Decision tree accuracy: 0.749
Random Decision tree time to run: 7.60 seconds
