# Problem Statement
Now that you've learned about random forests and decision trees let's do an exercise in accuracy. You know that random forests are basically a collection of decision trees. But how do the accuracies of the two models compare?

So here's what you should do. Pick a dataset. It could be one you've worked with before or it could be a new one. Then build the best decision tree you can.

Now try to match that with the simplest random forest you can. For our purposes measure simplicity with runtime. Compare that to the runtime of the decision tree. This is imperfect but just go with it.

Hopefully out of this you'll see the power of random forests, but also their potential costs. Remember, in the real world you won't necessarily be dealing with thousands of rows. It could be millions, billions, or even more.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn import ensemble # random forest model
from sklearn.model_selection import cross_val_score
from sklearn import tree # decision tree model
from IPython.display import Image # display tree
import pydotplus # render tree
import graphviz # render tree
import time

In [11]:
data_path = '../../datasets/opusdata/opusdata.csv'
df = pd.read_csv(data_path)
df.drop(['movie_name', 'movie_odid', 'domestic_box_office'], axis = 1, inplace = True)
X = df.drop(['sequel'], axis = 1)
X = pd.get_dummies(X)
Y = df['sequel']
print (df.shape)
display(df.head())

(1784, 10)


Unnamed: 0,production_year,production_budget,international_box_office,rating,creative_type,source,production_method,genre,sequel,running_time
0,2006,10000000,366513,PG-13,Dramatization,Based on Fiction Book/Short Story,Live Action,Drama,0,
1,2006,10000000,175380,PG-13,Historical Fiction,Original Screenplay,Live Action,Drama,0,
2,2006,10000000,31000000,Not Rated,Science Fiction,Original Screenplay,Live Action,Action,1,
3,2006,10000000,62581,PG-13,Contemporary Fiction,Based on Play,Live Action,Comedy,1,
4,2006,10000000,9920000,PG-13,Contemporary Fiction,Original Screenplay,Live Action,Comedy,0,108.0


In [12]:
# clean data
X_null = X.isnull().sum()
display(X_null[X_null > 0])
# impute running time with median
X.fillna(value = X['running_time'].median(skipna = True), axis = 0, inplace = True)

running_time    117
dtype: int64

In [7]:
# decision tree
# Initialize and train our tree.
dt_start_time = time.time()
decision_tree = tree.DecisionTreeClassifier(
    criterion='entropy',
    max_features=2,
    max_depth=5,
    random_state = 1337
)
decision_tree.fit(X, Y)
display(cross_val_score(decision_tree, X, Y, cv = 5))
print ("Decision Tree runtime: {}".format(time.time() - dt_start_time))
'''
# Render our tree.
dot_data = tree.export_graphviz(
    decision_tree, out_file=None,
    feature_names=X.columns,
    class_names=['Not Returning', 'Returning'],
    filled=True
)
graph = pydotplus.graph_from_dot_data(dot_data)
Image(graph.create_png())
''';

array([0.85195531, 0.85154062, 0.85434174, 0.84550562, 0.3258427 ])

Decision Tree runtime: 0.03255772590637207


In [8]:
# random forest
rf_start_time = time.time()
rfc = ensemble.RandomForestClassifier()
display(cross_val_score(rfc, X, Y, cv=5))
print ("Random Forest runtime: {}".format(time.time() - rf_start_time))

array([0.86592179, 0.86554622, 0.87114846, 0.8258427 , 0.49719101])

Random Forest runtime: 0.13008785247802734


# write-up
From a simple evaluation we can see that the random forest takes 10x longer to run than the simple decision tree at the cost of a 1% increase in accuracy.