# Challenge: If a tree falls in the forest...

Comparing decision tree with random forest

## Data
[Kickstarter projects](https://www.kaggle.com/kemical/kickstarter-projects/home) from Kaggle.
Predict if a project will be successful.

In [1]:
import pandas as pd
from sklearn import tree
from sklearn import ensemble
from sklearn.model_selection import cross_val_score
import time

ks = pd.read_csv('ks-projects-201801.csv', header=0)

ks.head()

Unnamed: 0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,usd_pledged_real,usd_goal_real
0,1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,2015-10-09,1000.0,2015-08-11 12:12:28,0.0,failed,0,GB,0.0,0.0,1533.95
1,1000003930,Greeting From Earth: ZGAC Arts Capsule For ET,Narrative Film,Film & Video,USD,2017-11-01,30000.0,2017-09-02 04:43:57,2421.0,failed,15,US,100.0,2421.0,30000.0
2,1000004038,Where is Hank?,Narrative Film,Film & Video,USD,2013-02-26,45000.0,2013-01-12 00:20:50,220.0,failed,3,US,220.0,220.0,45000.0
3,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,USD,2012-04-16,5000.0,2012-03-17 03:24:11,1.0,failed,1,US,1.0,1.0,5000.0
4,1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,USD,2015-08-29,19500.0,2015-07-04 08:35:03,1283.0,canceled,14,US,1283.0,1283.0,19500.0


In [2]:
categorical = ks.select_dtypes(include=['object'])
for i in categorical:
    column = categorical[i]
    print(i)
    print(column.nunique())

name
375764
category
159
main_category
15
currency
14
deadline
3164
launched
378089
state
6
country
23


Combining what we can see above, drop the unnecessary columns.

In [3]:
ks.drop(['ID', 'name', 'category', 'deadline', 'launched'], 1, inplace=True)

#Correlation matrix
print(ks.corr())

                      goal   pledged   backers  usd pledged  usd_pledged_real  \
goal              1.000000  0.007358  0.004012     0.005534          0.005104   
pledged           0.007358  1.000000  0.717079     0.857370          0.952843   
backers           0.004012  0.717079  1.000000     0.697426          0.752539   
usd pledged       0.005534  0.857370  0.697426     1.000000          0.907743   
usd_pledged_real  0.005104  0.952843  0.752539     0.907743          1.000000   
usd_goal_real     0.942692  0.005024  0.004517     0.006172          0.005596   

                  usd_goal_real  
goal                   0.942692  
pledged                0.005024  
backers                0.004517  
usd pledged            0.006172  
usd_pledged_real       0.005596  
usd_goal_real          1.000000  


Looks like pledged, usd pledged and usd_pledged_real are highly correlated, so are goal and usd_goal_real. Let's drop some of them.

In [4]:
ks.drop(['usd pledged', 'usd_pledged_real', 'usd_goal_real'], 1, inplace=True)

### Decision tree

In [5]:
start_time = time.time()
decision_tree = tree.DecisionTreeClassifier(random_state = 0)
X = ks.drop('state', 1)
Y = ks['state']
X = pd.get_dummies(X)
X = X.dropna(axis=1)
decision_tree.fit(X, Y)
print(cross_val_score(decision_tree, X, Y, cv=10))
print("--- %s seconds ---" % (time.time() - start_time))

[0.82066653 0.82240942 0.82272163 0.82272163 0.82272163 0.82404204
 0.82400634 0.82160306 0.82015053 0.82351698]
--- 46.883976221084595 seconds ---


### Random forest

In [6]:
start_time = time.time()
rfc = ensemble.RandomForestClassifier()
print(cross_val_score(rfc, X, Y, cv=10))
print("--- %s seconds ---" % (time.time() - start_time))

[0.85364952 0.85409845 0.85229884 0.85401537 0.85306467 0.85327594
 0.85427175 0.85271359 0.85173643 0.85418097]
--- 61.012364864349365 seconds ---


## Conclusion

Random forest classifier improves the accuracy by about 3% than decision tree, at the price of ~15s processing time.