# Checkpoint 24.7 | Random Forest Models

Now that you've learned about random forests and decision trees let's do an exercise in accuracy. You know that random forests are basically a collection of decision trees. But how do the accuracies of the two models compare?

So here's what you should do. Pick a dataset. It could be one you've worked with before or it could be a new one. Then build the best decision tree you can.

Now try to match that with the simplest random forest you can. For our purposes measure simplicity with runtime. Compare that to the runtime of the decision tree. This is imperfect but just go with it.

Hopefully out of this you'll see the power of random forests, but also their potential costs. Remember, in the real world you won't necessarily be dealing with thousands of rows. It could be millions, billions, or even more.

Submit a link to your models below.

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Imports" data-toc-modified-id="Imports-0"><span class="toc-item-num">0&nbsp;&nbsp;</span>Imports</a></span></li><li><span><a href="#Load-and-prepare-the-dataset" data-toc-modified-id="Load-and-prepare-the-dataset-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Load and prepare the dataset</a></span><ul class="toc-item"><li><span><a href="#Data-cleaning" data-toc-modified-id="Data-cleaning-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Data cleaning</a></span><ul class="toc-item"><li><span><a href="#Removing-NaN-rows" data-toc-modified-id="Removing-NaN-rows-1.1.1"><span class="toc-item-num">1.1.1&nbsp;&nbsp;</span>Removing NaN rows</a></span></li></ul></li><li><span><a href="#Dealing-with-highly-unique-columns" data-toc-modified-id="Dealing-with-highly-unique-columns-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Dealing with highly unique columns</a></span><ul class="toc-item"><li><span><a href="#Converting-launched--and-deadline-to-datetime-variables" data-toc-modified-id="Converting-launched--and-deadline-to-datetime-variables-1.2.1"><span class="toc-item-num">1.2.1&nbsp;&nbsp;</span>Converting <code>launched</code>  and <code>deadline</code> to datetime variables</a></span></li><li><span><a href="#Converting-name-to-word,-number,-and-punctuation-count" data-toc-modified-id="Converting-name-to-word,-number,-and-punctuation-count-1.2.2"><span class="toc-item-num">1.2.2&nbsp;&nbsp;</span>Converting <code>name</code> to word, number, and punctuation count</a></span></li></ul></li><li><span><a href="#Dropping-unnecessary-columns" data-toc-modified-id="Dropping-unnecessary-columns-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Dropping unnecessary columns</a></span></li><li><span><a href="#Splitting-into-X-and-Y-vars" data-toc-modified-id="Splitting-into-X-and-Y-vars-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Splitting into X and Y vars</a></span></li><li><span><a href="#Feature-generation" data-toc-modified-id="Feature-generation-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Feature generation</a></span></li></ul></li><li><span><a href="#Data-is-ready-to-be-trained" data-toc-modified-id="Data-is-ready-to-be-trained-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Data is ready to be trained</a></span><ul class="toc-item"><li><span><a href="#Initializing-the-models" data-toc-modified-id="Initializing-the-models-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Initializing the models</a></span></li></ul></li></ul></div>

## Imports

In [1]:
import pandas as pd_reg
pd_reg.__version__

'0.25.3'

In [2]:
import modin.pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [3]:
pd.__version__

'0.7.0'

## Load and prepare the dataset

To best illustrate the complexity differences between the two models, I'm going to load a larger data set that I can hopefully compare at different row-counts. I'm going to be using the [Kickstarter Projects](https://www.kaggle.com/kemical/kickstarter-projects) dataset that aims to predict Kickstarter success.

In [4]:
raw = pd.read_csv("/Users/chanvarma/Box/datasets-chanvarma/thinkful/kickstarter-projects/ks-projects-201801.csv")
raw.shape

(378661, 15)

In [5]:
data = raw.copy()
data.head()

Unnamed: 0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,usd_pledged_real,usd_goal_real
0,1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,2015-10-09,1000.0,2015-08-11 12:12:28,0.0,failed,0,GB,0.0,0.0,1533.95
1,1000003930,Greeting From Earth: ZGAC Arts Capsule For ET,Narrative Film,Film & Video,USD,2017-11-01,30000.0,2017-09-02 04:43:57,2421.0,failed,15,US,100.0,2421.0,30000.0
2,1000004038,Where is Hank?,Narrative Film,Film & Video,USD,2013-02-26,45000.0,2013-01-12 00:20:50,220.0,failed,3,US,220.0,220.0,45000.0
3,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,USD,2012-04-16,5000.0,2012-03-17 03:24:11,1.0,failed,1,US,1.0,1.0,5000.0
4,1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,USD,2015-08-29,19500.0,2015-07-04 08:35:03,1283.0,canceled,14,US,1283.0,1283.0,19500.0


The dataset records projects in 6 'states', but we're only going to be looking at the 'failed' and 'successful' states. 

In [6]:
data['state'].value_counts()

To request implementation, send an email to feature_requests@modin.org.


failed        197719
successful    133956
canceled       38779
undefined       3562
live            2799
suspended       1846
dtype: int64

To make matters simpler, we only account for `failed` and `succesful` campaigns. 

In [7]:
def success_or_not(state):
    if state == 'successful':
        return 1
    return 0

And we convert them to a dummy variable.

In [8]:
data = data.loc[data['state'].isin(['failed', 'successful'])]
data['successful'] = pd.to_numeric(data['state'].apply(success_or_not))
data.drop(columns = 'state', inplace = True)
data['successful'].value_counts()



0    197719
1    133956
dtype: int64

### Data cleaning

#### Removing NaN rows

In [9]:
data.info(verbose=True)



<class 'modin.pandas.dataframe.DataFrame'>
Int64Index: 331675 entries, 0 to 378660
Data columns (total 15 columns):
ID                  331675 non-null int64
name                331672 non-null object
category            331675 non-null object
main_category       331675 non-null object
currency            331675 non-null object
deadline            331675 non-null object
goal                331675 non-null float64
launched            331675 non-null object
pledged             331675 non-null float64
backers             331675 non-null int64
country             331675 non-null object
usd pledged         331465 non-null float64
usd_pledged_real    331675 non-null float64
usd_goal_real       331675 non-null float64
successful          331675 non-null int64
dtypes: float64(5), int64(3), object(7)
memory usage: 40.5+ MB


Since there are very few NaN objects, we can drop them.

In [10]:
data.isna().sum()

ID                    0
name                  3
category              0
main_category         0
currency              0
deadline              0
goal                  0
launched              0
pledged               0
backers               0
country               0
usd pledged         210
usd_pledged_real      0
usd_goal_real         0
successful            0
dtype: int64

In [11]:
data = data.dropna()
data.shape

(331462, 15)

### Dealing with highly unique columns 

In [12]:
categorical = data.select_dtypes(include=['object'])
unique_counts = {}

for i in categorical:
    unique_counts[i] = categorical[i].nunique()
    
unique_counts = {k: v for k, v in sorted(unique_counts.items(), key=lambda item: item[1], reverse = True)}
unique_counts

{'launched': 331042,
 'name': 329386,
 'deadline': 3102,
 'category': 159,
 'country': 22,
 'main_category': 15,
 'currency': 14}

In [13]:
categorical = list(categorical.columns)

#### Converting `launched`  and `deadline` to datetime variables
`launched` and `deadline` are saved as <str> objects. We will convert these to datetime objects, do some foreign exchange calculations, and then extract the month and year.

In [14]:
data[['launched', 'deadline']].head()

Unnamed: 0,launched,deadline
0,2015-08-11 12:12:28,2015-10-09
1,2017-09-02 04:43:57,2017-11-01
2,2013-01-12 00:20:50,2013-02-26
3,2012-03-17 03:24:11,2012-04-16
5,2016-02-26 13:38:27,2016-04-01


In [15]:
from datetime import datetime

def get_month_year(date_object, variable):
    if variable == 'launched':
        date_object = datetime.strptime(date_object, "%Y-%m-%d %H:%M:%S")
    elif variable == 'deadline':
        date_object = datetime.strptime(date_object, "%Y-%m-%d")
        
    return date_object.strftime("%b %Y")

In [16]:
data['launched_month'] = data['launched'].apply(get_month_year, args = ('launched', ))
data['deadline_month'] = data['deadline'].apply(get_month_year, args = ('deadline', ))
data[['launched', 'launched_month', 'deadline_month']].head()

Unnamed: 0,launched,launched_month,deadline_month
0,2015-08-11 12:12:28,Aug 2015,Oct 2015
1,2017-09-02 04:43:57,Sep 2017,Nov 2017
2,2013-01-12 00:20:50,Jan 2013,Feb 2013
3,2012-03-17 03:24:11,Mar 2012,Apr 2012
5,2016-02-26 13:38:27,Feb 2016,Apr 2016


In [17]:
data['launched_month'].nunique(), data['deadline_month'].nunique()

(105, 105)

In [18]:
data.drop(columns=['launched', 'deadline'], inplace=True)
categorical.remove('launched')
categorical.remove('deadline')

#### Converting `name` to word, number, and punctuation count

In [19]:
data['name'].head()

0                      The Songs of Adelaide & Abullah
1        Greeting From Earth: ZGAC Arts Capsule For ET
2                                       Where is Hank?
3    ToshiCapital Rekordz Needs Help to Complete Album
5                                 Monarch Espresso Bar
Name: name, dtype: object

In [20]:
from nltk import RegexpTokenizer, sent_tokenize

def title_counter(s, type_of_count):
    if type_of_count == 'word':
        tokenizer = RegexpTokenizer(r'\w+')
        return len(tokenizer.tokenize(s))
    
    if type_of_count == 'sent':
        return len(sent_tokenize(s))
    
    if type_of_count == 'num':
        tokenizer = RegexpTokenizer("[0-9]")
        return len(tokenizer.tokenize(s))

In [21]:
data['title_word_count'] = data['name'].apply(title_counter, args = ('word', ))
data['title_sent_count'] = data['name'].apply(title_counter, args = ('sent', ))
data['title_num_count'] = data['name'].apply(title_counter, args = ('num', ))

data[['name', 'title_word_count', 'title_sent_count', 'title_num_count']].head()

Unnamed: 0,name,title_word_count,title_sent_count,title_num_count
0,The Songs of Adelaide & Abullah,5,1,0
1,Greeting From Earth: ZGAC Arts Capsule For ET,8,1,0
2,Where is Hank?,3,1,0
3,ToshiCapital Rekordz Needs Help to Complete Album,7,1,0
5,Monarch Espresso Bar,3,1,0


In [22]:
data.drop(columns=['name'], inplace=True)
categorical.remove('name')

We can work with the remaining quantity of categorical variables. 

### Dropping unnecessary columns

In [23]:
data.columns

Index(['ID', 'category', 'main_category', 'currency', 'goal', 'pledged',
       'backers', 'country', 'usd pledged', 'usd_pledged_real',
       'usd_goal_real', 'successful', 'launched_month', 'deadline_month',
       'title_word_count', 'title_sent_count', 'title_num_count'],
      dtype='object')

In [24]:
data.drop(columns=['ID', 'currency', 'goal', 'pledged', 'usd pledged'], inplace=True)

### Splitting into X and Y vars

In [25]:
X = data.drop(columns=['successful'])
Y = data['successful']

X.shape, Y.shape

((331462, 11), (331462,))

### Feature generation

We start by creating dummies for the categorical variables.

In [26]:
X.dtypes

category             object
main_category        object
backers               int64
country              object
usd_pledged_real    float64
usd_goal_real       float64
launched_month       object
deadline_month       object
title_word_count      int64
title_sent_count      int64
title_num_count       int64
dtype: object

In [27]:
X = pd.get_dummies(X, columns=['country', 'launched_month',
                               'deadline_month', 'category', 'main_category'], drop_first=True)
X.shape

(331462, 407)

In [28]:
Y.shape

(331462,)

## Data is ready to be trained

In [29]:
Y.value_counts()



0    197611
1    133851
dtype: int64

In [30]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state = 13, test_size = .2, stratify = Y)

In [31]:
X_train.shape, X_test.shape, Y_train.shape, Y_test.shape

((265169, 407), (66293, 407), (265169,), (66293,))

### Initializing the models

In [33]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

dtree = DecisionTreeClassifier(random_state = 13)
rfm = RandomForestClassifier(random_state=13)

In [None]:
import time
from sklearn.model_selection import cross_val_score

dtree.fit

KeyboardInterrupt: 

Exception ignored in: 'ray._raylet.prepare_args'
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/ray/worker.py", line 1493, in put
    object_id = worker.put_object(value)
  File "/usr/local/lib/python3.7/site-packages/ray/worker.py", line 277, in put_object
    serialized_value = self.get_serialization_context().serialize(value)
  File "/usr/local/lib/python3.7/site-packages/ray/serialization.py", line 485, in serialize
    value, protocol=5, buffer_callback=writer.buffer_callback)
  File "/usr/local/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 68, in dumps
    cp.dump(obj)
  File "/usr/local/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 557, in dump
    return Pickler.dump(self, obj)
  File "/usr/local/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 499, in reducer_override
    return self._function_reduce(obj)
  File "/usr/local/lib/python3.7/site-packages/ray/cloudpickle/clou

In [None]:
rmf_start_time = time.time()
cross_val_score(rfm, X_train, Y_train, cv=10, n_jobs = -1)
print("--- %s seconds ---" % (time.time() - rmf_start_time))