# Kickstarter Example (reworked)

## Imports

Here we import the necessary modules.

In [1]:
import os
import numpy as np
import pandas as pd

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

## Reading the data

Load the data from the `.csv` into a `DataFrame`.

In [2]:
PATH_ROOT = os.path.join('input', 'kickstarter-projects')
FLOC = os.path.join(PATH_ROOT, 'ks-projects-201801.csv')

In [3]:
df = pd.read_csv(FLOC, index_col='ID')
df.head()

Unnamed: 0_level_0,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,usd_pledged_real,usd_goal_real
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,2015-10-09,1000.0,2015-08-11 12:12:28,0.0,failed,0,GB,0.0,0.0,1533.95
1000003930,Greeting From Earth: ZGAC Arts Capsule For ET,Narrative Film,Film & Video,USD,2017-11-01,30000.0,2017-09-02 04:43:57,2421.0,failed,15,US,100.0,2421.0,30000.0
1000004038,Where is Hank?,Narrative Film,Film & Video,USD,2013-02-26,45000.0,2013-01-12 00:20:50,220.0,failed,3,US,220.0,220.0,45000.0
1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,USD,2012-04-16,5000.0,2012-03-17 03:24:11,1.0,failed,1,US,1.0,1.0,5000.0
1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,USD,2015-08-29,19500.0,2015-07-04 08:35:03,1283.0,canceled,14,US,1283.0,1283.0,19500.0


## Building a model

Let's build a `RandomForestClassifier` model!

### Preprocessing

Here we perform preprocessing steps which entails creating the feature space - a clean, processed version of the raw `df`.

In [4]:
# create a constant list of desired columns
COLUMNS = ['category', 'main_category', 
           'currency', 'goal', 'pledged', 
           'state', 'backers', 'country', 
           'usd pledged', 'usd_pledged_real', 
           'usd_goal_real']
# only select the desired columns, drop NaN and missing values
feature_space = df[COLUMNS].dropna()
feature_space.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 374864 entries, 1000002330 to 999988282
Data columns (total 11 columns):
category            374864 non-null object
main_category       374864 non-null object
currency            374864 non-null object
goal                374864 non-null float64
pledged             374864 non-null float64
state               374864 non-null object
backers             374864 non-null int64
country             374864 non-null object
usd pledged         374864 non-null float64
usd_pledged_real    374864 non-null float64
usd_goal_real       374864 non-null float64
dtypes: float64(5), int64(1), object(5)
memory usage: 34.3+ MB


In [5]:
# features
X = feature_space.drop('state', axis=1)
# labels
y = feature_space['state']

In [6]:
# define some preprocessor objects...
label_encoder = LabelEncoder()

# ...and apply fit_transform to our features
for col in X.select_dtypes(include='object').columns.values:
    X[col] = label_encoder.fit_transform(X[col])

X.head()

Unnamed: 0_level_0,category,main_category,currency,goal,pledged,backers,country,usd pledged,usd_pledged_real,usd_goal_real
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1000002330,108,12,5,1000.0,0.0,0,9,0.0,0.0,1533.95
1000003930,93,6,13,30000.0,2421.0,15,21,100.0,2421.0,30000.0
1000004038,93,6,13,45000.0,220.0,3,21,220.0,220.0,45000.0
1000007540,90,10,13,5000.0,1.0,1,21,1.0,1.0,5000.0
1000011046,55,6,13,19500.0,1283.0,14,21,1283.0,1283.0,19500.0


In [7]:
# Create the (stratified) train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)

### Training

Here we instantiate and train the model on the test data.

In [13]:
%%time
# Let's just use the default parameters (10 estimators)
model = RandomForestClassifier()
model.fit(X_train, y_train)



Wall time: 5.17 s


### Testing

How well does the model perform on the test data (unseen during training)?

In [9]:
model.score(X_test, y_test)

0.8607494984847838

Not bad... 86% accuracy. Can we optimize our model parameters for the chosen feature space?

### Model grid (`GridSearchCV`)

Let's make use of ScikitLearn's grid searching capabilities. **Warning**: Grid searching is a computationally expensive action - it will take a few minutes (at least) to run the full grid search.

In [11]:
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.metrics import accuracy_score, make_scorer, confusion_matrix

In [18]:
%%time
# 5-fold cross validation... 
# (https://en.wikipedia.org/wiki/Cross-validation_(statistics)#k-fold_cross-validation)
skf = StratifiedKFold(n_splits=5)

# here we define the parameter grid we wish to test.
# let's just vary a couple for the time being.
params = {'max_depth': [10, 20], 'n_estimators': [10, 30]}

# choose the type of model to use in grid searching
model = RandomForestClassifier()

# establish and fit the grid
grid = GridSearchCV(estimator=model, 
                    param_grid=params, 
                    cv=skf, 
                    refit=True,
                    return_train_score=True,
                    scoring=make_scorer(accuracy_score))
grid.fit(X_train, y_train)

Wall time: 2min 37s


In [19]:
# print out the optimal parameters found by the grid search
print('Best parameters found: {}'.format(grid.best_params_))

Best parameters found: {'max_depth': 10, 'n_estimators': 30}


In [27]:
accuracy_score(y_test, grid.predict(X_test))

0.8817384438089547

We were able to increase our test-set accuracy by 2% by doing a small grid search... The grid can become more granular, and maybe some better results will be found. Let's look at a confusion matrix for our predictions.

In [28]:
# show confusion matrix
predicted_base = 'Predicted {}'
actually_base = 'Actually {}'
label_vals = [v for v in y.value_counts().index]
cm = confusion_matrix(y_test, grid.predict(X_test))
cm_frame = pd.DataFrame(cm, columns=[predicted_base.format(v) for v in label_vals],
                            index=[actually_base.format(v) for v in label_vals])
cm_frame.head()

Unnamed: 0,Predicted failed,Predicted successful,Predicted canceled,Predicted live,Predicted suspended
Actually failed,2,9446,0,241,0
Actually successful,1,49169,0,233,0
Actually canceled,0,565,0,135,0
Actually live,1,0,0,33462,0
Actually suspended,0,379,0,82,0


Here the problem of class imbalance becomes especially apparent - our model **doesn't predict *any* canceled or suspended projects**... This is most likely because these classes are severly under-represented in the data.