# Does Your Kickstarter Suck?
Analysis of kickstarter data to predict success of campaign

## Goals
Crowdsourcing is an increasingly popular way for entreuprenuers to raise money for a product, project or somethig. Kickstarter has emereged as the top funding platform for this purpose. I was interested in what makes a kickstarter campaign successful, specifically, what attributes are common in the most successful campaigns. The attibutes I wanted to look at in the campaigns were the main category, sub category, the funding goal, whether or not it was selected as a "staff pick," and the contents of the description of the product. 

## Data Wrangling and Cleaning
I obtained my data from webrobots.io, where they posted archived kickstarter data they have been scraping from the kickstarter website since 2015. There was over 40 gb of csv files archived on the site. The csv files contained json in serveral columns, as well as regularly typed data. I had to pull out relvant information from the json, as well as delete columns I determined didn't provide insight into the analysis. I also decided to work only on projects from the US, which would make the analysis easier for comparison against one another. 



## Model Fitting
### Logistic Regression (Logit)
Logistic Regression is useful because it creates coefficients for each feature, meaning I can see which feature effects the model on the whole more clearly. 
### Random Forest





# Importing and Cleaning Data

In [None]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings(action='ignore')


In [None]:
#df0 = pd.read_csv('0.1.csv')
#df1 = pd.read_csv('1.1.csv')
#df2 = pd.read_csv('2.1.csv')
#df3 = pd.read_csv('3.1.csv')
df4 = pd.read_csv('4.1.csv')
#arr = [df0, df1, df2, df3, df4]

In [None]:
super_df = df4 #pd.concat(arr)
super_df = super_df.set_index('id')
super_df.columns


Only a few of these fields are relevant to our analysis. We will create a new dataframe with only the important fields. 

In [None]:
df = super_df[['country', 'name', 'blurb', 'goal', 'state', 'deadline', 'launched_at', 'main_category', 'sub_category', 'staff_pick']]

In [None]:
# fill empty blurbs
df['blurb'] = df['blurb'].fillna('')
df = df[df['blurb'] != 'False']
# take out all that are not live or suspended or canceled, 
# later 2 categories imply exceptional curcumstance
# and kickstarter admistrative involvement
df = df[df['state'] != 'live']
df = df[df['state'] != 'canceled']
df = df[df['state'] != 'suspended']





# only campaigns in the US
df = df[df['country'] == 'US']
# encode TRUE OR FALSE with 1 or 0
df['staff_pick'] = (df.staff_pick == True).values.astype(np.int)
# encode state as 1 for success, 0 for failure
def encoder_(x):
    if x['state'] == 'failed':
        return 0
    else:
        return 1

df['state'] = df.apply(encoder_, axis=1)






In [None]:
df['launched_at_month'] = pd.DatetimeIndex(df['launched_at']).month
df['deadline_month'] = pd.DatetimeIndex(df['deadline']).month


In [None]:
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
# Import classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC
# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split, KFold, GridSearchCV

# Import other preprocessing modules
from sklearn.preprocessing import FunctionTransformer, StandardScaler, MaxAbsScaler, OneHotEncoder, LabelEncoder
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, precision_score, roc_curve, auc
from sklearn.feature_selection import RFE

import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')



NUMERIC = ['main_category_encoded', 'sub_category_encoded', 'launched_at_month', 'deadline_month', 'goal', 'staff_pick']
STATE = ['state']

In [None]:
le = LabelEncoder()
# apply "le.fit_transform"
df_encoded = df[['main_category', 'sub_category']].apply(le.fit_transform)
df['main_category_encoded'] = df_encoded['main_category']
df['sub_category_encoded'] = df_encoded['sub_category']


## Logistic Regression
We first want to use a simple logistic regression to determine the most influencial fields in determining if a kickstarter campaign is a success or failure. 

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df[NUMERIC], df[STATE], random_state=42)


In [None]:
logit = LogisticRegression()
rfe = RFE(logit)
rfe = rfe.fit(X_train, y_train)
#print(rfe.support_)
#print(rfe.ranking_)
support = np.array(rfe.support_)
ranking = np.array(rfe.ranking_)
# support
var_importance = pd.DataFrame({'support': support, 'ranking': ranking})
var_importance = var_importance.T
var_importance.columns = NUMERIC
var_importance

From this analysis, we find that main category, launched at month, and staff pick are the most important attributes for predicting success or failure. This make sense because some categories are typically more successful than others, with sub category coming closely behind in the importance ratings.  We also see that the month that the campaign was lauched in is a big factor in prediciting success. This is interesting because the deadline month in not nearly as influential as the launch month. We will take a closer look at this relationship later. Staff pick is also a good indicator because it creates more exposure for the campaign, and adds some level of endorsement of the project. 


In [None]:
pred = rfe.predict(X_test)
score = rfe.score(X_test, y_test)
print(score)

We can see that our model is 60 percent accurate, meaning the model performs slightly better than guessing if it will be a success or failure. A naive model would simply flip a coin to determine if it would be a success, giving it a 50 percent accuracy rating. 

In [None]:
print(confusion_matrix(y_test, pred))
print(classification_report(y_test, pred))

The classifier for successful campaigns has high precision and low recall, meaning that it is very picky about which things it classifies as successful, but ultimately loses a lot of true successes because it is so strict. On the failed campaign side, its the exact opposite. Our classifier casts a wide net to identify failed campaigns, and identifies 96 percent of them, but is only right 56 percent of the time. Hopefully some parameter tuning can impove the classifier. 

In [None]:
probs = rfe.predict_proba(X_test)
preds = probs[:,1]
fpr, tpr, threshold = roc_curve(y_test, preds)
roc_auc = auc(fpr, tpr)

plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()


We can see that our classifier is just above random chance. Hopefully we can improve this with some parameter tuning

In [None]:
# Tuning parameters for logistic regression
logit_params = {'C': [.01, .1, 1., 10., 100.], 'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']}
logit = LogisticRegression(class_weight='balanced')
lgt = GridSearchCV(logit, logit_params, cv=10)
lgt.fit(X_train, y_train)

In [None]:
# model accuracy
print('Acc: ', lgt.score(X_test, y_test))


print(lgt.best_params_)
C = lgt.best_params_['C']
solver = lgt.best_params_['solver']

In [None]:
logit = LogisticRegression(C=C, solver=solver)
logit = logit.fit(X_train, y_train)
#print(rfe.support_)

In [None]:
logit = LogisticRegression(C=C, solver=solver)
logit = logit.fit(X_train, y_train)

pred = logit.predict(X_test)
score = logit.score(X_test, y_test)
print(score)

In [None]:
print(confusion_matrix(y_test, pred))
print(classification_report(y_test, pred))

In [None]:
probs = rfe.predict_proba(X_test)
preds = probs[:,1]
fpr, tpr, threshold = roc_curve(y_test, preds)
roc_auc = auc(fpr, tpr)

plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()


Tuning the paramaters only led to a 2 percent increase in AUC. Maybe its time to try a different classifier to see if we can improve it. 


## Random Forest Classifier

In [None]:
# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(df[NUMERIC], df[STATE], random_state=42)

In [None]:
rfc = RandomForestClassifier()
rfe = RFE(rfc)
rfe = rfe.fit(X_train, y_train)
#print(rfe.support_)
#print(rfe.ranking_)
support = np.array(rfe.support_)
ranking = np.array(rfe.ranking_)
# support
var_importance = pd.DataFrame({'support': support, 'ranking': ranking})
var_importance = var_importance.T
var_importance.columns = NUMERIC
var_importance

These results are suprising because it almost exactly the inverse of the importance fields for Logistic Regression. 

In [None]:
rfc = RandomForestClassifier()
# Fit the pipeline to the training data

rfc.fit(X_train, y_train)
y_pred = rfc.predict(X_test)
# Compute and print accuracy
accuracy = rfc.score(X_test, y_test)
print("\RandomForestClassifier accuracy: ", accuracy)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

Out of the box, our random forest classifier is performing better than our logistic classifier. The f1 scores for sucessful and failed campaigns both are around 70, 

In [None]:
probs = rfc.predict_proba(X_test)
preds = probs[:,1]
fpr, tpr, threshold = roc_curve(y_test, preds)
roc_auc = auc(fpr, tpr)

plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()



In [None]:
# lets try tuning paramters!


In [None]:
n_estimators = [1, 2, 4, 8, 16]
max_depths = np.linspace(1, 12, 6, endpoint=True)
min_samples_splits = np.linspace(0.5, 1.0, 5, endpoint=True)



# Tuning parameters for logistic regression
rfc_params = {'max_depth': max_depths, 
                'min_samples_split': min_samples_splits}
rfc = RandomForestClassifier()
rfc = GridSearchCV(rfc, rfc_params, cv=5)
rfc.fit(X_train, y_train)

In [None]:
# model accuracy
print('Acc: ', rfc.score(X_test, y_test))


print(rfc.best_params_)
max_depth = rfc.best_params_['max_depth']
min_samples_split = rfc.best_params_['min_samples_split']


In [None]:
rfc = RandomForestClassifier(max_depth=max_depth, min_samples_split=min_samples_split, n_estimators=n_estimator)
rfc = rfc.fit(X_train, y_train)

y_pred = rfc.predict(X_test)
# Compute and print accuracy
accuracy = rfc.score(X_test, y_test)
print("\RandomForestClassifier accuracy: ", accuracy)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

In [None]:
probs = rfc.predict_proba(X_test)
preds = probs[:,1]
fpr, tpr, threshold = roc_curve(y_test, preds)
roc_auc = auc(fpr, tpr)

plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

## Naive Bayes

In [None]:
from sklearn.naive_bayes import MultinomialNB


In [None]:
# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(df[NUMERIC], df[STATE], random_state=42)

In [None]:
mnb = MultinomialNB()
rfe = RFE(mnb)
rfe = rfe.fit(X_train, y_train)
#print(rfe.support_)
#print(rfe.ranking_)
support = np.array(rfe.support_)
ranking = np.array(rfe.ranking_)
# support
var_importance = pd.DataFrame({'support': support, 'ranking': ranking})
var_importance = var_importance.T
var_importance.columns = NUMERIC
var_importance

In [None]:
mnb

In [None]:
mnb = MultinomialNB()

mnb.fit(X_train, y_train)
y_pred = mnb.predict(X_test)
# Compute and print accuracy
accuracy = mnb.score(X_test, y_test)
print("\nMultinomialNB: ", accuracy)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

In [None]:
probs = mnb.predict_proba(X_test)
preds = probs[:,1]
fpr, tpr, threshold = roc_curve(y_test, preds)
roc_auc = auc(fpr, tpr)


plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()


In [None]:
# Tuning parameters for bayes regression
bayes_params = {'alpha': [.01, .1, 1., 10., 100.]}
mnb = MultinomialNB()
mnb = GridSearchCV(mnb, bayes_params, cv=10)
mnb.fit(X_train, y_train)

In [None]:
# model accuracy
print('Acc: ', mnb.score(X_test, y_test))


print(mnb.best_params_)


In [None]:

mnb = MultinomialNB(alpha=100)

mnb.fit(X_train, y_train)
y_pred = mnb.predict(X_test)
# Compute and print accuracy
accuracy = mnb.score(X_test, y_test)
print("\nMultinomialNB: ", accuracy)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

In [None]:
probs = mnb.predict_proba(X_test)
preds = probs[:,1]
fpr, tpr, threshold = roc_curve(y_test, preds)
roc_auc = auc(fpr, tpr)


plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

## SVM


In [None]:
# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(df[NUMERIC], df[STATE], random_state=42)

In [None]:
svc = SVC()
rfe = RFE(svc)
rfe = rfe.fit(X_train, y_train)
#print(rfe.support_)
#print(rfe.ranking_)
support = np.array(rfe.support_)
ranking = np.array(rfe.ranking_)
# support
var_importance = pd.DataFrame({'support': support, 'ranking': ranking})
var_importance = var_importance.T
var_importance.columns = NUMERIC
var_importance

In [None]:
svc = SVC()

svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)
# Compute and print accuracy
accuracy = svc.score(X_test, y_test)
print("\nSVM: ", accuracy)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

In [None]:
probs = mnb.predict_proba(X_test)
preds = probs[:,1]
fpr, tpr, threshold = roc_curve(y_test, preds)
roc_auc = auc(fpr, tpr)


plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

In [None]:
# Tuning parameters for logistic regression
parameters = {'C':[1, 10, 100],
              'gamma':[0.1, 0.01]}
svc = SVC()
svc = GridSearchCV(svc, parameters, cv=10)
svc.fit(X_train, y_train)

In [None]:
# model accuracy
print('Acc: ', svc.score(X_test, y_test))


print(svc.best_params_)

In [None]:
svc = SVC()

mnb.fit(X_train, y_train)
y_pred = mnb.predict(X_test)
# Compute and print accuracy
accuracy = mnb.score(X_test, y_test)
print("\nMultinomialNB: ", accuracy)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

In [None]:
probs = mnb.predict_proba(X_test)
preds = probs[:,1]
fpr, tpr, threshold = roc_curve(y_test, preds)
roc_auc = auc(fpr, tpr)


plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()