# Predicting Kickstater Success

![title](http://the-lfb.com/wp-content/uploads/2013/05/kickstarter-thumbnail.jpg)

# Summary:

## What would I recommend?
* Write more.
* Use certain words (like "we" and "you") more, and use others (like "I") less". 
* Set your goal low. 
* If you can, create a project in Technology, Games, or Food.

## What tools did I use?
* To ensure a repeatable environment, built a docker container with all the tools I used, including Pandsa, NLTK, and Tensorflow: https://github.com/aaronwro/docker-jupyter. To get started, install docker toolbox (windows or mac), then
```bash
git clone https://github.com/aaronwro/docker-jupyter
docker-compose up jupyter
```
* Pandas for loading and cleaning data
* NLTK for stopwords and stemming
* Keras for building a deep learning model
* Plotly and Matplotlib for plots

## What would I do next?
* Statistics on provided and LICW Features
 * Remove outliers (+/- 3 std deviations)
 * Confidence interval on 
* Deep learning model:
 * Word2Vec embeddings to teach the model more about language, semantics, and similar concepts.
 * Use the LIWC features as additional training inputs for the model.
 * Change the model to predict % funded instead of the binary label. 
 * Multilayer convolutional network to allow the model to make more cross sentances inferences. 
 * Set up GPU with Tensorflow on a local machine (ideal, better GPUs available) or in AWS (more expensive and slower) for faster training
 * Create an interactive experience for real-time feedback
* Data collection
 * Use Beautiful soup to gather more Kickstarter campaigns. Almost 300k projects on kickstarter alone! 
 * Use Beautiful soup to gather data on campaigns from other crowd-funded applications. 
 * Gather metadata on presence of videos - present? length? can we get number of views?
 * Gather data on reward types, pricing, and language. Reward selection is very important for me personally.


In [None]:
import pandas as pd
import numpy as np
import nltk

import warnings # current version of seaborn generates a bunch of warnings that we'll ignore
warnings.filterwarnings("ignore")

import plotly
#print plotly.__version__  # version >1.9.4 required
from plotly.offline import download_plotlyjs, init_notebook_mode, iplot, plot
from plotly.graph_objs import *
init_notebook_mode()
import matplotlib.pyplot as plt
#from wordcloud import WordCloud, STOPWORDS
%matplotlib inline

from dateutil import parser as dateparser
from pygeocoder import Geocoder
from time import sleep
from tqdm import tqdm
from geonamescache import GeonamesCache
from termcolor import colored

def get_len(row, column):
    if not isinstance(row[column], basestring):
        return 0
    else:
        return len(row[column])

df = pd.read_csv('kickstarter_corpus_cleaned.csv')
df['avg_pledge_per_backer'] = df['pledged'] / df['backers']
df['percent_of_goal'] = df['pledged'] / df['goal']
df['cleaned_words_len'] = df.apply(lambda row: get_len(row, 'cleaned_words'), axis=1)
print '[kickstarter_corpus] shape: ' + str(df.shape)

In [None]:
df.head(3)

In [None]:
total_pledged_by_category = df[['category', 'funded', 'pledged']].groupby(['category', 'funded']).sum().reset_index()

funded = Bar(
    x=total_pledged_by_category.query('funded == True')['category'],
    y=total_pledged_by_category.query('funded == True')['pledged'],
    name='Funded'
)

not_funded = Bar(
    x=total_pledged_by_category.query('funded == False')['category'],
    y=total_pledged_by_category.query('funded == False')['pledged'],
    name='Not Funded'
)

data = [funded, not_funded]
layout = Layout(
    title='total funds pledged by category',
    barmode='group'
)

iplot(Figure(data=data, layout=layout))

In [None]:
avg_pledged_by_category = df[['category', 'funded', 'pledged']].groupby(['category', 'funded']).mean().reset_index()

funded = Bar(
    x=avg_pledged_by_category.query('funded == True')['category'],
    y=avg_pledged_by_category.query('funded == True')['pledged'],
    name='Funded'
)

not_funded = Bar(
    x=avg_pledged_by_category.query('funded == False')['category'],
    y=avg_pledged_by_category.query('funded == False')['pledged'],
    name='Not Funded'
)

data = [funded, not_funded]
layout = Layout(
    title='avg funds pledged by category',
    barmode='group'
)

iplot(Figure(data=data, layout=layout))

In [None]:
avg_percent_over_goal_by_category = df[['category', 'funded', 'percent_of_goal']].groupby(['category', 'funded']).mean().reset_index()

funded = Bar(
    x=avg_percent_over_goal_by_category.query('funded == True')['category'],
    y=avg_percent_over_goal_by_category.query('funded == True')['percent_of_goal'],
    name='Funded'
)

not_funded = Bar(
    x=avg_percent_over_goal_by_category.query('funded == False')['category'],
    y=avg_percent_over_goal_by_category.query('funded == False')['percent_of_goal'],
    name='Not Funded'
)

data = [funded, not_funded]
layout = Layout(
    title='avg % of goal pledged by category',
    barmode='group'
)

iplot(Figure(data=data, layout=layout))

In [None]:
text_len_by_category = df[['category', 'funded', 'cleaned_words_len']].groupby(['category', 'funded']).mean().reset_index()

funded = Bar(
    x=text_len_by_category.query('funded == True')['category'],
    y=text_len_by_category.query('funded == True')['cleaned_words_len'],
    name='Funded'
)

not_funded = Bar(
    x=text_len_by_category.query('funded == False')['category'],
    y=text_len_by_category.query('funded == False')['cleaned_words_len'],
    name='Not Funded'
)

data = [funded, not_funded]
layout = Layout(
    title='text length by category',
    barmode='group'
)

iplot(Figure(data=data, layout=layout))

In [None]:
goal_by_category = df[['category', 'funded', 'goal']].groupby(['category', 'funded']).mean().reset_index()

funded = Bar(
    x=goal_by_category.query('funded == True')['category'],
    y=goal_by_category.query('funded == True')['goal'],
    name='Funded'
)

not_funded = Bar(
    x=goal_by_category.query('funded == False')['category'],
    y=goal_by_category.query('funded == False')['goal'],
    name='Not Funded'
)

data = [funded, not_funded]
layout = Layout(
    title='avg. goal by category',
    barmode='group'
)

iplot(Figure(data=data, layout=layout))

In [None]:
layout = Layout(title='goal by category for funded projects')
iplot(Figure(data=[Box(y = df[df.funded==True][df.category==category]['goal'], name=category, showlegend=True) for category in df.sort_values(by='category').category.unique()], layout=layout), show_link=False,)

In [None]:
avg_text_len_by_category = df[['category', 'funded', 'cleaned_words_len']].groupby(['category', 'funded']).mean().reset_index()

funded = Bar(
    x=avg_text_len_by_category.query('funded == True')['category'],
    y=avg_text_len_by_category.query('funded == True')['cleaned_words_len'],
    name='Funded'
)

not_funded = Bar(
    x=avg_text_len_by_category.query('funded == False')['category'],
    y=avg_text_len_by_category.query('funded == False')['cleaned_words_len'],
    name='Not Funded'
)

data = [funded, not_funded]
layout = Layout(
    title='avg text length by category',
    barmode='group'
)

iplot(Figure(data=data, layout=layout))

In [None]:
from scipy.misc import imread
from wordcloud import WordCloud, STOPWORDS

kickstarter_mask = imread('./kickstarter.png', flatten=True)

wordcloud = WordCloud(
                      stopwords=STOPWORDS,
                      background_color='white',
                      width=1800,
                      height=1400,
                      mask=kickstarter_mask
                     ).generate(' '.join(df[df.funded==True]['stemmed_words']))

plt.figure(figsize=(10,10))
plt.imshow(wordcloud)
plt.axis('off')
plt.savefig('./funded_kickstarter_wordcloud.png', dpi=300)
plt.show()

In [None]:
kickstarter_mask = imread('./kickstarter.png', flatten=True)

wordcloud = WordCloud(
                      stopwords=STOPWORDS,
                      background_color='white',
                      width=1800,
                      height=1400,
                      mask=kickstarter_mask
                     ).generate(' '.join(df[df.funded==False]['stemmed_words']))

plt.figure(figsize=(10,10))
plt.imshow(wordcloud)
plt.axis('off')
plt.savefig('./not_funded_kickstarter_wordcloud.png', dpi=300)
plt.show()

In [None]:
avgs = df[['category', 'funded', 'cleaned_words_len', 'avg_pledge_per_backer', 'goal']].groupby(['category', 'funded']).mean().reset_index()
avgs

In [None]:
from scipy import stats
import statsmodels.stats.api as sms

stats_columns=['Column', '% Diff (funded - not_funded)', 'p_value']
#stats_columns=['Column', '% Diff (funded - not_funded)', 'Confidence Lower', 'Confidence Upper', 'p_value']

def percent_diff(a,b):
    return 100 * (a-b)/((a+b)/2)

def get_stats_df(df, col):
    funded = df[df.funded == True][col]
    not_funded = df[df.funded == False][col]
    t_test = stats.ttest_ind(funded, not_funded)
    #cm = sms.CompareMeans(sms.DescrStatsW(funded), sms.DescrStatsW(not_funded))
    out_df = pd.DataFrame(columns=stats_columns)
    out_df = out_df.append(pd.Series([col,percent_diff(np.mean(funded), np.mean(not_funded)),t_test[1]], index=stats_columns), ignore_index=True)
    #out_df = out_df.append(pd.Series([col,percent_diff(np.mean(funded), np.mean(not_funded)),cm.tconfint_diff(usevar='unequal')[0],cm.tconfint_diff(usevar='unequal')[1],t_test[1]], index=stats_columns), ignore_index=True)
    return out_df

def print_stats(df, col):
    funded = df[df.funded == True][col]
    not_funded = df[df.funded == False][col]
    result = stats.ttest_ind(funded, not_funded)
    percent_diff = 100 * (np.mean(funded) - np.mean(not_funded))
    if result[1] < 0.05:
        color = 'green'
    else:
        color = 'grey'
    print colored(' \t '.join([col, str(percent_diff), str(result[1])]), color)

#for category in df.sort_values(by='category').category.unique():
#    print '====== {} ======'.format(category)
#    print_stats(col, category=category)

#print ' \t '.join(stats_columns)
#print_stats(df, 'cleaned_words_len')
#print_stats(df[~df.avg_pledge_per_backer.isnull()], 'avg_pledge_per_backer')
#print_stats(df, 'goal')

stats_df = pd.DataFrame(columns=stats_columns)
stats_df = stats_df.append(get_stats_df(df, 'cleaned_words_len'))
stats_df = stats_df.append(get_stats_df(df[~df.avg_pledge_per_backer.isnull()], 'avg_pledge_per_backer'))
stats_df = stats_df.append(get_stats_df(df, 'goal'))
stats_df[stats_df.p_value < 0.05]

In [None]:
# LIWC stats
# http://lit.eecs.umich.edu/~geoliwc/LIWC_Dictionary.htm

liwc_df = pd.read_csv('kickstarter_assignment/LIWC2015_kickstarter_corpus_cleaned.csv')
liwc_columns = ['WC', 'Analytic', 'Clout', 'Authentic', 'Tone', 'WPS', 'Sixltr','Dic', 'function', 'pronoun', 'ppron', 'i', 'we', 'you', 'shehe','they', 'ipron', 'article', 'prep', 'auxverb', 'adverb', 'conj','negate', 'verb', 'adj', 'compare', 'interrog', 'number', 'quant','affect', 'posemo', 'negemo', 'anx', 'anger', 'sad', 'social','family', 'friend', 'female', 'male', 'cogproc', 'insight', 'cause','discrep', 'tentat', 'certain', 'differ', 'percept', 'see', 'hear','feel', 'bio', 'body', 'health', 'sexual', 'ingest', 'drives','affiliation', 'achieve', 'power', 'reward', 'risk', 'focuspast','focuspresent', 'focusfuture', 'relativ', 'motion', 'space', 'time','work', 'leisure', 'home', 'money', 'relig', 'death', 'informal','swear', 'netspeak', 'assent', 'nonflu', 'filler', 'AllPunc','Period', 'Comma', 'Colon', 'SemiC', 'QMark', 'Exclam', 'Dash','Quote', 'Apostro', 'Parenth', 'OtherP']
stats_df = pd.DataFrame(columns=stats_columns)
for col in liwc_columns:
    stats_df = stats_df.append(get_stats_df(liwc_df, col))
stats_df[stats_df.p_value < 0.05].sort_values('% Diff (funded - not_funded)', ascending=False)