# An Analysis of Kickstarter Campaigns
## Tyler Zarnik: github - atTlr

---

## Executive Summary

---
 
The primary purpose of this project is to attempt to compile a comprehensive analysis of publicly available Kickstarter datasets with the goal to provide visualizations of the data as well as a machine learning model that will predict and give a probability of if the campaign will be successful. 

We will incorporate the python libraries pandas, numpy, and matplotlib for the Data Cleaning and Data Visualization steps of the project. Sklearn will be the library used for the machine learning model. We will be utilizing Amazon Web Services (AWS) to run and return our fitted model. A logistic regression will be used for the explanatory model while a random forest classifier will be used for a more predictive model. Natural Language Processing will additionally be used as further exogenous variables within our modeling pipeline as a way to incorporate the title and description of the Kickstarter Campaign as well.

The analysis will also incorporate current articles that have been pubished by Kickstarter themselves as 'best practices' for running a campaign. We will look to corroborate any of these 'best practices' if they can be shown through the data as well as provide our own reccomendations based on our model.

### Contents:

- [Preliminary Data On Kickstarter](#Preliminary-Data-On-Kickstarter)
- [Tips Normally Explaining Successful Campaigns](#Tips-Normally-Explaining-Successful-Campaigns)
- [Data Dictionary](#Data-Dictionary)
- [Data Collection](#Data-Collection)
- [Imports](#Imports)
- [Exploratory Data Analysis](#EDA)
- [Preprocessing & Modeling](#Preprocessing-&-Modeling)
- [Evaluation and Conceptual Understanding](#Evaluation-and-Conceptual-Understanding)
- [Conclusion and Recommendations](#Conclusion-and-Recommendations)

### Preliminary Data On Kickstarter

Before conducting the analysis, we find that it is neccessary to explain some of the facts nuances in the Kickstarter crowdfunding platform. The two main nuances that set Kickstarter apart from other plateforms are that Kickstarter does not offer a 'flexible' funding option and that Kickstarter only recieves money from a campaign if said campaign reachs its goals. For the US, kickstarter applies a 3-5% fee for credit card and PayPal processing, and take 5% for any successfully funded campaign. (https://www.kickstarter.com/articles/creative-projects-community-covid-19). This creates a unique situation as Kickstarter is incentivezed to help to cultivate and assist its campaign creators to reach the goals that it has set. When a campaign does not reach its chosen goal, no funds/pledges are collected or given to the creator by Kickstarter. As stated before this is markedly different from other platforms which offer 'flexible' funding options. Flexible funding options allow any funds that were pledge towards a goal to be collected by the creator with the caveat that any funds under the sepecified goal have a high percentage taken out by the crowdfunding comapny. 

Becuase of these nuances, Kickstarter has published via their blogs, different strategies and practices on what they see as actions taken by campaigns according to their data. According to Kickstarters stat page, Around 38% of all campaigns reach their goal (https://www.kickstarter.com/help/stats?ref=global-footer) this means that a few percentage points less than 2/3rds of all campaigns do not reach their goal. Interestingly, out of the 5.2 billion dollars requested by all campaigns launched on the site, 4.73 billion dollars have been collected by successfully funded projects. This could either indicate that the projects that get funded more often are campaigns with high goals or that campaigns that are successfully funded are often backed far above the suggested goal. 

### Tips Normally Explaining Successful Campiagns

After reading through Kickstarters blog and other related articles and op-eds about how to run successful campaigns, there are typically similar ideas that are presented in the articles. I will attempt to simplify and present these in a more broad view below:

> 1. __Have a Complete and Fleshed Out Kickstarter Page__ - 
    > Pages should have a picture, a well defined description, a timline, video, and personal touch.
    
> 2. __Projects shoulf have clearly defined goals and incentives__ - 

> 3. __Projects Should Engage with Backers Who Have Pledged__ - 

With the data that we were able to collect, from a cruesory glance, we do not believe that we will be able to definitively prove or disprove these claims. Rather we do hope that we can provide some additional data driven info to help bolster the tips above.

## Data Dictionary

|Feature|Type|Dataset|Description|
|---|---|---|---|
|id|object|df|Identifing ID For Each Campaign|
|name|object|df|Name and Description of the Campaign|
|category|object|df|Subcategory Description of the Campaign|
|main_category|object|df|Category Of the Campaign|
|currency|object|df|Currency of Pledged Amount|
|launched|datetime|df|Date When Campaign Was Created and Started|
|deadline|datetime|df|Designated End Date of Campaign|
|pledged|float|df|Amount Pledged In Amount of Currency|
|usd_pledged|float|df|Trasnlated Amount Pledged in US Dollars|
|goal|float|df|Amount That Creator Needs To Complete Project|
|backers|int|df|Number of Users That Have Donated to the Campaign|
|country|object|df|Country In Which the Campaign is Taking Place|
|spotlight|object|df|If The Campaign Was Spotlighted on the Kickstarter Website|
|staff_pick|object|df|If the Campaign Was Endorsed by the Staff at Kickstarter|
|duration|int|df|Length In Days of Kickstarter Campaign|
|month_launched|object|df|Month in Which Campaign Was Launched|
|result|int|df|Whether the Campaign Reached The Goal Set|

## Data Collection

We collected our data from two sources, a Kaggle dataset as well as a collection of datasets from WebRobots.io. We have included a notebook that has the full process of the data cleaning and merging of datasets. Both datasets were cleared of any duplicates based on the campaign unique ID. This ensured that there was no overlap between the data sets gathered. There were several columns that were cleaned such as the 'name' and 'category' columns and the 'launched' and 'deadline' columns were adjusted to be datatime objects. Kickstarters stats page says that there have been 497,000 projects. Our dataset in total after cleaning and deleting duplicates round out to be around the 380,000 range. The time period of our data is from 2009 to July of 2020. While our data dies not capture every single campiagn we do believe that it is representative of the campaign population as a whole.

### Imports

In [11]:
import pandas as pd
import numpy as np
import os
import re
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from datetime import datetime
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import metrics
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.compose import make_column_transformer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.ensemble import RandomForestClassifier

## EDA

In [None]:
df = pd.read_csv('../data/total_kickstarter.csv')

We will first start by creating a pariplot to look at all of our numeric variables. Generally thi will give a sense of how our data is spread and visuals any overarching patterns.

In [None]:
sns.pairplot(df[['pledged','goal','backers','result','duration']]);

The most important thing to notice from this quick pairplot is that our dataset is ripe with outliers. This is most evidently seen in the 'goal' vs 'pledged' plots. Our data is also clumped at some points. From a cursory look, the duration is the most evidently clumped data especially around the apparent 30 and 60 day marks. We can also see from our 'result' vs 'result' plot that there are more failed rather than successful campaigns. Below will will dig deeper on some of the more interesting plots.

In [None]:
plt.figure(figsize = (16,10))
plt.hist(df['duration'])
plt.hist(df.loc[df['result']==1]['duration'])
plt.title('Distribution of Duration of Campaigns in Days With Color Portion of Success: Bins 10',fontsize = 20)
plt.xlabel('Duration',fontsize = 12)
plt.ylabel('Frequency',fontsize = 12)
plt.tight_layout;

The graph above shows the distribution of the duration of campaigns. This graph seems to show that duration has a somewhat normal distribution. While it does have considerable right skew, the graph using 10 bins shows that most data is between the 20-40 day mark which would indicate that most campaigns last around the 30 day mark. We can see that visually, all bins have a similar percentage of successful campaigns which is denoted by the orange bar in the graph above. The plot below will adjust the amount of bins so that we can get a more accurate picture of the data.

In [None]:
plt.figure(figsize = (16,10))
plt.hist(df_cov['duration'])
plt.hist(df_cov.loc[df_cov['result']==1]['duration'])
plt.title('Covid-19: Distribution of Duration of Campaigns in Days With Color Portion of Success: Bins 10',fontsize = 20)
plt.xlabel('Duration',fontsize = 12)
plt.ylabel('Frequency',fontsize = 12)

In [None]:
plt.figure(figsize = (16,10))
plt.hist(df['duration'],bins = 11)
plt.hist(df.loc[df['result']==1]['duration'],bins = 11)
plt.title('Distribution of Duration of Campaigns in Days With Color Portion of Success: Bins 11',fontsize = 20)
plt.xlabel('Duration',fontsize = 12)
plt.ylabel('Frequency',fontsize = 12)
plt.tight_layout;

As we increased the number of bins, we can now start to see how most of the data is as previously hypothesized above, centered around the 30 day mark. Whereas success was relatively even across the duration earlier, now we can clearly see that 30 days and less have a high percentage of successful campaigns. AT around 2 months, there is clearly a low point. Based on the evidence and suggestions given by Kickstarter, generally campaigns that are more successful are better planned from the beginning which have the clearly defined outline and are marketed well. Longer campaigns may suffer from lack of coordination and clear plan and probably rely on having a loger deadline to be a crutch to getting more funding. 

In [None]:
plt.figure(figsize = (16,10))
plt.hist(df_cov['duration'],bins = 11)
plt.hist(df_cov.loc[df_cov['result']==1]['duration'],bins = 11)
plt.title('Covid-19: Distribution of Duration of Campaigns in Days With Color Portion of Success: Bins 11',fontsize = 20)
plt.xlabel('Duration',fontsize = 12)
plt.ylabel('Frequency',fontsize = 12)

In [None]:
plt.figure(figsize = (16,10))
plt.bar(x = df['main_category'].value_counts().keys(), height = df['main_category'].value_counts())
plt.bar(x = df.loc[df['result']==1]['main_category'].value_counts().keys(), height = df.loc[df['result']==1]['main_category'].value_counts())
plt.xlabel('Main Categories', fontsize = 12)
plt.ylabel('Frequency', fontsize = 12)
plt.title('Main Categories Frequency vs Success', fontsize = 20)
plt.tight_layout;

One topic that we often found that was not discussed, was the topic of how category influences success. Kickstarter clearly focuses on the idea of 'projects' and does not allow crowdfunding of medical or personal campaigns that are not directly tied with a business campaign or personal artistic/technology based campaign. This will inherently shape the type of demographics that frequent and back the site. Kickstarters data found that projects are not purely or even mostly funded by friends and family but rather people who are independent of the creators circle. While the categories are organized from larget to cmallest in terms of number of campaigns, this does not also suggest that the more common campaigns are the most succesfful. The most successful campiagn categories based on percentages are: comics, theater, dance, and music. the music category is the only category that would be considered a 'main' or 'large' category on the website. Technology, journalism, fashion, and food are the lowest categories upon visual inspection. 

This might indicate that there is a demographic of backers that are not being fully catered to. There may also be a marketshare that uses other crowdfunding websites for their funding of specific categories suchs as theater or dance. Without data to other crowdfunding sites we can only speculate at this moment.

In [None]:
plt.figure(figsize = (16,10))
plt.bar(x = df_cov['main_category'].value_counts().keys(), height = df_cov['main_category'].value_counts())
plt.bar(x = df_cov.loc[df_cov['result']==1]['main_category'].value_counts().keys(), height = df_cov.loc[df_cov['result']==1]['main_category'].value_counts())
plt.xlabel('Main Categories', fontsize = 12)
plt.ylabel('Frequency', fontsize = 12)
plt.title('Covid-19: Main Categories Frequency vs Success', fontsize = 20)
plt.tight_layout;

In [None]:
plt.figure(figsize = (16,10))
df.loc[(df['backers'] < 500)]['backers'].hist(bins = 15, grid = False)
df.loc[(df['backers']<500) & (df['result']==1)]['backers'].hist(bins = 15, grid = False)
plt.xlabel('Number of Backers Per Campaign',fontsize = 12)
plt.ylabel('Frequency',fontsize = 12)
plt.title('Distribution of Backers with Success Highlighted: Less Than 500 Backers',fontsize = 20)
plt.tight_layout;

The above histogram shows that most campaigns that are successful are at least hitting the 50+ backer mark when the reach their goal. Assuming the distribution of goals is also extremely right skewed, than it would be safe to assume that campaigns that are successful in reaching their goal are funded by more lower pledging backers rather than fewer high pledging backers. This would inidcate that you do want to engage with as many people as possible rather than finding the 'white whale' backers who drop a far larger pledge.

In [None]:
plt.figure(figsize = (16,10))
df[df['goal']<50000]['goal'].hist(bins = 15)
df[(df['goal']<50000)&(df['result']==1)]['goal'].hist(bins = 15,grid = False)
plt.xlabel('Amount in USD',fontsize = 12)
plt.ylabel('Frequency',fontsize = 12)
plt.title('Distribution of Goals with Success Highlighted in USD: Less Than $50,000',fontsize = 20)
plt.tight_layout;

In continuation with the graph from above, the distribution of pledge goals is also heavily right skewed. However, it should be noted that it appears that after the initial more affordable less than $500 goals, most goals beyond that mark are percentage wise about as succesful at reaching their goal. This however does run counter to one of our observations earlier that most higher campaigns are funded according to the data from kickstaters stats page.

In [None]:
plt.figure(figsize = (16,10))
plt.bar(x = df['month_launched'].value_counts().keys(), height = df['month_launched'].value_counts())
plt.bar(x = df.loc[df['result']==1]['month_launched'].value_counts().keys(), height = df.loc[df['result']==1]['month_launched'].value_counts())
plt.xlabel('Months', fontsize = 15)
plt.ylabel('Frequency', fontsize = 15)
plt.title('Number of Campaigns per Month: Color Denotes Success', fontsize = 20)
plt.xticks(np.arange(12), (['Jul','Mar','May','Jun','Apr','Oct','Nov','Feb','Aug','Sep','Jan','Dec']),fontsize = 15)
plt.yticks(fontsize = 15)
plt.tight_layout;

## Preprocessing and Modeling

Within the data cleaning notebook attached there is a dataset that has added the dummy variables for many of the numeric and categorical columns below. Also else is the same.

In [2]:
df = pd.read_csv('../data/no_cvec_kickstarter.csv')

In [3]:
df.dropna(inplace=True)

In [4]:
X = df.drop(columns = ['result','pledged','usd_pledged','launched','deadline','id'])

In [5]:
y = df['result']

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X,y)

In [8]:
col_transformer = make_column_transformer(
    (CountVectorizer(), 'name'), 
    remainder = "passthrough"
)

pipe = Pipeline([
    ("col_trans", col_transformer),
    ("log_reg", LogisticRegression())
])

pipe_params = {
    # Digging through transformers
    'col_trans__countvectorizer__ngram_range': [(1,1),(1,2)],
    'col_trans__countvectorizer__max_features': [100, 200],
    
    "log_reg__C" : [1, 5],
    "log_reg__max_iter":[1000]
}

gs = GridSearchCV(pipe, pipe_params, cv = 3)

gs.fit(X_train, y_train)

GridSearchCV(cv=3, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('col_trans',
                                        ColumnTransformer(n_jobs=None,
                                                          remainder='passthrough',
                                                          sparse_threshold=0.3,
                                                          transformer_weights=None,
                                                          transformers=[('countvectorizer',
                                                                         CountVectorizer(analyzer='word',
                                                                                         binary=False,
                                                                                         decode_error='strict',
                                                                                         dtype=<class 'numpy.int64'>,
                 

In [9]:
gs.score(X_train, y_train)

0.9136314218164875

In [10]:
gs.score(X_test, y_test)

0.9129520910342828

In [None]:
rf_col_transformer = make_column_transformer(
    (CountVectorizer(), 'name'), 
    remainder = "passthrough"
)

rf_pipe = Pipeline([
    ("col_trans", rf_col_transformer),
    ("rf", RandomForestClassifier())
])

rf_pipe_params = {
    # Digging through transformers
    'col_trans__countvectorizer__ngram_range': [(1,1),(1,2)],
    'col_trans__countvectorizer__max_features': [100, 200],
    
    "rf__max_depth" : [20, 30],
    'rf__max_features' : [90,100]
}

rf_gs = GridSearchCV(rf_pipe, rf_pipe_params, cv = 3)

rf_gs.fit(X_train, y_train)

In [None]:
rf_gs.score(X_train, y_train)

In [None]:
rf_gs.score(X_test, y_test)