## Project Kickstarter 
by Vivika Wilde wilde.vivika@gmail.com
   Sebastian Fuhrer fuhrer_sebastian@web.de

## Objective 

In recent years, the range of funding options for projects created by individuals and small companies has expanded considerably. In addition to savings, bank loans, friends & family funding and other traditional options, crowdfunding has become a popular and readily available alternative. 

Kickstarter, founded in 2009, is one particularly well-known and popular crowdfunding platform. It has an all-or-nothing funding model, whereby a project is only funded if it meets its goal amount; otherwise no money is given by backers to a project.
A huge variety of factors contribute to the success or failure of a project — in general, and also on Kickstarter. Some of these are able to be quantified or categorized, which allows for the construction of a model to attempt to predict whether a project will succeed or not. The aim of this project is to construct such a model and also to analyse Kickstarter project data more generally, in order to help potential project creators assess whether or not Kickstarter is a good funding option for them, and what their chances of success are.


## Set up

In [1]:
%reset -fs
import glob, os, re
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime
import json
%matplotlib inline

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.model_selection import cross_val_predict, cross_val_score, cross_validate
from sklearn.metrics import roc_curve, confusion_matrix, accuracy_score, recall_score, precision_score

from sklearn.linear_model import LogisticRegression

# Set random seed 
RSEED = 42

In [2]:
data = pd.concat(map(pd.read_csv, glob.glob(os.path.join('', "data/*.csv"))))
data = data.reset_index(drop=True)

In [3]:
df = data.copy()

## Variable names and description

* **backers_count** - Number of supporters/investors
* **blurb** - A short description of the product written for promotional purposes
* **category** - Projects have been classified into 16 categories. These categories broadly define the genre a project belongs to. **Subcategory** - Categories are further sub-divided in subcategories to give more details on the project. For instance, the category “Technology” has further been split into subcategories like Gadgets, Web, Apps, Software etc. There are 144 total subcategories.
* **converted_pledged_amount** - Total pledged amount in USD.
* **country** - Country of origin of the project
* **created_at** - Date when Project was created (timestamp)
* **creator** - Information on the Creator including ID, name, etc.
* **currency** - Currency used to support the project (3-letter code)**???**
* **currency_symbol** - Symbol of the currency
* **currency_trailing_code** - Defines whether the currency codes are always shown after the amount, independent of the locale.
* **current_currency** - **???**
* **deadline** - The date before which the goal amount has to be gathered.
* **disable_communication** - **???**
* **friends** - No values given (208922 NaN and 300 empty lists)
* **fx_rate** - **???**
* **goal** - Funding amount the project initially asked for **??? USD ???**
* **id** - Project ID
* **is_backing** - No values given (208922 NaN and 300 empty lists)
* **is_starrable** - Whether the project can be marked as favourite or not **???**
* **is_starred** - Whether the project was marked as favourite or not **???**
* **launched_at** - Launch date of the project (timestamp)
* **location** - Project location
* **name** - Project name
* **permissions** - No values given (208922 NaN and 300 empty lists)
* **photo** - Project photo
* **pledged** - Pledge amount in original currency
* **profile** - **???**
* **slug** - **??? Creator-selected keyword id of the project ???**
* **source_url** - URL to the category the project belongs to
* **spotlight** - Spotlight allows creators to make a home for their project on Kickstarter after they've been successfully funded. Each creator can take control of their page and build a customized, central hub for news, updates, links to finished work, and anything else they want the world to know about their project
* **staff_pick** - Staff picks was a feature that highlighted promising projects on the site to give them a boost by helping them get exposure through the email newsletter and highlighted spots around the site. The old 'Kickstarter Staff Pick' badge.
* **state** - Was the project successful at the end of the day? state is a categorical variable divided into the levels successful, failed, live, cancelled, undefined and suspended. 
* **state_changed_at** - Date the state was changed last (timestamp)
* **static_usd_rate** - Conversion rate used by Kickstarter to calculate usd_pledged
* **urls** - URL to the project's side 
* **usd_pledged** - Pledged amount in USD (conversion made by Kickstarter)
* **usd_type** - **???**

## Data types and missings

In [4]:
df.head(2)

Unnamed: 0,backers_count,blurb,category,converted_pledged_amount,country,created_at,creator,currency,currency_symbol,currency_trailing_code,...,slug,source_url,spotlight,staff_pick,state,state_changed_at,static_usd_rate,urls,usd_pledged,usd_type
0,315,Babalus Shoes,"{""id"":266,""name"":""Footwear"",""slug"":""fashion/fo...",28645,US,1541459205,"{""id"":2094277840,""name"":""Lucy Conroy"",""slug"":""...",USD,$,True,...,babalus-childrens-shoes,https://www.kickstarter.com/discover/categorie...,False,False,live,1548223375,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",28645.0,international
1,47,A colorful Dia de los Muertos themed oracle de...,"{""id"":273,""name"":""Playing Cards"",""slug"":""games...",1950,US,1501684093,"{""id"":723886115,""name"":""Lisa Vollrath"",""slug"":...",USD,$,True,...,the-ofrenda-oracle-deck,https://www.kickstarter.com/discover/categorie...,True,False,successful,1504976459,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",1950.0,domestic


In [5]:
df.tail(2)

Unnamed: 0,backers_count,blurb,category,converted_pledged_amount,country,created_at,creator,currency,currency_symbol,currency_trailing_code,...,slug,source_url,spotlight,staff_pick,state,state_changed_at,static_usd_rate,urls,usd_pledged,usd_type
209220,76,Seattle Transmedia & Independent Film Festival...,"{""id"":295,""name"":""Festivals"",""slug"":""film & vi...",5692,US,1425256957,"{""id"":307076473,""name"":""Timothy Vernor"",""is_re...",USD,$,True,...,transmedia-gallery-space-stiff-2015,https://www.kickstarter.com/discover/categorie...,True,False,successful,1429536379,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",5692.0,domestic
209221,44,The @1000TimesYes 2009 Tweet Box is a handmade...,"{""id"":13,""name"":""Journalism"",""slug"":""journalis...",1293,US,1263225900,"{""id"":1718677513,""name"":""Article"",""slug"":""arti...",USD,$,True,...,the-1000timesyes-2009-tweet-box,https://www.kickstarter.com/discover/categorie...,True,True,successful,1266814815,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",1293.0,domestic


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 209222 entries, 0 to 209221
Data columns (total 37 columns):
backers_count               209222 non-null int64
blurb                       209214 non-null object
category                    209222 non-null object
converted_pledged_amount    209222 non-null int64
country                     209222 non-null object
created_at                  209222 non-null int64
creator                     209222 non-null object
currency                    209222 non-null object
currency_symbol             209222 non-null object
currency_trailing_code      209222 non-null bool
current_currency            209222 non-null object
deadline                    209222 non-null int64
disable_communication       209222 non-null bool
friends                     300 non-null object
fx_rate                     209222 non-null float64
goal                        209222 non-null float64
id                          209222 non-null int64
is_backing                  300 

### Missing Data

In [7]:
missing = pd.DataFrame(df.isnull().sum(), columns=['Number'])
missing['Percentage'] = round(missing.Number / df.shape[0] * 100, 1)
missing[missing.Number != 0]

Unnamed: 0,Number,Percentage
blurb,8,0.0
friends,208922,99.9
is_backing,208922,99.9
is_starred,208922,99.9
location,226,0.1
permissions,208922,99.9
usd_type,480,0.2


For the features 'friends', 'is_backing', 'is_starred' and 'permissions' only .1 percent of the data is given.
Therefore these features are not useable  and will be removed from the set. 


In [8]:
df.drop(['friends', 'permissions', 'is_backing', 'is_starred'], axis=1, inplace=True);

In [9]:
df.backers_count.unique()

array([ 315,   47,  271, ..., 3142, 6586, 1192])

### Backers

In [10]:
df.rename(columns = {'backers_count':'backers'}, inplace = True)

### Category

In [11]:
df.category.unique()[0]

'{"id":266,"name":"Footwear","slug":"fashion/footwear","position":5,"parent_id":9,"color":16752598,"urls":{"web":{"discover":"http://www.kickstarter.com/discover/categories/fashion/footwear"}}}'

In [12]:
df_cat = pd.DataFrame.from_dict([json.loads(x) for x in df.category])
df_cat['urls_web'] = pd.DataFrame.from_dict([x for x in df_cat.urls])
df_cat['urls_web_discover'] = pd.DataFrame.from_dict([x for x in df_cat.urls_web])
df_cat.head(3)

Unnamed: 0,color,id,name,parent_id,position,slug,urls,urls_web,urls_web_discover
0,16752598,266,Footwear,9.0,5,fashion/footwear,{'web': {'discover': 'http://www.kickstarter.c...,{'discover': 'http://www.kickstarter.com/disco...,http://www.kickstarter.com/discover/categories...
1,51627,273,Playing Cards,12.0,4,games/playing cards,{'web': {'discover': 'http://www.kickstarter.c...,{'discover': 'http://www.kickstarter.com/disco...,http://www.kickstarter.com/discover/categories...
2,10878931,43,Rock,14.0,17,music/rock,{'web': {'discover': 'http://www.kickstarter.c...,{'discover': 'http://www.kickstarter.com/disco...,http://www.kickstarter.com/discover/categories...


In [13]:
df[['category', 'subcategory']] = df_cat.slug.str.split("/",expand=True,)

### Country

In [14]:
df.country.unique()

array(['US', 'GB', 'FR', 'AU', 'NZ', 'ES', 'IT', 'NO', 'NL', 'CA', 'SG',
       'MX', 'SE', 'IE', 'DE', 'BE', 'HK', 'AT', 'JP', 'DK', 'CH', 'LU'],
      dtype=object)

### Currency 

In [15]:
df.groupby(['currency', 'currency_symbol','currency_trailing_code', 'current_currency']).size()

currency  currency_symbol  currency_trailing_code  current_currency
AUD       $                True                    CAD                     11
                                                   GBP                      1
                                                   USD                   4854
CAD       $                True                    AUD                      2
                                                   CAD                     23
                                                   GBP                      2
                                                   USD                   9796
CHF       Fr               False                   CAD                      1
                                                   EUR                      1
                                                   USD                    661
DKK       kr               True                    CAD                      1
                                                   USD                    

In [16]:
df.drop('currency_symbol',axis=1, inplace=True);
df.drop('currency_trailing_code',axis=1, inplace=True);

As the currency symbols are less specific and more ambiguous then the currencies themselves only the currency will 
be used as a feature.

### Dates

In [17]:
df_dates = pd.DataFrame([datetime.fromtimestamp (x) for x in df['state_changed_at']],columns=['state_changed'])
df_dates['launched'] = pd.DataFrame([datetime.fromtimestamp (x) for x in df['launched_at']])
df_dates['created'] = pd.DataFrame([datetime.fromtimestamp (x) for x in df['created_at']])
df_dates['deadline'] = pd.DataFrame([datetime.fromtimestamp (x) for x in df['deadline']])
df_dates['duration'] = df_dates['deadline']-df_dates['created']
df_dates['state'] = df['state']

df_dates.head()

Unnamed: 0,state_changed,launched,created,deadline,duration,state
0,2019-01-23 07:02:55,2019-01-23 07:02:55,2018-11-06 00:06:45,2019-03-14 06:02:55,128 days 05:56:10,live
1,2017-09-09 19:00:59,2017-08-10 19:00:59,2017-08-02 16:28:13,2017-09-09 19:00:59,38 days 02:32:46,successful
2,2013-06-12 07:03:15,2013-05-13 07:03:15,2012-09-30 08:45:33,2013-06-12 07:03:15,254 days 22:17:42,successful
3,2017-03-13 18:22:56,2017-01-12 19:22:56,2017-01-07 10:11:11,2017-03-13 18:22:56,65 days 08:11:45,failed
4,2013-01-09 21:32:07,2012-12-10 21:32:07,2012-12-06 19:04:31,2013-01-09 21:32:07,34 days 02:27:36,successful


### Creator

In [18]:
df.creator.unique()[0]

'{"id":2094277840,"name":"Lucy Conroy","slug":"babalus","is_registered":null,"chosen_currency":null,"avatar":{"thumb":"https://ksr-ugc.imgix.net/assets/023/784/556/6ed11b25c853ec1aef7f4360d0eb59ef_original.jpg?ixlib=rb-1.1.0&w=40&h=40&fit=crop&v=1548222691&auto=format&frame=1&q=92&s=b64463d8ae6195f7aeb62393e2ca2dde","small":"https://ksr-ugc.imgix.net/assets/023/784/556/6ed11b25c853ec1aef7f4360d0eb59ef_original.jpg?ixlib=rb-1.1.0&w=160&h=160&fit=crop&v=1548222691&auto=format&frame=1&q=92&s=00bc518b23a932bd76fb6e21f4eb6834","medium":"https://ksr-ugc.imgix.net/assets/023/784/556/6ed11b25c853ec1aef7f4360d0eb59ef_original.jpg?ixlib=rb-1.1.0&w=160&h=160&fit=crop&v=1548222691&auto=format&frame=1&q=92&s=00bc518b23a932bd76fb6e21f4eb6834"},"urls":{"web":{"user":"https://www.kickstarter.com/profile/babalus"},"api":{"user":"https://api.kickstarter.com/v1/users/2094277840?signature=1552621545.c7a32fed985a78dec253fe61c1acb7a99edbc0af"}}}'

In [39]:
def dict_extract(search_col, extract_after, end):
    search_col = search_col.astype('str')
    result_col = []
    for i in range(len(search_col)):
        index_start = search_col[i].find(extract_after) + len(extract_after)
        index_end = search_col[i].find(end, index_start)
        result_col.append(search_col[i][index_start:index_end])
    return result_col

In [80]:
df['creator_id'] = dict_extract(df.creator, '"id":', ',')
df['creator_id'] = df['creator_id'].astype('int')

### FX

In [41]:
(df['fx_rate'] - df['static_usd_rate']).describe().round(4)
df.query('(fx_rate - static_usd_rate) > 0.05').shape[0]

4182

### Pledged Amount

In [42]:
(df['usd_pledged'] - df['static_usd_rate']* df['pledged']).describe().round()

count    209222.0
mean          0.0
std         740.0
min      -50333.0
25%           0.0
50%           0.0
75%           0.0
max      255167.0
dtype: float64

In [43]:
(df['converted_pledged_amount'] - df['usd_pledged']).describe().round()

KeyError: 'converted_pledged_amount'

In [46]:
df.query('(converted_pledged_amount - usd_pledged)/usd_pledged > 0.05').shape[0]

UndefinedVariableError: name 'converted_pledged_amount' is not defined

In [47]:
df['usd_pledged'] = df[['converted_pledged_amount','usd_pledged']].mean(axis=1);

KeyError: "['converted_pledged_amount'] not in index"

In [48]:
df.drop('converted_pledged_amount', axis=1, inplace=True);

KeyError: "['converted_pledged_amount'] not found in axis"

As usd_pledged is the product of the rate and the original pledged amount, the later two can be removed as features. By definition, the converted_pledged_amount and usd_pledged are assumed to be identical. Therefore, converted_pledged_amount will be removed from the dataset after usd_pledged is adjusted for the deviations. 

### ID

In [81]:
df.id.nunique()

182264

In [50]:
df.id.unique()[0]

2108505034

### Location

In [56]:
df.location.unique()[0]

'{"id":2462429,"name":"Novato","slug":"novato-ca","short_name":"Novato, CA","displayable_name":"Novato, CA","localized_name":"Novato","country":"US","state":"CA","type":"Town","is_root":false,"urls":{"web":{"discover":"https://www.kickstarter.com/discover/places/novato-ca","location":"https://www.kickstarter.com/locations/novato-ca"},"api":{"nearby_projects":"https://api.kickstarter.com/v1/discover?signature=1552595066.49b64db66a5124f5831752d055cd09aff20cc652&woe_id=2462429"}}}'

In [73]:
df['location_id'] = dict_extract(df.location, '"id":', ',')
df['location_id'].nunique()

15236

In [75]:
df['location_name'] = dict_extract(df.location, '"short_name":"', '",')
df['location_name'].nunique()

14515

### Profile

In [76]:
df.profile[0]

'{"id":3508024,"project_id":3508024,"state":"inactive","state_changed_at":1541459205,"name":null,"blurb":null,"background_color":null,"text_color":null,"link_background_color":null,"link_text_color":null,"link_text":null,"link_url":null,"show_feature_image":false,"background_image_opacity":0.8,"should_show_feature_image_section":true,"feature_image_attributes":{"image_urls":{"default":"https://ksr-ugc.imgix.net/assets/023/667/205/a565fde5382d6b53276597bcbf505af7_original.jpg?ixlib=rb-1.1.0&crop=faces&w=1552&h=873&fit=crop&v=1546238810&auto=format&frame=1&q=92&s=4faccb2ba6fae37a2d990e8471669753","baseball_card":"https://ksr-ugc.imgix.net/assets/023/667/205/a565fde5382d6b53276597bcbf505af7_original.jpg?ixlib=rb-1.1.0&crop=faces&w=560&h=315&fit=crop&v=1546238810&auto=format&frame=1&q=92&s=53798a47ff4e37129dfd4d11827fa5c4"}}}'

In [86]:
df['profile_id'] = dict_extract(df.profile, '"id":', ',"')
df['profile_id'] = df['profile_id'].astype('int')
df['profile_id'].nunique()

182264

The profile_id and id are different variables however they seem both to identify the individual projects.

## Label and Features

In [None]:
df.state.value_counts()

In [None]:
df.state.hist()

In [None]:
df.describe().round(2)

## Model

### Preprocessing

In [None]:
# Dropping the unnecessary columns 
df1 = df.query('state == "successful" or state == "failed"')
#X = df1.drop(['blurb', 'creator', 'photo', 'profile', 'state', 'urls', 'location', 'creator_RM', 'source_url', 'slug', 'name', 'usd_type'], axis=1)
X = df1[['backers', 'category', 'country', 'created_at', 'currency', 'current_currency', 'deadline', 'disable_communication', 'fx_rate', 'goal', 'is_starrable', 'launched_at', 'pledged', 'spotlight', 'staff_pick', 'state_changed_at', 'static_usd_rate', 'usd_pledged', 'usd_type']]
y = pd.get_dummies(df1[['state']], drop_first=True)
print(X.shape)
print(y.shape)

In [None]:
X.nunique()

In [None]:
# Creating list for categorical predictors/features 
cat_features = list(X.columns[X.dtypes==object])
cat_features

In [None]:
# Creating list for numerical predictors/features
# Since 'Survived' is our target variable we will exclude this feature from this list of numerical predictors 
num_features = list(X.columns[X.dtypes!=object])
num_features

In [None]:
# Split into train and test set 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=RSEED)

In [None]:
print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)
print('y_train shape:', y_train.shape)
print('y_test shape:', y_test.shape)

In [None]:
X_train.nunique()

In [None]:
y_train.nunique()

### Pipline

In [None]:
from sklearn.pipeline import Pipeline

# Pipline for numerical features
num_pipeline = Pipeline([
    ('imputer_num', SimpleImputer(strategy='median')),
    ('std_scaler', StandardScaler())
])

# Pipeline for categorical features 
cat_pipeline = Pipeline([
    ('imputer_cat', SimpleImputer(strategy='constant', fill_value='missing')),
    ('1hot', OneHotEncoder(handle_unknown='ignore'))
])

In [None]:
from sklearn.compose import ColumnTransformer

# Complete pipeline
preprocessor = ColumnTransformer([
    ('num', num_pipeline, num_features),
    ('cat', cat_pipeline, cat_features)
])

### Predictive Modelling

In [None]:
# Building a full pipeline with our preprocessor and a LogisticRegression Classifier
pipe_logreg = Pipeline([
    ('preprocessor', preprocessor),
    ('logreg', LogisticRegression(max_iter=1000))
])

In [None]:
# Making predictions on the training set using cross validation as well as calculating the probabilities 
y_train_predicted = cross_val_predict(pipe_logreg, X_train, y_train.values.ravel(), cv=5, verbose=5)

In [None]:
# Calculating the accuracy for the LogisticRegression Classifier 
print('Cross validation scores:')
print('-------------------------')
print("Accuracy: {:.2f}".format(accuracy_score(y_train, y_train_predicted)))
print("Recall: {:.2f}".format(recall_score(y_train, y_train_predicted)))
print("Precision: {:.2f}".format(precision_score(y_train, y_train_predicted)))

### Optimizing via Grid Search

In [None]:
# Defining parameter space for grid-search. Since we want to access the classifier step in our pipeline 
# we have to add 'logreg__' infront of the corresponding hyperparameters. 
param_logreg = {'logreg__penalty':('l1','l2'),
                'logreg__C': [0.01, 0.1, 1, 10, 100]
               }

grid_logreg = GridSearchCV(pipe_logreg, param_grid=param_logreg, cv=3, scoring='accuracy', 
                           verbose=5, n_jobs=-1)

In [None]:
grid_logreg.fit(X_train, y_train.values.ravel());

In [None]:
# Show best parameters
print('Best score:\n{:.2f}'.format(grid_logreg.best_score_))
print("Best parameters:\n{}".format(grid_logreg.best_params_))

In [None]:
# Save best model as best_model
best_model = grid_logreg.best_estimator_['logreg']

### Final Evaluation

In [None]:
# Preparing the test set 
preprocessor.fit(X_train)
X_test_preprocessed = preprocessor.transform(X_test)

In [None]:
# Calculating the accuracy, recall and precision for the test set with the optimized model
y_test_predicted = best_model.predict(X_test_preprocessed)

print("Accuracy: {:.2f}".format(accuracy_score(y_test, y_test_predicted)))
print("Recall: {:.2f}".format(recall_score(y_test, y_test_predicted)))
print("Precision: {:.2f}".format(precision_score(y_test, y_test_predicted)))