




![Kickstarter_logo](Kickstarter_Logo.png)

Launched in 2009, **Kickstarter** is one of the world's leading crowdfunding platforms. As of December 2019, Kickstarter has received more than $4.6 billion in pledges from 17.2 million backers to fund 445,000 projects, such as films, music, stage shows, comics, journalism, video games, technology, publishing, and food-related projects. It's mission is to bring "*creative projects to life*"

Kickstarter, founded in 2009, is one particularly well-known and popular crowdfunding platform. It has an all-or-nothing funding model, whereby a project is only funded if it meets its goal amount; otherwise no money is given by backers to a project.
A huge variety of factors contribute to the success or failure of a project — in general, and also on Kickstarter. Some of these are able to be quantified or categorized, which allows for the construction of a model to attempt to predict whether a project will succeed or not.

## Objective of the Analysis 

1. Primary **business objective** is to provide creators with a recommendation on how to launch a successful Kickstarter campaign! 

2. Determine which factors decide whether or not a project will achieved is funding goal. 

### How do we define a "successful" Project? 

Success in the context of the Kickstarter Dataset is defined as achieving the funding goal.

## Project Requirements: 

1. Try different (at least 3) machine learning algorithms to check which performs best on the problem at hand
2. What would be the right performance metric- precision, recall, accuracy, F1 score, or something else? (Check TPR?)

**Hint**: Check for Data imbalance


## Description of the Dataset 

**'backers_count':** Number of folks who pledge money to join creators in bringing projects to life

**'blurb':** Description of the project / company

**'category':** Describes the topic of the project (e.g. music, fashion)

**'converted_pledged_amount':** Amount of money pledged, converted to the currency on the `current_currency`column 

**'country':** Country where the project creators originates from 

**'created_at':** Date and time when the project was initially created on Kickstarter

**'creator':** The person or team behind the project idea, working to bring it to life

**'currency':** Name of original currency 

**'currency_symbol':** corresponding currency symbol

**'currency_trailing_code':** ?

**'current_currency':** Currency after the conversion has taken place

**'deadline':** Final crowdfunding date

**'disable_communication':** whether or not a project creator is able to communicate with they backers

**'friends':** unclear, null or empty

**'fx_rate':** Foreign exchange rate between the original currency and the current currency

**'goal':** The amount of money that a creator needs to complete their project. Minimum requirement for the project to be financed

**'id':** Project ID

**'is_backing':** 

**'is_starrable':** provides the option to leave a star review

**'is_starred':** has received a star review

**'launched_at':** state and time when the project was launched for funding at Kickstarter

**'location':** Contains the town or city of the project creator

**'name':** Name of the campaign

**'permissions':** unclear; is either NA or empty in the dataset

**'photo':** contains a link and information to the projects photos

**'pledged':** Amount pledged by the contributors in the original currency 

**'profile':** Details about the projects profile including ID number and various visual settings

**'slug':** Name of the project with hyphens and lowercase letters instead of spaces and uppercase letters

**'source_url':** link to the project category on the Kickstarter website

**'spotlight':** Option to put the campaign in a spotlight via a landing page on Kickstarter after it has been successfully financed

**'staff_pick':** Whether a project was handpicked and highlighted by the Kickstarter team. These projects are displayed favorably on the Kickstarter page.

**'state':** Status of the campaign that can be classified into one of the following categories:
   * *'successful'*: project has achieved the funding goal and is neither canceled or suspended. (only classified as successful after the deadline has passed?)
   * *'failed'*: project has failed to achieve the funding goal within the deadline
   * *'live'*: Campaign are classified as live when they are still ongoing regardless of whether they have already achieved the funding goal or not
   * *'suspended'*: A project may be suspended if the Trust & Safety team uncovers evidence that it is in violation of Kickstarter's rules
   * *'canceled'*: A project may be canceled if the creator wants to make any major changes to the project, such as the funding goal or campaign duration, or likes to rework the idea and start again
   
**'state_changed_at':** Date and time when a project status was changed (e.g. from live to successful / failed)

**'static_usd_rate':** Conversion rate between the original currency and USD

**'urls':** link to the creator's campaign on Kickstarter

**'usd_pledged':** Pledged amount converted to USD done by Kickstarter

**'usd_type':** unclear, classifies either as domestic or international 

##  Import Libaries

In [63]:
import pandas as pd
import numpy as np
import glob
import matplotlib.pyplot as plt
import seaborn as sns
import time
import calendar

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score, plot_confusion_matrix
from sklearn.metrics import roc_curve, accuracy_score, precision_recall_curve, f1_score, precision_score, recall_score
from sklearn.preprocessing import StandardScaler


%matplotlib inline
sns.set_theme(palette="light:#5A9")
sns.set_context("paper", rc={"font.size":8,"axes.titlesize":8,"axes.labelsize":5})

##  Import the Datasets

We have a total of 55 csv documents which combined make up our dataset. In a first step, we will load and concatenate all of the csv files into one dataframe. 

In [56]:
import glob

In [58]:
df = pd.concat([pd.read_csv(i) for i in glob.glob("data/Kickstarter*.csv")], ignore_index=True)

In [1]:
import pandas as pd

In [59]:
df.shape

(209222, 37)

**Remark:**
Our combined Kickstarter dataset has a total of 209222 observations and 37 columns.  

## Data Cleaning

### Understanding the dataset

In [72]:
df.head(2).T

Unnamed: 0,0,1
backers_count,315,47
blurb,Babalus Shoes,A colorful Dia de los Muertos themed oracle de...
category,"{""id"":266,""name"":""Footwear"",""slug"":""fashion/fo...","{""id"":273,""name"":""Playing Cards"",""slug"":""games..."
converted_pledged_amount,28645,1950
country,US,US
created_at,1541459205,1501684093
creator,"{""id"":2094277840,""name"":""Lucy Conroy"",""slug"":""...","{""id"":723886115,""name"":""Lisa Vollrath"",""slug"":..."
currency,USD,USD
currency_symbol,$,$
currency_trailing_code,True,True


In [73]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 209222 entries, 0 to 209221
Data columns (total 37 columns):
backers_count               209222 non-null int64
blurb                       209214 non-null object
category                    209222 non-null object
converted_pledged_amount    209222 non-null int64
country                     209222 non-null object
created_at                  209222 non-null int64
creator                     209222 non-null object
currency                    209222 non-null object
currency_symbol             209222 non-null object
currency_trailing_code      209222 non-null bool
current_currency            209222 non-null object
deadline                    209222 non-null int64
disable_communication       209222 non-null bool
friends                     300 non-null object
fx_rate                     209222 non-null float64
goal                        209222 non-null float64
id                          209222 non-null int64
is_backing                  300 

**Commentary**:

Looking at the output from the commands `head()` and `info()`, there seem to a number of columns which contain information that are not usable in their current format.

In addition, `permissions`, `is_backing`, `friends` and `is_starred` only have 300 observations each. Hence, we should consider removing them. In contrast, the remaining columns are almost complete. 

Furthermore, the time related columns are not in a date format yet and therefore need to be converted. 
Finally, the columns `name`and `slug`contain the same information just in a slightly different form. We can therefore consider removing one of the columns. 

#### Remove columns

In [80]:
df.drop(labels=["permissions", "is_backing", "friends", "is_starred", 'name'], 
        axis=1, 
        inplace=True)

In [82]:
df.shape

(209222, 32)

#### Convert time columns to dates 

In [83]:
time = ["created_at", "deadline", "launched_at", "state_changed_at"]

In [84]:
for i in time:
    df[i] = pd.to_datetime(df[i], unit="s")

In [85]:
df.head(1).T

Unnamed: 0,0
backers_count,315
blurb,Babalus Shoes
category,"{""id"":266,""name"":""Footwear"",""slug"":""fashion/fo..."
converted_pledged_amount,28645
country,US
created_at,2018-11-05 23:06:45
creator,"{""id"":2094277840,""name"":""Lucy Conroy"",""slug"":""..."
currency,USD
currency_symbol,$
currency_trailing_code,True


#### Check and remove duplicates 

Duplicate entries would change the distribution of our dataset by weighting some observations more than others. We can use the `id` column to check for duplicates in the dataset. 

In [86]:
df.duplicated(subset='id', keep='first').sum()/len(df['id'])

0.12884878263280153

The dataset contains around 13% duplicate entries, which need to be addressed. Before removing the duplicates from the dataset, we need to check whether all columns contain duplicate entries or whether they might have just the same id. 

In [101]:
df_duplicates = df[df.duplicated(['id'], keep = False)]

In [105]:
df_duplicates.sort_values(by="id").tail(2).T

Unnamed: 0,120077,159940
backers_count,709,709
blurb,"Over 6,000 thoughtfully designed icons that ar...","Over 6,000 thoughtfully designed icons that ar..."
category,"{""id"":51,""name"":""Software"",""slug"":""technology/...","{""id"":51,""name"":""Software"",""slug"":""technology/..."
converted_pledged_amount,70751,70751
country,US,US
created_at,2017-06-01 12:43:16,2017-06-01 12:43:16
creator,"{""id"":876759573,""name"":""Jory Raphael"",""slug"":""...","{""id"":876759573,""name"":""Jory Raphael"",""slug"":""..."
currency,USD,USD
currency_symbol,$,$
currency_trailing_code,True,True


It appears like the entries are indeed equivalent to each other. Hence, we will drop them.

In [107]:
df.drop_duplicates(subset='id', keep='first', inplace=True)

In [108]:
df.shape

(182264, 32)

After removing the duplicate entries, we have 182264 observations left.

#### Check and treat missing values

In [115]:
df.isna().sum()

backers_count                 0
blurb                         8
category                      0
converted_pledged_amount      0
country                       0
created_at                    0
creator                       0
currency                      0
currency_symbol               0
currency_trailing_code        0
current_currency              0
deadline                      0
disable_communication         0
fx_rate                       0
goal                          0
id                            0
is_starrable                  0
launched_at                   0
location                    224
photo                         0
pledged                       0
profile                       0
slug                          0
source_url                    0
spotlight                     0
staff_pick                    0
state                         0
state_changed_at              0
static_usd_rate               0
urls                          0
usd_pledged                   0
usd_type

Overall, the number of missing values is very limited. 

For `blurb`where we have only 8 missing values, we can take a look at the data points to see if we can manually include the missing information. 

Regarding `usd_type`, we will drop the entire column because its meaning is not clear. 

Lastly, `location` has 224 missing values. Here we might be able to extract the information from another column. 

Let's start with `blurb`. 

In [163]:
df[df["blurb"].isnull()==True]

Unnamed: 0,backers_count,blurb,category,converted_pledged_amount,country,created_at,creator,currency,currency_symbol,currency_trailing_code,...,slug,source_url,spotlight,staff_pick,state,state_changed_at,static_usd_rate,urls,usd_pledged,usd_type
65168,39,,"{""id"":269,""name"":""Ready-to-wear"",""slug"":""fashi...",8675,DE,2017-09-02 14:59:35,"{""id"":1303591875,""name"":""Annabelle Deisler"",""i...",EUR,€,False,...,serious-business-collection,https://www.kickstarter.com/discover/categorie...,False,False,failed,2017-10-10 08:46:30,1.2037,"{""web"":{""project"":""https://www.kickstarter.com...",8873.674115,international
65299,0,,"{""id"":311,""name"":""Food Trucks"",""slug"":""food/fo...",0,US,2016-09-08 19:44:28,"{""id"":874463436,""name"":""LeMae Fitzwater"",""is_r...",USD,$,True,...,foragers-cuisine-food-truck,https://www.kickstarter.com/discover/categorie...,False,False,canceled,2016-09-15 19:35:46,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",0.0,domestic
90055,0,,"{""id"":20,""name"":""Conceptual Art"",""slug"":""art/c...",0,US,2015-02-16 16:19:14,"{""id"":1316410093,""name"":""Rumi Forum"",""slug"":""i...",USD,$,True,...,international-festival-of-language-and-culture,https://www.kickstarter.com/discover/categorie...,False,False,canceled,2015-02-20 16:21:07,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",0.0,domestic
99620,242,,"{""id"":339,""name"":""Sound"",""slug"":""technology/so...",54599,GB,2015-07-02 09:51:04,"{""id"":161070731,""name"":""ACWorldwide"",""slug"":""a...",GBP,£,False,...,star-wars-bluetooth-speakers,https://www.kickstarter.com/discover/categorie...,False,False,canceled,2015-10-02 22:30:21,1.516337,"{""web"":{""project"":""https://www.kickstarter.com...",54676.079907,domestic
108662,0,,"{""id"":21,""name"":""Digital Art"",""slug"":""art/digi...",0,US,2017-11-03 03:24:21,"{""id"":1454907110,""name"":""moe"",""is_registered"":...",USD,$,True,...,charivari,https://www.kickstarter.com/discover/categorie...,False,False,failed,2018-01-12 23:34:08,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",0.0,domestic
117999,0,,"{""id"":286,""name"":""Spaces"",""slug"":""theater/spac...",0,US,2015-12-08 01:17:09,"{""id"":376626888,""name"":""Amanda Donnadio (delet...",USD,$,True,...,long-island-school-auditorium,https://www.kickstarter.com/discover/categorie...,False,False,canceled,2015-12-08 10:43:04,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",0.0,domestic
169430,0,,"{""id"":20,""name"":""Conceptual Art"",""slug"":""art/c...",0,US,2012-03-06 19:47:56,"{""id"":79887943,""name"":""Brian Mercer"",""is_regis...",USD,$,True,...,the-lineup-0,https://www.kickstarter.com/discover/categorie...,False,False,canceled,2012-03-12 19:42:07,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",0.0,domestic
183688,2,,"{""id"":351,""name"":""Printing"",""slug"":""crafts/pri...",20,US,2014-08-02 15:05:38,"{""id"":2029667279,""name"":""Danger Grills"",""slug""...",USD,$,True,...,online-sticker-book-vending-machine,https://www.kickstarter.com/discover/categorie...,False,False,canceled,2014-08-18 03:52:00,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",20.0,domestic


We tried to retrieve the `blurb`information from the campaign websites, however, the websites are no longer active. Hence, we will drop the 8 rows with missing information. Given the size of our dataset, we will not lose too much valuable information. 

In [168]:
df.dropna(subset=["blurb"], inplace=True)

In [169]:
df.shape

(182256, 32)

All 8 missing values for **blurb** have been removed. Next, we will drop the **usd_type** column.

In [171]:
df.drop(labels="usd_type", axis=1, inplace=True)

In [161]:
pd.set_option("max_colwidth", 50, "max_columns",20)
df[4:5][["blurb", "location", "urls"]].T

Unnamed: 0,4
blurb,"Livng with a brain impairment, what its like t..."
location,"{""id"":2507703,""name"":""Traverse City"",""slug"":""t..."
urls,"{""web"":{""project"":""https://www.kickstarter.com..."


#### Extract relevant information from object columns by retrieving the text 

We need to retrieve the relevant information from the following columns:
1. category
2. creator
3. location
4. photo
5. profile
6. source_url
7. urls
