# KICKSTARTER - group 6

## From Business Problem to ML Problem

#### BUSINESS OVERVIEW
Kickstarter is a crowdfunding platform that allows people to support creative projects. Film, gaming, and music, as well as art, design, and technology projects, are all covered.

The dataset at hand was crawled from the platform and contains detailed information about all current and historic projects on Kickstarter, as well as their status (successful, failed, canceled, live, suspended). 

Every project creator establishes a financing target (funding goal) and a deadline for their project. People can donate money to help make the concept a reality if they like it. Funding on Kickstarter is *“all-or-nothing”*. If the project meets its funding goal, all backers' credit cards will be charged after the deadline passes and Kickstarter deducts a 5% fee from pledged amount. On the other hand, if the projects fall short of its funding goal, no one is charged. 

Project creators retain complete ownership of their work. After a project is deemed successful, Kickstarter cannot be used to seek loans or to give financial returns or equity. Backers can support projects to help them come to life, not to profit monetarily. 

As stated from the platform’s website, while 10% of projects finished having never received a single pledge, 78% of projects that raised more than 20% of their goal were successfully funded. Therefore Kickstarter has a great potential to bring ideas to fruition

#### PROJECT’S GOAL

During the course of this project, we will take the perspective of project creators to assist them in optimizing their proposal. 

For a project’s success or failure on crowdfunding platforms, it’s important to consider the influence of all the factors characterizing that project. Some of these factors can be measured or classified, allowing for the development of a model to forecast whether a project will succeed or fail. 

Some projects are more successful than others and our intuition is that this does not always depend on the key idea. Some projects might fail because they don’t hit the target (backers) due to wrong descriptions, uncommon topic, too high funding goal or simply the project doesn’t seem trustworthy

The goal of this project is to analyze Kickstarter projects’ data and build a useful model for project creators to understand which features attract backers the most or which projects are most likely to collect a higher amount. We will try to find the main patterns and the odds of a project’s success. Thanks to this model, decision makers (project creators) will gain useful insights before publishing their project on the platform.

#### PROBLEM DEFINITION

In order to achieve the goal explained above, we will use a dataset crawled from Kickstarter, which contains detailed information about all current and historic projects on Kickstarter, as well as their status (successful, failed, canceled, live, suspended). The dataset contains all the projects hosted between **MISSING** and **MISSING**.

Given the **big amount** of original data available (205696 projects with more than 37 variables), and the reasons explained below, it is reasonable to involve automation to solve this problem. 
- There is no existing formula to answer the main question.The features of each project set on the platform contribute in different ways to its success and this cannot be translated into simple rules.
- Analyzing the probability of success and which are the main drivers of the end result, project by project, would **not** be **feasible**. 
- Some columns like the description of the project (blurb) contain **unstructured text** which needs to be analyzed. Nevertheless, there are few features with unstructured text. 

All in all, there is a big potential for data to be **represented in a meaningful way**, with both numbers and categorical values (e.g. state, status, location). 

#### STRUCTURE OF THE DATA

The amount of data available is enough to build a machine learning model. We have information regarding:
- The type of the project (category, subcategory, brief description, its profile)
- The creator
- The start date, the duration of the crowdfunding, and the date when the status of the project was changed 
- The funding goal, the pledged amount, the original currency, the exchange rate and the converted pledged amount. 
- The number of backers achieved. 
For more information regarding the variables available, please see the **DATA CLEANING** section.

The **quality** and the **quantity** of data are fundamental to building an efficient model. The data available is complete and consistent across the datasets. There are some variables with almost all empty cells (friends, is_starred, etc.) and some others with an invalid format (category, creator, location, etc.) that must be modified. 

Around 55% of the projects available are successful, ~ 36% are labeled as failed and the rest is live/canceled/suspended. This means we have little information regarding the canceled projects. Nevertheless, we have a big and balanced amount of successful and failed projects. 

12% of the projects are current ones, while around 88% are past projects. This difference is valuable since we will work on past projects to build an efficient model and apply it to current projects. 

Projects are split into 15 categories and 159 subcategories. As we can see on the right, Music, Film & Video and Technology are the categories with more projects while Dance is the category with only 3 subcategories and 3156 projects (less than 2% of the entire dataset).

All in all, we can state the quality of data is good enough to work on it and create a model. 

Data available presents **regular patterns** between the independent variables (inputs) and the final result (success/failure, pledged amount)
These patterns are necessary for the model to learn from them and to extract a valid output.

**WE MISS A PART DESCRIBING IF WE DROP SOME VARIABLES AND WHY, IF WE WILL USE NLP AND WHY, AND PARTS LIKE Some features were initially retained for exploratory data analysis (EDA) purposes, but were then dropped in order to use machine learning models. These included features that are related to outcomes (e.g. the amount pledged and the number of backers) rather than related to the properties of the project itself (e.g. category, goal, length of campaign). IN THE EDA**

 
#### EVALUATION OF THE MODEL – criteria
**TO BE REVIEWED**
1. Proportion of the projects where the model accurately predicted the final success or failure of the project. 
2. The difference between the level of the pledged amount predicted by the model against the amount achieved 
3. The number of backers predicted for the specific type of project from the model versus the actual amount of backers


## Loading data

Import the dataset:

In [1]:
import pandas as pd
import numpy as np
import os
import datetime


Merge all the csv files to have all the data together.

In [None]:
files = [file for file in os.listdir('Kickstarter_Dataset')]

all_df = pd.DataFrame()

for file in files:
    df = pd.read_csv('./Kickstarter_Dataset/'+file)
    all_df = pd.concat([all_df, df])
    
all_df.to_csv("Kickstarter_Complete.csv", index = False)

Import the complete dataset.

In [None]:
df = pd.read_csv("Kickstarter_Complete.csv")

df.head()

In [None]:
df.dtypes

## Data Cleaning

#### Columns to delete.

We decided to delete the following columns: currency_symbol, id, photo, permissions(276), friends(274), source_url, is_backing (276), is_starred (276). 

In [None]:
del df['currency_symbol']
del df['id']
del df['photo']
del df['permissions']
del df['friends']
del df['source_url']
del df['is_backing']
del df['is_starred']

#### Rename backers_count into nr_backers.
backers_count shows the number of backers for that project.

In [None]:
df=df.rename(columns={"backers_count":"nr_backers"})

#### Create 3 new columns from the category column: category, subcategory and category_id.

In [None]:
df=df.rename(columns={"category":"Category"})

In [None]:
df['category'] = df['Category'].apply(lambda x: x.split('"slug":"')[1].split('/')[0])
df['category'] = df['category'].apply(lambda x: x.split('"')[0])
df['subcategory'] = df['Category'].apply(lambda x: x.split('"name":"')[1].split('"')[0])
df['subcategory_id'] = df['Category'].apply(lambda x: x.split('"id":')[1].split(',')[0])

In [None]:
del df['Category']

#### Modify the date time columns.

In [None]:
df['created_at'] = pd.to_datetime(df['created_at'], unit="s").dt.date
df['state_changed_at'] = pd.to_datetime(df['state_changed_at'], unit="s").dt.date
df['deadline'] = pd.to_datetime(df['deadline'], unit="s").dt.date
df['launched_at'] = pd.to_datetime(df['launched_at'], unit="s").dt.date

In [None]:
df.head()

#### Create 4 new columns from the creator one: creator_id, creator_name, is_registered.

In [None]:
df['creator_id'] = df['creator'].apply(lambda x: x.split('"id":')[1].split(',')[0])
df['creator_name'] = df['creator'].apply(lambda x: x.split('"name":"')[1].split('"')[0])
df['is_creator_registered'] = df['creator'].apply(lambda x: x.split('"is_registered":')[1].split(',')[0])

In [None]:
del df['creator']

#### Create 2 new columns from the location one: city and state.

In [None]:
df['nation'] = df['location'].astype(str).apply(lambda x: x.split('"state":"')[1].split('"')[0] if len(x.split('"state":"'))>1 else x.split('-')[0])
df['city'] = df['location'].astype(str).apply(lambda x: x.split('"name":"')[1].split('"')[0] if len(x.split('"name":"'))>1 else x.split('-')[0])

In [None]:
del df['location']

#### Create 2 columns from the profile one: project_id and project_status.

In [None]:
df['project_id'] = df['profile'].apply(lambda x: x.split('"id":')[1].split(',')[0])
df['project_status'] = df['profile'].apply(lambda x: x.split('"state":"')[1].split('"')[0])

In [None]:
del df['profile']

#### Modify the urls column.

In [None]:
df['url'] = df['urls'].apply(lambda x: x.split('"project":"')[1].split('"')[0])

In [None]:
del df['urls']

#### Ordering the columns.

In [None]:
df = df[['project_id', 'state', 'name', 'slug', 'blurb', 'url', 'category', 'subcategory','subcategory_id', 
         'creator_id', 'creator_name', 'is_creator_registered', 'country', 'nation', 'city', 'created_at', 
         'launched_at','deadline', 'nr_backers', 'goal', 'pledged', 'currency', 'usd_pledged', 'current_currency', 
         'fx_rate', 'static_usd_rate', 'currency_trailing_code', 'usd_type', 'project_status', 'state_changed_at',  
         'disable_communication', 'is_starrable', 'spotlight', 'staff_pick' ]]

In [None]:
df.to_csv("Cleaned_Kickstarter.csv", index = False)

In [None]:
df.dtypes

#### Description of our final variables

- project_id: id of the project.
- state: status of the project (successful, failed, canceled, live, suspended)
- name: name of the project.
- slug: nickname of the project.
- blurb: description of what’s the project about.
- url: url of the project.
- category: category of the project.
- subcategory: subcategory of the project.
- subcategory_id: id of the subcategory of the project.
- creator_id: id of the creator of the project.
- creator_name: name of the creator of the project.
- is_creator_registered: 
- country: country where the project has originated.
- nation: nation where the project has originated.
- city: city where the project has originated.
- created_at: when the project has been created - yyyy/mm/dd.
- launched_at: launch date of the project - yyyy/mm/dd.
- deadline: deadline of the project - yyyy/mm/dd.
- nr_backers: number of backers for the project.
- goal: amount of money for reaching the goal.
- pledged: pledged amount in the initial currency.
- currency: currency of the project.
- usd_pledged: pledged amount multiplied for the static usd rate.             
- current_currency: current currency of the project.
- fx_rate: exchange rate.               
- static_usd_rate.      
- currency_trailing_code.   
- usd_type: international or domestic.               
- project_status.          
- state_changed_at: when the state of the project changed - yyyy/mm/dd.         
- disable_communication: status about communication, id false for all campaigns that have ended.   
- is_starrable: how successful Kickstarter believes the campaign will be.           
- spotlight: after your project is successfully funded you will gain access to the Spotlight page tool which allows you to make a home for your project.   
- staff_pick: feature that highlits promising projects on the site to give them a boost by helping them get exposure through email newsletter and highlighted spots around the site.

##  Exploratory Data Analysis

### Main Statistics

How many successful/failed/canceled projects?

In [None]:
df['state'].value_counts(normalize=True) * 100

Statistics regarding categories and subcategories available

In [None]:
nr_category = df['category'].nunique()
nr_subcategory = df['subcategory'].nunique()
active_projects = df['project_status'].value_counts()

print(f'There are {nr_category} categories')
print(f'There are {nr_subcategory} subcategories')
print(f'There are {active_projects[1]} current projects and {active_projects[0]} past ones')

In [None]:
df_category = pd.DataFrame({ 'Nr of subcategories': df.groupby('category')['subcategory'].nunique(),
                            'Projects per category': df.groupby('category')['project_id'].nunique()
                           }).sort_values('Projects per category', ascending = False)
df_category["Frequency"] = df_category['Projects per category']/df_category['Projects per category'].sum()*100

df_category


In [None]:
cat_sub = df.groupby(['category','subcategory']).size()
cat_sub_frame = cat_sub.to_frame()
cat_sub_frame

#### Statistic evaluation:
- mean/median # of backers or amout collected per cat/subcat
- correlation? 
    - on nr_backers, usd_pledged, category, subcategory

In [None]:
df.columns

In [None]:
pd.set_option('display.max_columns', None)

df_grouped = df.groupby('category')
df_grouped.describe()

In [None]:
df_grouped.mean()

### Visualization

In [None]:
import seaborn as sns
from matplotlib import pyplot as plt
from matplotlib import cm

In [None]:
plt.figure(figsize=(16,6))
df['launched_at'] = pd.to_datetime(df['launched_at'])
df.set_index('launched_at').category.resample('M').count().plot() #resampling time series to Months
plt.xlim('2009-01-01', '2018-12-31')
plt.xlabel('')
plt.ylabel('Number of projects')
plt.title('Number of projects launched on Kickstarter, 2009-2019')
plt.show()

In [None]:
#year_df['launched_at'] = pd.to_datetime(year_df['launched_at'],format='%Y')
year_df = df.set_index('launched_at').state
year_df = pd.get_dummies(year_df).resample('YS').sum()
year_df1 = year_df[['successful', 'failed']]

fig, ax = plt.subplots(1,2, figsize=(16,6))
year_df1.plot.bar(ax=ax[0], color=['darkblue', 'grey'])
ax[0].set_title('Total number of failed and successful projects')
ax[0].set_xlabel('')

year_df1["successful"].div(year_df.sum(axis=1), axis=0).plot(kind='bar', ax=ax[1], color='darkblue') # Normalizes counts across rows
ax[1].set_title('Success Rate')
ax[1].set_xlabel('')
plt.show()

The left image depicts the total number of failed and successful projects, which indicated that the total number of failures and successes have been decreading since 2013. Not equally as the right images showcases: the success rate has depreciated over the past years.

In [None]:
fig, ((ax1, ax2, ax3)) = plt.subplots(3, 1, figsize=(16,20))
color = cm.CMRmap(np.linspace(0, 1, 16,df.category.nunique()))

df.groupby('category').category.count().plot(kind='bar', ax=ax1, color=color)
ax1.set_title('Number of projects')
ax1.set_xlabel('')

df.groupby('category').goal.median().plot(kind='bar', ax=ax2, color=color)
ax2.set_title('Median project goal ($)')
ax2.set_xlabel('')

df.groupby('category').usd_pledged.median().plot(kind='bar', ax=ax3, color=color)
ax3.set_title('Median pledged per project ($)')
ax3.set_xlabel('')

fig.subplots_adjust(hspace=0.5)
plt.show()

The illustrations above aim to highlight the differences among the 15 different categories. Film&Video is the most used category, closely followed by music. Art, publishing and technology take the third place. However, technology has the highest median project goal. Design is the category with the highed pledged amount per project.

In [None]:
plt.figure(figsize=(16,6))
df.set_index('launched_at').sort_index().usd_pledged.cumsum().plot()
plt.xlim('2009-01-01', '2019-02-28') # Limiting to whole months
plt.xlabel('')
plt.ylabel('Cumulative amount pledged in $', fontsize=12)
plt.title('Cumulative pledged', fontsize=16)
plt.show()

The cumulative pledged figure shows the total of pledged amounts for each year 2009-2019. The trend can be split into two phases, with a change in 2013/2014.

In [None]:
plt.figure(figsize=(16,6))
sns.boxplot(df.launched_at.dt.year, np.log(df.usd_pledged))
plt.xlabel('')
plt.ylabel('Amount pledged (log-transformed)',) #Log-transforming to make the trend clearer, as the distribution is heavily positively skewed
plt.title('Amount pledged on Kickstarter projects, 2009-2019')
plt.show()

Again, the trend can be split into two phases, with a change in 2014. We can see a greater variation in amounts pledged from 2014, with lower median amounts than before 2014, but generally higher mean amounts due to some very large projects.

### Prototype model - Logistic Regression
For our dummy model we train a classification model with several numeric and categorical features of completed projects to get an estimation whether the model would generally be able to predict the success of a project.

Numeric features: 
- Nr of Backers
- Goal

Categorical features:
- Category
- Subcategory
- Country 
- Nation
- spotlight
- staff_pick

Target variable: 
- state

In [None]:
#Imports
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn import set_config 
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression

In [None]:
#Transform dataset
df_dummy=df

df_dummy=df_dummy.loc[df['state'].isin(["failed","successful"])]
df_dummy=df_dummy.loc[df['project_status'].isin(["inactive"])]

#Drop unused columns
del_col=['project_id', 'name', 'slug', 'blurb', 'url', 'subcategory_id', 'creator_id', 'creator_name',
        'city', 'created_at', 'launched_at', 'deadline','pledged', 'currency', 'usd_pledged', 'current_currency',
        'fx_rate', 'static_usd_rate', 'currency_trailing_code', 'usd_type', 'project_status','state_changed_at',
        'disable_communication', 'is_creator_registered','is_starrable']

df_dummy=df_dummy.drop(del_col, axis = 1)

#Transform y to 0:1 
y=df_dummy["state"].replace({'failed' : 0, 'successful': 1})

X=df_dummy
X=X.drop('state', axis = 1)

#### Trial #1

In [None]:
#Build preprocessor for columns
#Standardize numerical features
numeric_features=["nr_backers", "goal"]
numeric_transformer = Pipeline(steps =[
    ("imputer",SimpleImputer(strategy="mean")),
    ("scaler", StandardScaler())])

#Encode categorical features
cat_features=["category", "subcategory", "country", "nation", "spotlight", "staff_pick"]
categorical_transformer = OneHotEncoder(handle_unknown="ignore")

#Column transformer
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, cat_features)])

set_config(display="diagram")

#Run Column Transformer
X_trans = preprocessor.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_trans,y,test_size=0.2,shuffle=True, random_state=123)

#Build Logistic Regression
clf = LogisticRegression()
clf.fit(X_train, y_train)

# Making predictions
y_train_pred = clf.predict(X_train)
y_test_pred = clf.predict(X_test)

#### Evaluation #1

In [None]:
# Logistic regression scores
print("Logistic regression score for training set:", round(clf.score(X_train, y_train),5))
print("Logistic regression score for test set:", round(clf.score(X_test, y_test),5))
print("\nClassification report:")
print(classification_report(y_test, y_test_pred))

The performance metrics are  extremely high and decribe a perfect model. This is mainly because  the feature "spotlight" is perfectly correlated to the target variable. In the following trial we will evaluate the model without this feature. Nonetheless, "spotlight" is an important variable that needs further analysis. 

#### Trial #2

In [None]:
#Build preprocessor for columns
#Standardize numerical features
numeric_features=["nr_backers", "goal"]
numeric_transformer = Pipeline(steps =[
    ("imputer",SimpleImputer(strategy="mean")),
    ("scaler", StandardScaler())])

#Encode categorical features
cat_features=["category", "subcategory", "country", "nation", "staff_pick"]
categorical_transformer = OneHotEncoder(handle_unknown="ignore")

#Column transformer
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, cat_features)])

set_config(display="diagram")

#Run Column Transformer
X_trans = preprocessor.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_trans,y,test_size=0.2,shuffle=True, random_state=123)

#Build Logistic Regression
clf = LogisticRegression()
clf.fit(X_train, y_train)

# Making predictions
y_train_pred = clf.predict(X_train)
y_test_pred = clf.predict(X_test)

#### Evaluation #2

In [None]:
# Logistic regression scores
print("Logistic regression score for training set:", round(clf.score(X_train, y_train),5))
print("Logistic regression score for test set:", round(clf.score(X_test, y_test),5))
print("\nClassification report:")
print(classification_report(y_test, y_test_pred))

# Machine Learning Models
## Overview

1. Logistic Regression
2. Logistic Regression PCA (Domi)
3. XYZ
4. Random Forest (Maria, Eugenia)

In [None]:
# Importing the required libraries
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA


In [None]:
# Fitting PCA
pca = PCA()
pca.fit_transform(X)
explained_var = np.cumsum(pca.explained_variance_ratio_)
# Plotting the amount of variation explained by PCA with different numbers of components
plt.plot(list(range(1, len(explained_var)+1)), explained_var)
plt.title('Amount of variation explained by PCA', fontsize=14)
plt.xlabel('Number of components')
plt.ylabel('Explained variance');