# The Kickstarter Project

#### Design a Model to allow for an assessment wether a campaign for a project on KICKSTARTER will most likely succeed or fail. This could help projectowners decide if the work connected to maintaining a campaign is worth it. 

Data can be found [here](https://neuefische-students.slack.com/files/U01BHAYB2CU/F06E67VRYMB/kickstarter_data.zip) and should be saved in a folder with the name Kickstarter_data

## Description of the Dataset 

The Dataset used in the course of this project consisted of data collected between April 2009 and March 2019. In general 209222 projects from 157326 creators are included in the dataset. 
\
Columns possibly relevant regarding the success or failiure of a project include 

* backers_count: amount of people pledging money to the project                                    
* category: json string containing information on e.g. the projects id, categories and URL
* country: country of the projects creator 
* deadline: end date of the project 
* goal: information on the amount of money needed to succeed in the local currency of the project
* launched_at: start date of the project
* spotlight: feature enabled after reaching the goal, to show of the project
* staff_pick: marked by a staff member of kickstarter
* state: (successful/failed/canceled/live/suspended)

Further columns include information on the projects currency and exchange rates, as well as the creator, their specific location and their profile information. Out of these columns, only the static_usd_rate was included in order to transform the goal's currency to USD allowing for better comparability. 

As our model aims at predicting a campaigns success before it is started, columns including information not available to the potential creator up front had to be excluded. Those columns are the _'backers_count'_ and the _'staff_pick'_. Furthermore, _'spotlight'_ is excluded as the information is identical to the information given in the _'state'_ column, which will be used as a target column in the following.  

In [None]:
# important libraries 

import pandas as pd 
import os, re, json, warnings
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pickle

from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier

from utils.evaluation import *                  #contains functions used for model evaluation
from utils.Preprocessing_funct import *         #contains functions used for pre-processing

RSEED = 42

### Pre-process the data 
 The file *Preprocessing_funct.py* contains all functions that were used in the Preprocessing. Reasons for the application of the individual functions are found in the comments behind the functions, or in the 'EDA.ipynb' notebook.
 
 The following cell contains a workflow to load, preprocess, clean and save the data. Alternatively the dataloader.py can be run.

In [None]:
# ####################################################################################################################
# ###### The preprocessed Data is saved under 'data/cleaned_data.csv' ################################################
# ####################################################################################################################


# ############# Load the data into a dataframe #######################################################################

# directory = 'Kickstarter_data/'
# data = pd.DataFrame()
# relevant_columns = ['category', 'country', 'creator', 'state', 'static_usd_rate', 'goal', 'launched_at', 'deadline']

# for file in sorted(os.listdir(directory)):
#     df_temp = pd.read_csv(directory+file)
#     data = pd.concat([data, df_temp[relevant_columns]], ignore_index=True)

# ############ clean data of duplicates ################################################################################

# data = data.drop_duplicates(ignore_index =True)     

# ############## transform columns containing dates to datetime ##########################################################

# data['launched_at'] = pd.to_datetime(data['launched_at'], unit='s')
# data['deadline'] = pd.to_datetime(data['deadline'], unit='s')

# ############## preprocess the columns ##################################################################################

# data = state_to_binary(data)                        # work on state-column (to binary -> failed/succeeded)
# data = extract_category(data)                       # extract the category of the project
# data = extract_year_date_month(data, 'launched_at') # extract year and date of the launched_at column
# data = duration(data, 'launched_at', 'deadline')    # extract campaign duration from launched_at and deadline
# data = convert_to_usd(data)                         # convert goal column to USD for unification 
# data = north_america(data)                          # work on country column to devide between north america and not north america
# data = unrealistic_goal(data)                       # exclude goals above 1000000 USD, see EDA notebook

# ########### drop unneccessary columns ####################################################################################

# data = data.drop(['category', 'country', 'creator', 'static_usd_rate', 'launched_at', 'deadline'], axis =1)

# ########### Label-encode the categorical columns containing strings ######################################################

# le = LabelEncoder()
# data['slug'] = le.fit_transform(data['slug'])

# ##########  save pre-processed Dataframe in csv_file #####################################################################

# data.to_csv('data/cleaned_data.csv', index=False)

### Separate Target from Features, perform Train-Test split and scale data

While some models tested in the course of the project require Standard scaled values, others require MinMax scaled values for good training results. Here we introduce a standard scaled and a minmax scaled train and test set, that can be used for the different models besides the unscaled dataset. Further, for the scaled datasets, the categorical data is one-hot-encoded. The scaled train and test datasets generated in the cell below were saved in the folder 'data'. To uncomment the cell, click inside the cell, press cmd+a, then cmd+k and cmd+u 

In [None]:
# ###########################################################################################################################
# #### The data generated in this cell can be found in the data folde. Only execute when aiming at new train/test sets ######
# ###########################################################################################################################



# ########## Load Data ######################################################################################################
# data = pd.read_csv('data/cleaned_data.csv')

# ########## define target and features variables ###########################################################################
# X = data.drop('state', axis=1)
# y = data.state

# ########## split to train and test ########################################################################################
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=RSEED)

# ########## define categorical and numerical columns #######################################################################
# cat_columns = ['slug', 'launched_at_weekday', 'launched_at_month']
# num_columns = ['goal', 'duration_days']

# ########## One hot encode categorical data ################################################################################
# X_train_oh = pd.get_dummies(X_train, columns = cat_columns, drop_first = True)
# X_test_oh = pd.get_dummies(X_test, columns = cat_columns, drop_first = True)

# ########## Apply standard scaler on numerical data #######################################################################
# std_scaler = StandardScaler()
# X_train_stdscaled = std_scaler.fit_transform(X_train[num_columns])
# X_test_stdscaled = std_scaler.transform(X_test[num_columns])

# X_train_stdscaled = pd.DataFrame(np.concatenate([X_train_stdscaled, X_train_oh.drop(num_columns, axis=1)], axis=1))
# X_test_stdscaled = pd.DataFrame(np.concatenate([X_test_stdscaled, X_test_oh.drop(num_columns, axis=1)], axis=1))

# ########## Apply MinMax scaler on numerical data #########################################################################
# mm_scaler = MinMaxScaler()
# X_train_mmscaled = mm_scaler.fit_transform(X_train_oh[num_columns])
# X_test_mmscaled = mm_scaler.transform(X_test_oh[num_columns])

# X_train_mmscaled = pd.DataFrame(np.concatenate([X_train_mmscaled, X_train_oh.drop(num_columns, axis=1)], axis=1))
# X_test_mmscaled = pd.DataFrame(np.concatenate([X_test_mmscaled, X_test_oh.drop(num_columns, axis=1)], axis=1))


# ########## Save the different x_train & x_test versions ##################################################################
# y_train.to_csv('data/y_train.csv', index=False)
# y_test.to_csv('data/y_test.csv', index=False)

# X_train.to_csv('data/unscaled_Xtrain.csv', index=False)
# X_test.to_csv('data/unscaled_Xtest.csv', index=False)

# X_train_stdscaled.to_csv('data/stdscaled_Xtrain.csv', index=False)
# X_test_stdscaled.to_csv('data/stdscaled_Xtest.csv', index=False)

# X_train_mmscaled.to_csv('data/mmscaled_Xtrain.csv', index=False)
# X_test_mmscaled.to_csv('data/mmscaled_Xtest.csv', index=False)

### The Models

In this notebook our best model after investigating multiple machine learning algorithms, including **logistic regression**, **k-nearest neighbours**, **random forest**, **xgboost** and **stacking** of different models, with grid search is shown. The best model was a **XGBoost** model with the following parameters: .
Further, the baseline model (logistic regression with baseline parameters) is shown.
The grid searches for the different models are shown in the 'Gridsearch_models.ipynb' notebook. The best models for each algorithms are saved in the models folder. 


In [None]:
# Load scaled train and test data (standard scaled for logistic regression and original data for xgboost)

y_train = pd.read_csv('data/y_train.csv')
y_test = pd.read_csv('data/y_test.csv')

X_train = pd.read_csv('data/unscaled_Xtrain.csv')
X_test = pd.read_csv('data/unscaled_Xtest.csv')

X_train_stdscaled = pd.read_csv('data/stdscaled_Xtrain.csv')
X_test_stdscaled = pd.read_csv('data/stdscaled_Xtest.csv')

#### The Baseline model

As a baseline model, a simple logistic regression, using the default parameters (penalty='l2', C=1.0, class_weight=None, random_state=None, solver='lbfgs', max_iter=100, multi_class='auto') was performed. 

In [None]:
# # Train baseline model
# logreg_baseline = LogisticRegression()      
# logreg_baseline.fit(X_train, y_train)

# # Save baseline model
# pickle.dump(logreg_baseline, open('./models/baseline_logreg_model', 'wb')) 

# Load baseline model
logreg_baseline = pickle.load(open('./models/baseline_logreg_model', 'rb'))

# Visualise Results
y_pred_baseline = logreg_baseline.predict(X_test)
vis_results(y_test, y_pred_baseline, 'Baseline_LogReg', savefig=False)

# get baseline accuracy in percent
base_line_acc = accuracy_score(y_test, y_pred_baseline)*100

#### The Best model

The best model was a XGBoost model with the following parameters: 'colsample_bytree': 0.85, 'learning_rate': 0.1, 'max_depth': 23, 'n_estimators': 100, 'subsample': 0.75

In [None]:
best_model = pickle.load(open('./models/best_xgb_model', 'rb'))

# Predict on best model
y_pred_best = best_model.predict(X_test)

# Plot classification report and confusion matrix
vis_results(y_test, y_pred_best, 'XGBoost', savefig=False)

### Compare accuracies

As the statistics knowledge of our clients (small business owners) might vary, the accuracy was chosen as a metric that can be understood intuitively and moreover, is relatively save, as we optimized the models on precision and can therefore exclude the risk of having a high accuracy while precision is rather low. Nevertheless, confusion matrices were saved for each model to be shown to clients interested in more in-depth statistics.\
In order to compare the accuracies a bar plot appeared to be the most comprehensible approach, easy to grasp with just one glance in the presentation. Additionally the baseline accuracy was plotted for comparison reasons.

In [None]:
accuracies = pickle.load(open('accuracies.txt', 'rb'))
accuracies = pd.DataFrame(accuracies.items(), columns=['model', 'accuracy'])
accuracies.accuracy = accuracies.accuracy * 100

In [None]:
clrs = ['grey' if (x < accuracies.accuracy.max()) else 'red' for x in accuracies.accuracy]
ax = sns.barplot(x='model', y='accuracy', data=accuracies, palette=clrs, saturation=0.5)
ax.axhline(base_line_acc, color="black", linestyle = 'dashed', label='baseline')
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.set_ylim(0,100)
ax.set_ylabel('Accuracy')
ax.set_xlabel('Model')
ax.legend()
ax.figure.savefig('images/accuracy_comparison.png',dpi=600)

### Error Analysis 

Finally, a small error analysis was performed for the results of the best model (XGBoost, as seen above). The results are shown in the **Error_analysis.ipynb** notebook. However, no specific errors were found. 