#### CIS 520

## Predicting Success of Kickstarter Projects

This notebook presents a brief walkthrough of the varioius aspects of the machine learning task to predict the success of projects on the crowdfunding platform Kickstarter. 

### Overview of the dataset and feature engineering 

* The data was obtained from [here](https://https://www.kaggle.com/kemical/kickstarter-projects) and [here](https://https://webrobots.io/kickstarter-datasets/), and was merged on the project ID to get a large number of features.  
* From all obtained features, rigorous manual feature selection was done to retain only the features that *could possibly* be dependent on the success of the project, and are known to the project team before launching their project on Kickstarter. We get a set of 26 features at this stage.     
* The correlation between features were computed and feature importances were determined separately for classification and regression, to further prune correlated and unimportant features.
* Various scaling methods were used appropriately on the features.  
*  To avoid fitting a regression for the projects that barely receive a fraction of their goal amounts, our dataset was modified to only use points in which the project has reached at least 50% of its funding goal. This ensured better performance as the regression model was not forced to fit near zero valued predictions.     
* The final dataset contained 69210 examples and 19  features for classification, and 45641 and 20 features for regression. It can be found [here](https://https://www.dropbox.com/sh/zzsjziljp7qz9fq/AABqGrO6WUwyFjbzIDnO3hdCa).  

#### Data distributions

The following set of pie charts depict the distribution of some of categorical features in the dataset.  
![alt text](https://i.imgur.com/O5eXQ7e.png)

Some insights: 
* 64% of the projects are successful. 
* The most popular country for Kickstarter is the US.
* Most project creators prefer not to launch their projects on weekends

The following set of bar graphs depict the distribution of some of the continuous features in the dataset. 
![alt text](https://i.imgur.com/NaaKCtj.png)

Some insights: 
* The  distribution  of  amounts  raised  by  the  failed  projects  is  very different  from  the  distribution  of  successful  projects

#### Feature selection

We set the threshold to determining if a feature is correlated to another by setting the Pearson Correlation Coefficient threshold to 0.9. 

![alt text](https://i.imgur.com/438l7HT.png)

The figure above shows the correlations between all the features in our dataset.  The negative values
indicate  features  which  are  negatively  correlated, such as the average sentence length, the number of sentences in the blurb, daysine and yesweekend. 

We also use a wrapper-based approach to further select features. We use LightGBM classifier and regressor to obtain feature importances for classification and regression, as plotted below.  

**Classification**
![Classification](https://i.imgur.com/Q9jiLNN.png)
**Regression**
![Regression](https://i.imgur.com/UIiQFzA.png)

* Out of the 26 features, only 19 are required for attaining 95% of the total feature importance for the classification task. 

![alt text](https://i.imgur.com/YShdDYI.png)

* Based on the plots above: To avoid fitting a regression for the projects that barely receive a fraction of their goal amounts, our dataset was modified to only use points in which the project has reached at least 50% of its funding goal, reducing the dataset size from 69210 to 45641.
* Out of the 26 features, only 20 are required for attaining 95% of the total feature importance for the regression task.     

### Hyperparameter tuning, training and testing

![alt text](https://i.imgur.com/VkaxnPI.png)

![alt text](https://i.imgur.com/7xKEe3v.png)

In this notebook, we will walk through one classification and one regression model - **Gradient Boosting** and obtain their performance metrics. 

In [0]:
# Imports 

import pandas as pd
from sklearn.preprocessing import OneHotEncoder, normalize, scale
from sklearn.feature_selection import *
import numpy as np
import calendar
import re
from nltk.tokenize import word_tokenize , sent_tokenize
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import *
import matplotlib.pyplot as plt
import time
from sklearn.model_selection import learning_curve
from sklearn.dummy import DummyClassifier
import sys

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge, ElasticNet, Lasso, LogisticRegression
from sklearn.decomposition import PCA
from sklearn.neural_network import MLPRegressor, MLPClassifier
from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor, AdaBoostRegressor, AdaBoostClassifier, RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

In [27]:
# Downloading data
!wget https://www.dropbox.com/sh/zzsjziljp7qz9fq/AAAx-2P_uNIJhDqhFFSCn-Wna/fresh_train_206_5.csv
!wget https://www.dropbox.com/sh/zzsjziljp7qz9fq/AABdxLbVrBc2-X5_8ik-oJMza/fresh_test_206_5.csv

--2019-12-11 03:59:06--  https://www.dropbox.com/sh/zzsjziljp7qz9fq/AAAx-2P_uNIJhDqhFFSCn-Wna/fresh_train_206_5.csv
Resolving www.dropbox.com (www.dropbox.com)... 162.125.65.1, 2620:100:6021:1::a27d:4101
Connecting to www.dropbox.com (www.dropbox.com)|162.125.65.1|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /sh/raw/zzsjziljp7qz9fq/AAAx-2P_uNIJhDqhFFSCn-Wna/fresh_train_206_5.csv [following]
--2019-12-11 03:59:08--  https://www.dropbox.com/sh/raw/zzsjziljp7qz9fq/AAAx-2P_uNIJhDqhFFSCn-Wna/fresh_train_206_5.csv
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://ucc63f80abb0d9fc1c650214187e.dl.dropboxusercontent.com/cd/0/inline/AuBCgKCnTgJdrHufTy7z0sVBHqX1dUZLUwUnSqxmnhmBTljG0biCoBhNmofy8qNx2Uv6W6Mp9_uYSUsd9nrYTIyT6elAWpJltzXjHLFEVXgwFUCxMTz1g1totZB2l1nqzkM/file# [following]
--2019-12-11 03:59:08--  https://ucc63f80abb0d9fc1c650214187e.dl.dropboxusercontent.com/cd/0/inline/AuB

In [0]:
# Loading data
df_train = pd.read_csv("fresh_train_206_5.csv",engine='python',error_bad_lines=False)
df_test = pd.read_csv("fresh_test_206_5.csv",engine='python',error_bad_lines=False)

In [0]:
# Dataset for classification
classificationTrain_X = df_train.drop(['state','usd_pledged_real','blurbSpellingErrors'],axis=1)
classificationTrain_y = df_train['state']
classificationTest_X = df_test.drop(['state','usd_pledged_real','blurbSpellingErrors'],axis=1)
classificationTest_y = df_test['state']

In [0]:
# Dataset for regression
regression_train_X_2 = df_train[np.divide(np.exp(df_train.usd_pledged_real-1),np.exp(df_train.usd_goal_real-1))>=0.5]
regression_train_X_2 = regression_train_X_2.drop(['state','usd_pledged_real'],axis=1)
regression_train_y_2 = df_train[np.divide(np.exp(df_train.usd_pledged_real-1),np.exp(df_train.usd_goal_real-1))>=0.5]['usd_pledged_real']

regression_test_X_2 = df_test[np.divide(np.exp(df_test.usd_pledged_real-1),np.exp(df_test.usd_goal_real-1))>=0.5]
regression_test_X_2 = regression_test_X_2.drop(['state','usd_pledged_real'],axis=1)
regression_test_y_2 = df_test[np.divide(np.exp(df_test.usd_pledged_real-1),np.exp(df_test.usd_goal_real-1))>=0.5]['usd_pledged_real']

In [0]:
# Standard scale the data for regression
scaler = StandardScaler()
train_regression_set_scaled = scaler.fit_transform(regression_train_X_2)
train_regression_target_scaled = scaler.fit_transform(regression_train_y_2.values.reshape(-1,1))
test_regression_set_scaled = scaler.fit_transform(regression_test_X_2)
test_regression_target_scaled = scaler.fit_transform(regression_test_y_2.values.reshape(-1,1))

#### Hyperparameter tuning for Gradient Boosting Classification

Takes a long time to run. Uncomment to rerun. 

In [0]:
# hyperparameters_gradientboost_c = {
#         'n_estimators' : [1500,2500],
#         'learning_rate' : [0.05,0.1,1],
#         'loss' : ['deviance', 'exponential']
# }
# grad_class = GradientBoostingClassifier()
# clf5 = GridSearchCV(grad_class, hyperparameters_gradientboost_c, scoring=None, n_jobs=-1, iid='deprecated',
#              refit=True, cv=5, verbose=20, pre_dispatch='2*n_jobs', return_train_score=False)
# clf5.fit(classificationTrain_X,classificationTrain_y)
# clf5.best_params_

#### Training for Gradient Boosting Classification

![alt text](https://i.imgur.com/7CHAgAG.png)

In [7]:
# Train the classification
params_gradientboosting_classifier = {
    'learning_rate' : 0.05,
    'n_estimators' : 2500,
    'loss' : 'exponential'
}

grad_class = GradientBoostingClassifier(**params_gradientboosting_classifier)
grad_class.fit(classificationTrain_X,classificationTrain_y)

GradientBoostingClassifier(criterion='friedman_mse', init=None,
                           learning_rate=0.05, loss='exponential', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=2500,
                           n_iter_no_change=None, presort='auto',
                           random_state=None, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

#### Testing performance for Gradient Boosting Classification

In [0]:
# Compute classification metrics
def aucgenerate(test_y,y_pred):
    print(classification_report(test_y,y_pred))
    print('Area under ROC curve')
    print(roc_auc_score(test_y,y_pred))

In [9]:
# Test the model on unseen data
print(grad_class.score(classificationTrain_X,classificationTrain_y))
print(grad_class.score(classificationTest_X,classificationTest_y))
y_pred_gb = grad_class.predict(classificationTest_X)
aucgenerate(classificationTest_y,y_pred_gb)

0.9063899725473198
0.8906949862736598
              precision    recall  f1-score   support

          -1       0.88      0.79      0.84      4863
           1       0.89      0.94      0.92      8979

    accuracy                           0.89     13842
   macro avg       0.89      0.87      0.88     13842
weighted avg       0.89      0.89      0.89     13842

Area under ROC curve
0.8682862315173818


#### Hyperparameter tuning for Gradient Boosting Regression

Takes a long time to run. Uncomment to rerun.

In [0]:
# hyperparameters_gradientboost = {
#         'n_estimators' : [950,1500,2000,2500,3000,5000],
#         'learning_rate' : [0.1,0.01],
#         'loss' : ['ls', 'lad', 'huber', 'quantile']
# }
# grad_reg = GradientBoostingRegressor()
# clf3 = GridSearchCV(grad_reg, hyperparameters_gradientboost, scoring=None, n_jobs=-1, iid='deprecated',
#              refit=True, cv=5, verbose=15, pre_dispatch='2*n_jobs', return_train_score=False)
# clf3.fit(test_regression_set_scaled,test_regression_target_scaled)
# clf3.best_params_

#### Training for Gradient Boosting Regression

![alt text](https://i.imgur.com/IlbizMq.png)

In [35]:
# Train the regression
grad_reg = GradientBoostingRegressor(learning_rate=0.01,n_estimators=3000,loss='ls', random_state=1)
grad_reg.fit(train_regression_set_scaled,train_regression_target_scaled)

  y = column_or_1d(y, warn=True)


GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
                          learning_rate=0.01, loss='ls', max_depth=3,
                          max_features=None, max_leaf_nodes=None,
                          min_impurity_decrease=0.0, min_impurity_split=None,
                          min_samples_leaf=1, min_samples_split=2,
                          min_weight_fraction_leaf=0.0, n_estimators=3000,
                          n_iter_no_change=None, presort='auto', random_state=1,
                          subsample=1.0, tol=0.0001, validation_fraction=0.1,
                          verbose=0, warm_start=False)

#### Testing performance for Gradient Boosting Regression

In [0]:
# Compute regression metrics
def regression_metrics(y_test,y_pred):
    print('Mean Squared Error')
    print(mean_squared_error(y_test,y_pred))
    print('Mean absolute error')
    print(mean_absolute_error(y_test, y_pred))

In [37]:
# Test the model on unseen data
y_pred = grad_reg.predict(test_regression_set_scaled)
print('R^2')
print(grad_reg.score(test_regression_set_scaled,test_regression_target_scaled))
regression_metrics(test_regression_target_scaled,y_pred)

R^2
0.8899141765546065
Mean Squared Error
0.11008582344539353
Mean absolute error
0.2044089047831065


We combine the ROC curve results for all our classification classifiers in the plot below: 

![alt text](https://i.imgur.com/ORgpUH5.png)