# **Kickstarter Success Exploratory Data Analysis**
**BUSI/COMP 488-002 Final Project** | May 7, 2021

*Brendan Carr  
Daniel Tracy  
Kevin Barth  
Siddharth Bowgal  
Alex Damiano  
Peter Morrow*

### Objective & Overview

___

We were tasked with creating a strategy for identifying successful Kickstarter campaigns so that investors can choose the projects with the most promise. To do this, we analyzed which features of a Kickstarter project available at the project start date were most useful in determining success. We also looked for projects with an amount of money pledged that was over the goal in order to determine which projects were the best investments. Finally, we performed a text analysis to identify key adjectives in the successful projects, to give investors further insight into what kinds of language help promote a project toward success.

#### Outline


1. Set Up Notebook & Load Data
2. Data Transformation
3. Comparing Models for Classification
4. Does Model Generalize Well?
5. Predicting Amount over Goal
6. Appendix

### 1. Set Up Notebook & Load Data

___

First, we will import some fundamental libraries and read in our raw data.

> **Note:** In order to run this notebook, you must download the ***ks-projects-201801.csv.zip*** file inlcuded on the [Github page](https://github.com/brendancarr34/Kickstarter-project-EDA) and unzip it.




In [None]:
import pandas as pd
import seaborn as sns
import numpy as np
from datetime import date
import matplotlib.pyplot as plt

# This class is used for aesthetic output purposes of this notebook.
class color:
  BOLD = '\033[1m'
  END = '\033[0m'

In [None]:
# Filepath may need to be changed depending on the location of the csv file

df = pd.read_csv(r"C:\Users\xxxx\Desktop\ks-projects-201801.csv")

#### Data Preview

We take a quick snapshot of the data to get a feel for the imported data.

In [None]:
df.head(10)

In [None]:
df.shape

### 2. Data Transformation

___

Given such a large dataset, there will be variables and data we do not want to include our analysis. This simplifies the results of our EDA and gives actionable insights into predicting success of certain projects. In this section, we prime our data for analysis using the folowing strategies:
- Dropping Columns
- Filtering Rows
- Visualizing Outliers and Correlation
- Removing Outliers
- Data Type Casting
- Feature Engineering

#### Dropping Columns

To simplify our analysis, we will look at only US data - therefore our recommendations will be for US based projects. We also need to drop ID and name since these are just identifiers - there is no real information inherent in these features. We also notice there are multiple variations of the "pledged" variable; since we are only working with US data, we will only keep "pledged" and drop the rest.

In [None]:
# Drop ID and name
try:
  df.drop(labels = ['ID'], axis = 1, inplace = True)
except:
  print("Already dropped 'id' and 'name' columns.")

# Filter on 'currency' = 'USD' and drop 'currency'
try:
  df = df[df.currency == 'USD']
  df.drop(labels = ['currency'], axis = 1, inplace = True)
except:
  print("Already dropped 'currency' column.")

# Filter on country = US and drop country
try:
  df = df[df.country == 'US']
  df.drop(labels = ['country'], axis = 1, inplace = True)
except:
  print("Already dropped 'country' column.")

# Drop usd pledged, usd_pledged_real, and usd_goal_real
try:
  df.drop(labels = ['usd pledged','usd_pledged_real', 'usd_goal_real'], axis = 1, inplace = True)
except:
  print("Already dropped 'usd pledged','usd_pledged_real', and 'usd_goal_real' columns.")

print(df.shape)
display(df)

#### Filtering Rows

We want to look only at completed campaigns, so we can train our classifier on data that had a certain outcome (shown in the 'state' variable). We also filter on projects with a goal under $50,000 since that seemed to be a reasonable goal, and we look at medium-term projects (at most 1 year term).



In [None]:
print(df.state.value_counts())

# Filter out postings that are still active or undefined
df = df[df.state != "undefined"]
df = df[df.state != "live"]
df.state.value_counts()

# Drop everything under 50k
df.loc[df.goal > 50000,'goal']=None
# Drop everything over 1 year in length
df.loc[df.term >365, 'term']=None
# Drop outliers over 50k in funding, to 
df.loc[df.pledged > 50000, 'pledged']=None
df.dropna(inplace=True)

In [None]:
# Confirm that active and undefined projects have been removed.
print(df.state.value_counts())

#### Visualizing Outliers and Correlation

In this section, we visualize the data to get a better understanding of outliers, trends, and correlation. We modify the state variable to make all projects that are not "successful" to be considered "unsuccessful". Again, this is for completed campaigns - those that were labeled "active" or "undefined" were removed.

In [None]:
temp_df = df
temp_df.state =  temp_df.state.astype('string')
for index, row in temp_df.iterrows():
  if (row['state'] != 'successful'):
    temp_df.at[index, 'state'] = 'unsuccessful'
repmap={"successful": 1, "unsuccessful": 0}
temp_df['state'].replace(repmap, inplace=True)
temp_df.state= temp_df.state.astype('uint8')

In [None]:
fig, axarr = plt.subplots(4, 2, figsize=(30, 40))
(temp_df[temp_df['state']==1]).main_category.value_counts().plot(kind='bar', ax=axarr[0][0])
temp_df.groupby('main_category')['state'].mean().plot(kind='bar', ax=axarr[0][1])
temp_df.groupby('year')['state'].mean().plot(kind='bar', ax=axarr[1][0])
temp_df.groupby('mnth_lnch')['state'].mean().plot(kind='bar', ax=axarr[1][1])
temp_df.groupby('dow_lnch')['state'].mean().plot(kind='bar', ax=axarr[2][0])
temp_df.groupby('hour_lnch')['state'].mean().plot(kind='bar', ax=axarr[2][1])
temp_df.groupby('mnth_ddln')['state'].mean().plot(kind='bar', ax=axarr[3][0])
temp_df.groupby('term')['state'].mean().plot( ax=axarr[3][1])
facet = sns.FacetGrid(temp_df, hue="state",aspect=4)
facet.map(sns.kdeplot, 'goal', shade= True)
facet.set(xlim=(temp_df['goal'].min(), temp_df['goal'].max()))
facet.add_legend()
plt.show()


In [None]:
import seaborn as sns
plt.subplots(figsize=(18,12))
sns.heatmap(df.corr(), annot=True, cmap="Blues")
plt.show()

#### Remove Outliers

To identify possible outliers, we can create two visuals: histograms and boxplots. The histogram shows us how frequent a value occurs in a bin, while a boxplot shows the minimum, maximum, and IQR (interquartile range) of a dataset. After looking at our visualizations to get a better understanding of outliers, we filter the data even further by keeping campaigns with a term under 61 days. Finally, after we confirm that we have no missing data, our data is ready for analysis using machine learning.

If you don't reach your goal, you get none of the money... we want projects to be successful so we will filter down to 10k goals since these are even more attainable...

In [None]:
# REVIEW

# Drop everything under 10k
df = df.loc[df.goal < 10000]

In [None]:
# filter out outliers 
# Display distribution of numeric features to observe outliers.


numeric_features =['goal', 'pledged', 'term']
for col in numeric_features:
  f, axes = plt.subplots(1, 2, figsize=(8, 4)) 
  df[col].hist(bins = 30, ax = axes[0])
  axes[0].set_title('Distribution of '+ col)
  df.boxplot(column = col, ax = axes[1])
  plt.show()

In [None]:
# change to 10k?
print('% under 25k goal: ' + str((df['goal']<25000).sum()/len(df)*100))
print('% under 50k funded: ' + str((df['pledged']<50000).sum()/len(df)*100))

In [None]:
# Drop everything over 61 days in length
df = df.loc[df.term < 61]
# Drop outliers over 50k in funding, to 
df = df.loc[df.pledged < 50000]

In [None]:
# Run visualization again to see distribution
numeric_features =['goal', 'pledged', 'term']
for col in numeric_features:
  f, axes = plt.subplots(1, 2, figsize=(8, 4)) 
  df[col].hist(bins = 30, ax = axes[0])
  axes[0].set_title('Distribution of '+ col)
  df.boxplot(column = col, ax = axes[1])
  plt.show()

In [None]:
# check for missing data
print(df.isnull().sum())
print(df.dtypes)

#### Data Type Casting

We type cast our variables appropriately so they can be properly read by machine learning models.

In [None]:
#Set Categorical data to category type
df.state = df.state.astype('category')
df.dow_lnch = df.dow_lnch.astype('category')
df.main_category = df.main_category.astype('category')

In [None]:
df['mnth_lnch']=df['mnth_lnch'].astype('category')
df['mnth_ddln']=df['mnth_ddln'].astype('category')
df['hour_lnch']=df['hour_lnch'].astype('category')
df['5friends'] = df['5friends'].astype('uint8')
df['10friends'] = df['10friends'].astype('uint8')
df['15friends'] = df['15friends'].astype('uint8')
print(df.dtypes)

#### Feature Engineering

We create features for year, month, day, and hour using Pandas DatetimeIndex as potential predictors. We also create the feature "term" to replace the "deadline" and "launched" features since this information can be expressed in one feature. Finally, we drop the features that were used to engineer the new features, to avoid multicollinearity.

**Definitions of key new variables:**

1. **5friends** = Does the project have at least 5 backers?
    * Expressed as a binary categorical variable (1 for Yes, 0 for No)
    * Similar variables created: **10friends** and **15friends**  
    

2. **Over_goal_50** = Does the project have a pledged amount that is at least 150% of the goal?
    * Expressed as a binary categorical variable (1 for Yes, 0 for No)
    * Similar variables created: **Over_goal_20** and **Over_goal_30**  
    

3. **Predicted_state** = These are predictions of whether a project will be successful using the Random Forest classifier we trained and refined.
    * Expressed as a binary categorical variable (1 for Yes, 0 for No)

In [None]:
# add month, day of the week, and hour of posting
df['deadline'] = pd.to_datetime(df['deadline'])
df['launched'] = pd.to_datetime(df['launched'])
df['year'] = pd.DatetimeIndex(df['launched']).year
df['mnth_lnch'] = pd.DatetimeIndex(df['launched']).month
df['dow_lnch']=df['launched'].dt.day_name()
df['hour_lnch'] = pd.DatetimeIndex(df['launched']).hour
df['mnth_ddln'] = pd.DatetimeIndex(df['deadline']).month

# add term length
df['term'] = df['deadline'] - df['launched']
df['term'] = pd.TimedeltaIndex(df['term']).days

#drop unnecessary columns
df.drop(['deadline', 'launched', 'category'], axis = 1, inplace = True)

In [None]:
# adding categorical variables for backers greater than 5, 10, and 15
df['5friends'] = np.where(df['backers'] >= 5, 1, 0)
df['10friends'] = np.where(df['backers'] >= 10, 1, 0)
df['15friends'] = np.where(df['backers'] >= 15, 1, 0)
print(df['5friends'].value_counts())
print(df['10friends'].value_counts())
print(df['15friends'].value_counts())

### 3. Comparing Models for Classification

___

We want to compare different classifiers to see what works best with our data, and use the winner for further analysis. First we will prepare the data set for the machine learning algorithms. The models we evaluate are the following:

*   Random Forest
*   Support Vector Machine
*   Logistic Regression
*   Decision Tree Classifier (with AdaBoost)
*   Bagging Classifier
*   K-Nearest Neighbors

#### Prepare Data for Machine Learning

We will use 2015 data to evaluate different classifiers. Choosing 2015 was an arbitrary subset of the data - we will check if the chosen model generalizes to other years' data later in the notebook.

Preparing the data involves dropping "5friends" and "15friends", and keeping "10friends". For further explanation on why we went about preparing the data this way, please refer to the Appendix.

In [None]:
# Define functions to evaluate our models

from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

def show_results(y_test, y_pred):

  # Output the accuracy of our prediction
  print(f"R-square = {round(accuracy_score(y_test, y_pred),4)}")

  # Visualize the confusion matrix to make it easier to read
  con_matrix = confusion_matrix(y_test, y_pred)
  confusion_matrix_df = pd.DataFrame(con_matrix, ('Unsuccessful', 'Successful'), ('Unsuccessful', 'Successful'))
  heatmap = sns.heatmap(confusion_matrix_df, annot=True, annot_kws={"size": 20}, fmt="d", cmap="Blues")
  heatmap.yaxis.set_ticklabels(heatmap.yaxis.get_ticklabels(), rotation=0, ha='right', fontsize = 14)
  heatmap.xaxis.set_ticklabels(heatmap.xaxis.get_ticklabels(), rotation=45, ha='right', fontsize = 14)
  plt.ylabel('Actual', fontsize = 14)
  plt.xlabel('Predicted', fontsize = 14)

  # Print the classification report
  from sklearn.metrics import classification_report
  print(classification_report(y_test, y_pred))

def best_model(model):
    print(model.best_score_)    
    print(model.best_params_)
    print(model.best_estimator_)  

In [None]:
# Create evaluate_model function
def evaluate_model(predictions, probs, train_predictions, train_probs):
    """Compare machine learning model to baseline performance.
    Computes statistics and shows ROC curve."""
    
    baseline = {}
    
    baseline['recall'] = recall_score(y_test, [1 for _ in range(len(y_test))])
    baseline['precision'] = precision_score(y_test, [1 for _ in range(len(y_test))])
    baseline['roc'] = 0.5
    
    results = {}
    
    results['recall'] = recall_score(y_test, predictions)
    results['precision'] = precision_score(y_test, predictions)
    results['roc'] = roc_auc_score(y_test, probs)
    
    train_results = {}
    train_results['recall'] = recall_score(y_sm, train_predictions)
    train_results['precision'] = precision_score(y_sm, train_predictions)
    train_results['roc'] = roc_auc_score(y_sm, train_probs)
    
    for metric in ['recall', 'precision', 'roc']:
        print(f'{metric.capitalize()} Test: {round(results[metric], 2)} Train: {round(train_results[metric], 2)}')
    
    # Calculate false positive rates and true positive rates
    base_fpr, base_tpr, _ = roc_curve(y_test, [1 for _ in range(len(y_test))])
    model_fpr, model_tpr, _ = roc_curve(y_test, probs)

    plt.figure(figsize = (8, 6))
    plt.rcParams['font.size'] = 16
    
    # Plot both curves
    plt.plot(base_fpr, base_tpr, 'b', label = 'baseline')
    plt.plot(model_fpr, model_tpr, 'r', label = 'model')
    plt.legend();
    plt.xlabel('False Positive Rate'); plt.ylabel('True Positive Rate'); plt.title('ROC Curves');

In [None]:
# Drop highly influential variables, keep only "10friends"
df_classifier2015 = df[df.year== 2015]
df_classifier2015 = df_classifier2015.drop(['pledged','year','backers','name'], axis = 1)

df_classifier2015 = df_classifier2015.drop(['5friends','15friends'], axis = 1)

display(df_classifier2015)

In [None]:
df_classifier2015.state.value_counts()

In [None]:
# We typecast variables accordingly and make sure all data types are correct.
df_classifier2015['state'] = df_classifier2015['state'].astype('category')
df_classifier2015.dtypes

In [None]:
# Subset the data into predictor variables (X) and the target variable (y)
X = df_classifier2015.loc[:, df_classifier2015.columns != 'state']

y = df_classifier2015.state

# Now we need to one hot encode the categorical features to make them machine readable. 
X = pd.get_dummies(X)

print(f"{color.BOLD}Predictor Variables for Kickstarter Data{color.END} - {X.shape[1]} columns x {X.shape[0]:,d} rows\n")

display(X.head())
try: 
  repmap={"successful": 1, "unsuccessful": 0}
  y.replace(repmap, inplace=True)
except:
  print(f"\n{color.BOLD}~~~{color.END} 'state' already encoded during this runtime execution. {color.BOLD}~~~{color.END}\n")

print(f"{color.BOLD}Target Variable for Kickstarter Data{color.END} - {y.shape[0]:,d} rows\n")
display(y.head())

In [None]:
# We want to scale our continuous numeric data to optimize our models

def scale_numeric(features, numeric_features, scaler):
    for col in numeric_features:
        features[col] = scaler.fit_transform(features[col].values.reshape(-1, 1))
    return features

numeric_features = ['goal','term']

#2 we can now define the scaler we want to use and apply it to our features 
from sklearn.preprocessing import MinMaxScaler, StandardScaler
scaler = MinMaxScaler()
X_scaled = scale_numeric(X, numeric_features, scaler)

#3 Let's see if it worked
X_scaled.describe()

In [None]:
# Basic imports
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Split data into 70% train and 30% test
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y,
test_size=0.3,
stratify=y,
random_state=42)

We need to take care of class imbalance - we will use the SMOTE package to do so

> Source: https://www.analyticsvidhya.com/blog/2020/10/overcoming-class-imbalance-using-smote-techniques/





In [None]:
# Handles class imbalance
# Source: https://www.analyticsvidhya.com/blog/2020/10/overcoming-class-imbalance-using-smote-techniques/

from imblearn.over_sampling import SMOTE
from collections import Counter
counter= Counter(y)
print('Before', counter)

smt=SMOTE()
x_sm, y_sm = smt.fit_resample(X_train, y_train)
counter= Counter(y_sm)
print('After', counter)

#### Random Forest

In [None]:
# Random Forest

# Instantiate a random forests classifier, let's call it 'rf'
rf = RandomForestClassifier(n_estimators=25, 
                            bootstrap = True, 
                            max_features = 'auto', 
                            min_samples_leaf = 5, 
                            criterion='gini',
                            random_state=42)

# Fit 'rf' to the training set
rf.fit(x_sm, y_sm)

# Predict the test set labels 'y_pred'
y_pred = rf.predict(X_test)
show_results(y_test, y_pred)



In [None]:
# Let's evaluate our Tree's performance using AUC
# Make probability predictions
train_probs = rf.predict_proba(x_sm)[:, 1]
probs = rf.predict_proba(X_test)[:, 1]

train_predictions = rf.predict(x_sm)
predictions = rf.predict(X_test)

from sklearn.metrics import precision_score, recall_score, roc_auc_score, roc_curve
print(f'Train ROC AUC Score: {roc_auc_score(y_sm, train_probs)}')
print(f'Test ROC AUC  Score: {roc_auc_score(y_test, probs)}')
print(f'Baseline ROC AUC: {roc_auc_score(y_test, [1 for _ in range(len(y_test))])}')

# Call our ROC evaluation function
evaluate_model(predictions, probs, train_predictions, train_probs)

#### Support Vector Machine

In [None]:
#Create a svm Classifier
from sklearn.svm import SVC

svmachine = SVC(C=100, gamma=0.1, kernel='poly', max_iter=1000, random_state=42) 
     # c=regulariziation (penalty), gamma=fitting (over vs under), kernel=transformation function

#Train the model using the training sets
svmachine.fit(x_sm, y_sm)

#Predict the response for test dataset
y_pred = svmachine.predict(X_test)
show_results(y_test, y_pred)

#### Logistic Regression

In [None]:
# Logistic Regression

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

# Fit primal logistic regression
param_grid = {'C': [10,50,100,200,300], 'max_iter': [1000], 'fit_intercept':[True],'intercept_scaling':[1],
              'penalty':['l2'], 'tol':[0.001,0.0001,0.00001]}
log_primal_Grid = GridSearchCV(LogisticRegression(solver='lbfgs', random_state=42),param_grid, cv=5, refit=True, verbose=0)
log_primal_Grid.fit(x_sm, y_sm);

In [None]:
best_model(log_primal_Grid)

In [None]:
y_pred = log_primal_Grid.predict(X_test)
show_results(y_test, y_pred)

#### Decision Tree Classifier (with AdaBoost)

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier

# Instantiate a classification-tree, let's call it 'dt'
dt = DecisionTreeClassifier(max_depth=1, criterion='gini', min_samples_leaf = 10, splitter = "random")

# Instantiate an AdaBoost classifier, let's call it 'adab_clf'
adb_clf = AdaBoostClassifier(base_estimator=dt, n_estimators=25,random_state=42, learning_rate=.1)

# Fit 'adb_clf' to the training set
adb_clf.fit(x_sm, y_sm)

# Predict the test data
y_pred = adb_clf.predict(X_test)
show_results(y_test, y_pred)

#### Bagging Classifier

In [None]:
from sklearn.ensemble import BaggingClassifier

# Instantiate a classification-tree, let's call it 'dt'
dt = DecisionTreeClassifier(criterion='gini', random_state = 22)

dt.fit(x_sm,y_sm)

# Predict test-set labels
y_pred= dt.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print('Accuracy of Decision Classifier: {:.3f}'.format(accuracy))

# Instantiate a BaggingClassifier, let's call it 'bc'
bc = BaggingClassifier(base_estimator=dt, n_estimators=25, n_jobs=-1,random_state=22)

# Fit 'bc' to the training set
bc.fit(x_sm, y_sm)

# Predict test set labels
y_pred = bc.predict(X_test)

# Evaluate and print test-set accuracy
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy of Bagging Classifier: {:.3f}'.format(accuracy))

#### K-Nearest Neighbors

In [None]:
# import the k-nearest neighbors classifier from sci-kit learn
from sklearn.neighbors import KNeighborsClassifier

# Instantiate the KNeighborsClassifier with a n_neighbors value of 3
knn = KNeighborsClassifier(n_neighbors=3)

# Fit the model
knn.fit(x_sm, y_sm)

In [None]:
# # Run prediction on test data
y_pred = knn.predict(X_test)
print("Test set predictions: \n {}".format(y_pred))

# # Calculate the accuracy of our prediction using np.mean
print("Accuracy of Predicition: {:.2f}".format(np.mean(y_pred==y_test)))

# # Alternatively, we can use knn's internal score function
print("Accuracy of Predicition: {:2f}".format(knn.score(X_test, y_test)))

We evaluate our models using confusion matrices and accuracy scores. After comparing models and some team discussion, we decide to go with Random Forest as our best model.

### 4. Does Model Generalize Well?

___


In the previous section, we tested and evaluated models on a subset of our data, namely 2015 data. We need to verify that our chosen model, Random Forest, works well on the other data so that we remain confident in its ability to predict success of a campaign.

#### Generalize to 20XX data

We can look at data from different years to see if the model generalizes well. In this example, we test the Random Forest on 2017 data.

In [None]:
df_classifier2017 = df[df.year== 2017]
df_classifier2017 = df_classifier2017.drop(['pledged','year','backers','name'], axis = 1)

df_classifier2017 = df_classifier2017.drop(['5friends','15friends'], axis = 1)

display(df_classifier2017)

In [None]:
df_classifier2017.state.value_counts()

In [None]:
df_classifier2017['state'] = df_classifier2017['state'].astype('category')
df_classifier2017.dtypes

In [None]:
X = df_classifier2017.loc[:, df_classifier2017.columns != 'state']

y = df_classifier2017.state

# Now you need to one hot encode the categorical features to make them machine readable. 
X = pd.get_dummies(X)

print(f"{color.BOLD}Predictor Variables for Kickstarter Data{color.END} - {X.shape[1]} columns x {X.shape[0]:,d} rows\n")

display(X.head())
try: 
  repmap={"successful": 1, "unsuccessful": 0}
  y.replace(repmap, inplace=True)
except:
  print(f"\n{color.BOLD}~~~{color.END} 'state' already encoded during this runtime execution. {color.BOLD}~~~{color.END}\n")

print(f"{color.BOLD}Target Variable for Kickstarter Data{color.END} - {y.shape[0]:,d} rows\n")
display(y.head())

In [None]:
#2 we can now define the scaler we want to use and apply it to our features 
from sklearn.preprocessing import MinMaxScaler, StandardScaler
scaler = MinMaxScaler()
X_scaled = scale_numeric(X, numeric_features, scaler)

#3 Let's see if it worked
X_scaled.describe()

In [None]:
# Here, we use the Random Forest already created in the "Random Forest" section

# Predict the test set labels 'y_pred'
y_pred = rf.predict(X_scaled)
show_results(y, y_pred)

### 5. Predicting Amount over Goal

___

Campaigns with especially strong funding are great indicators of the strong faith that backers have in a particular project. We believe that projects with a pledged amount that is 150% of a project's goal shows remarkable interest in a project, and we want to be able to identify these types of campaigns for investors.


#### Preparing Data for Over 50% Excess Funding

We prepare the data the same way it was prepared in the "Prepare Data for Machine Learning" section, except we do it for our entire dataset. We also create a new column, "predicted_state", to create predictions for whether a project is successful using our Random Forest classifier that we created earlier. We then add this column to our Dataframe.






In [None]:
df_classifierOverGoal = df.drop(['pledged','year','backers','name'], axis = 1)

df_classifierOverGoal = df_classifierOverGoal.drop(['5friends','15friends'], axis = 1)

display(df_classifierOverGoal)

df_classifierOverGoal['state'] = df_classifierOverGoal['state'].astype('category')


In [None]:
X = df_classifierOverGoal.loc[:, df_classifierOverGoal.columns != 'state']

y = df_classifierOverGoal.state

# Now we need to one hot encode the categorical features to make them machine readable. 
X = pd.get_dummies(X)

print(f"{color.BOLD}Predictor Variables for Kickstarter Data{color.END} - {X.shape[1]} columns x {X.shape[0]:,d} rows\n")

display(X.head())
try: 
  repmap={"successful": 1, "unsuccessful": 0}
  y.replace(repmap, inplace=True)
except:
  print(f"\n{color.BOLD}~~~{color.END} 'state' already encoded during this runtime execution. {color.BOLD}~~~{color.END}\n")

print(f"{color.BOLD}Target Variable for Kickstarter Data{color.END} - {y.shape[0]:,d} rows\n")
display(y.head())

In [None]:
numeric_features = ['goal','term']

#2 we can now define the scaler we want to use and apply it to our features 
from sklearn.preprocessing import MinMaxScaler, StandardScaler
scaler = MinMaxScaler()
X_scaled = scale_numeric(X, numeric_features, scaler)

#3 Let's see if it worked
X_scaled.describe()

predicted_values = rf.predict(X_scaled)
df['predicted_state'] = predicted_values

In [None]:
display(X_scaled)

#### Excess Funding Feature Engineering

We create the following columns and add them to our dataset: 'over_goal_50', 'over_goal_20', 'over_goal_30'. These are binary variables that take on the value of 1 for projects that have pledged amounts in excess of goal of 50%, 20%, and 30% respectively. 

In [None]:
def getOverGoal50(df):
  df['over_goal_50'] = (df['pledged']/df['goal'] > 1.5)
  return df

def getOverGoal20(df):
  df['over_goal_20'] = (df['pledged']/df['goal'] > 1.2)
  return df

def getOverGoal30(df):
  df['over_goal_30'] = (df['pledged']/df['goal'] > 1.3)
  return df


# Add the columns.
df = getOverGoal50(df)
df = getOverGoal20(df)
df = getOverGoal30(df)

# Map them to 0 or 1
repmap={True: 1, False: 0}
df['over_goal_50'].replace(repmap, inplace=True)
df['over_goal_20'].replace(repmap, inplace=True)
df['over_goal_30'].replace(repmap, inplace=True)

# Quickly see if our new columns have indeed been added to the data
display(df[['pledged','goal','over_goal_50','over_goal_20','over_goal_30']].sample(n=10, random_state = 42))

In [None]:
display(df)

#### Creating X and Y for Classifier



After performing scenario analysis to see which of the three "over_goal" variables we should use as our target variable, we settled on using 'over_goal_50', so we dropped the other two.


To build on the success of our Random Forest classifier, we subset the data based on how our Random Forest predicted the success of a project (i.e. where 'predicted_state' = 1), and ran another Random Forest on this new subset to predict whether a project will have a 'pledged' value that is over 50% in excess of its goal.






In [None]:

pred_Successful = df[df['predicted_state'] == 1]
print(pred_Successful.columns)
print(pred_Successful.dtypes)
pred_Successful = pred_Successful.drop(columns=['predicted_state','pledged','state','backers','over_goal_20','over_goal_30', '5friends','15friends','name'])

pred_Successful = pd.get_dummies(pred_Successful)

overgoal_rf = pred_Successful.copy()

overgoal_rf_Y = overgoal_rf.over_goal_50

overgoal_rf_X = overgoal_rf.loc[:, overgoal_rf.columns != 'over_goal_50']

print(overgoal_rf_X.columns)
print(overgoal_rf_Y)

In [None]:
display(overgoal_rf)

#### Classifier Output

We create a new Random Forest classifier and train it on our new subset of data, to see how well it predicts 'over_goal_50'.

In [None]:
# Basic imports
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Split data into 70% train and 30% test
X_train, X_test, y_train, y_test = train_test_split(overgoal_rf_X, overgoal_rf_Y,
test_size=0.3,
random_state=42)

# Instantiate a random forests classifier
rf = RandomForestClassifier(n_estimators=25, 
                            bootstrap = True, 
                            max_features = 'auto', 
                            min_samples_leaf = 5, 
                            criterion='gini',
                            random_state=42)

# Fit 'rf' to the training set
rf.fit(X_train, y_train)

# Predict the test set labels 'y_pred'
y_pred = rf.predict(X_test)

# Output the accuracy of our prediction
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))

# Visualize the confusion matrix to make it easier to read
con_matrix = confusion_matrix(y_test, y_pred)
confusion_matrix_df = pd.DataFrame(con_matrix, ('Over Goal By < 50%', 'Over Goal By > 50%'), ('Over Goal By < 50%', 'Over Goal By > 50%'))
heatmap = sns.heatmap(confusion_matrix_df, annot=True, annot_kws={"size": 20}, fmt="d", cmap="Blues")
heatmap.yaxis.set_ticklabels(heatmap.yaxis.get_ticklabels(), rotation=0, ha='right', fontsize = 14)
heatmap.xaxis.set_ticklabels(heatmap.xaxis.get_ticklabels(), rotation=45, ha='right', fontsize = 14)
plt.ylabel('Actual', fontsize = 14)
plt.xlabel('Predicted', fontsize = 14)

# Print the classification report
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

To sum up, we first used a Random Forest classifier to predict whether a campaign will be successful. Then, we make a new Random Forest classifier to be used on the projects that were deemed successful by the first Random Forest, to predict whether a project will rake in pledged money that is over 50% in excess of the project's goal. Evaluating the combination of the two classifiers, we arrive at the following result:





In [None]:
df_successful = df[df.predicted_state == 1]
#print(df.state.value_counts())
predicted_over50_values = rf.predict(overgoal_rf_X)
df_successful['over50'] = predicted_over50_values
df_successful = df_successful[df_successful['over50']==1]
df_successful.state = df_successful.state.astype('uint8')

print(f'Of the campaigns that we predicted to be successful and reach 150% of their goal, {color.BOLD}{df_successful.state.mean()*100}%{color.END} were actually successful\n')

#### Visualization of Predicted Successful and Over 50% Funding in Excess of Goal

In this section, we create visualizations of our final dataset that contains predictions made by our two Random Forest classifiers.

In [None]:
fig, axarr = plt.subplots(4, 2, figsize=(30, 40))
df_successful.main_category.value_counts().plot(kind='bar', ax=axarr[0][0])
df_successful.groupby('year')['state'].sum().plot(kind='bar', ax=axarr[1][0])
df_successful.groupby('mnth_lnch')['state'].sum().plot(kind='bar', ax=axarr[1][1])
df_successful.groupby('dow_lnch')['state'].sum().plot(kind='bar', ax=axarr[2][0])
df_successful.groupby('hour_lnch')['state'].sum().plot(kind='bar', ax=axarr[2][1])
df_successful.groupby('mnth_ddln')['state'].sum().plot(kind='bar', ax=axarr[3][0])
df_successful.groupby('term')['state'].sum().plot( ax=axarr[3][1])

### 6. Appendix

#### 10 Friends Reasoning

Knowing that individuals are less likely to give if no one else has done so, we wanted to find a way to quanitify this momentum effect. To do with we looked at data w less than 50 backers to compare all postings vs. those with more than 10 backers.

In [None]:
df_10friends = df[df.backers>10]
df_10friends=df_10friends[df.backers<50]

In [None]:
df_no10friends=df[df.backers<50]

In [None]:
df_10friends.backers.plot(kind='box')

In [None]:
df_no10friends.backers.plot(kind='box')

From this, we see that it is far more difficult to get from 0 to 10 backers than 10 to 20. When filtering only those less than 50, the avg number of backers is less than 10, but once 10 is reached, the avg number of backers is 25. This shows the importance of momentum, and how important the first few backers are in rallying others to give money towards a project.

#### Text Analysis of Successful Projects

Finally, to understand how to best phrase a posting, we used text analysis to find the top 10 nouns and modifiers used in the names of successful posts for each category.

In [None]:
# importing package for text analysis
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
import spacy
nlp = spacy.load('en')
from nltk.tokenize import word_tokenize
from collections import Counter
from nltk.corpus import stopwords

In [None]:
# Loops through data and selects only nouns from successful posts of each category
# Takes 7-8 mins to run
adjective_counter = pd.DataFrame()
for cat in df.main_category.unique():
  string = ''
  for index, row in df.iterrows():
    if row['main_category']==cat:
      if row['state']==1:
        string= string+' ' +row['name']
  BoW = []
  doc = nlp(string)
  for token in doc:
    if token.dep_ == 'amod':
      BoW.append(token.text)

  
  adjective_counter[cat]=Counter(no_punct).most_common(10)

display(adjective_counter)
      

In [None]:
# Loops through data and selects only nouns from successful posts of each category
# Takes 6-7 mins to run
noun_counter = pd.DataFrame()
for cat in df.main_category.unique():
  string = ''
  for index, row in df.iterrows():
    if row['main_category'] == cat:
      if row['state'] == 1:
        string = string+' ' +row['name']
  BoW = []
  doc = nlp(string)
  for token in doc:
    if token.pos_ =='NOUN':
      BoW.append(token.text)
  lower_tokens = [t.lower() for t in BoW]
  no_stops = [t for t in lower_tokens
            if t not in stopwords.words('english')]
  no_punct = [w for w in no_stops
        if w.isalpha()]
  
  noun_counter[cat]=Counter(no_punct).most_common(10)

display(noun_counter)
      