# Problem Statement

Kickstarter is a popular crowdfunding platform. In the exploratory analysis previously done the factors affecting the success of a crowdfunding campaign. Here the goal is to predict if a kickstarter project will be successful or will fail before its actual deadline.

In [None]:
#Setting up required libraries and packages

In [None]:
import numpy as np
import pandas as pd
import os
from datetime import datetime
import time
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import preprocessing
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score
from sklearn.metrics import classification_report
from sklearn import preprocessing
import string

# Data Overview

There are a combined data of 700,000+ projects from two data sources.
Below is the brief summary of the explanatory variables

ID                :  It is the unique identifier for a project.

Name              :  Name of the project seeking crowdfunding.

Category          :  Category in which the project falls

Main Category     :  The high level category on which the project falls

Currency          :  Self - Explanatory.

Deadline          :  The date by which the project tends to seek crowdfunding for the campaign. 

Goal              :  Amount the crowdfunding campaign seeks for itself.

Launched          :  The date on which the project is launched.

Pledged           :  The amount that was pledged by the backers of the campaign.

State             :  The final state, determining whether the project was successful, unsuccessful, cancelled or failed.

Backers           :  The count of the number of users backing the project.

Country           :  The country of origination

USD Pledged       :  The amount in US Dollor pledged for the project.

In [59]:
# Importing Data
df_kick_201801 = pd.read_csv("ks-projects-201801.csv",encoding = "utf-8",low_memory=False)
df_kick_201612 = pd.read_csv("ks-projects-201612.csv",encoding = "utf-8",low_memory=False)

Checking the structure of the of the two datasets

In [16]:
print(df_kick_201801.shape)
print(df_kick_201612.shape)

(378661, 15)
(323750, 17)


Checking the relevant columns and merge the two datasets,
df_kick_main is the main workset for the analysis

In [19]:
df_kick_201612_Workset=df_kick_201612.iloc[:,0:12]
df_kick_201801_Workset=df_kick_201801.iloc[:,0:12]
#df_kick_main = df_kick_201612_Workset.append(df_kick_201801_Workset)

Aggregated Dataset

In [57]:
df_kick_main.shape
df_kick_main.info()

In [None]:
Data Frame Shape

(702411, 12)

Data Frame Info

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 702411 entries, 0 to 702410
Data columns (total 12 columns):
ID                702411 non-null int64
name              702403 non-null object
category          702406 non-null object
main_category     702411 non-null object
currency          702411 non-null object
deadline          702411 non-null object
goal              702411 non-null object
launched          702411 non-null object
pledged           702411 non-null object
state             702411 non-null object
backers           702411 non-null object
country           702411 non-null object
dtypes: int64(1), object(11)
memory usage: 64.3+ MB

Segregating the variables as categorical and constinuous

In [72]:
df_kick_main.columns = ['ID','name','category','main_category','currency','deadline','goal','launched','pledged','state','backers','country']

ks_cat_vars=[ 'category', 'main_category', 'currency','country']
ks_cont_vars=['goal', 'pledged', 'backers','pledged']

Checking correlation among the continous variables defined above. It can be seen that pledged (USD amount) and backers have high correlation > 0.75. It is expected behavior as more backers would mean more amount pledged.


In [91]:
df_kick_main[ks_cont_vars]

In [92]:
import matplotlib.pyplot as plt

plt.matshow(df_kick_main[ks_cont_vars].corr())

Filtering only for successful and failed projects for subsequent EDA

In [93]:
kick_projects = df_kick_main[(df_kick_main['state'] == 'failed') | (df_kick_main['state'] == 'successful')]
#converting 'successful' state to 1 and failed to 0
kick_projects['state'] = (kick_projects['state'] =='successful').astype(int)
print(kick_projects.shape)

(612977, 12)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


In [94]:
#checking distribution of projects across various main categories
kick_projects.groupby(['main_category','state']).size()
#kick_projects.groupby(['category','state']).size()

main_category  state
Art            0        26223
               1        21164
Comics         0         7442
               1        10341
Crafts         0        10333
               1         3786
Dance          0         2324
               1         4439
Design         0        26853
               1        18509
Fashion        0        25682
               1         9903
Film & Video   0        62557
               1        45027
Food           0        29571
               1        11341
Games          0        29016
               1        21903
Journalism     0         5794
               1         1881
Music          0        40945
               1        45960
Photography    0        12127
               1         6208
Publishing     0        43065
               1        22555
Technology     0        36963
               1        11496
Theater        0         7045
               1        12524
dtype: int64