# Data analysis for Kickstrater campaigns

Crowd funding is very crowded space in present day where Kickstarter and Indiegogo hold the most market share. My goal is to identify factors for a successful campaign. Successful campaign is one where 'pledged' money (same as total money raised) is greater than or equal to 'goal' (same as total money asked/requested) within a given deadline.

I am using Kaggle data for this analysis which contains below features,

1. ID - unique ID
2. name -   name of the campaign
3. category - level 2 category
4. main_category - level 1 category
5. currency
6. deadline - date stamp
7. goal - amount in local currency
8. launched - date stamp
9. pledged - amount in local currency
10. state - failed, successful, undefined, canceled, live
11. backers - total number of people invested
12. country - country where campaign is launched
13. usd_pledged - conversion to USD done by Kickstarter
14. usd_pledged_real - conversion to USD done by Fixer.io API
15. usd_goal_real - conversion to USD done by Fixer.io API


I will be running through multiple techniques to understand and clean the dataset. In second file I will talk about data exploration using visual. However, this notebook is strictly used for data wrangling purpose only.


In [22]:
# Import all required packages

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import re

In [23]:
# Read dataset csv file downloaded from Kaggle website

df = pd.read_csv(r"C:\Users\Adi\Desktop\Data_Science\Capstone1\DataSet.csv")

# Check the basic structure for all features

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 378661 entries, 0 to 378660
Data columns (total 15 columns):
ID                  378661 non-null int64
name                378657 non-null object
category            378661 non-null object
main_category       378661 non-null object
currency            378661 non-null object
deadline            378661 non-null object
goal                378661 non-null float64
launched            378661 non-null object
pledged             378661 non-null float64
state               378661 non-null object
backers             378661 non-null int64
country             378661 non-null object
usd pledged         374864 non-null float64
usd_pledged_real    378661 non-null float64
usd_goal_real       378661 non-null float64
dtypes: float64(5), int64(2), object(8)
memory usage: 43.3+ MB


## Update column names

The above table shows 'name' and 'usd pledged' have less entries than other columns. Also the latter has a space character in column name. Also, check the list of column names below.

In [24]:
# Check column names

print(df.columns)

# Change column name

df.rename(columns = {'usd pledged':'usd_pledged'}, inplace=True)


Index(['ID', 'name', 'category', 'main_category', 'currency', 'deadline',
       'goal', 'launched', 'pledged', 'state', 'backers', 'country',
       'usd pledged', 'usd_pledged_real', 'usd_goal_real'],
      dtype='object')


## Check for null and duplicate values


Let's display the data frame head. This is our raw dataframe from the Kaggle api. I will now add this to a new dataframe 'data', where we will clean by filtering out unwanted entries. Let's begin the analysis of identifying unwanted or "bad" data.

The table below displays total null value entries in each column. We are not concerned about usd_pledged as there is another column usd_pledged_real where currency conversion is more accurate and in sync with usd_goal_real.

Let's explore the name column to explore missing values. I will begin this by capturing value counts for name column.

In [25]:
# Check null values in each column

print('Frequency of null values \n')
print(df.isnull().sum())

print('\nTop 5 duplicate campaign names')
df.name.value_counts().head()

Frequency of null values 

ID                     0
name                   4
category               0
main_category          0
currency               0
deadline               0
goal                   0
launched               0
pledged                0
state                  0
backers                0
country                0
usd_pledged         3797
usd_pledged_real       0
usd_goal_real          0
dtype: int64

Top 5 duplicate campaign names


New EP/Music Development    41
Canceled (Canceled)         13
N/A (Canceled)              11
Music Video                 11
Debut Album                 10
Name: name, dtype: int64

## Drop invalid names 

New information has come to light! There are multiple campaigns under similar name(s) and there are multiple names which have word 'canceled' in them. Let's explore this in more depth.

I want to check the state for each campaign where name contains word 'canceled'. These are bad data points as the campaigns with name 'canceled' were never executed and do not contribute towards analysis for successful campaigns. Drop all the rows where name column has 'canceled' word and store this as new dataframe 'data'. 


In [26]:
# Create a list to store boolean entries for if name column contains 'canceled' word
# Display the count of true (yes canceled in name value) or false

invalid_names = (df.name.str.contains('Canceled', case=False)) | (df.name.str.contains('Cancelled', case=False))
print(invalid_names.value_counts())

# Assigning false (where 'canceled' not present) list entries from invalid_names list

data = df[invalid_names == False]
data.info()

False    355530
True      23131
Name: name, dtype: int64
<class 'pandas.core.frame.DataFrame'>
Int64Index: 355530 entries, 0 to 378660
Data columns (total 15 columns):
ID                  355530 non-null int64
name                355526 non-null object
category            355530 non-null object
main_category       355530 non-null object
currency            355530 non-null object
deadline            355530 non-null object
goal                355530 non-null float64
launched            355530 non-null object
pledged             355530 non-null float64
state               355530 non-null object
backers             355530 non-null int64
country             355530 non-null object
usd_pledged         351737 non-null float64
usd_pledged_real    355530 non-null float64
usd_goal_real       355530 non-null float64
dtypes: float64(5), int64(2), object(8)
memory usage: 43.4+ MB


## Drop null values

As we can see from data info, name column still has some missing entries. Let's explore this by looking at missing values


In [27]:
# Check for null values

print('Total null values in name column: ', data.name.isnull().sum())

print('\nInconsistent data for the missing name rows\n')


Total null values in name column:  4

Inconsistent data for the missing name rows



Looking at the four line items we are not sure if the remaining features have captured data correctly. Also number of rows is too less to make an impact on total data points. Hence, I will drop these rows from our dataframe.


In [28]:
# Reassign data frame without null value rows in name column

data = data.dropna(subset=['name'], how='any')
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 355526 entries, 0 to 378660
Data columns (total 15 columns):
ID                  355526 non-null int64
name                355526 non-null object
category            355526 non-null object
main_category       355526 non-null object
currency            355526 non-null object
deadline            355526 non-null object
goal                355526 non-null float64
launched            355526 non-null object
pledged             355526 non-null float64
state               355526 non-null object
backers             355526 non-null int64
country             355526 non-null object
usd_pledged         351733 non-null float64
usd_pledged_real    355526 non-null float64
usd_goal_real       355526 non-null float64
dtypes: float64(5), int64(2), object(8)
memory usage: 43.4+ MB


In [29]:
# Check for null values to confirm drop

data.name.isnull().sum()

0

## Drop undefined state

We cant tell from the dataframe that most or at least some of the observations have 'undefined' state. We are more concerned about failed and successful projects for the scope of this analysis. To begin with I will drop all the rows where state is undefined.



In [30]:
# Filter dataframe without undefined state and assign it back to data

data = data[data.state != 'undefined']
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 351967 entries, 0 to 378660
Data columns (total 15 columns):
ID                  351967 non-null int64
name                351967 non-null object
category            351967 non-null object
main_category       351967 non-null object
currency            351967 non-null object
deadline            351967 non-null object
goal                351967 non-null float64
launched            351967 non-null object
pledged             351967 non-null float64
state               351967 non-null object
backers             351967 non-null int64
country             351967 non-null object
usd_pledged         351733 non-null float64
usd_pledged_real    351967 non-null float64
usd_goal_real       351967 non-null float64
dtypes: float64(5), int64(2), object(8)
memory usage: 43.0+ MB


## Drop duplicate names

Aha. So now there are multiple line items for same name. Let's explore the data for couple of spot checks for duplicate names.


There are multiple entries under a given name. These entries have valid state and other details so it would be wrong to assume it is bad data. At the same time, if we accept this as normal data there is a risk of accounting for repeat values as these might be test campaigns or even worse, incorrect entries.

Total duplicate entries is about 1.3% of total dataset, i.e. ~4000 compared to the whole dataset of more than 300,000 rows. I will drop rows with duplicate name to maintain a cleaner dataset.


In [31]:
# Spot check example
# Fliter data where name is one of the repeate values

data[data.name == 'New EP/Music Development'].head()

Unnamed: 0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd_pledged,usd_pledged_real,usd_goal_real
13622,1068645001,New EP/Music Development,Rock,Music,USD,2015-04-23,3500.0,2015-03-18 16:20:34,1.0,failed,1,US,1.0,1.0,3500.0
99303,1504097381,New EP/Music Development,Rock,Music,USD,2015-04-27,2500.0,2015-03-24 21:35:07,11.0,failed,2,US,11.0,11.0,2500.0
132498,1672411097,New EP/Music Development,Music,Music,USD,2016-05-06,3000.0,2016-03-11 20:10:55,290.0,failed,0,"N,0""",,290.0,3000.0
140448,1713090809,New EP/Music Development,Rock,Music,USD,2015-04-24,3800.0,2015-03-24 21:02:47,16.0,failed,2,US,16.0,16.0,3800.0
142270,1722909259,New EP/Music Development,Metal,Music,USD,2015-04-22,3000.0,2015-03-18 16:10:38,123.0,failed,6,US,123.0,123.0,3000.0


In [32]:
# Total duplicates in name column

data.name.duplicated(keep=False).sum()

4781

In [33]:
# Drop rows with duplicate names

data = data.drop_duplicates(subset=['name'], keep=False)

# Data wrangling (part 2)

In [34]:
df = data

### Drop null values

Total null values is 230 which is less than 0.06% of total observations.

In [35]:
# Drop null values
df.dropna(how='any', inplace=True)


### Check feature data types

Data type for all features is correct expect where it contains date entries. Therefore, converting launched and deadline columns into date-time object.

In [36]:
# Convert columns from string to datetime

df['launched'] = pd.to_datetime(df.launched)
df['deadline'] = pd.to_datetime(df.deadline)


### Add new features

1. Launched date is when campaign was first active. I will consider this date for seasonality. It is better to create new features, year, month and day (Monday through Sunday). These will come in handy when looking at time series data.
2. Another very important feature is duration. This is the time available for a campaign to pledge money equal to or more than the goal. Remember, for a campaign to be successful, it ought to pledge money at least as much as the goal before or by the deadline date.
3. usd_pledged_real and usd_goal_real are values for each campaign in USD, irrespective of country/currency. Backers is another feature which means total number of pledges made or total number of people who pledged money for a given campaign. Using these I created few new features as below:
    
    p_timesgoal - pledged to goal ratio. Note: this should equal to or greater than 1 for successful projects
    
    p_perbacker - average money pledged per backer for a given project. Note: this metric could be skewed as not every backer is pledging same amount  of money. However, we will dive deep into this idea at some point.
    
    Also, I will check for inf values in p_perbacker as there would be projects where number of backers is zero.

In [37]:
# Create new columns and pick values from launched date

df['year'] = df.launched.dt.year
df['day'] = df.launched.dt.dayofweek
df['month'] = df.launched.dt.month

# Duration in days is the difference between deadline and launched date
# Adding 1 to the difference as the actual days is 1 more than the difference

df['duration'] = (df['deadline'] - df['launched']).astype('timedelta64[D]') + 1

# Create new fetaures

df['p_timesgoal'] = df['usd_pledged_real'] / df['usd_goal_real']
df['p_perbacker'] = df['usd_pledged_real'] / df['backers']

# Replace inf value with NaN with zero
df[['p_perbacker']] = df[['p_perbacker']].replace([np.inf, -np.inf], np.nan).fillna(0)

# Convert new features to int dtype
df['p_timesgoal'] = df['p_timesgoal'].round(2)
df['p_perbacker'] = df['p_perbacker'].round(0).astype(int)

### Drop columns

We do not require some of the features in existing dataset. Drop columns below:

1. usd_pledged is currency conversion done by using exchange rate from different sources. I will keep usd_pledged_real instead, which uses one exchange rate table for all campaigns.
2. ID is not required as every name is unique
3. goal is value in local currency, we already have data for all campaigns in USD under column usd_goal_real
4. pledged is value in local currency, we already have data for all campaigns in USD under column usd_pledged_real
5. currency is the symbol per campaign country. We already have a column country

Dropping these features will use less memory and give better view of dataframe when displayed for analysis.

In [38]:
del df['usd_pledged']
del df['ID']
del df['goal']
del df['pledged']
del df['currency']

There is one observation for year 1970, which is incorrect for obvious reasons. I will drop this entry.

In [39]:
# Drop any obervations before 2005

df = df[df.year >2005]

## Display clean dataset

Note: I don't need to clean values in column usd_pledged. Another column usd_pledged_real captures correct currency conversion data.

In [40]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 346955 entries, 0 to 378660
Data columns (total 16 columns):
name                346955 non-null object
category            346955 non-null object
main_category       346955 non-null object
deadline            346955 non-null datetime64[ns]
launched            346955 non-null datetime64[ns]
state               346955 non-null object
backers             346955 non-null int64
country             346955 non-null object
usd_pledged_real    346955 non-null float64
usd_goal_real       346955 non-null float64
year                346955 non-null int64
day                 346955 non-null int64
month               346955 non-null int64
duration            346955 non-null float64
p_timesgoal         346955 non-null float64
p_perbacker         346955 non-null int32
dtypes: datetime64[ns](2), float64(4), int32(1), int64(4), object(5)
memory usage: 43.7+ MB


In [41]:
df.head()

Unnamed: 0,name,category,main_category,deadline,launched,state,backers,country,usd_pledged_real,usd_goal_real,year,day,month,duration,p_timesgoal,p_perbacker
0,The Songs of Adelaide & Abullah,Poetry,Publishing,2015-10-09,2015-08-11 12:12:28,failed,0,GB,0.0,1533.95,2015,1,8,59.0,0.0,0
1,Greeting From Earth: ZGAC Arts Capsule For ET,Narrative Film,Film & Video,2017-11-01,2017-09-02 04:43:57,failed,15,US,2421.0,30000.0,2017,5,9,60.0,0.08,161
2,Where is Hank?,Narrative Film,Film & Video,2013-02-26,2013-01-12 00:20:50,failed,3,US,220.0,45000.0,2013,5,1,45.0,0.0,73
3,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,2012-04-16,2012-03-17 03:24:11,failed,1,US,1.0,5000.0,2012,5,3,30.0,0.0,1
4,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,2015-08-29,2015-07-04 08:35:03,canceled,14,US,1283.0,19500.0,2015,5,7,56.0,0.07,92


## Export dataset

I will use this data set for next exercise on data storytelling

In [42]:
df.to_csv('DataSet_clean.csv', index=False)

## Summary

Kickstarter dataset has more than 300,000 observations which is good for finding correlations and trends. This dataset has many useful features.

As a part of data wrangling, I performed the following steps.

##### Import dataset from Kaggle CSV
##### Update column names to maintain consistency
##### Check for null and duplicate entries
##### Drop observations with invalid campaign names
##### Drop observations with null values
##### Drop observations where 'state' feature is 'undefined'
(state feature can take values - failed, successful, canceled, live and undefined)
##### Drop duplicate observations
##### Data wrangling (part 2)
Drop any null values
Check feature data types
Add new features
Drop columns which are not required in analysis

##### Export dataset into new CSV file



