## Project Kickstarter 
by Vivika Wilde wilde.vivika@gmail.com
   Sebastian Fuhrer fuhrer_sebastian@web.de

## Objective 

In recent years, the range of funding options for projects created by individuals and small companies has expanded considerably. In addition to savings, bank loans, friends & family funding and other traditional options, crowdfunding has become a popular and readily available alternative. 

Kickstarter, founded in 2009, is one particularly well-known and popular crowdfunding platform. It has an all-or-nothing funding model, whereby a project is only funded if it meets its goal amount; otherwise no money is given by backers to a project.
A huge variety of factors contribute to the success or failure of a project — in general, and also on Kickstarter. Some of these are able to be quantified or categorized, which allows for the construction of a model to attempt to predict whether a project will succeed or not. The aim of this project is to construct such a model and also to analyse Kickstarter project data more generally, in order to help potential project creators assess whether or not Kickstarter is a good funding option for them, and what their chances of success are.


## Set up

In [1]:
import glob, os
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime
%matplotlib inline

In [2]:
data = pd.concat(map(pd.read_csv, glob.glob(os.path.join('', "data/*.csv"))))
data = data.reset_index(drop=True)

In [3]:
df = data.copy()

## Variable names and description

* Backers: Number of supporters/investors
* Blurb: A short description of the product written for promotional purposes
* Category: Projects have been classified into 16 categories. These categories broadly define the genre a project belongs to.
* Subcategory: Categories are further sub-divided in subcategories to give more details on the project. For instance, the category “Technology” has further been split into subcategories like Gadgets, Web, Apps, Software etc. There are 144 total subcategories.
* Converted_pledged_amount: Total pledged amount in USD.
* Currency: Currency used to support the project
* Country: The country the project tries to get its pledged amount from (target audience,two letter code)
* USD_pledged: Pledged amount in USD (conversion made by Kickstarter)
* USD_pledged_real: Pledged amount in USD (conversion made by fixer.io api)
* USD_goal_real: Goal amount in USD
* Launched: the date the project was launched on
* Deadline: The date before which the goal amount has to be gathered.

## Data types and missings

In [4]:
df.head(2)

Unnamed: 0,backers_count,blurb,category,converted_pledged_amount,country,created_at,creator,currency,currency_symbol,currency_trailing_code,...,slug,source_url,spotlight,staff_pick,state,state_changed_at,static_usd_rate,urls,usd_pledged,usd_type
0,315,Babalus Shoes,"{""id"":266,""name"":""Footwear"",""slug"":""fashion/fo...",28645,US,1541459205,"{""id"":2094277840,""name"":""Lucy Conroy"",""slug"":""...",USD,$,True,...,babalus-childrens-shoes,https://www.kickstarter.com/discover/categorie...,False,False,live,1548223375,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",28645.0,international
1,47,A colorful Dia de los Muertos themed oracle de...,"{""id"":273,""name"":""Playing Cards"",""slug"":""games...",1950,US,1501684093,"{""id"":723886115,""name"":""Lisa Vollrath"",""slug"":...",USD,$,True,...,the-ofrenda-oracle-deck,https://www.kickstarter.com/discover/categorie...,True,False,successful,1504976459,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",1950.0,domestic


In [5]:
df.tail(2)

Unnamed: 0,backers_count,blurb,category,converted_pledged_amount,country,created_at,creator,currency,currency_symbol,currency_trailing_code,...,slug,source_url,spotlight,staff_pick,state,state_changed_at,static_usd_rate,urls,usd_pledged,usd_type
209220,76,Seattle Transmedia & Independent Film Festival...,"{""id"":295,""name"":""Festivals"",""slug"":""film & vi...",5692,US,1425256957,"{""id"":307076473,""name"":""Timothy Vernor"",""is_re...",USD,$,True,...,transmedia-gallery-space-stiff-2015,https://www.kickstarter.com/discover/categorie...,True,False,successful,1429536379,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",5692.0,domestic
209221,44,The @1000TimesYes 2009 Tweet Box is a handmade...,"{""id"":13,""name"":""Journalism"",""slug"":""journalis...",1293,US,1263225900,"{""id"":1718677513,""name"":""Article"",""slug"":""arti...",USD,$,True,...,the-1000timesyes-2009-tweet-box,https://www.kickstarter.com/discover/categorie...,True,True,successful,1266814815,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",1293.0,domestic


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 209222 entries, 0 to 209221
Data columns (total 37 columns):
backers_count               209222 non-null int64
blurb                       209214 non-null object
category                    209222 non-null object
converted_pledged_amount    209222 non-null int64
country                     209222 non-null object
created_at                  209222 non-null int64
creator                     209222 non-null object
currency                    209222 non-null object
currency_symbol             209222 non-null object
currency_trailing_code      209222 non-null bool
current_currency            209222 non-null object
deadline                    209222 non-null int64
disable_communication       209222 non-null bool
friends                     300 non-null object
fx_rate                     209222 non-null float64
goal                        209222 non-null float64
id                          209222 non-null int64
is_backing                  300 

### Missing Data

In [7]:
missing = pd.DataFrame(df.isnull().sum(), columns=['Number'])
missing['Percentage'] = round(missing.Number / df.shape[0] * 100, 1)
missing[missing.Number != 0]

Unnamed: 0,Number,Percentage
blurb,8,0.0
friends,208922,99.9
is_backing,208922,99.9
is_starred,208922,99.9
location,226,0.1
permissions,208922,99.9
usd_type,480,0.2


For the features 'friends', 'is_backing', 'is_starred' and 'permissions' only .1 percent of the data is given.
Therefore these features are not useable  and will be removed from the set. 


In [8]:
df.drop(['friends', 'permissions', 'is_backing', 'is_starred'], axis=1);

In [9]:
df.backers_count.unique()

array([ 315,   47,  271, ..., 3142, 6586, 1192])

### Backers

In [10]:
df.rename(columns = {'backers_count':'backers'}, inplace = True)

### Blurb ( --> tbd)

In [11]:
df.drop('blurb', axis=1);

### Category

In [12]:
df.category.unique()

array(['{"id":266,"name":"Footwear","slug":"fashion/footwear","position":5,"parent_id":9,"color":16752598,"urls":{"web":{"discover":"http://www.kickstarter.com/discover/categories/fashion/footwear"}}}',
       '{"id":273,"name":"Playing Cards","slug":"games/playing cards","position":4,"parent_id":12,"color":51627,"urls":{"web":{"discover":"http://www.kickstarter.com/discover/categories/games/playing%20cards"}}}',
       '{"id":43,"name":"Rock","slug":"music/rock","position":17,"parent_id":14,"color":10878931,"urls":{"web":{"discover":"http://www.kickstarter.com/discover/categories/music/rock"}}}',
       '{"id":48,"name":"Nonfiction","slug":"publishing/nonfiction","position":9,"parent_id":18,"color":14867664,"urls":{"web":{"discover":"http://www.kickstarter.com/discover/categories/publishing/nonfiction"}}}',
       '{"id":36,"name":"Classical Music","slug":"music/classical music","position":3,"parent_id":14,"color":10878931,"urls":{"web":{"discover":"http://www.kickstarter.com/discover

In [13]:
df_cat = pd.DataFrame([x for x in df['category']])
df_cat.head()

Unnamed: 0,0
0,"{""id"":266,""name"":""Footwear"",""slug"":""fashion/fo..."
1,"{""id"":273,""name"":""Playing Cards"",""slug"":""games..."
2,"{""id"":43,""name"":""Rock"",""slug"":""music/rock"",""po..."
3,"{""id"":273,""name"":""Playing Cards"",""slug"":""games..."
4,"{""id"":48,""name"":""Nonfiction"",""slug"":""publishin..."


In [None]:
df[['category', 'subcategory']] = df_cat.slug.str.split("/",expand=True,)

In [None]:
df.country.unique()

### Currency 

In [None]:
df.currency_symbol.unique()

In [None]:
df.groupby(['currency', 'currency_symbol']).size()

In [None]:
df.drop('currency_symbol',axis=1);

In [None]:
df.source_url.unique();

As the currency symbols are less specific and more ambiguous then the currencies themselves only the currency will 
be used as a feature.

### Dates

In [None]:
df_dates = pd.DataFrame([datetime.fromtimestamp (x) for x in df['created_at']])

### Creator

In [None]:
df.creator.unique()

In [None]:
df['creator'].replace(',',':',inplace=True)
#df['new_creator'] = df['creator'].apply(lambda x: x.split(':')[1])
#df.head()

## Label and Features

In [None]:
df.state.value_counts()

In [None]:
df.state.hist()

In [None]:
df.describe().round(2)