# Exploring data from https://webrobots.io/kickstarter-datasets/

We downloaded files in csv format from webrobots. All files since '2015-11-01' till now  ('2017-10-15') have similar structure (minor differences can be found in column indexes - newer files have a bit more columns). Older files are available only in json format and are not supposed to be very helpful for our approach as they are scraped in large time intervals. 

Our first goal is to find how project features (e.g. pledged amount, backers count) evolve in time. Therefore we need data that are collected at least about every 30 days, because 30 days is the basic duration of project proposed by kickstarter (60 days is maximum).

Webrobots store their monthly scraped data in zip files. Each zip file contains about 13 to 48 csv files.

In this notebook we will explore what kind of data are stored in csv files (no plots or analysis yet), concretely in files scraped on  2015-11-01.

In [1]:
import pandas as pd
import os
import ast

In [2]:
df = pd.read_csv(r'C:\Users\Patrik\Downloads\webrobots.iokickstarter-datasets\Kickstarter_2015-11-01T14_09_04_557Z\Kickstarter007.csv')
df.head()

Unnamed: 0,id,name,blurb,goal,pledged,state,slug,disable_communication,country,currency,...,static_usd_rate,usd_pledged,photo,creator,location,category,profile,spotlight,urls,source_url
0,1838469271,MechRunner,An endless arcade action game that casts you a...,25000.0,28434.0,successful,mechrunner,False,US,USD,...,1.0,28434.0,"{""small"":""https://ksr-ugc.imgix.net/projects/8...","{""urls"":{""web"":{""user"":""https://www.kickstarte...","{""country"":""US"",""urls"":{""web"":{""discover"":""htt...","{""urls"":{""web"":{""discover"":""http://www.kicksta...","{""background_image_opacity"":0.8,""link_text_col...",True,"{""web"":{""project"":""https://www.kickstarter.com...",https://www.kickstarter.com/discover/categorie...
1,496736434,Kingdom Death : Monster,Cooperative board game set in a nightmare-horr...,35000.0,2049721.07,successful,kingdom-death-monster,False,US,USD,...,1.0,2049721.0,"{""small"":""https://ksr-ugc.imgix.net/projects/2...","{""urls"":{""web"":{""user"":""https://www.kickstarte...","{""country"":""US"",""urls"":{""web"":{""discover"":""htt...","{""urls"":{""web"":{""discover"":""http://www.kicksta...","{""background_image_opacity"":0.8,""link_text_col...",True,"{""web"":{""project"":""https://www.kickstarter.com...",https://www.kickstarter.com/discover/categorie...
2,1757011463,The X-Cube,"The neXt generation, shape-shifting, 3D logic ...",30000.0,53854.73,successful,the-x-cube,False,US,USD,...,1.0,53854.73,"{""small"":""https://ksr-ugc.imgix.net/projects/5...","{""urls"":{""web"":{""user"":""https://www.kickstarte...","{""country"":""US"",""urls"":{""web"":{""discover"":""htt...","{""urls"":{""web"":{""discover"":""http://www.kicksta...","{""background_image_opacity"":0.8,""link_text_col...",True,"{""web"":{""project"":""https://www.kickstarter.com...",https://www.kickstarter.com/discover/categorie...
3,1453716748,Thunderbeam! for the iPad,Thunderbeam is a retro-futuristic adventure ga...,20000.0,24221.78,successful,thunderbeam-for-the-ipad,False,US,USD,...,1.0,24221.78,"{""small"":""https://ksr-ugc.imgix.net/projects/3...","{""urls"":{""web"":{""user"":""https://www.kickstarte...","{""country"":""US"",""urls"":{""web"":{""discover"":""htt...","{""urls"":{""web"":{""discover"":""http://www.kicksta...","{""background_image_opacity"":0.8,""link_text_col...",True,"{""web"":{""project"":""https://www.kickstarter.com...",https://www.kickstarter.com/discover/categorie...
4,1700479977,Super World Karts - Indie Kart AllStars! - PC ...,SNES style retro mode 7 kart racer fun for 1-4...,16000.0,17816.35,successful,super-world-karts-indie-kart-racer,False,AU,AUD,...,0.928942,16550.36,"{""small"":""https://ksr-ugc.imgix.net/projects/1...","{""urls"":{""web"":{""user"":""https://www.kickstarte...","{""country"":""AU"",""urls"":{""web"":{""discover"":""htt...","{""urls"":{""web"":{""discover"":""http://www.kicksta...","{""background_image_opacity"":1,""link_text_color...",True,"{""web"":{""project"":""https://www.kickstarter.com...",https://www.kickstarter.com/discover/categorie...


In [3]:
df.describe()

Unnamed: 0,id,goal,pledged,deadline,state_changed_at,created_at,launched_at,backers_count,static_usd_rate,usd_pledged
count,4610.0,4610.0,4610.0,4610.0,4610.0,4610.0,4610.0,4610.0,4610.0,4610.0
mean,1070710000.0,36130.07,33812.14,1403824000.0,1403275000.0,1396709000.0,1400907000.0,519.428633,1.032326,33767.96
std,608629600.0,545709.0,216768.2,46274030.0,45811330.0,46608150.0,46481960.0,2760.109742,0.198005,217744.6
min,62304.0,1.0,0.0,1244833000.0,1244833000.0,1241317000.0,1241500000.0,0.0,0.114865,0.0
25%,543835400.0,2500.0,26.0,1384575000.0,1384575000.0,1374009000.0,1381944000.0,2.0,1.0,27.39911
50%,1084258000.0,6050.0,2128.5,1418941000.0,1418914000.0,1411760000.0,1416249000.0,45.0,1.0,2162.107
75%,1574755000.0,15000.0,13713.0,1438098000.0,1438087000.0,1431492000.0,1435357000.0,249.75,1.0,13838.0
max,2146878000.0,35000000.0,8596475.0,1451499000.0,1446379000.0,1446297000.0,1446379000.0,74405.0,1.715913,8596475.0


In [4]:
df['state'].value_counts()

successful    2213
failed        1554
live           757
canceled        86
Name: state, dtype: int64

## Exploring columns and determining which information could be usefull for our purposes

As we can see, some columns contain data in json format. We can use dictionaries and other Python essentials to parse these data.

We choose one row to find out what kind of problems we may face and how to deal with them. It is obvious that this method does not guarantee that we came across all kinds of problems, but is sufficient for now.

In [5]:
row = 0
df.loc[row,:]

id                                                               1838469271
name                                                             MechRunner
blurb                     An endless arcade action game that casts you a...
goal                                                                  25000
pledged                                                               28434
state                                                            successful
slug                                                             mechrunner
disable_communication                                                 False
country                                                                  US
currency                                                                USD
currency_symbol                                                           $
currency_trailing_code                                                 True
deadline                                                         1400295600
state_change

In [6]:
df.loc[row,'blurb']

'An endless arcade action game that casts you as a powerful mech in stunning cinematic battles against a relentless robotic army.'

### photo

In [7]:
dict_photo = ast.literal_eval(df.loc[row,'photo'])
dict_photo

{'1024x768': 'https://ksr-ugc.imgix.net/projects/878320/photo-original.jpg?v=1400295449&w=1024&h=768&fit=crop&auto=format&q=92&s=a5ec63c73668e03ba83058188b5872c1',
 '1536x1152': 'https://ksr-ugc.imgix.net/projects/878320/photo-original.jpg?v=1400295449&w=1536&h=1152&fit=crop&auto=format&q=92&s=f8fb270a57f4fdef7b092aa065e02b11',
 'ed': 'https://ksr-ugc.imgix.net/projects/878320/photo-original.jpg?v=1400295449&w=338&h=250&fit=crop&auto=format&q=92&s=14b860031e30da7f9c55c973f7591b7d',
 'full': 'https://ksr-ugc.imgix.net/projects/878320/photo-original.jpg?v=1400295449&w=560&h=420&fit=crop&auto=format&q=92&s=aff53a0945947e5ac54a1d15ab327ecd',
 'key': 'projects/878320/photo-original.jpg',
 'little': 'https://ksr-ugc.imgix.net/projects/878320/photo-original.jpg?v=1400295449&w=200&h=150&fit=crop&auto=format&q=92&s=3fcc8e7c5afb15235d6af45d16cc2d6c',
 'med': 'https://ksr-ugc.imgix.net/projects/878320/photo-original.jpg?v=1400295449&w=266&h=200&fit=crop&auto=format&q=92&s=5527e1ffe1f80592ffaeeab1

#### what may be handy:

In [8]:
len(dict_photo)

9

### creator

In [9]:
dict_creator = ast.literal_eval(df.loc[row,'creator'])
dict_creator

{'avatar': {'medium': 'https://ksr-ugc.imgix.net/avatars/1812653/logo.original.jpg?v=1392916085&w=160&h=160&fit=crop&auto=format&q=92&s=e49ad51f2fb80bdb31a249493a54e152',
  'small': 'https://ksr-ugc.imgix.net/avatars/1812653/logo.original.jpg?v=1392916085&w=80&h=80&fit=crop&auto=format&q=92&s=bae1a74ca8ffb384fa6f5e316d40cf82',
  'thumb': 'https://ksr-ugc.imgix.net/avatars/1812653/logo.original.jpg?v=1392916085&w=40&h=40&fit=crop&auto=format&q=92&s=eabea1abd42568e29cb6dd63c724ac69'},
 'id': 375569972,
 'name': 'Spark Plug Games',
 'slug': 'sparkpluggames',
 'urls': {'api': {'user': 'https://api.kickstarter.com/v1/users/375569972?signature=1446477215.36602b5966bc0fdb16f5f8708e7c1cfbb91afffc'},
  'web': {'user': 'https://www.kickstarter.com/profile/sparkpluggames'}}}

#### what may be handy:

In [10]:
dict_creator['urls']['web']['user']

'https://www.kickstarter.com/profile/sparkpluggames'

### location

In [11]:
dict_location = ast.literal_eval(df.loc[row,'location'])
dict_location

ValueError: malformed node or string: <_ast.Name object at 0x000002DB3C4B1208>

In [12]:
# let's find out what caused ValueError
df.loc[row,'location']

'{"country":"US","urls":{"web":{"discover":"https://www.kickstarter.com/discover/places/cary-nc","location":"https://www.kickstarter.com/locations/cary-nc"},"api":{"nearby_projects":"https://api.kickstarter.com/v1/discover?signature=1446458433.d4e6ccc44854b2a0d5581b5b69ea425bbc2ccf4e&woe_id=2375810"}},"name":"Cary","displayable_name":"Cary, NC","short_name":"Cary, NC","id":2375810,"state":"NC","type":"Town","is_root":false,"slug":"cary-nc"}'

In [13]:
# "is_root":false - in Python false is unknown term, let's convert it to False
df.loc[row,'location'].replace('false', 'False')

'{"country":"US","urls":{"web":{"discover":"https://www.kickstarter.com/discover/places/cary-nc","location":"https://www.kickstarter.com/locations/cary-nc"},"api":{"nearby_projects":"https://api.kickstarter.com/v1/discover?signature=1446458433.d4e6ccc44854b2a0d5581b5b69ea425bbc2ccf4e&woe_id=2375810"}},"name":"Cary","displayable_name":"Cary, NC","short_name":"Cary, NC","id":2375810,"state":"NC","type":"Town","is_root":False,"slug":"cary-nc"}'

In [14]:
dict_location = ast.literal_eval(df.loc[row,'location'].replace('true', 'True').replace('false', 'False'))
dict_location

{'country': 'US',
 'displayable_name': 'Cary, NC',
 'id': 2375810,
 'is_root': False,
 'name': 'Cary',
 'short_name': 'Cary, NC',
 'slug': 'cary-nc',
 'state': 'NC',
 'type': 'Town',
 'urls': {'api': {'nearby_projects': 'https://api.kickstarter.com/v1/discover?signature=1446458433.d4e6ccc44854b2a0d5581b5b69ea425bbc2ccf4e&woe_id=2375810'},
  'web': {'discover': 'https://www.kickstarter.com/discover/places/cary-nc',
   'location': 'https://www.kickstarter.com/locations/cary-nc'}}}

#### what may be handy:

In [15]:
dict_location['is_root']

False

In [16]:
dict_location['name']

'Cary'

In [17]:
dict_location['state']

'NC'

In [18]:
dict_location['type']

'Town'

In [19]:
dict_location['urls']['web']['discover']

'https://www.kickstarter.com/discover/places/cary-nc'

### category

In [20]:
dict_category = ast.literal_eval(df.loc[row,'category'])
dict_category

{'color': 51627,
 'id': 35,
 'name': 'Video Games',
 'parent_id': 12,
 'position': 7,
 'slug': 'games/video games',
 'urls': {'web': {'discover': 'http://www.kickstarter.com/discover/categories/games/video%20games'}}}

#### what may be handy:

In [21]:
dict_category['id']

35

In [22]:
dict_category['name']

'Video Games'

In [23]:
dict_category['parent_id']

12

In [24]:
dict_category['position']

7

In [25]:
dict_category['slug'].split('/')[0]

'games'

### profile

In [26]:
dict_profile = ast.literal_eval(df.loc[row,'profile'])
dict_profile

ValueError: malformed node or string: <_ast.Name object at 0x000002DB3C5246D8>

In [27]:
# let's find out what caused ValueError
df.loc[row,'profile']

'{"background_image_opacity":0.8,"link_text_color":"","state_changed_at":1427207526,"should_show_feature_image":true,"blurb":"An endless arcade action game that casts you as a powerful mech in stunning cinematic battles against a relentless robotic army.","background_color":"191d1d","project_id":898572,"name":"MechRunner","feature_image_attributes":{"image_urls":{"default":"https://ksr-ugc.imgix.net/projects/878320/photo-original.jpg?v=1400295449&w=1536&h=1152&fit=crop&auto=format&q=92&s=f8fb270a57f4fdef7b092aa065e02b11","baseball_card":"https://ksr-ugc.imgix.net/projects/878320/photo-original.jpg?v=1400295449&w=1536&h=1152&fit=crop&auto=format&q=92&s=f8fb270a57f4fdef7b092aa065e02b11"}},"link_url":"","show_feature_image":false,"id":898572,"state":"active","text_color":"ffffff","link_text":"Follow along!","link_background_color":""}'

In [28]:
# there are problems with boolean values again and null also causes problem
# however, it looks like we already have important information from this column contained in other columns
dict_profile = ast.literal_eval(df.loc[row,'profile'].replace('true', 'True').replace('false', 'False').replace('null', 'None'))
dict_profile

{'background_color': '191d1d',
 'background_image_opacity': 0.8,
 'blurb': 'An endless arcade action game that casts you as a powerful mech in stunning cinematic battles against a relentless robotic army.',
 'feature_image_attributes': {'image_urls': {'baseball_card': 'https://ksr-ugc.imgix.net/projects/878320/photo-original.jpg?v=1400295449&w=1536&h=1152&fit=crop&auto=format&q=92&s=f8fb270a57f4fdef7b092aa065e02b11',
   'default': 'https://ksr-ugc.imgix.net/projects/878320/photo-original.jpg?v=1400295449&w=1536&h=1152&fit=crop&auto=format&q=92&s=f8fb270a57f4fdef7b092aa065e02b11'}},
 'id': 898572,
 'link_background_color': '',
 'link_text': 'Follow along!',
 'link_text_color': '',
 'link_url': '',
 'name': 'MechRunner',
 'project_id': 898572,
 'should_show_feature_image': True,
 'show_feature_image': False,
 'state': 'active',
 'state_changed_at': 1427207526,
 'text_color': 'ffffff'}

### urls

In [29]:
dict_urls = ast.literal_eval(df.loc[row,'urls'])
dict_urls

{'web': {'project': 'https://www.kickstarter.com/projects/sparkpluggames/mechrunner?ref=category',
  'rewards': 'https://www.kickstarter.com/projects/sparkpluggames/mechrunner/rewards'}}

#### what may be handy:

In [30]:
dict_urls['web']['rewards']

'https://www.kickstarter.com/projects/sparkpluggames/mechrunner/rewards'

## What next

* data integration:
    * read all csv files and store information about all live projects
    * fetch information about end state (success/failure) for every live project