# Kickstarter ML Project

## Preparation

In [26]:
import pandas as pd
import numpy as np
import glob

import sweetviz as sv

### Load Data

In [5]:
# Load csv's and merge to a single dataframe

#path = ".../*.csv"" # Borak's path
path = "data/*.csv" # Christian's path
#path = ".../*.csv"" # Matthias's path

all_files = glob.glob(path)
df_raw = pd.concat((pd.read_csv(f) for f in all_files))

### Overview

In [4]:
## Execute Sweetviz 
#my_report = sv.analyze(df_raw)
#my_report.show_html() # Default arguments will generate to "SWEETVIZ_REPORT.html"

In [3]:
df_raw.columns

Index(['backers_count', 'blurb', 'category', 'converted_pledged_amount',
       'country', 'created_at', 'creator', 'currency', 'currency_symbol',
       'currency_trailing_code', 'current_currency', 'deadline',
       'disable_communication', 'friends', 'fx_rate', 'goal', 'id',
       'is_backing', 'is_starrable', 'is_starred', 'launched_at', 'location',
       'name', 'permissions', 'photo', 'pledged', 'profile', 'slug',
       'source_url', 'spotlight', 'staff_pick', 'state', 'state_changed_at',
       'static_usd_rate', 'urls', 'usd_pledged', 'usd_type'],
      dtype='object')

## Data Cleaning

For safety, we operate on a copy of the data frame.

In [None]:
df = df_raw.copy()

For a basic feature analysis, we may focus on easily accessible features. Consequently, we drop the cols
- `blurb`, `creator`, `slug`, `name` and `photo`, because they might only be exploitable through a semantic or context analysis,
- `currency_symbol`, `currency_trailing_code`, because they are redundant,
- `friends` and `permission`, because they do not contain any information,
- `disable_communication`, `is_backing` and `ìs_starred`, because they have an entry only for the same 300 data points and it is questionable whether the missing entries may be treated as one category; later on we might try this with one of them and drop the other two,  
- `urls`, `source_url`, `profile` and `state_changed_at`, because they do not contain additional information.

Extract field and subfield ids from `category` (id and parent_id):

Sort `countries` into categories US, CA, GB, AN, EU, Other:

Turn entries of `created_at`, `launched`, `deadline` into dates:

Compare `fx_rate` and `static_usd_rate` and keep just one of them:

Use what remains of `fx_rate` and `static_usd_rate` to convert `goal` and `pledged` entries into USD, then drop `fx_rate`:

Check relation between `converted_pledged_amount`, `usd_pledged` and `pledged` after conversion into USD. Keep just one of them:

Check for `id` copies and remove if they refer to the same data point: multiplicities of entries:

In [10]:
print(*df.id.value_counts().unique())

3 2 1


List of multiply used `id` entries:

In [31]:
dic = df.id.value_counts()
multiples = np.array([key for key in dic.keys() if dic[key] > 1])
len(multiples)

26957

They seem to differ mainly in `creator` and `usd_type`:

In [52]:
#np.array([id for id in multiples if all(df.query('id == '+str(id)).nunique() != 1)])
for id in multiples[1:20]:
    print(id, [c for c in df.columns if df.query('id == '+str(id))[c].nunique() != 1])
    print()

1860103231 ['creator', 'friends', 'is_backing', 'is_starred', 'permissions', 'source_url', 'usd_type']

1335851549 ['creator', 'friends', 'is_backing', 'is_starred', 'permissions', 'source_url', 'usd_type']

860973294 ['creator', 'friends', 'is_backing', 'is_starred', 'permissions', 'source_url', 'usd_type']

938836477 ['creator', 'friends', 'is_backing', 'is_starred', 'permissions', 'source_url', 'usd_type']

1531898946 ['creator', 'friends', 'is_backing', 'is_starred', 'permissions', 'source_url', 'usd_type']

1883785497 ['creator', 'friends', 'is_backing', 'is_starred', 'permissions', 'source_url', 'usd_type']

1637278787 ['converted_pledged_amount', 'creator', 'current_currency', 'fx_rate', 'source_url', 'urls']

1792982088 ['creator', 'friends', 'is_backing', 'is_starred', 'permissions', 'source_url', 'usd_type']

650550348 ['creator', 'friends', 'is_backing', 'is_starred', 'permissions', 'source_url']

1617549428 ['creator', 'friends', 'is_backing', 'is_starred', 'permissions', '

In that case, only one of the rows with the same `id` entry need to be kept. Let us check which `id` entries refer to rows differing in further values:

In [None]:
# ...

Remove data points whose `state` entry is `canceled` or `suspended`. In `state`, replace `live` by `successful` if `converted_pledged_amount` >= `goal`, and drop all other data points marked as `live`:

Remaining questions:
- Relation between currency and current_currency
- What to do with location? Extract cities and try to assign some score to each? To test this, extract a few big cities and check for correlation with target.