# Table of Contents
1. [Data Wrangling](#Data_wrangling)<br>
    1a. [Dataframe cleaning](#cleaning)<br>
    1b. [Text preprocessing](#text)<br>

<a id='Data_wrangling'></a>

# 1. Data Wrangling
The datasets were found on webrobots.io where they crawl all Kickstarter projects once a month. For this project, the 2019-08-15 dataset was used with 56 datasets.
<br>
The data wrangling portion of this project is straightford. First we would have to eliminate features that are unrelated to campaign outcome. Also we like to create new features related to year and month to understand any temporal effect on pledge money. Then we would process the summary of campaign

In [20]:
import numpy as np
import pandas as pd
import glob
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns
import scipy
from keras.layers import Input, Dense, Dropout, Embedding, LSTM, Bidirectional
from keras.models import Model
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from langdetect import detect
from nltk.stem import WordNetLemmatizer, PorterStemmer

In [2]:
#define path and set up to import csv files
path = 'Dataset'
all_files = glob.glob(path +"/*.csv")

In [3]:
#create dataframes from all csv files
li = []
for filename in all_files:
    df = pd.read_csv(filename, index_col = None)
    li.append(df)
df = pd.concat(li, ignore_index = True)

In [4]:
#merge files
df = pd.concat(li, ignore_index = True)
df.head()

Unnamed: 0,backers_count,blurb,category,converted_pledged_amount,country,created_at,creator,currency,currency_symbol,currency_trailing_code,...,slug,source_url,spotlight,staff_pick,state,state_changed_at,static_usd_rate,urls,usd_pledged,usd_type
0,103,"Funding the mixing, mastering, and promotion o...","{""id"":39,""name"":""Hip-Hop"",""slug"":""music/hip-ho...",5612,US,1456593666,"{""id"":1531055178,""name"":""JC Stroebel and Henry...",USD,$,True,...,john-chuck-and-the-class-debut-ep,https://www.kickstarter.com/discover/categorie...,True,True,successful,1459964983,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",5612.0,domestic
1,318,We follow the challenges and achievements of g...,"{""id"":30,""name"":""Documentary"",""slug"":""film & v...",26237,US,1495058182,"{""id"":652875854,""name"":""Matthew Temple"",""is_re...",USD,$,True,...,girls-of-summer-big-diamond-dreams,https://www.kickstarter.com/discover/categorie...,True,True,successful,1499054401,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",26237.0,domestic
2,0,Task No.1 is inspired by the history and expre...,"{""id"":38,""name"":""Electronic Music"",""slug"":""mus...",0,GB,1357630802,"{""id"":1699678150,""name"":""Sonny Phillips"",""slug...",GBP,£,False,...,task-no1,https://www.kickstarter.com/discover/categorie...,False,False,failed,1362937678,1.614583,"{""web"":{""project"":""https://www.kickstarter.com...",0.0,international
3,22,MAJOR KEY ALERT - Future Heroes is a Denver ra...,"{""id"":39,""name"":""Hip-Hop"",""slug"":""music/hip-ho...",1575,US,1455591114,"{""id"":518056209,""name"":""Future Heroes"",""is_reg...",USD,$,True,...,future-heroes-sxsw-is-calling,https://www.kickstarter.com/discover/categorie...,True,False,successful,1457935201,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",1575.0,domestic
4,17,We're traveling to Rhode Island to film Mako a...,"{""id"":30,""name"":""Documentary"",""slug"":""film & v...",3290,US,1465224753,"{""id"":632937188,""name"":""Ryan Walton"",""is_regis...",USD,$,True,...,pelagic-shark-diving-shoot,https://www.kickstarter.com/discover/categorie...,True,False,successful,1467825676,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",3290.0,domestic


<a id='cleaning'></a>

## 1a. Dataframe Cleaning

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 207621 entries, 0 to 207620
Data columns (total 37 columns):
backers_count               207621 non-null int64
blurb                       207613 non-null object
category                    207621 non-null object
converted_pledged_amount    207621 non-null int64
country                     207621 non-null object
created_at                  207621 non-null int64
creator                     207621 non-null object
currency                    207621 non-null object
currency_symbol             207621 non-null object
currency_trailing_code      207621 non-null bool
current_currency            207621 non-null object
deadline                    207621 non-null int64
disable_communication       207621 non-null bool
friends                     444 non-null object
fx_rate                     207621 non-null float64
goal                        207621 non-null float64
id                          207621 non-null int64
is_backing                  444 

In [6]:
#drop unrelated features
df = df.drop(columns = ['backers_count', 'blurb', 'category', 'converted_pledged_amount','created_at', 
                             'creator', 'currency', 'currency_symbol', 'currency_trailing_code', 
                             'current_currency', 'deadline', 'friends', 'id', 'is_backing', 'is_starrable', 
                             'is_starred', 'location', 'name', 'permissions', 'photo', 'pledged',
                            'profile', 'source_url', 'staff_pick', 'static_usd_rate','urls',
                            'usd_pledged', 'usd_type'])

In [7]:
#convert include only US since language 
df['goal'] = df['goal']*df['fx_rate']
df = df[df['disable_communication'] == False]

In [8]:
#convert epoch to datetime
df['state_changed_at'] = df['state_changed_at'].apply(lambda x: datetime.fromtimestamp(x))
df['launched_at'] = df['launched_at'].apply(lambda x: datetime.fromtimestamp(x))

#create new vector for days taken to complete goals
df['days_to_state_change'] = df['state_changed_at'] - df['launched_at']
df = df.drop(columns = ['fx_rate', 'disable_communication', 'state_changed_at'])

In [9]:
#create days to complete campaign feature
df['days_to_state_change'] = df['days_to_state_change'].dt.days

In [10]:
#create launch_month and lauch_year column
df['month_launched'] = df['launched_at'].apply(lambda x: x.month)
df['year_launched'] = df['launched_at'].apply(lambda x: x.year)
df = df.drop(columns = ['launched_at'])

In [11]:
#remove all null containing columns
df = df.dropna()

<a id='text'></a>

## 1b. Text Preprocessing

In [12]:
df = df.rename(columns = {'slug': 'title'})
df['title'] = df['title'].apply(lambda x: ' '.join(x.split('-')))

In [13]:
#detect english 
def detector(phrases):
    try:
        return detect(phrases)
    except:
        return ''

df['language'] = df.title.apply(lambda x: detector(x))

In [14]:
#remove all non-english
df = df[df.language == 'en']
df.drop(columns = ['language'])

Unnamed: 0,country,goal,title,spotlight,state,days_to_state_change,month_launched,year_launched
0,US,5000.00000,john chuck and the class debut ep,True,successful,30,3,2016
1,US,24042.00000,girls of summer big diamond dreams,True,successful,26,6,2017
3,US,500.00000,future heroes sxsw is calling,True,successful,19,2,2016
4,US,2500.00000,pelagic shark diving shoot,True,successful,30,6,2016
5,AU,1019.09097,gorilla my dreams mime of my life,True,successful,28,10,2017
...,...,...,...,...,...,...,...,...
207616,US,500.00000,dantes capstone project who am i,True,successful,31,3,2016
207617,US,2200.00000,the pond of stars,False,failed,30,8,2014
207618,AT,55738.62600,fluxo the worlds first truly smart lamp,True,successful,33,12,2015
207619,HK,3826.37430,next bund18 deck blackstone magic bar playing ...,True,successful,25,1,2019


In [15]:
#lowercase and tokenize the title feature
df['title'] = df.title.apply(lambda x: x.lower())
df['title'] = df.title.apply(lambda x: word_tokenize(x))

In [16]:
#copy for RNN analysis
df_copy = df.copy

In [17]:
#remove stopwords
stopwords = set(stopwords.words('english'))
def important_words(tokens):
    output = []
    for token in tokens:
        if token not in stopwords:
            output.append(token)
    return output

df['title'] = df['title'].apply(lambda x: important_words(x))

In [18]:
#remove non-alphabetical words and change n't to not, 've to have, etc, negation handling
app = {"n't": 'not', "'d": 'had', "'ve": 'have', "'re": 'are', "'ll": 'will'}
def replacing(tokens):
    output = []
    for word in tokens:
        if word in app:
            output.append(app[word])
        else:
            if word.isalpha():
                output.append(word)
    return output

df['title'] = df['title'].apply(lambda x: replacing(x))

In [21]:
#lemmatizing and stemming
ps = PorterStemmer()
wnl = WordNetLemmatizer()
def lem_stem(tokens):
    output = []
    for word in tokens:
        lem = wnl.lemmatize(word)
        ps_lem = ps.stem(lem)
        output.append(ps_lem)
    return output

df['title'] = df['title'].apply(lambda x: lem_stem(x))

In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 151113 entries, 0 to 207620
Data columns (total 9 columns):
country                 151113 non-null object
goal                    151113 non-null float64
title                   151113 non-null object
spotlight               151113 non-null bool
state                   151113 non-null object
days_to_state_change    151113 non-null int64
month_launched          151113 non-null int64
year_launched           151113 non-null int64
language                151113 non-null object
dtypes: bool(1), float64(1), int64(3), object(4)
memory usage: 10.5+ MB


In [23]:
df.head()

Unnamed: 0,country,goal,title,spotlight,state,days_to_state_change,month_launched,year_launched,language
0,US,5000.0,"[john, chuck, class, debut, ep]",True,successful,30,3,2016,en
1,US,24042.0,"[girl, summer, big, diamond, dream]",True,successful,26,6,2017,en
3,US,500.0,"[futur, hero, sxsw, call]",True,successful,19,2,2016,en
4,US,2500.0,"[pelag, shark, dive, shoot]",True,successful,30,6,2016,en
5,AU,1019.09097,"[gorilla, dream, mime, life]",True,successful,28,10,2017,en


In [24]:
df.describe()

Unnamed: 0,goal,days_to_state_change,month_launched,year_launched
count,151113.0,151113.0,151113.0,151113.0
mean,38248.29,30.624069,6.304256,2015.972901
std,998661.7,13.083881,3.246589,2.134699
min,0.01,0.0,1.0,2009.0
25%,1500.0,28.0,4.0,2015.0
50%,5000.0,30.0,6.0,2016.0
75%,12882.61,32.0,9.0,2018.0
max,100000000.0,92.0,12.0,2019.0


## Data Wrangling Summary
After concatenating the files, all repetitive and unrelated features were dropped from the combined dataset. Datasets with disabled communication was dropped since majority of the population have communications on and they are treated as outliers. The currency of goal was normalized to USD and launch month and launch years were created as new features. Days to reach goal was created as new features. All rows with null values were dropped

For text preprocessing step, we want to identify all titles that were not english since for this project, since for this project we would only want to conduct nlp for english and not multilanguage. Then we tokenized the features and removing stop words. Then punctuations and text that are not alphanumeric were removed and finally the text was stemmed and lemmatized.