# Predictions of Crowdfunding Campaign Success with Machine Learning Approach

Over the past century there has been a dramatic increase in crowdfunding project activity, which offers an alternative for both creators and backers to sell products and invest in creative businesses respectively. However, empirical analysis shows that only one-third of crowdfunding campaigns could meet their fundraising goal. The aim of this project is to develop a model that predicts the success of crowdfunding project with machine learning approach. The datasets are retrospectively collected from Web Robots, Kickstarter website, and Indiegogo website. The model could provide insights in pre-lunching stage and in early stage of fundraising.

## Table of Contents
1. [Libraries](#libs)
2. [Reader](#reads)
3. [Analysis](#analysis)
4. [Preparations](#prepares)
5. [Selection](#selection)
6. [Preprocessing](#preps)
7. [Word Embeddings](#wembed)

## 1. Libraries <a class="anchor" id="libs"></a>

In [None]:
# standard libraries
import json

# 3rd party libraries
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

# custom libraries
from src.reader import *

## 2. Reader <a class="anchor" id="reads"></a>

We combine dataset from [Web Robots](https://webrobots.io/kickstarter-datasets/) and [Our Scraper]( https://github.com/unedo08/kickstarter-scrapper)

In [None]:
path_json_file = "dataset\kickstarter-corpus.json"
path_kickstarter_csv = "dataset\kickstarter"

df_ks = read_json(path_json_file).merge(read_csv(path_kickstarter_csv), how="inner", on="site")

Check three rows.

In [None]:
display(df_ks.head(3))

## 3. Analysis  <a class="anchor" id="analysis"></a>

In [None]:
df_ks.info()

There are 45 columns (attributes) with 11637 rows of data.

For text attributes _(temporary)_, we will use:
- `story in campaign column`
- `post comment`

For meta attributes _(temporary)_, we will use:
- `backers_count`
- `usd_pledged`

## 4. Preparation <a class="anchor" id="prepares"></a>

In [None]:
dfc_ks = df_ks.copy()

# create a new column from story key in campain 
dfc_ks["story"] = [d.get("story") for d in dfc_ks.campaign]

# drop rows with empty story
dfc_ks = dfc_ks[(dfc_ks.story != "") & (dfc_ks.story != "<n/a>")]

# drop rows with empty comment
dfc_ks = dfc_ks[(dfc_ks.comment != {}) & (dfc_ks.comment != "<n/a>")]
dfc_ks = dfc_ks.reset_index(drop=True)

## 5. Selection <a class="anchor" id="selection"></a>

### 5.1. Selection on Text Attributes

In [None]:
# story as string and comment as list of string
list_comments = []
for i in dfc_ks.itertuples():
    list_sub_comments = []
    for j in i.comment:
        list_sub_comments.append(i.comment[j]["post_comment"])
    list_comments.append(list_sub_comments)
df_text = pd.DataFrame(dfc_ks[["site", "story"]])
df_text["comment"] = list_comments

### 5.2. Selection on Meta Attributes

In [None]:
df_meta = pd.DataFrame(dfc_ks[["backers_count", "usd_pledged"]])

## 6. Preprocessing <a class="anchor" id="preps"></a>

In [None]:
def word_tokenization(df_in):
    try:
        print(f"Word Tokenization is in progress...")
        df_copy = df_in.copy()
        df_copy["story"] = df_copy["story"].apply(lambda t: word_tokenize(t))
        df_copy["comment"] = [[word_tokenize(t) for t in i] for i in df_copy["comment"]]
        df_out = df_copy.copy()
        print(f"Word Tokenization is complete.")
        return df_out
    except Exception as e:
        print(e)
        return df_in

def lowercasing(df_in):
    try:
        print(f"Lowercasing is in progress...")
        df_copy = df_in.copy()
        df_copy["story"] = df_copy["story"].apply(lambda i: list(map(lambda t: t.lower(), i)))
        df_copy["comment"] = [[[t.lower() for t in j] for j in i] for i in df_copy["comment"]]
        df_out = df_copy
        print(f"Lowercasing is complete.")
        return df_out
    except Exception as e:
        print(e)
        return df_in
    
def stopword_removal(df_in):
    try:
        print(f"Stopword Removal is in progress...")
        # Porter M.F. (1980) An Algorithm for Suffix Stripping. Program, 14: 130-137.
        stoplist = stopwords.words('english')       
        df_copy = df_in.copy()
        # exclude stopwords with Python's list comprehension and pandas.DataFrame.apply.
        df_copy["story"] = df_copy["story"].apply(lambda i: [t for t in i if t not in (stoplist)])
        df_copy["comment"] = [[[t for t in j if t not in (stoplist)] for j in i] for i in df_copy["comment"]]
        df_out = df_copy
        print(f"Stopword Removal is complete.")
        return df_out
    except Exception as e:
        print(e)
        return df_in

def stemming(df_in):
    try:
        print(f"Stemming is in progress...")
        df_copy = df_in.copy()
        ps = PorterStemmer()
        df_copy["story"] = df_copy["story"].apply(lambda i: [ps.stem(t) for t in i])
        df_copy["comment"] = [[[ps.stem(t) for t in j] for j in i] for i in df_copy["comment"]]
        df_out = df_copy
        print(f"Stemming is complete.")
        return df_out
    except Exception as e:
        print(e)
        return df_in

def html_tag_removal(txt):
    # regex pattern object of html bracket
    bracket = re.compile("<.*?>")
    res = re.sub(bracket, "", txt)
    return res

def punctuation_removal(txt):
    res = re.sub(r'[^\w\s]', '', txt)
    return res

def html_tag_and_punctuation_removal(df_in):
    try:
        print(f"HTML Tag and Punctuation Removal is in progress...")
        df_copy = df_in.copy()
        
        # html tag removal
        df_copy["story"] = [list(filter(None, [html_tag_removal(t) for t in i])) for i in df_copy["story"]]
        df_copy["comment"] = [[list(filter(None, [html_tag_removal(t) for t in j])) for j in i] for i in df_copy["comment"]]
        
        # punctuation removal
        df_copy["story"] = [list(filter(None, [punctuation_removal(t) for t in i])) for i in df_copy["story"]]
        df_copy["comment"] = [[list(filter(None, [punctuation_removal(t) for t in j])) for j in i] for i in df_copy["comment"]]
        
        df_out = df_copy
        print(f"HTML Tag and Punctuation Removal is complete.")
        return df_out
    except Exception as e:
        print(e)
        return df_in

### 6.1. Text Attributes Preprocessing

In [None]:
df_text = word_tokenization(df_text)
df_text = lowercasing(df_text)
df_text = stopword_removal(df_text)
df_text = stemming(df_text)
df_text = html_tag_and_punctuation_removal(df_text)

### 6.2. Meta Attributes Preprocessing

## 7. Word Embeddings <a class="anchor" id="wembed"></a>

---