# NLP - Kickstarter
From the point of view of investors, the risk of losing one's capital as a result of a failed investment is high for them.
Kickstarter's project have few risky characteristics for backers who want to invest in successful projects: 
- Items are frequently new and not evaluated in a mature market before.
- Creators may be unskilled and lack the necessary abilities to develop and launch products.

As a result, backers appears to be sort of Venture Capitalist. The only difference is that in this case, instead of equity, backers pledge money in exchange for a (usually tangible) reward. 

Therefore, taking the perspective of backers, we would only put money into initiatives that have the best chance of succeeding. Given the vast number of projects on Kickstarter, there would be plenty of high-probability-of-success campaigns to pick from.

With NLP, we will try to fit a model and evaluate its **precision**, since this appears to be the most relevant metrics. Out of the projects that the model predicted would be successes, how many turned out to be actual successes?

In [5]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

In [7]:
# The following need to be rerun only if you don't have the packages
!conda install nltk --yes
! conda install -c conda-forge spacy --yes
! python -m spacy download en_core_web_sm
! pip install wordcloud
! conda install -c anaconda gensim --yes

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Collecting wordcloud
  Using cached wordcloud-1.8.1.tar.gz (220 kB)
Building wheels for collected packages: wordcloud
  Building wheel for wordcloud (setup.py) ... [?25lerror
[31m  ERROR: Command errored out with exit status 1:
   command: /opt/anaconda3/bin/python -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/v7/rl617x8x7yg31vngrgpbx2pw0000gn/T/pip-install-o5jpgb2k/wordcloud_71eb616027a24085941cb60c672aeba9/setup.py'"'"'; __file__='"'"'/private/var/folders/v7/rl617x8x7yg31vngrgpbx2pw0000gn/T/pip-install-o5jpgb2k/wordcloud_71eb616027a24085941cb60c672aeba9/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exe

In [None]:
# DATA


In [1]:
# Length of blurb - how many words?
# Count length of each blurb
df_nlp['blurb_length'] = df['blurb'].str.split().str.len()


In [None]:
# Select when state != live
# id - slug - state
# Remove missing values
# 1 success, 0 failure

In [None]:
# removing all non-alphabetical characters followed by converting the words to lowercases. 

# Here, we assume that non-character words, such as numbers and punctuation, play a minimal role in prediction. 
# The words are also then stemmed into their root words, and Stop Words are also removed.

In [None]:
# Create bag of words Model
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer(max_features=2500) #Keep top 2500 most frequently used words
X=cv.fit_transform(corpus).toarray() #Develop a sparse matrix for every word
y=dataset.iloc[0:len(dataset[‘blurb’]),3].values


# Consider only the **** most used words. Understand the number by trying which one makes more difference


In [None]:
# Word Cloud
from wordcloud import WordCloud 
import matplotlib.pyplot as plt
text=""
for i in range(len(dataset.blurb)):
    if i==44343 or i==62766 or i==97999:
        continue
    text=text+str(dataset.blurb[i])
    
wordcloud = WordCloud(max_font_size=50, max_words=40,background_color="white").generate(text.lower())
plt.figure(figsize=(12,5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

In [None]:
# Fitting our classifier
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)
# Fitting Logistic Regression to the Training set
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)


In [None]:
# Predictions
pred=classifier.predict_proba(X_test)[:,1]
pred=pred>0.9
pred=pred.astype(int)
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, pred)
precision=cm[1,1]/(cm[0,1]+cm[1,1])
print("Precision is "+precision)

In [None]:
# Try neural network thwough the bag of words model
# Use Word2vec to convert the words to vectors through the GloVe data, and then run a neural network through it. I have even attempted to cluster the observations through K-Means clustering based on the descriptions to achieve better results. Dimensionality reduction techniques did not do much as well.
Despite their greater sophistication and purported effectiveness with text, these models could not beat the accuracy of a simple Logsitic Regression. In fact, such models took a much longer processing time.