# Natural Language Processing - Basics

In this project we will use articles from [Hacker News](http://news.ycombinator.com/) to predict the number of upvotes (or likes) of each post, using basic natural language processing techniques. We will proceed as follows:

1. Data exploration
2. Tokenization of headlines
3. Removal of irrelevant tokens (stopwords) and tokens with a low frequency - overfitting
4. Train/test data split
5. Prediction of upvotes
6. Error measurement (MSE, RMSE)

In [1]:
# Disable warnings in Anaconda
import warnings
warnings.filterwarnings('ignore')

import pandas as pd # Data Procession
pd.options.display.max_columns = 52
submissions = pd.read_csv("sel_hn_stories.csv")
submissions.columns = ["submission_time", "upvotes", "url", "headline"]
submissions = submissions.dropna()
submissions.head()

Unnamed: 0,submission_time,upvotes,url,headline
0,2010-02-17T16:57:59Z,1,blog.jonasbandi.net,Software: Sadly we did adopt from the construc...
1,2014-02-04T02:36:30Z,1,blogs.wsj.com,Google’s Stock Split Means More Control for L...
2,2011-10-26T07:11:29Z,1,threatpost.com,SSL DOS attack tool released exploiting negoti...
3,2011-04-03T15:43:44Z,67,algorithm.com.au,Immutability and Blocks Lambdas and Closures
4,2013-01-13T16:49:20Z,1,winmacsofts.com,Comment optimiser la vitesse de Wordpress?


## Tokenization of headlines

In [2]:
tokenized_headlines = submissions['headline'].apply(str.split).tolist()

In [3]:
# We could use: from string import punctuation, but we customized the punctuation signs in the list below
punctuation = [",", ":", ";", ".", "'", '"', "’", "?", "/", "-", "+", "&", "(", ")"]
clean_tokenized = []

# In this loop we'll convert tokens into lowercase and get rid of punctuation
for headline in tokenized_headlines:
    for i, word in enumerate(headline):
        headline[i] = headline[i].lower()
        for punct in punctuation:
            if punct in headline[i]:
                headline[i] = headline[i].replace(punct,"")
    clean_tokenized.append(headline)

In [4]:
import numpy as np
unique_tokens = []
single_tokens = []

# In this loop we'll add all the words/tokens that occur more than once to the unique_tokens list.
# We will use a helper list single_tokens to identify which tokens occur more than once
# We could have used a dictionary for this to avoid using a helper list.
for headline in clean_tokenized:
    for word in headline:
        if word in single_tokens and word not in unique_tokens:
            unique_tokens.append(word)
        elif word not in single_tokens:
            single_tokens.append(word)
            
counts = pd.DataFrame(0, index = np.arange(len(clean_tokenized)),columns=unique_tokens)

counts.head(1)

Unnamed: 0,and,for,as,you,is,the,split,good,how,what,Unnamed: 11,of,de,in,a,with,amazon,cloud,at,google,to,status,back,raises,faster,an,...,deploying,plate,healthcare,term,gist,saving,devops,improved,practical,celebrate,thomas,sabo,club,breaking,macbook,contracts,frameworks,animated,walks,auctions,clouds,hammer,autonomous,vehicle,crowdsourcing,disaster
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [5]:
# In this loop we'll add the word counts for each headline
for index, headline in enumerate(clean_tokenized):
    for token in headline:
        if token in unique_tokens:
            counts.loc[index,token] +=1
counts.head(1)

Unnamed: 0,and,for,as,you,is,the,split,good,how,what,Unnamed: 11,of,de,in,a,with,amazon,cloud,at,google,to,status,back,raises,faster,an,...,deploying,plate,healthcare,term,gist,saving,devops,improved,practical,celebrate,thomas,sabo,club,breaking,macbook,contracts,frameworks,animated,walks,auctions,clouds,hammer,autonomous,vehicle,crowdsourcing,disaster
0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


## Token removal
Stopwords like and and to provide no information.
Words that only occur a few times will cause overfitting - not enough information.

In [6]:
word_counts = counts.sum() # return number of times each word appears

# We will remove words that occur less than 5 times and more than 100 times
col_filter = word_counts[(word_counts >= 5) & (word_counts <= 100)].index

counts = counts.loc[:,col_filter]

## Splitting the data - Train-Test
-`X_train`: Train features

-`X_test`: Test features

-`y_train`: Train target -upvotes

-`y_test`: Test target -upvotes

In [7]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    counts, submissions["upvotes"], test_size=0.2, random_state=1)

## Linear regression model

In [8]:
from sklearn.linear_model import LinearRegression

clf = LinearRegression()
clf.fit(X_train,y_train)
predictions = clf.predict(X_test)

### Linear regression: Error score
We'll use mean squared error (MSE) and root mean squared error (RMSE)

In [9]:
from sklearn.metrics import mean_squared_error

mse = mean_squared_error(y_test,predictions)
mean_upvotes = submissions['upvotes'].mean()
std_upvotes = submissions['upvotes'].std()
print("Linear Regression Model")
print("-----------------------")
print("MSE = {:.2f}".format(mse))
print("RMSE = {:.2f}".format(mse**(1/2)))
print("Mean Upvotes = {:.2f}".format(mean_upvotes))
print("Standard Deviation Upvotes = {:.2f}".format(std_upvotes))

Linear Regression Model
-----------------------
MSE = 2651.15
RMSE = 51.49
Mean Upvotes = 10.10
Standard Deviation Upvotes = 39.50


Our average error is 51 upvotes - greater than our standard deviation. This means our model os not very accurate.

## Random forests regression model

In [10]:
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(min_samples_leaf=5)
rf.fit(X_train,y_train)
predictions = rf.predict(X_test)

### Random forests: Error score

In [11]:
mse = mean_squared_error(y_test,predictions)
mean_upvotes = submissions['upvotes'].mean()
std_upvotes = submissions['upvotes'].std()
print("Random Forest Regression Model")
print("-------------------------------")
print("MSE = {:.2f}".format(mse))
print("RMSE = {:.2f}".format(mse**(1/2)))
print("Mean Upvotes = {:.2f}".format(mean_upvotes))
print("Standard Deviation Upvotes = {:.2f}".format(std_upvotes))

Random Forest Regression Model
-------------------------------
MSE = 1916.18
RMSE = 43.77
Mean Upvotes = 10.10
Standard Deviation Upvotes = 39.50


Using random forests, our average error goes decreases a bit, but there is still room for improvement. 
## Conclusion
Our first attempts to predict the upvotes are not very accurate, since our RMSE is greater than our standard deviation.

In order to increase our model accuracy, we can:
        
- Use more data: For learning purposes, we are using a small sample of [this data set](https://github.com/arnauddri/hn)
- Use features like headling length and average word length (meta features)
