# NLP: Predicting Upvotes Based on Headline
## Introduction
Hacker News is a community where users can submit articles, and other users can upvote those articles. The articles with the most upvotes make it to the front page, where they're more visible to the community.
## Goal
In this project, I'll be predicting the number of upvotes articles received, based on their headlines. Because upvotes are an indicator of popularity, I'll discover which types of articles tend to be the most popular.
## Data

The data set consists of submissions users made to Hacker News from 2006 to 2015. Developer Arnaud Drizard used the Hacker News API to scrape the data, which can be found in one of his [GitHub repositories](https://github.com/arnauddri/hn). I've sampled 3000 rows from the data randomly, and removed all of the extraneous columns. I will solely be working with the following four columns:

* `submission_time` - When the article was submitted
* `upvotes` - The number of upvotes the article received
* `url` - The base URL of the article
* `headline` - The article's headline

In [1]:
import pandas as pd
import numpy as np

submissions = pd.read_csv("sel_hn_stories.csv")
submissions.columns = ["submission_time", "upvotes", "url", "headline"]
submissions = submissions.dropna()
submissions.head()

Unnamed: 0,submission_time,upvotes,url,headline
0,2010-02-17T16:57:59Z,1,blog.jonasbandi.net,Software: Sadly we did adopt from the construc...
1,2014-02-04T02:36:30Z,1,blogs.wsj.com,Google’s Stock Split Means More Control for L...
2,2011-10-26T07:11:29Z,1,threatpost.com,SSL DOS attack tool released exploiting negoti...
3,2011-04-03T15:43:44Z,67,algorithm.com.au,Immutability and Blocks Lambdas and Closures
4,2013-01-13T16:49:20Z,1,winmacsofts.com,Comment optimiser la vitesse de Wordpress?


### Data Preparation
My goal is to train a linear regression algorithm that predicts the number of upvotes a headline would receive. To do this, I'll need to convert each headline to a numerical representation. I will be using the 'bag of words' model, which represents each piece of text as a numerical vector.

In [2]:
tokenized_headlines = []
for item in submissions['headline']:
    tokenized_headlines.append(item.split())

#preview the data  
print(tokenized_headlines[0:5])

[['Software:', 'Sadly', 'we', 'did', 'adopt', 'from', 'the', 'construction', 'analogy'], ['Google’s', 'Stock', 'Split', 'Means', 'More', 'Control', 'for', 'Larry', 'and', 'Sergey'], ['SSL', 'DOS', 'attack', 'tool', 'released', 'exploiting', 'negotiation', 'overhead'], ['Immutability', 'and', 'Blocks', 'Lambdas', 'and', 'Closures'], ['Comment', 'optimiser', 'la', 'vitesse', 'de', 'Wordpress?']]


Now that I have my tokens, I know they will need some processing to help with making predictions later on. I will need to get rid of punctuation, and make all words lowercase for consistency.

In [3]:
punctuation = [",", ":", ";", ".", "'", '"', "’", "?", "/", 
               "-", "+", "&", "(", ")"]
clean_tokenized = []

for item in tokenized_headlines:
    tokens = []
    for token in item:
        token = token.lower()
        for punc in punctuation:
            token = token.replace(punc, "")
        tokens.append(token)
    clean_tokenized.append(tokens)

Now I will retrieve all unique words from each headline, create a matrix, and assign those words as column headers. After, I will populate the matrix with the number of token occurences.

In [4]:
unique_tokens = []
single_tokens = []
for tokens in clean_tokenized:
    for token in tokens:
        if token not in single_tokens:
            single_tokens.append(token)
        elif token in single_tokens and token not in unique_tokens:
            unique_tokens.append(token)

counts = pd.DataFrame(0, index=np.arange(len(clean_tokenized)), columns=unique_tokens)
counts.head()

Unnamed: 0,and,for,as,you,is,the,split,good,how,what,...,frameworks,animated,walks,auctions,clouds,hammer,autonomous,vehicle,crowdsourcing,disaster
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [5]:
for i, item in enumerate(clean_tokenized):
    for token in item:
        if token in unique_tokens:
            counts.iloc[i][token] += 1
            
counts.head()

Unnamed: 0,and,for,as,you,is,the,split,good,how,what,...,frameworks,animated,walks,auctions,clouds,hammer,autonomous,vehicle,crowdsourcing,disaster
0,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,1,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Removing Columns to Increase Accuracy
My resulting matrix contains over 2000 columns. This will make it difficult to implement a linear regression model to make accurate predictions. To fix this, I will remove all columns that represent stopwords and words that occur less than 5 times or more than 100 times.

In [6]:
word_counts = counts.sum(axis=0)

counts = counts.loc[:,(word_counts >= 5) & (word_counts <= 100)]

counts.shape

(2800, 661)

### Linear Regression


In [7]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

X_train, X_test, y_train, y_test = train_test_split(counts, submissions["upvotes"], test_size=0.2, random_state=1)

lr = LinearRegression()
lr.fit(X_train, y_train)
predictions = lr.predict(X_test)

mse = mean_squared_error(predictions, y_test)
print(mse)

2651.1457056689683


This is a fairly large error. In this case, the mean number of upvotes is 10, and the standard deviation is 39.5. The square root of the MSE is 51.5. This means the average error is 51.5 upvotes away from the true value. This is higher than the standard deviation, so my predictions are far off-base.

To Do:
* add headline length, average word length
* use random forest or another technique

In [8]:
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor()
rf.fit(X_train, y_train)
preds = rf.predict(X_test)

mse = mean_squared_error(preds, y_test)
print(mse)



2395.3236084823275


By implementing a random forest, I reduced the error to 48.9 upvotes which is a small improvement but still a large error. I suspect there are some stopwords in my data set that can be removed and incease my model accuracy.

In [13]:
stops = []

counts.columns.values

array(['as', 'you', 'good', 'what', 'de', 'amazon', 'cloud', 'at',
       'google', 'back', 'raises', 'an', '2014', 'out', 'show', 'dont',
       'from', 'video', 'facebook', 'via', 'startups', 'testing',
       'releases', 'into', 'job', 'released', 'or', 'it', 'icrosoft',
       'programming', 'new', 'using', 'email', 'most', 'api', 'network',
       'first', 'but', 'hn', 'startup', 'security', 'part', '1', 'get',
       'really', 'rich', 'open', 'online', 'years', 'after', 'his',
       'live', 'not', 'china', 'web', 'three', 'hour', 'all', '[video]',
       '–', 'app', 'power', 'us', 'tv', 'why', 'top', 'windows', 'hacker',
       'microsoft', 'that', 'now', 'think', 'companies', 'can', 'control',
       'hack', 'may', 'introducing', 'this', 'my', 'twitter', 'tech',
       'internet', 'i', 'nsa', 'find', 'by', 'stop', 'them', 'traffic',
       'blog', 'some', 'best', 'own', 'success', 'industry', 'search',
       'time', 'start', 'its', 'realtime', 'chat', 'images', 'need',
       