# NLP: Predicting Upvotes Based on Headline
## Introduction
Hacker News is a community where users can submit articles, and other users can upvote those articles. The articles with the most upvotes make it to the front page, where they're more visible to the community.
## Goal
In this project, I'll be predicting the number of upvotes articles received, based on their headlines. Because upvotes are an indicator of popularity, I'll discover which types of articles tend to be the most popular.
## Data

The data set consists of submissions users made to Hacker News from 2006 to 2015. Developer Arnaud Drizard used the Hacker News API to scrape the data, which can be found in one of his [GitHub repositories](https://github.com/arnauddri/hn).

* `submission_time` - When the article was submitted
* `upvotes` - The number of upvotes the article received
* `url` - The base URL of the article
* `headline` - The article's headline

In [3]:
import pandas as pd
import numpy as np

submissions = pd.read_csv("stories.csv")
submissions.columns = ["id", "submission_time", "submission_id", "author", "upvotes", "url", "num_comments", "headline"]
submissions = submissions.dropna()
submissions.head()

Unnamed: 0,id,submission_time,submission_id,author,upvotes,url,num_comments,headline
0,9079983,2015-02-20T11:34:22.000Z,1424432062,Rutger24s,1,startupjuncture.com,0,24sessions: live business advice over video-chat
1,9079986,2015-02-20T11:35:32.000Z,1424432132,AndrewDucker,3,blog.erratasec.com,0,Some notes on SuperFish
2,9079988,2015-02-20T11:36:18.000Z,1424432178,davidiach,1,twitter.com,0,Apple Watch models could contain 29.16g of gold
3,9080000,2015-02-20T11:41:06.000Z,1424432466,CiaranR,1,phpconference.co.uk,0,PHP UK Conference Diversity Scholarship Programme
4,9080006,2015-02-20T11:43:04.000Z,1424432584,mstolpm,2,preview.onedrive.com,2,Microsoft giving away 100GB free OneDrive stor...


In [6]:
submissions = submissions.drop(["id", "submission_id", "author", "num_comments"], axis=1)

Unnamed: 0,submission_time,upvotes,url,headline
0,2015-02-20T11:34:22.000Z,1,startupjuncture.com,24sessions: live business advice over video-chat
1,2015-02-20T11:35:32.000Z,3,blog.erratasec.com,Some notes on SuperFish
2,2015-02-20T11:36:18.000Z,1,twitter.com,Apple Watch models could contain 29.16g of gold
3,2015-02-20T11:41:06.000Z,1,phpconference.co.uk,PHP UK Conference Diversity Scholarship Programme
4,2015-02-20T11:43:04.000Z,2,preview.onedrive.com,Microsoft giving away 100GB free OneDrive stor...


In [10]:
submissions.shape

(1455868, 4)

### Data Preparation
My goal is to train a linear regression algorithm that predicts the number of upvotes a headline would receive. To do this, I'll need to convert each headline to a numerical representation. I will be using the 'bag of words' model, which represents each piece of text as a numerical vector.

In [7]:
tokenized_headlines = []
for item in submissions['headline']:
    tokenized_headlines.append(item.split())

#preview the data  
print(tokenized_headlines[0:5])

[['24sessions:', 'live', 'business', 'advice', 'over', 'video-chat'], ['Some', 'notes', 'on', 'SuperFish'], ['Apple', 'Watch', 'models', 'could', 'contain', '29.16g', 'of', 'gold'], ['PHP', 'UK', 'Conference', 'Diversity', 'Scholarship', 'Programme'], ['Microsoft', 'giving', 'away', '100GB', 'free', 'OneDrive', 'storage', 'for', '1', 'year']]


Now that I have my tokens, I know they will need some processing to help with making predictions later on. I will need to get rid of punctuation, and make all words lowercase for consistency.

In [8]:
punctuation = [",", ":", ";", ".", "'", '"', "’", "?", "/", 
               "-", "+", "&", "(", ")"]
clean_tokenized = []

for item in tokenized_headlines:
    tokens = []
    for token in item:
        token = token.lower()
        for punc in punctuation:
            token = token.replace(punc, "")
        tokens.append(token)
    clean_tokenized.append(tokens)

Now I will retrieve all unique words from each headline, create a matrix, and assign those words as column headers. After, I will populate the matrix with the number of token occurences.

In [9]:
unique_tokens = []
single_tokens = []
for tokens in clean_tokenized:
    for token in tokens:
        if token not in single_tokens:
            single_tokens.append(token)
        elif token in single_tokens and token not in unique_tokens:
            unique_tokens.append(token)

counts = pd.DataFrame(0, index=np.arange(len(clean_tokenized)), columns=unique_tokens)
counts.head()

KeyboardInterrupt: 