# NLP: Predicting Upvotes Based on Headline
## Introduction
Hacker News is a community where users can submit articles, and other users can upvote those articles. The articles with the most upvotes make it to the front page, where they're more visible to the community.
## Goal
In this project, I'll be predicting the number of upvotes articles received, based on their headlines. Because upvotes are an indicator of popularity, I'll discover which types of articles tend to be the most popular.
## Data

The data set consists of submissions users made to Hacker News from 2006 to 2015. Developer Arnaud Drizard used the Hacker News API to scrape the data, which can be found in one of his [GitHub repositories](https://github.com/arnauddri/hn).

* `submission_time` - When the article was submitted
* `upvotes` - The number of upvotes the article received
* `url` - The base URL of the article
* `headline` - The article's headline

In [2]:
import pandas as pd
import numpy as np

data = pd.read_csv("stories.csv")
data.columns = ["id", "submission_time", "submission_id", "author", "upvotes", "url", "num_comments", "headline"]
data = data.dropna()

In [3]:
#drop columns I don't need
data = data.drop(["id", "submission_id", "author", "num_comments"], axis=1)

In [4]:
data.shape

(1455868, 4)

This dataset is pretty large, I'm going to shuffle the rows of the data frame and use a quarter of the data for this project.

In [6]:
np.random.seed(1)
shuffled_index = np.random.permutation(data.index)
data = data.reindex(shuffled_index)

#number of rows I want
print(int(len(data) / 4))

363967


In [12]:
#create new df of my desired length
submissions = data.iloc[:363967]
submissions.head(3)
export = submissions.to_csv(r'C:\Users\deand\Documents\Repositories\Predicting_Upvotes\submissions.csv', index=None, header=True)

### Data Preparation
My goal is to train a linear regression algorithm that predicts the number of upvotes a headline would receive. To do this, I'll need to convert each headline to a numerical representation. I will be using the 'bag of words' model, which represents each piece of text as a numerical vector.

In [8]:
tokenized_headlines = []
for item in submissions['headline']:
    tokenized_headlines.append(item.split())

#preview the data  
print(tokenized_headlines[0:5])

[['Safe', 'Conferences', 'Are', 'Deliberately', 'Designed'], ['Nike', 'Free', '7.0', 'V2', "Men's", 'Shoes', 'In', 'Black', '/', 'Gray', '/', 'Orange', '-', '$52.68', ':', 'Nike', 'Free', 'Run'], ['y', 'Own', 'HTTP', 'Daemon', '-', 'Simple', 'daemon', 'in', 'Linux', 'x86', 'assembly'], ['emory', 'card', 'xbox', '360'], ['OWC', 'launching', 'SandForce-based', 'SSDs', 'for', 'latest', 'MacBook', 'Air']]


Now that I have my tokens, I know they will need some processing to help with making predictions later on. I will need to get rid of punctuation, and make all words lowercase for consistency.

In [9]:
punctuation = [",", ":", ";", ".", "'", '"', "’", "?", "/", 
               "-", "+", "&", "(", ")"]
clean_tokenized = []

for item in tokenized_headlines:
    tokens = []
    for token in item:
        token = token.lower()
        for punc in punctuation:
            token = token.replace(punc, "")
        tokens.append(token)
    clean_tokenized.append(tokens)

Now I will retrieve all unique words from each headline, create a matrix, and assign those words as column headers. After, I will populate the matrix with the number of token occurences.

In [10]:
unique_tokens = []
single_tokens = []
for tokens in clean_tokenized:
    for token in tokens:
        if token not in single_tokens:
            single_tokens.append(token)
        elif token in single_tokens and token not in unique_tokens:
            unique_tokens.append(token)

counts = pd.DataFrame(0, index=np.arange(len(clean_tokenized)), columns=unique_tokens)
counts.head()

KeyboardInterrupt: 