## Predicting The Upvotes

In this project, we'll be predicting the number of upvotes the articles received, based on their headlines. Because upvotes are an indicator of popularity, we'll discover which types of articles tend to be the most popular.

Hacker News is a community where users can submit articles, and other users can upvote those articles. The articles with the most upvotes make it to the front page, where they're more visible to the community. Our data set consists of submissions users made to Hacker News from 2006 to 2015.

Our data only has four columns:-

- **submission_time** - When the article was submitted
- **upvotes** - The number of upvotes the article received
- **url** - The base URL of the article
- **headline** - The article's headline

In [1]:
import pandas as pd

In [2]:
submissions = pd.read_csv("sel_hn_stories.csv")
submissions.columns = ["submission_time", "upvotes", "url", "headline"]

In [3]:
submissions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2999 entries, 0 to 2998
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   submission_time  2999 non-null   object
 1   upvotes          2999 non-null   int64 
 2   url              2810 non-null   object
 3   headline         2989 non-null   object
dtypes: int64(1), object(3)
memory usage: 93.8+ KB


In [4]:
submissions = submissions.dropna()
submissions.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2800 entries, 0 to 2998
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   submission_time  2800 non-null   object
 1   upvotes          2800 non-null   int64 
 2   url              2800 non-null   object
 3   headline         2800 non-null   object
dtypes: int64(1), object(3)
memory usage: 109.4+ KB


Our goal is to train a linear regression algorithm that predicts the number of upvotes a headline would receive. To do this, we'll need to convert each headline to a numerical representation.

For this we will convert the headings into a list of words and cleaning the list (removing the punctuations) and finally make a list of unique words.

In [5]:
tokenized_headlines = []

for row in submissions['headline']:
    lst = row.split()
    tokenized_headlines.append(lst)

In [6]:
punctuation = [",", ":", ";", ".", "'", '"', "’", "?", "/", "-", "+", "&", "(", ")"]
clean_tokenized = []

for item in tokenized_headlines:
    tokens = []
    for token in item:
        token = token.lower()
        for punc in punctuation:
            token = token.replace(punc, "")
        tokens.append(token)
    clean_tokenized.append(tokens)

Making `columns` of total each unique words and `rows` equal to total number of headings we have. It contains the occurence of word in headings. Like this
<br/>
<br/>
<center>
<h4> 1) [i, rode, my, horse, to, berlin]<h4/>
<center>2) [you, rode, my, horse, to, berlin, in, the, winter]

    ↓                       ↓
    ↓                       ↓
    ↓                       ↓
    ↓                       ↓
    ↓                       ↓
<center>   [i, rode, my, horse, to, berlin, in, the, winter, you]
   
    ↓                       ↓
    ↓                       ↓
    ↓                       ↓
    ↓                       ↓
<center/>
   
  |   | i | rode | my | horse | to | berlin | in | the | winter | you |
|---|---|------|----|-------|----|--------|----|-----|--------|-----|
| 1 | 1 | 1    | 1  | 1     | 1  | 1      | 0  | 0   | 0      | 0   |
| 2 | 0 | 1    | 1  | 1     | 1  | 1      | 1  | 1   | 1      | 1   |

In [7]:
import numpy as np
unique_tokens = []
single_tokens = []

for each in clean_tokenized:
    for item in each:
        if item not in single_tokens:
            single_tokens.append(item)
        elif item in single_tokens and item not in unique_tokens:
            unique_tokens.append(item)
            
counts = pd.DataFrame(0, index=np.arange(len(clean_tokenized)), columns=unique_tokens)

In [8]:
for index, each in enumerate(clean_tokenized):
    for item in each:
        if item in unique_tokens:
            counts.iloc[index][item] += 1

In [9]:
counts.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2800 entries, 0 to 2799
Columns: 2310 entries, and to disaster
dtypes: int64(2310)
memory usage: 49.4 MB


We have over 2000 columns in our matrix. This can make it very hard for a linear regression model to make good predictions. Too many columns will cause the model to fit to noise instead of the signal in the data.

There are two kinds of features that will reduce prediction accuracy. Features that occur only a few times will cause overfitting, because the model doesn't have enough information to accurately decide whether they're important. These features will probably correlate differently with upvotes in the test set and the training set.

Features that occur too many times can also cause issues. These are words like `and` and `to`, which occur in nearly every headline. These words don't add any information, because they don't necessarily correlate with upvotes. These types of words are sometimes called `stopwords`.

To reduce the number of features and enable the linear regression model to make better predictions, we'll remove any words that occur fewer than `5` times or more than `100` times.

In [10]:
word_counts = counts.sum(axis=0)
counts = counts.loc[:,(word_counts >= 5) & (word_counts <= 100)]

Now we'll need to split the data into two sets so that we can evaluate our algorithm effectively. We'll train our algorithm on a training set, then test its performance on a test set.

In [11]:
from sklearn.model_selection import train_test_split

In [12]:
X_train, X_test, y_train, y_test = train_test_split(counts, submissions["upvotes"], test_size=0.2, random_state=1)

Now that we have a training set and a test set, let's train a model and make test predictions. When we make predictions with a linear regression model, the model assigns coefficients to each column. 

Essentially, the model is determining which words correlate with more upvotes, and which with less. By finding these correlations, the model will be able to predict which headlines will be highly upvoted in the future.

In [13]:
from sklearn.linear_model import LinearRegression

In [14]:
clf = LinearRegression()
clf.fit(X_train, y_train)
predictions = clf.predict(X_test)

In [15]:
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, predictions)
print(mse)

2651.1457056689665


Let's check with `RandomForestRegression` algorithm.

In [16]:
from sklearn.ensemble import RandomForestRegressor

In [17]:
rfr = RandomForestRegressor(max_depth=20, random_state=1)
rfr.fit(X_train, y_train)
predictions = rfr.predict(X_test)

In [18]:
mse = mean_squared_error(y_test, predictions)

In [19]:
print(mse)

2039.3099402428354


Finally, it shows some improvement. It may possible that it will be overfited.