# Natural Language Processing

May 18, 2018

Demonstration of manual encoding and predicting the number of upvotes, an indicator of popularity, that articles in Hacker News received, based on their headlines. 

The data set consists of submissions users made to Hacker News from 2006 to 2015 after being scraped by developer Arnaud Drizard. 3000 rows were sampled from the data set randomly, with extraneous columns removed with the exception of four:

- submission_time - when the article was submitted
- upvotes - the number of upvotes the article received
- url - the base URL of the article
- headline - the article's headline

Link to Hacker News --> https://news.ycombinator.com
Link to Arnaud Drizard's data set --> https://github.com/arnauddri/hn



In [295]:
import pandas as pd
submissions = pd.read_csv("sel_hn_stories.csv")
submissions.columns = ["submission_time", "upvotes", "url", "headline"]
submissions = submissions.dropna()

In [296]:
submissions.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 2800 entries, 0 to 2998
Data columns (total 4 columns):
submission_time    2800 non-null object
upvotes            2800 non-null int64
url                2800 non-null object
headline           2800 non-null object
dtypes: int64(1), object(3)
memory usage: 109.4+ KB


# Tokenizing Headlines

Using a bag of words model. A bag of words model represents each piece of text as a numerical vector. Afterwards, change all token words to lowercase and strip all punctuations. Encode to zeros and ones, and scrub any tokens whose frequencies exceed 100 or are below 5, reducing to a smaller feature set which should enable the linear regression model to make better predictions.

https://en.wikipedia.org/wiki/Bag-of-words_model

In [297]:
tokenized_headlines = []

for item in submissions["headline"]:
    tokenized_headlines.append(item.split())

In [298]:
sample = tokenized_headlines

In [302]:
loweree = [[t.lower() for t in each] for each in sample]

In [303]:
loweree

[['software:',
  'sadly',
  'we',
  'did',
  'adopt',
  'from',
  'the',
  'construction',
  'analogy'],
 ['google’s',
  'stock',
  'split',
  'means',
  'more',
  'control',
  'for',
  'larry',
  'and',
  'sergey'],
 ['ssl',
  'dos',
  'attack',
  'tool',
  'released',
  'exploiting',
  'negotiation',
  'overhead'],
 ['immutability', 'and', 'blocks', 'lambdas', 'and', 'closures'],
 ['comment', 'optimiser', 'la', 'vitesse', 'de', 'wordpress?'],
 ['ilk', 'is', 'not', 'as', 'good', 'for', 'you', 'as', 'you', 'think'],
 ['worldometers', '-', 'real', 'time', 'world', 'statistics'],
 ['icrosoft', 'strikes', 'back:', 'introduces', 'docs', 'for', 'facebook'],
 ['net', 'http', 'status', 'codes'],
 ['anecdata',
  'or',
  'how',
  'mckinsey’s',
  'story',
  'became',
  'sheryl',
  'sandberg’s',
  'fact'],
 ['immigration', 'overhaul', 'passes', 'in', 'senate'],
 ['what', 'matters', 'most', 'at', 'ad:tech', 'sf', '2014'],
 ['amazon',
  'silk',
  'revisited:',
  'is',
  'the',
  'split',
  'cloud',

In [304]:
loweree

[['software:',
  'sadly',
  'we',
  'did',
  'adopt',
  'from',
  'the',
  'construction',
  'analogy'],
 ['google’s',
  'stock',
  'split',
  'means',
  'more',
  'control',
  'for',
  'larry',
  'and',
  'sergey'],
 ['ssl',
  'dos',
  'attack',
  'tool',
  'released',
  'exploiting',
  'negotiation',
  'overhead'],
 ['immutability', 'and', 'blocks', 'lambdas', 'and', 'closures'],
 ['comment', 'optimiser', 'la', 'vitesse', 'de', 'wordpress?'],
 ['ilk', 'is', 'not', 'as', 'good', 'for', 'you', 'as', 'you', 'think'],
 ['worldometers', '-', 'real', 'time', 'world', 'statistics'],
 ['icrosoft', 'strikes', 'back:', 'introduces', 'docs', 'for', 'facebook'],
 ['net', 'http', 'status', 'codes'],
 ['anecdata',
  'or',
  'how',
  'mckinsey’s',
  'story',
  'became',
  'sheryl',
  'sandberg’s',
  'fact'],
 ['immigration', 'overhaul', 'passes', 'in', 'senate'],
 ['what', 'matters', 'most', 'at', 'ad:tech', 'sf', '2014'],
 ['amazon',
  'silk',
  'revisited:',
  'is',
  'the',
  'split',
  'cloud',

In [305]:
puncs = [",", ":", ";", ".", "'", '"', "’", "?", "/", "-", "+", "&", "(", ")"]

In [306]:
puncs

[',', ':', ';', '.', "'", '"', '’', '?', '/', '-', '+', '&', '(', ')']

In [307]:
clean = []
            
for each in loweree:
    sub=[]
    for t in each:
        for p in puncs:
            t = t.replace(p,"")
        sub.append(t)
    clean.append(sub)

In [308]:
clean


[['software',
  'sadly',
  'we',
  'did',
  'adopt',
  'from',
  'the',
  'construction',
  'analogy'],
 ['googles',
  'stock',
  'split',
  'means',
  'more',
  'control',
  'for',
  'larry',
  'and',
  'sergey'],
 ['ssl',
  'dos',
  'attack',
  'tool',
  'released',
  'exploiting',
  'negotiation',
  'overhead'],
 ['immutability', 'and', 'blocks', 'lambdas', 'and', 'closures'],
 ['comment', 'optimiser', 'la', 'vitesse', 'de', 'wordpress'],
 ['ilk', 'is', 'not', 'as', 'good', 'for', 'you', 'as', 'you', 'think'],
 ['worldometers', '', 'real', 'time', 'world', 'statistics'],
 ['icrosoft', 'strikes', 'back', 'introduces', 'docs', 'for', 'facebook'],
 ['net', 'http', 'status', 'codes'],
 ['anecdata',
  'or',
  'how',
  'mckinseys',
  'story',
  'became',
  'sheryl',
  'sandbergs',
  'fact'],
 ['immigration', 'overhaul', 'passes', 'in', 'senate'],
 ['what', 'matters', 'most', 'at', 'adtech', 'sf', '2014'],
 ['amazon',
  'silk',
  'revisited',
  'is',
  'the',
  'split',
  'cloud',
  'brows

# Build the encode matrix

Build the dataframe of 0s and 1s matrix while using the unique token words as columns

In [309]:
import numpy as np
unique_tokens = []
single_tokens = []
clean_tokenized = clean

[[single_tokens.append(each) for each in joy] for joy in clean_tokenized]

for joy in single_tokens:
    if joy not in unique_tokens:
        unique_tokens.append(joy)
        
counts = pd.DataFrame(0, index=np.arange(len(clean_tokenized)), columns=unique_tokens)

In [310]:
counts


Unnamed: 0,software,sadly,we,did,adopt,from,the,construction,analogy,googles,...,bzier,curves,headshots,telephones,accessories,rotjs,roguelike,nissan,connecting,response
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
# increment based on occurences

for i, item in enumerate(clean_tokenized):
    for k in range (0,len(item)):
        if item[k] in unique_tokens:
            counts.loc[i,item[k]] += 1

In [335]:
counts

Unnamed: 0,software,sadly,we,did,adopt,from,the,construction,analogy,googles,...,bzier,curves,headshots,telephones,accessories,rotjs,roguelike,nissan,connecting,response
0,1,1,1,1,1,1,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [336]:
sum(counts["and"])


289

In [340]:
# drop features for common words ie occurrences less than 5 or greater than 100
word_counts = []  
for j in range(0,counts.shape[1]):
    word_counts.append(counts[counts.columns[j]].sum())



In [363]:
zz = counts
colname = []
for w in range(0,len(word_counts)):
    if word_counts[w] < 5 or word_counts[w] > 100:
        colname.append(counts.columns[w])
        
zz = zz.drop(colname,axis=1)

In [360]:
# features we dropped 
colname

['sadly',
 'adopt',
 'the',
 'construction',
 'analogy',
 'stock',
 'split',
 'for',
 'larry',
 'and',
 'sergey',
 'ssl',
 'dos',
 'exploiting',
 'negotiation',
 'overhead',
 'immutability',
 'blocks',
 'lambdas',
 'closures',
 'comment',
 'optimiser',
 'la',
 'vitesse',
 'ilk',
 'is',
 'worldometers',
 '',
 'strikes',
 'introduces',
 'docs',
 'status',
 'codes',
 'anecdata',
 'how',
 'mckinseys',
 'sheryl',
 'sandbergs',
 'fact',
 'immigration',
 'overhaul',
 'passes',
 'in',
 'senate',
 'matters',
 'adtech',
 'sf',
 'silk',
 'revisited',
 'faster',
 'dieter',
 'rams',
 'ten',
 'principles',
 'of',
 'a',
 'style',
 'said',
 'to',
 'implicate',
 'russia',
 'ocrtranslation',
 'chinese',
 'text',
 'des',
 'dcorations',
 'nol',
 'plus',
 'colos',
 'childish',
 'behaviour',
 'shouldnt',
 'atlantic',
 'monthly',
 'writer',
 'mirillis',
 'action',
 '1330',
 'older',
 'entrepreneurs',
 'niche',
 'children',
 'minus',
 'cryptic',
 'syntax',
 'ab',
 'with',
 'analytics',
 'latino',
 'sexdates',

In [365]:
counts = zz
counts

Unnamed: 0,software,we,did,from,googles,means,more,control,attack,tool,...,garden,shoes,see,ui,c#,wins,pirate,bay,preview,diet
0,1,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,1,1,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


# Model 
Using scikit learn built in method for linear regression


In [368]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(counts, submissions["upvotes"], test_size=0.2, random_state=1)

In [369]:
from sklearn.linear_model import LinearRegression

clf = LinearRegression()
clf.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [377]:
clf.coef_

array([-6.33981580e+00, -1.18617954e+00,  1.56589595e+00,  4.36475395e+00,
       -8.01809873e+00,  1.01799889e+01, -1.11567225e+01,  2.73060860e+01,
       -1.63175703e+01, -9.08068706e+00,  1.75204169e+01, -2.76722073e+00,
       -6.75975764e-01,  2.43201967e+00, -5.83061602e+00,  4.75168476e+00,
        9.16611325e-01,  6.19423877e+00,  1.78861309e+01,  4.04290514e+00,
       -1.06455261e+01, -2.13213415e+00, -2.42273866e+01,  2.08899091e+01,
        1.15315914e+01,  8.92053713e-01, -1.19174981e+01,  1.46052900e+00,
       -3.74986599e+00,  1.06127698e+00, -3.60581940e+00, -2.40192325e-01,
        8.75818583e-01,  7.57110100e+00, -1.02349097e+01,  1.27316248e-01,
       -9.03605739e+00,  7.06608230e+00, -1.73292663e+00,  1.76900495e+01,
        1.19755952e+01,  1.53616035e+01,  1.60365897e+01, -3.76476931e+00,
       -2.84294146e+01, -1.01201254e+01, -1.62388766e+00,  5.40058042e+00,
        1.26469966e+01,  2.61686822e-01, -8.52287851e+00, -4.43729334e+00,
        4.19996861e+01, -

In [372]:
# The mean square error or MSE
# average of the square of the distances to the best fit line through the data points
print("Residual sum of squares: %.2f"
      % np.mean((clf.predict(X_test) - y_test) ** 2))

Residual sum of squares: 2651.15


In [373]:
# Explained variance score: 1 is perfect prediction 
# or the R-squared - the proportion of the variance in the dependent variable that is predictable 
# from the independent variable
print('Variance score: %.2f' % clf.score(X_test, y_test))

Variance score: -0.43


In [382]:
# to get a sense of dispersion (error) of upvotes in the data set to compare against the RMSE

print(np.mean(submissions.upvotes))
print(np.std(submissions.upvotes))

10.095357142857143
39.49177825665033


# Conclusion

The very high RMSE value of predicted upvotes relative to the upvotes dispersion in the data set suggests the predictions are way off-base. Suggest several steps:

- Use the entire data set. This approach should reduce the error rate dramatically. Using more data and more features will ensure that the model will find more occurrences of the same features in the test and training sets, which should help the model make better predictions.
- Add "meta" data features like headline length and average word length.
- Use a random forest, or another more powerful machine learning technique.
- Explore different thresholds for removing extraneous columns.
