# Sentiment Analysis on News Articles

Now that the data has been cleaned, we can proceed with trying to train a classifier. The general plan is to use word embeddings from a Word2Vec model to represent the text for each article, then use a many-to-one recurrent neural network to predict the sentiment of these articles.

In [13]:
import time

import pandas as pd

import gensim.models

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

## Processing the Headlines and Descriptions

Right now, the headlines and descriptions are two separate sequences of words that we can't directly analyze. To make things easier, let's concatenate the headlines and descriptions into one block of text so we can treat them as one sequence for the RNN to analyze.

In [2]:
articles = pd.read_csv('./data/cleaned_articles.csv')

In [3]:
def add_combined_column(articles):
    '''For each article in the dataframe, concatenate the headline and description into a combined entry,
    ignoring NaN entries.'''
    for i, row in articles.iterrows():
        headline, description = row['headline'], row['description']
        combined = ""
        
        if type(headline) == str:
            combined += headline
        elif type(description) == str:
            combined += description
            
        articles.loc[i, 'combined'] = combined

In [4]:
add_combined_column(articles)
articles.head()

Unnamed: 0,headline,url,description,category,source,combined
0,"A Sad Bulldog, A Happy Prince And More Things ...",https://www.huffpost.com/entry/coronavirus-dis...,"A sad bulldog and a happy, paint-covered princ...",good,huffpost,"A Sad Bulldog, A Happy Prince And More Things ..."
1,John Krasinski Shocks 9-Year-Old 'Hamilton' Fa...,https://www.huffpost.com/entry/john-krasinski-...,"""The Office"" star struck gold again in his You...",good,huffpost,John Krasinski Shocks 9-Year-Old 'Hamilton' Fa...
2,I Was Struggling As A Single Mom. Then A Stran...,https://www.huffpost.com/entry/struggling-sing...,"""With this gift, I was finally able to get out...",good,huffpost,I Was Struggling As A Single Mom. Then A Stran...
3,Pink's Advice To Fans: 'Change The F**king Wor...,https://www.huffpost.com/entry/pink-peoples-ch...,“I care about decency and humanity and kindnes...,good,huffpost,Pink's Advice To Fans: 'Change The F**king Wor...
4,10 Books For Parents Who Want To Raise Kind Kids,https://www.huffpost.com/entry/parenting-books...,These parenting books emphasize emotional inte...,good,huffpost,10 Books For Parents Who Want To Raise Kind Kids


Nice! Before we can represent this new 'combined' entry as a matrix, we have to remove punctuation / special characters and numbers, since they don't convey much meaningful information about sentiment and could bog down the model.

In [5]:
import re

def process_text(text):
    text = text.lower()
    # remove any non alphanumeric characters,
    # escaping ' and * because of contractions and swear censorship (e.g. f**k)
    text = re.sub(r"[^a-zA-Z'\*]", " ", text)
    # tokenize the combined text by splitting on spaces and removing whitespace
    text = text.split()
    word_list = []
    for i, word in enumerate(text):
        # remove any censored swears which would likely just waste space in the w2v model
        if '*' in word:
            continue
        # remove any artifact of the regex checking that leaves a single apostrophe as a 'word'
        elif word == "'":
            continue
        # remove single-quoted text by checking for first character apostrophes then last character apostrophes
        word = re.sub(r"^'", "", word)
        word = re.sub(r"'$", "", word)
        
        # removing 's possessives
        word = re.sub(r"'s$", "", word)
        word_list.append(word)
        
    return word_list

In [6]:
articles['processed'] = articles['combined'].apply(lambda text: process_text(text))
articles.head()

Unnamed: 0,headline,url,description,category,source,combined,processed
0,"A Sad Bulldog, A Happy Prince And More Things ...",https://www.huffpost.com/entry/coronavirus-dis...,"A sad bulldog and a happy, paint-covered princ...",good,huffpost,"A Sad Bulldog, A Happy Prince And More Things ...","[a, sad, bulldog, a, happy, prince, and, more,..."
1,John Krasinski Shocks 9-Year-Old 'Hamilton' Fa...,https://www.huffpost.com/entry/john-krasinski-...,"""The Office"" star struck gold again in his You...",good,huffpost,John Krasinski Shocks 9-Year-Old 'Hamilton' Fa...,"[john, krasinski, shocks, year, old, hamilton,..."
2,I Was Struggling As A Single Mom. Then A Stran...,https://www.huffpost.com/entry/struggling-sing...,"""With this gift, I was finally able to get out...",good,huffpost,I Was Struggling As A Single Mom. Then A Stran...,"[i, was, struggling, as, a, single, mom, then,..."
3,Pink's Advice To Fans: 'Change The F**king Wor...,https://www.huffpost.com/entry/pink-peoples-ch...,“I care about decency and humanity and kindnes...,good,huffpost,Pink's Advice To Fans: 'Change The F**king Wor...,"[pink, advice, to, fans, change, the, world]"
4,10 Books For Parents Who Want To Raise Kind Kids,https://www.huffpost.com/entry/parenting-books...,These parenting books emphasize emotional inte...,good,huffpost,10 Books For Parents Who Want To Raise Kind Kids,"[books, for, parents, who, want, to, raise, ki..."


## Creating and Training the Word2Vec Model

Let's initialize a Word2Vec Model with 300-dimensional word vectors and a subsampling parameter of 1e-5, since [Mikolov et al](https://arxiv.org/pdf/1310.4546.pdf) achieved good results with these settings.

In [16]:
w2v_model = gensim.models.Word2Vec(size = 300, sample=1e-5)

I pulled a [pre-existing dataset](https://www.kaggle.com/therohk/million-headlines) of news headlines that we can add to the dataset I assembled to give our model as robust and wide a vocabulary as possible.

In [10]:
abc_headlines = pd.read_csv('./data/abc_articles.csv')
abc_headlines.head()

Unnamed: 0,publish_date,headline_text
0,20030219,aba decides against community broadcasting lic...
1,20030219,act fire witnesses must be aware of defamation
2,20030219,a g calls for infrastructure protection summit
3,20030219,air nz staff in aust strike for pay rise
4,20030219,air nz strike to affect australian travellers


First, we'll process the headlines using our previous `process_text()` function.

In [11]:
abc_headlines['processed'] = abc_headlines['headline_text'].apply(lambda text: process_text(text))
abc_headlines.head()

Unnamed: 0,publish_date,headline_text,processed
0,20030219,aba decides against community broadcasting lic...,"[aba, decides, against, community, broadcastin..."
1,20030219,act fire witnesses must be aware of defamation,"[act, fire, witnesses, must, be, aware, of, de..."
2,20030219,a g calls for infrastructure protection summit,"[a, g, calls, for, infrastructure, protection,..."
3,20030219,air nz staff in aust strike for pay rise,"[air, nz, staff, in, aust, strike, for, pay, r..."
4,20030219,air nz strike to affect australian travellers,"[air, nz, strike, to, affect, australian, trav..."


The Gensim Word2Vec model can't input `pd.DataFrame` or `pd.Series` objects, so let's convert the 'processed' `pd.Series` from each dataframe into a list of entries.

In [12]:
abc_rows = abc_headlines['processed'].to_list()

article_rows = articles['processed'].to_list()

combined_rows = abc_rows + article_rows

Now, we can build the vocabulary of the model.

In [17]:
start = time.time()

w2v_model.build_vocab(combined_rows)

current = time.time()
elapsed = (current - start) / 60
print('Time to build vocab: {} mins'.format(round(elapsed, 2)))

Time to build vocab: 0.26 mins


Then, train it on the *combined* rows because the more data we have here, the better the model will be able to discern the relationships between the words. The sentiment classification will be done by the RNN later, so for right now we just want the best embeddings we can get.

In [244]:
start = time.time()

w2v_model.train(
    combined_rows, 
    total_examples = w2v_model.corpus_count, 
    epochs = 30)

current = time.time()
elapsed = (current - start) / 60
print('Time to train model: {} mins'.format(round(elapsed, 2)))

w2v_model.save('./models/w2v')

Time to train model: 4.48 mins


Run the following cell if you'd like to play around with the w2v model without having to train it!

In [19]:
w2v_model = gensim.models.Word2Vec.load('./models/w2v')

Time to sanity check the model.

In [20]:
w2v_model.wv.most_similar('covid')

[('pandemic', 0.6704552173614502),
 ('vaccinated', 0.5635499358177185),
 ('streaming', 0.554112434387207),
 ('updates', 0.5361689329147339),
 ('antibodies', 0.5320745706558228),
 ('ebola', 0.5311349630355835),
 ('coronavirus', 0.5284198522567749),
 ('neighbors', 0.5263728499412537),
 ('vaccinate', 0.5231813192367554),
 ('swine', 0.5097109079360962)]

Looks like it did alright!

In [193]:
def torch_embed(word_list, word_vectors):
    word_tensors = []
    num_words = len(word_list)
    for word in word_list:
        try:
            word_vector = word_vectors.get_vector(word)
            word_torch_tensor = torch.from_numpy(word_vector)
            word_tensors.append(word_torch_tensor)
        # if the word doesn't haven an embedding, a KeyError is thrown
        except KeyError as e:
            print(e, 'Ignoring word.')
            num_words -= 1
            pass
    # convert the list of tensors into one long tensor with shape (len(word_list), 300)
    article_tensor = torch.stack(word_tensors)
    article_tensor = article_tensor.view(article_tensor.shape[0], 300, 1)
    return article_tensor

In [46]:
x = articles['description'].isna()

In [47]:
for i, item in enumerate(x):
    if item == True:
        print(i)
        break

862


In [None]:
class SentimentClassifier(nn.Module):
    def __init__(self):
        super(SentimentClassifier, self).__init__()
        
        self.embeddings = 