Library imports:

In [1]:
# Standard libs
import random

# DS libs
import pandas as pd

# Additional libraries
from tqdm import tqdm

In [2]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

In [3]:
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA  # for compactness

In [4]:
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/alexanderdesouza/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [5]:
# General notebook configs:
pd.set_option('max.columns', 999)

The data used for this experiment comes from news headlines posted by the Austrailian Broadcasting Corporation (ABC News). The dataset consists of 1M such headlines and have already been substantially preprocessed.

In [6]:
raw_headlines = pd.read_csv('../data/abcnews_million_headlines.csv')

In [7]:
raw_headlines.sample(3)

Unnamed: 0,date,headline
811562,20130815,mother of dead twins granted release on parole
777508,20130410,abbott accuses government of 'surrender' on bo...
991831,20151102,15 years of continuous human space time


In [8]:
raw_headlines.shape

(1048575, 2)

**ToDo** _Introduce sorting by date here to ensure the dataframe is ordered._

Some headlines appear in duplicate; or do they...

In [11]:
raw_headlines[raw_headlines.duplicated()==True]

Unnamed: 0,date,headline


In [12]:
raw_headlines[raw_headlines.duplicated(subset=['headline'])==True].sample(10)

Unnamed: 0,date,headline
288438,20070227,closer am1
381259,20080504,nrl interview jason taylor
714519,20120808,belinda varischetti interviews tim macnamara
653474,20111117,interview george bailey
446550,20090228,man charged over fatal stabbing
641190,20110925,abc entertainment
780577,20130420,interview anthony faingaa
801009,20130706,interview john cartwright
829825,20131026,interview alastair cook
929308,20150127,national rural news


A random sampling of duplicate headlines indicates that duplicate headlines are the result of common news items - the weather, for example - that are recurrent, and for which the date provides a relevant distinguishing context. For this present analysis however, attribution of the date is encapsulated in the unique enumeration assigned to each article, so the `date` column can be dropped.

In [13]:
headlines = raw_headlines.drop('date', axis=1).copy(deep=True)

Next some simple processing is applied to reduce the input text to a simplified form against which matching can be reduced computationally.

In [14]:
stopwords = nltk.corpus.stopwords.words('english')

With an eye toward a production-like environment, the following is constructed...

In [15]:
def remove_stopwords(text, stopwords=''):
    """
    Given an input 'text' stop words are removed, reducing the complexity of the input text. A collection of stopwords must
    be supplied, otherwise this method returns the input text directly again.
        :param: text: Input string.
        :return: Reduced complexity string.
    """
    return ' '.join([word for word in text.split() if word not in stopwords])

In [16]:
headlines['stopped_headline'] = headlines['headline'].apply(lambda r: ' '.join([word for word in r.split() if word not in stopwords]))

In [17]:
headlines.sample(3)

Unnamed: 0,headline,stopped_headline
122741,meeting focuses on uni regional campus shake up,meeting focuses uni regional campus shake
14783,herron sigma merger complete,herron sigma merger complete
1042307,mont blanc: rescue efforts suspended for stranded,mont blanc: rescue efforts suspended stranded


Employ the semantic intensity analyzer from NLTK, `SIA`, a robust, but rule-based, lexicographic heuristic.

**ToDo** _Ideally semantic tagging can be learned from the text itself rather than being applied rotely._

In [18]:
# Load the SIA scored headlines here and skip the next 3 cells (the scoring below takes approximately 4 mins)
scored_headlines = pd.read_pickle('../data/abcnews_million_headlines_sia_scored.pkl')

In [19]:
sia = SIA()  # initialize the nltk semantic intensity analyzer

In [20]:
# And score each of the headlines in the dataset
scored_headlines = []

for index, row in tqdm(headlines.iterrows(), desc='SIA scoring of headlines'):
    sia_scores = sia.polarity_scores(row['stopped_headline'])
    sia_scores['headline'] = row['headline']
    sia_scores['stopped_headline'] = row['stopped_headline']
    scored_headlines.append(sia_scores)
    
scored_headlines = pd.DataFrame(scored_headlines)

SIA scoring of headlines: 1048575it [03:27, 5056.50it/s]


In [21]:
# Sort columns as desired, for readability
scored_headlines = scored_headlines[['headline', 'stopped_headline', 'neg', 'neu', 'pos', 'compound']]

In [22]:
scored_headlines.sample(3)

Unnamed: 0,headline,stopped_headline,neg,neu,pos,compound
541257,residents urge council to save sugarworld,residents urge council save sugarworld,0.0,0.556,0.444,0.4939
531064,grant money allocation decided soon,grant money allocation decided soon,0.0,0.615,0.385,0.3612
494119,smurray38 said it,smurray38 said,0.0,1.0,0.0,0.0


In [27]:
# Save the SIA scored headlines as it's quite a costly operation to generate them
scored_headlines.to_pickle('../data/abcnews_million_headlines_sia_scored-22okt2018.pkl')

Sentiment can be corsely binned to create set(s) of document tags.

In [28]:
def binary_sentiment_tag(score):
    """
    A simple method returning 0 or 1 if the input score is considered either negative or positive.
        :param: score: A value between [-1.0, +1.0]
        :return: Integer value of 0 or 1
    """
    return 0 if score < 0.0 else 1

**Tagging Strategy** <br/>
Two tags are introduced:
1. The first is a single unique integer that could represent, for example, the unique date and time at which the article was published (though timestamps are unavailable).
2. The second tag represents the output of the binary sentiment tag defined above.

In [29]:
tagged_docs = [TaggedDocument(words=nltk.word_tokenize(row['stopped_headline']), tags=[index, binary_sentiment_tag(row['compound'])]) for index, row in tqdm(scored_headlines.iterrows(), desc='Sentence tagging')]
random.sample(tagged_docs, 5)

Sentence tagging: 1048575it [03:29, 5014.28it/s]


[TaggedDocument(words=['interview', 'ben', 'barba'], tags=[789946, 1]),
 TaggedDocument(words=['belinda', 'varischetti', 'interviews', 'rory', 'graham'], tags=[743795, 1]),
 TaggedDocument(words=['griffith', 'club', 'discuss', 'new', 'airspace', 'system'], tags=[55716, 1]),
 TaggedDocument(words=['growing', 'concern', 'potential', "'needle", 'stick', "'", 'inju'], tags=[819436, 1]),
 TaggedDocument(words=['man', 'dies', 'gyrocopter', 'crash'], tags=[535447, 0])]

In [None]:
# Define a set of hyperparameters that can be optimized later
max_epochs = 100  # number of training epochs
alpha = 0.025     # initial learning rate, selected 

In [None]:
Doc2Vec(min_count=1, window=10, size=100, sample=1e-4, negative=5, workers=8)

model = Doc2Vec(size=10,            # let's call it something like the number of neurons
                alpha=alpha,        # learning rate
                min_alpha=0.00025,  # minimum learning rate
                min_count=1,        # minimum term frequency
                dm =0.5)            # there is a trade off here in the degree of memory distribution to use for the model (i.e., DM v DBOW)

model.build_vocab(tagged_docs)

for epoch in range(max_epochs):
    print('iteration {0}'.format(epoch))
    model.train(tagged_docs,
                total_examples=model.corpus_count,
                epochs=model.iter)
    # Decrease the learning rate
    model.alpha -= 0.0002
    # Fix the learning rate to prevent decay
    model.min_alpha = model.alpha

In [None]:
# Save the model to file for reuse later
model.save('./doc2vec.model')

In [None]:
# Load the model from file to reuse
model = Doc2Vec.load('./doc2vec.model')

The example(s) below is not relevant and will need to be modified slightly if loading and running this notebook from scratch.

A headline is randomly sampled from which a related query can be manually composed in order to assess the quality of the model.

In [None]:
scored_headlines.sample(1)

In [None]:
tokenized_query_statement = "al qaeda bombing suspect".split() # not stop words are pre-filtered in this composition
query_vector = model.infer_vector(tokenized_query_statement)
similarity = model.docvecs.most_similar([query_vector])
similarity

In [None]:
results = scored_headlines.iloc[list(dict(similarity).keys())]
results['similarity'] = list(dict(similarity).values())

In [None]:
results

In [None]:
matched_headline_indexes = [int(i) for i, s in similarity]
raw_headlines[matched_headline_indexes]

In [None]:
np.argmax([-0.1, -0.2, -0.3])