# Introduction
The purpose of this notebook is to present an example of Topic Modeling usins Latent Dirichlet Distribution (LDA). The dataset used is a list of over one million news headlines published over a period of 15 years. These headlines was sourced from ABC (Australian Broadcasting Corp.) and can be downloaded from [Kaggle](https://www.kaggle.com/therohk/million-headlines/data).

# Steps
* [The data](#The-data)
* [Data pre-processing](#Data-pre-processing)
* [Generate Bag of Words](#Generate-Gag-of-Words)
* [Generate TF-IDF](#Generate-TF-IDF)
* [Generate LDA models](#Generate-LDA-models)
    * [Using Bag of Words](#Using-Bag-of-Words)
    * [Using TF-IDF](#Using-TF-IDF)

# The data

In [54]:
import pandas as pd
pd.options.display.max_rows = 10

documents = pd.read_csv('data/abcnews-date-text.csv', error_bad_lines=False)
# Viewing
display(documents)

Unnamed: 0,publish_date,headline_text
0,20030219,aba decides against community broadcasting lic...
1,20030219,act fire witnesses must be aware of defamation
2,20030219,a g calls for infrastructure protection summit
3,20030219,air nz staff in aust strike for pay rise
4,20030219,air nz strike to affect australian travellers
...,...,...
1186013,20191231,vision of flames approaching corryong in victoria
1186014,20191231,wa police and government backflip on drug amne...
1186015,20191231,we have fears for their safety: victorian premier
1186016,20191231,when do the 20s start


# Data pre-processing
The data pre-processing is performed following the below steps:
* Split the text into sentences and the sentences into words (tokenization);
* Lowercase the words and remove punctuation;
* Remove words that have fewer than 3 characters;
* Remove all stopwords;
* Lemmatize words (words in third person are changed to first person and verbs in past and future tenses are changed to present);
* Stem words (words are reduced to their root form).

In [56]:
import numpy as np
np.random.seed(59)

import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS

import nltk
from nltk.stem.porter import *
#nltk.download('wordnet')

def preprocess(text):
    stemmer = nltk.SnowballStemmer('english')
    processed_text = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            token = stemmer.stem(nltk.WordNetLemmatizer().lemmatize(token, pos='v'))
            processed_text.append(token)
    return processed_text

processed_docs = documents['headline_text'].map(preprocess)
# Viewing
display(processed_docs)

0                   [decid, communiti, broadcast, licenc]
1                                      [wit, awar, defam]
2                  [call, infrastructur, protect, summit]
3                             [staff, aust, strike, rise]
4                    [strike, affect, australian, travel]
                                ...                      
1186013     [vision, flame, approach, corryong, victoria]
1186014     [polic, govern, backflip, drug, amnesti, bin]
1186015                [fear, safeti, victorian, premier]
1186016                                           [start]
1186017    [yarravill, shoot, woman, dead, critic, injur]
Name: headline_text, Length: 1186018, dtype: object

# Generate Bag of Words

In [58]:
dictionary = gensim.corpora.Dictionary(processed_docs)
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
# Viewing
doc_sample = bow_corpus[4310]
for i in range(len(doc_sample)):
    print("Word {} \"{}\" appears {} time(s)".format(
        doc_sample[i][0],
        dictionary[doc_sample[i][0]],
        doc_sample[i][1])
    )

Word 162 "govt" appears 1 time(s)
Word 240 "group" appears 1 time(s)
Word 292 "vote" appears 1 time(s)
Word 589 "local" appears 1 time(s)
Word 838 "want" appears 1 time(s)
Word 3567 "compulsori" appears 1 time(s)
Word 3568 "ratepay" appears 1 time(s)


# Generate TF-IDF

In [63]:
from gensim import corpora, models
from pprint import pprint

tfidf = models.TfidfModel(bow_corpus)
tfidf_corpus = tfidf[bow_corpus]
# Viewing
doc_sample = tfidf_corpus[4310]
for i in range(len(doc_sample)):
    print("Word {} \"{}\" has weight {}".format(
        doc_sample[i][0],
        dictionary[doc_sample[i][0]],
        doc_sample[i][1])
    )

Word 162 "govt" has weight 0.25617525269671065
Word 240 "group" has weight 0.3011111395538523
Word 292 "vote" has weight 0.33416888830557095
Word 589 "local" has weight 0.33377677352466983
Word 838 "want" has weight 0.3121925622107832
Word 3567 "compulsori" has weight 0.5158075532653446
Word 3568 "ratepay" has weight 0.5070590825348879


# Generate LDA models

## Using Bag of Words

In [64]:
lda_model_bow = gensim.models.LdaMulticore(
    bow_corpus,
    num_topics=10,
    id2word=dictionary,
    passes=2,
    workers=2
)
# Viewing
for idx, topic in lda_model_bow.print_topics(-1):
    print('Topic {}: {}\n'.format(idx, topic))

Topic 0: 0.050*"australia" + 0.035*"trump" + 0.015*"test" + 0.014*"tasmania" + 0.013*"final" + 0.012*"royal" + 0.012*"win" + 0.011*"world" + 0.011*"lose" + 0.010*"street"

Topic 1: 0.044*"say" + 0.027*"elect" + 0.014*"state" + 0.013*"labor" + 0.012*"china" + 0.012*"countri" + 0.011*"claim" + 0.011*"minist" + 0.010*"call" + 0.010*"deal"

Topic 2: 0.034*"govern" + 0.032*"queensland" + 0.025*"south" + 0.022*"north" + 0.021*"coast" + 0.019*"water" + 0.014*"west" + 0.014*"commiss" + 0.014*"gold" + 0.012*"flood"

Topic 3: 0.030*"charg" + 0.027*"court" + 0.025*"murder" + 0.023*"attack" + 0.020*"donald" + 0.018*"face" + 0.017*"alleg" + 0.016*"jail" + 0.015*"accus" + 0.014*"woman"

Topic 4: 0.025*"year" + 0.022*"market" + 0.018*"australian" + 0.015*"world" + 0.013*"rise" + 0.013*"high" + 0.013*"bank" + 0.013*"price" + 0.013*"women" + 0.012*"power"

Topic 5: 0.029*"melbourn" + 0.019*"miss" + 0.018*"island" + 0.017*"tasmanian" + 0.015*"guilti" + 0.013*"death" + 0.012*"chines" + 0.012*"john" + 0.0

## Using TF-IDF

In [65]:
lda_model_tfidf = gensim.models.LdaMulticore(
    tfidf_corpus,
    num_topics=10,
    id2word=dictionary,
    passes=2,
    workers=4
)
# Viewing
for idx, topic in lda_model_tfidf.print_topics(-1):
    print('Topic {}: {}\n'.format(idx, topic))

Topic 0: 0.012*"rural" + 0.012*"countri" + 0.009*"govern" + 0.009*"hour" + 0.009*"health" + 0.007*"fund" + 0.006*"nation" + 0.005*"say" + 0.005*"budget" + 0.005*"feder"

Topic 1: 0.018*"crash" + 0.010*"die" + 0.010*"polic" + 0.010*"miss" + 0.009*"search" + 0.009*"road" + 0.009*"death" + 0.009*"friday" + 0.009*"monday" + 0.008*"driver"

Topic 2: 0.007*"fiji" + 0.006*"presid" + 0.006*"indonesia" + 0.006*"august" + 0.006*"syria" + 0.006*"marriag" + 0.006*"brexit" + 0.006*"elect" + 0.005*"terror" + 0.005*"univers"

Topic 3: 0.012*"bushfir" + 0.010*"weather" + 0.009*"hill" + 0.006*"explain" + 0.006*"social" + 0.005*"footag" + 0.005*"human" + 0.005*"outback" + 0.005*"onlin" + 0.005*"firefight"

Topic 4: 0.013*"coast" + 0.010*"turnbul" + 0.009*"gold" + 0.009*"thursday" + 0.008*"plead" + 0.007*"histori" + 0.006*"peter" + 0.006*"mother" + 0.006*"malcolm" + 0.005*"storm"

Topic 5: 0.010*"scott" + 0.009*"pacif" + 0.009*"grandstand" + 0.008*"march" + 0.007*"queensland" + 0.006*"leagu" + 0.005*"spr