<a href="https://colab.research.google.com/github/andrewmsilva/DataScienceStudies/blob/master/Topic%20Modeling%20with%20headlines.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction
The purpose of this notebook is to present an example of Topic Modeling usins Latent Dirichlet Distribution (LDA). The dataset used is a list of over one million news headlines published over a period of 15 years. These headlines was sourced from ABC (Australian Broadcasting Corp.) and can be downloaded from [Kaggle](https://www.kaggle.com/therohk/million-headlines/data).

# Steps
* [Data loading](#Data-loading)
* [Data pre-processing](#Data-pre-processing)
* [Features extraction](#Features-extraction)
  * [Generate Bag of Words](#Generate-Bag-of-Words)
  * [Generate TF-IDF](#Generate-TF-IDF)
* [Models training](#Models-training)
  * [With Bag of Words](#With-Bag-of-Words)
  * [With TF-IDF](#With-TF-IDF)

# Data loading

In [5]:
from google.colab import files
files.upload()
!pip install -q kaggle
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
!kaggle datasets download -d therohk/million-headlines

Saving kaggle.json to kaggle.json
Downloading million-headlines.zip to /content
 89% 18.0M/20.2M [00:00<00:00, 80.1MB/s]
100% 20.2M/20.2M [00:00<00:00, 99.2MB/s]


In [0]:
import pandas as pd

headlines_df = pd.read_csv('million-headlines.zip', compression='zip', header=0, sep=',', quotechar='"')

In [7]:
# Displaying
headlines_df

Unnamed: 0,publish_date,headline_text
0,20030219,aba decides against community broadcasting lic...
1,20030219,act fire witnesses must be aware of defamation
2,20030219,a g calls for infrastructure protection summit
3,20030219,air nz staff in aust strike for pay rise
4,20030219,air nz strike to affect australian travellers
...,...,...
1186013,20191231,vision of flames approaching corryong in victoria
1186014,20191231,wa police and government backflip on drug amne...
1186015,20191231,we have fears for their safety: victorian premier
1186016,20191231,when do the 20s start


# Data pre-processing

In [0]:
import numpy as np
np.random.seed(59)

import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS

import nltk
from nltk.stem.porter import *
#nltk.download('stopwords')
#nltk.download('wordnet')

def preprocess(text):
  stemmer = nltk.SnowballStemmer('english')
  lemmatizer = nltk.WordNetLemmatizer()

  processed_text = []
  for token in gensim.utils.simple_preprocess(text):
    if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
      token = stemmer.stem(lemmatizer.lemmatize(token, pos='v'))
      processed_text.append(token)
  return processed_text

processed = headlines_df.headline_text.map(preprocess)
headlines_df['processed'] = processed

In [11]:
# Displaying
headlines_df

Unnamed: 0,publish_date,headline_text,processed
0,20030219,aba decides against community broadcasting lic...,"[decid, communiti, broadcast, licenc]"
1,20030219,act fire witnesses must be aware of defamation,"[wit, awar, defam]"
2,20030219,a g calls for infrastructure protection summit,"[call, infrastructur, protect, summit]"
3,20030219,air nz staff in aust strike for pay rise,"[staff, aust, strike, rise]"
4,20030219,air nz strike to affect australian travellers,"[strike, affect, australian, travel]"
...,...,...,...
1186013,20191231,vision of flames approaching corryong in victoria,"[vision, flame, approach, corryong, victoria]"
1186014,20191231,wa police and government backflip on drug amne...,"[polic, govern, backflip, drug, amnesti, bin]"
1186015,20191231,we have fears for their safety: victorian premier,"[fear, safeti, victorian, premier]"
1186016,20191231,when do the 20s start,[start]


# Features extraction

## Generate Bag of Words

In [0]:
dictionary = gensim.corpora.Dictionary(processed)
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)
bow_corpus = [dictionary.doc2bow(doc) for doc in processed]

In [14]:
# Displaying
doc_sample = bow_corpus[4310]
for i in range(len(doc_sample)):
  print("Word {} \"{}\" appears {} time(s)".format(
    doc_sample[i][0],
    dictionary[doc_sample[i][0]],
    doc_sample[i][1])
  )

Word 162 "govt" appears 1 time(s)
Word 240 "group" appears 1 time(s)
Word 292 "vote" appears 1 time(s)
Word 589 "local" appears 1 time(s)
Word 838 "want" appears 1 time(s)
Word 3567 "compulsori" appears 1 time(s)
Word 3568 "ratepay" appears 1 time(s)


## Generate TF-IF

In [0]:
from gensim import corpora, models

tfidf = models.TfidfModel(bow_corpus)
tfidf_corpus = tfidf[bow_corpus]

In [16]:
# Displaying
doc_sample = tfidf_corpus[4310]
for i in range(len(doc_sample)):
    print("Word {} \"{}\" has weight {}".format(
        doc_sample[i][0],
        dictionary[doc_sample[i][0]],
        doc_sample[i][1])
    )

Word 162 "govt" has weight 0.25617525269671065
Word 240 "group" has weight 0.3011111395538523
Word 292 "vote" has weight 0.33416888830557095
Word 589 "local" has weight 0.33377677352466983
Word 838 "want" has weight 0.3121925622107832
Word 3567 "compulsori" has weight 0.5158075532653446
Word 3568 "ratepay" has weight 0.5070590825348879


# Models training

## With Bag of Words

In [0]:
lda_model_bow = gensim.models.LdaMulticore(
    bow_corpus,
    num_topics=10,
    id2word=dictionary,
    passes=2,
    workers=2
)

In [18]:
# Displaying
for idx, topic in lda_model_bow.print_topics(-1):
    print('Topic {}: {}\n'.format(idx, topic))

Topic 0: 0.026*"govern" + 0.020*"nation" + 0.020*"south" + 0.017*"canberra" + 0.017*"north" + 0.015*"tasmania" + 0.013*"china" + 0.012*"lose" + 0.011*"west" + 0.011*"dead"

Topic 1: 0.020*"news" + 0.016*"water" + 0.015*"hospit" + 0.015*"warn" + 0.013*"power" + 0.013*"plan" + 0.010*"farmer" + 0.010*"scott" + 0.009*"council" + 0.009*"resid"

Topic 2: 0.029*"court" + 0.019*"face" + 0.016*"accus" + 0.015*"peopl" + 0.013*"child" + 0.013*"speak" + 0.013*"sentenc" + 0.013*"jail" + 0.013*"drug" + 0.012*"abus"

Topic 3: 0.018*"elect" + 0.013*"time" + 0.011*"game" + 0.011*"say" + 0.009*"win" + 0.009*"make" + 0.008*"video" + 0.008*"open" + 0.008*"meet" + 0.008*"australian"

Topic 4: 0.033*"trump" + 0.017*"chang" + 0.013*"health" + 0.011*"rural" + 0.011*"say" + 0.011*"countri" + 0.011*"busi" + 0.011*"fund" + 0.011*"indigen" + 0.010*"work"

Topic 5: 0.048*"polic" + 0.029*"sydney" + 0.023*"death" + 0.021*"charg" + 0.019*"donald" + 0.017*"murder" + 0.016*"shoot" + 0.016*"woman" + 0.015*"perth" + 0.01

## With TF-IDF

In [0]:
lda_model_tfidf = gensim.models.LdaMulticore(
    tfidf_corpus,
    num_topics=10,
    id2word=dictionary,
    passes=2,
    workers=2
)

In [22]:
# Displaying
for idx, topic in lda_model_tfidf.print_topics(-1):
    print('Topic {}: {}\n'.format(idx, topic))

Topic 0: 0.015*"murder" + 0.014*"charg" + 0.014*"polic" + 0.012*"alleg" + 0.011*"woman" + 0.010*"court" + 0.009*"jail" + 0.009*"sentenc" + 0.008*"morrison" + 0.008*"child"

Topic 1: 0.015*"drum" + 0.015*"countri" + 0.013*"royal" + 0.011*"commiss" + 0.010*"peopl" + 0.010*"hour" + 0.007*"weather" + 0.007*"peter" + 0.006*"pacif" + 0.006*"footag"

Topic 2: 0.024*"news" + 0.018*"queensland" + 0.009*"john" + 0.008*"leagu" + 0.007*"david" + 0.006*"histori" + 0.006*"nation" + 0.006*"kohler" + 0.006*"ash" + 0.005*"univers"

Topic 3: 0.011*"stori" + 0.008*"market" + 0.008*"price" + 0.007*"australian" + 0.006*"million" + 0.006*"share" + 0.005*"dollar" + 0.005*"brief" + 0.005*"tasmanian" + 0.005*"andrew"

Topic 4: 0.015*"donald" + 0.012*"elect" + 0.011*"govern" + 0.008*"feder" + 0.007*"liber" + 0.007*"say" + 0.007*"labor" + 0.006*"thursday" + 0.006*"korea" + 0.006*"street"

Topic 5: 0.012*"rural" + 0.010*"climat" + 0.009*"michael" + 0.008*"financ" + 0.007*"hong" + 0.007*"kong" + 0.007*"video" + 0.