<a href="https://colab.research.google.com/github/andrewmsilva/DataScienceStudies/blob/master/Topic%20Modeling%20with%20news.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction
The purpose of this notebook is to present an example of Topic Modeling using Latent Dirichlet Distribution (LDA). The dataset used is a list of over one million news headlines published over a period of 15 years. These headlines was sourced from ABC (Australian Broadcasting Corp.) and can be downloaded from [Kaggle](https://www.kaggle.com/snapcrack/all-the-news/data).

# Steps
* [Data loading](#Data-loading)
* [Data pre-processing](#Data-pre-processing)
* [Features extraction](#Features-extraction)
  * [Generate Bag of Words](#Generate-Bag-of-Words)
  * [Generate TF-IDF](#Generate-TF-IDF)
* [Models training](#Models-training)
  * [With Bag of Words](#With-Bag-of-Words)
  * [With TF-IDF](#With-TF-IDF)

# Data loading

In [0]:
from google.colab import files
files.upload()
!pip install -q kaggle
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
!kaggle datasets download -d snapcrack/all-the-news

Saving kaggle.json to kaggle.json
Downloading all-the-news.zip to /content
 96% 234M/244M [00:03<00:00, 98.6MB/s]
100% 244M/244M [00:03<00:00, 75.6MB/s]


In [0]:
from zipfile import ZipFile
import pandas as pd

zip_file = ZipFile('all-the-news.zip')
dfs = [ pd.read_csv(zip_file.open(text_file.filename)) for text_file in zip_file.infolist() ]
news_df = pd.concat(dfs, axis=0, ignore_index=True)
news_df = news_df[['title', 'author', 'content']]

In [2]:
# Displaying
news_df

Unnamed: 0,title,author,content
0,House Republicans Fret About Winning Their Hea...,Carl Hulse,WASHINGTON — Congressional Republicans have...
1,Rift Between Officers and Residents as Killing...,Benjamin Mueller and Al Baker,"After the bullet shells get counted, the blood..."
2,"Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial ...",Margalit Fox,"When Walt Disney’s “Bambi” opened in 1942, cri..."
3,"Among Deaths in 2016, a Heavy Toll in Pop Musi...",William McDonald,"Death may be the great equalizer, but it isn’t..."
4,Kim Jong-un Says North Korea Is Preparing to T...,Choe Sang-Hun,"SEOUL, South Korea — North Korea’s leader, ..."
...,...,...,...
142565,An eavesdropping Uber driver saved his 16-year...,Avi Selk,Uber driver Keith Avila picked up a p...
142566,Plane carrying six people returning from a Cav...,Sarah Larimer,Crews on Friday continued to search L...
142567,After helping a fraction of homeowners expecte...,Renae Merle,When the Obama administration announced a...
142568,"Yes, this is real: Michigan just banned bannin...",Chelsea Harvey,This story has been updated. A new law in...


# Data pre-processing

In [0]:
import numpy as np
np.random.seed(59)

import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS

import nltk
#nltk.download('stopwords')
#nltk.download('wordnet')

def preprocess(text):
  stemmer = nltk.SnowballStemmer('english')
  lemmatizer = nltk.WordNetLemmatizer()

  processed_text = []
  for token in gensim.utils.simple_preprocess(text):
    if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
      token = stemmer.stem(lemmatizer.lemmatize(token, pos='v'))
      processed_text.append(token)
  return processed_text

processed = news_df.content.map(preprocess)
news_df['processed'] = processed

In [4]:
# Displaying
news_df

Unnamed: 0,title,author,content,processed
0,House Republicans Fret About Winning Their Hea...,Carl Hulse,WASHINGTON — Congressional Republicans have...,"[washington, congression, republican, fear, co..."
1,Rift Between Officers and Residents as Killing...,Benjamin Mueller and Al Baker,"After the bullet shells get counted, the blood...","[bullet, shell, count, blood, dri, votiv, cand..."
2,"Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial ...",Margalit Fox,"When Walt Disney’s “Bambi” opened in 1942, cri...","[walt, disney, bambi, open, critic, prais, spa..."
3,"Among Deaths in 2016, a Heavy Toll in Pop Musi...",William McDonald,"Death may be the great equalizer, but it isn’t...","[death, great, equal, necessarili, evenhand, f..."
4,Kim Jong-un Says North Korea Is Preparing to T...,Choe Sang-Hun,"SEOUL, South Korea — North Korea’s leader, ...","[seoul, south, korea, north, korea, leader, sa..."
...,...,...,...,...
142565,An eavesdropping Uber driver saved his 16-year...,Avi Selk,Uber driver Keith Avila picked up a p...,"[uber, driver, keith, avila, pick, passeng, lo..."
142566,Plane carrying six people returning from a Cav...,Sarah Larimer,Crews on Friday continued to search L...,"[crew, friday, continu, search, lake, eri, pla..."
142567,After helping a fraction of homeowners expecte...,Renae Merle,When the Obama administration announced a...,"[obama, administr, announc, massiv, effort, he..."
142568,"Yes, this is real: Michigan just banned bannin...",Chelsea Harvey,This story has been updated. A new law in...,"[stori, updat, michigan, prohibit, local, gove..."


# Features extraction

## Generate Bag of Words

In [0]:
dictionary = gensim.corpora.Dictionary(processed)
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)
bow_corpus = [dictionary.doc2bow(doc) for doc in processed]

## Generate TF-IF

In [0]:
from gensim import corpora, models

tfidf = models.TfidfModel(bow_corpus)
tfidf_corpus = tfidf[bow_corpus]

# Models training

## With Bag of Words

In [0]:
lda_model_bow = gensim.models.LdaMulticore(
    bow_corpus,
    num_topics=20,
    id2word=dictionary,
    workers=2
)

In [8]:
# Displaying
for idx, topic in lda_model_bow.print_topics(-1):
    print('Topic {}: {}\n'.format(idx, topic))

Topic 0: 0.100*"trump" + 0.020*"clinton" + 0.015*"campaign" + 0.014*"donald" + 0.010*"presid" + 0.007*"immigr" + 0.007*"presidenti" + 0.006*"hillari" + 0.006*"state" + 0.006*"candid"

Topic 1: 0.008*"facebook" + 0.006*"post" + 0.006*"compani" + 0.006*"appl" + 0.005*"user" + 0.005*"video" + 0.005*"internet" + 0.005*"stori" + 0.004*"work" + 0.004*"googl"

Topic 2: 0.007*"food" + 0.005*"go" + 0.005*"work" + 0.005*"come" + 0.005*"look" + 0.004*"know" + 0.004*"home" + 0.004*"think" + 0.004*"citi" + 0.004*"want"

Topic 3: 0.007*"know" + 0.006*"think" + 0.005*"tell" + 0.005*"famili" + 0.005*"life" + 0.005*"go" + 0.005*"want" + 0.005*"love" + 0.005*"film" + 0.005*"stori"

Topic 4: 0.023*"court" + 0.015*"case" + 0.010*"judg" + 0.009*"feder" + 0.009*"state" + 0.008*"attorney" + 0.007*"justic" + 0.007*"charg" + 0.007*"prison" + 0.007*"sentenc"

Topic 5: 0.017*"china" + 0.013*"north" + 0.010*"korea" + 0.008*"south" + 0.006*"chines" + 0.006*"nuclear" + 0.005*"state" + 0.005*"water" + 0.005*"missil"

## With TF-IDF

In [0]:
lda_model_tfidf = gensim.models.LdaMulticore(
    tfidf_corpus,
    id2word=dictionary,
    num_topics=20,
    workers=2
)

In [10]:
# Displaying
for idx, topic in lda_model_tfidf.print_topics(-1):
    print('Topic {}: {}\n'.format(idx, topic))

Topic 0: 0.016*"brazil" + 0.015*"olymp" + 0.014*"pipelin" + 0.010*"rousseff" + 0.010*"brazilian" + 0.010*"temer" + 0.008*"cooney" + 0.007*"dakota" + 0.007*"tribe" + 0.006*"reai"

Topic 1: 0.003*"women" + 0.002*"film" + 0.002*"music" + 0.002*"love" + 0.002*"book" + 0.002*"feel" + 0.002*"stori" + 0.002*"life" + 0.002*"movi" + 0.002*"think"

Topic 2: 0.010*"attack" + 0.010*"polic" + 0.008*"mosul" + 0.007*"kill" + 0.006*"iraqi" + 0.006*"islam" + 0.006*"bomb" + 0.006*"shoot" + 0.006*"milit" + 0.005*"citi"

Topic 3: 0.009*"comey" + 0.009*"trump" + 0.007*"russia" + 0.007*"russian" + 0.007*"intellig" + 0.006*"investig" + 0.006*"clinton" + 0.006*"email" + 0.005*"flynn" + 0.005*"committe"

Topic 4: 0.014*"ceasefir" + 0.011*"nypd" + 0.009*"kaepernick" + 0.006*"nhtsa" + 0.006*"bowser" + 0.005*"farag" + 0.005*"solarc" + 0.004*"anthem" + 0.004*"mcginti" + 0.003*"francoi"

Topic 5: 0.011*"ail" + 0.010*"reilli" + 0.005*"kelli" + 0.005*"sander" + 0.005*"deutsch" + 0.005*"carlson" + 0.004*"murdoch" + 0.