<a href="https://colab.research.google.com/github/andrewmsilva/DataScienceStudies/blob/master/Topic%20Modeling%20with%20news.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction
The purpose of this notebook is to present an example of Topic Modeling using Latent Dirichlet Distribution (LDA). The dataset used is a list of over one million news headlines published over a period of 15 years. These headlines was sourced from ABC (Australian Broadcasting Corp.) and can be downloaded from [Kaggle](https://www.kaggle.com/snapcrack/all-the-news/data).

# Steps
* [Data loading](#Data-loading)
* [Data pre-processing](#Data-pre-processing)
* [Features extraction](#Features-extraction)
  * [Generate Bag of Words](#Generate-Bag-of-Words)
  * [Generate TF-IDF](#Generate-TF-IDF)
* [Models training](#Models-training)
  * [With Bag of Words](#With-Bag-of-Words)
  * [With TF-IDF](#With-TF-IDF)

# Data loading

In [0]:
from google.colab import files
files.upload()
!pip install -q kaggle
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
!kaggle datasets download -d snapcrack/all-the-news

Saving kaggle.json to kaggle.json
Downloading all-the-news.zip to /content
 96% 234M/244M [00:03<00:00, 98.6MB/s]
100% 244M/244M [00:03<00:00, 75.6MB/s]


In [0]:
from zipfile import ZipFile
import pandas as pd

zip_file = ZipFile('all-the-news.zip')
dfs = [ pd.read_csv(zip_file.open(text_file.filename)) for text_file in zip_file.infolist() ]
news_df = pd.concat(dfs, axis=0, ignore_index=True)

In [2]:
# Displaying
news_df

Unnamed: 0.1,Unnamed: 0,id,title,publication,author,date,year,month,url,content
0,0,17283,House Republicans Fret About Winning Their Hea...,New York Times,Carl Hulse,2016-12-31,2016.0,12.0,,WASHINGTON — Congressional Republicans have...
1,1,17284,Rift Between Officers and Residents as Killing...,New York Times,Benjamin Mueller and Al Baker,2017-06-19,2017.0,6.0,,"After the bullet shells get counted, the blood..."
2,2,17285,"Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial ...",New York Times,Margalit Fox,2017-01-06,2017.0,1.0,,"When Walt Disney’s “Bambi” opened in 1942, cri..."
3,3,17286,"Among Deaths in 2016, a Heavy Toll in Pop Musi...",New York Times,William McDonald,2017-04-10,2017.0,4.0,,"Death may be the great equalizer, but it isn’t..."
4,4,17287,Kim Jong-un Says North Korea Is Preparing to T...,New York Times,Choe Sang-Hun,2017-01-02,2017.0,1.0,,"SEOUL, South Korea — North Korea’s leader, ..."
...,...,...,...,...,...,...,...,...,...,...
142565,146028,218078,An eavesdropping Uber driver saved his 16-year...,Washington Post,Avi Selk,2016-12-30,2016.0,12.0,https://web.archive.org/web/20161231004909/htt...,Uber driver Keith Avila picked up a p...
142566,146029,218079,Plane carrying six people returning from a Cav...,Washington Post,Sarah Larimer,2016-12-30,2016.0,12.0,https://web.archive.org/web/20161231004909/htt...,Crews on Friday continued to search L...
142567,146030,218080,After helping a fraction of homeowners expecte...,Washington Post,Renae Merle,2016-12-30,2016.0,12.0,https://web.archive.org/web/20161231004909/htt...,When the Obama administration announced a...
142568,146031,218081,"Yes, this is real: Michigan just banned bannin...",Washington Post,Chelsea Harvey,2016-12-30,2016.0,12.0,https://web.archive.org/web/20161231004909/htt...,This story has been updated. A new law in...


# Data pre-processing

In [0]:
import numpy as np
np.random.seed(59)

import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS

import nltk
#nltk.download('stopwords')
#nltk.download('wordnet')

def preprocess(text):
  stemmer = nltk.SnowballStemmer('english')
  lemmatizer = nltk.WordNetLemmatizer()

  processed_text = []
  for token in gensim.utils.simple_preprocess(text):
    if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
      token = stemmer.stem(lemmatizer.lemmatize(token, pos='v'))
      processed_text.append(token)
  return processed_text

processed = news_df.content.map(preprocess)
news_df['processed'] = processed

In [4]:
# Displaying
news_df

Unnamed: 0.1,Unnamed: 0,id,title,publication,author,date,year,month,url,content,processed
0,0,17283,House Republicans Fret About Winning Their Hea...,New York Times,Carl Hulse,2016-12-31,2016.0,12.0,,WASHINGTON — Congressional Republicans have...,"[washington, congression, republican, fear, co..."
1,1,17284,Rift Between Officers and Residents as Killing...,New York Times,Benjamin Mueller and Al Baker,2017-06-19,2017.0,6.0,,"After the bullet shells get counted, the blood...","[bullet, shell, count, blood, dri, votiv, cand..."
2,2,17285,"Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial ...",New York Times,Margalit Fox,2017-01-06,2017.0,1.0,,"When Walt Disney’s “Bambi” opened in 1942, cri...","[walt, disney, bambi, open, critic, prais, spa..."
3,3,17286,"Among Deaths in 2016, a Heavy Toll in Pop Musi...",New York Times,William McDonald,2017-04-10,2017.0,4.0,,"Death may be the great equalizer, but it isn’t...","[death, great, equal, necessarili, evenhand, f..."
4,4,17287,Kim Jong-un Says North Korea Is Preparing to T...,New York Times,Choe Sang-Hun,2017-01-02,2017.0,1.0,,"SEOUL, South Korea — North Korea’s leader, ...","[seoul, south, korea, north, korea, leader, sa..."
...,...,...,...,...,...,...,...,...,...,...,...
142565,146028,218078,An eavesdropping Uber driver saved his 16-year...,Washington Post,Avi Selk,2016-12-30,2016.0,12.0,https://web.archive.org/web/20161231004909/htt...,Uber driver Keith Avila picked up a p...,"[uber, driver, keith, avila, pick, passeng, lo..."
142566,146029,218079,Plane carrying six people returning from a Cav...,Washington Post,Sarah Larimer,2016-12-30,2016.0,12.0,https://web.archive.org/web/20161231004909/htt...,Crews on Friday continued to search L...,"[crew, friday, continu, search, lake, eri, pla..."
142567,146030,218080,After helping a fraction of homeowners expecte...,Washington Post,Renae Merle,2016-12-30,2016.0,12.0,https://web.archive.org/web/20161231004909/htt...,When the Obama administration announced a...,"[obama, administr, announc, massiv, effort, he..."
142568,146031,218081,"Yes, this is real: Michigan just banned bannin...",Washington Post,Chelsea Harvey,2016-12-30,2016.0,12.0,https://web.archive.org/web/20161231004909/htt...,This story has been updated. A new law in...,"[stori, updat, michigan, prohibit, local, gove..."


# Features extraction

## Generate Bag of Words

In [0]:
dictionary = gensim.corpora.Dictionary(processed)
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)
bow_corpus = [dictionary.doc2bow(doc) for doc in processed]

## Generate TF-IF

In [0]:
from gensim import corpora, models

tfidf = models.TfidfModel(bow_corpus)
tfidf_corpus = tfidf[bow_corpus]

# Models training

## With Bag of Words

In [0]:
lda_model_bow = gensim.models.LdaMulticore(
    bow_corpus,
    num_topics=20,
    id2word=dictionary,
    workers=2
)

In [8]:
# Displaying
for idx, topic in lda_model_bow.print_topics(-1):
    print('Topic {}: {}\n'.format(idx, topic))

Topic 0: 0.095*"trump" + 0.021*"clinton" + 0.016*"campaign" + 0.013*"donald" + 0.010*"presid" + 0.007*"immigr" + 0.007*"presidenti" + 0.007*"hillari" + 0.006*"state" + 0.006*"candid"

Topic 1: 0.008*"facebook" + 0.006*"appl" + 0.006*"compani" + 0.005*"post" + 0.005*"user" + 0.005*"internet" + 0.005*"video" + 0.005*"stori" + 0.004*"googl" + 0.004*"work"

Topic 2: 0.006*"food" + 0.005*"go" + 0.005*"work" + 0.005*"come" + 0.005*"look" + 0.004*"know" + 0.004*"think" + 0.004*"citi" + 0.004*"home" + 0.004*"want"

Topic 3: 0.007*"know" + 0.005*"think" + 0.005*"tell" + 0.005*"life" + 0.005*"go" + 0.005*"famili" + 0.005*"film" + 0.005*"want" + 0.005*"love" + 0.005*"women"

Topic 4: 0.023*"court" + 0.015*"case" + 0.009*"judg" + 0.009*"feder" + 0.009*"state" + 0.007*"attorney" + 0.007*"justic" + 0.006*"prison" + 0.006*"charg" + 0.006*"legal"

Topic 5: 0.016*"china" + 0.014*"north" + 0.011*"korea" + 0.008*"south" + 0.006*"chines" + 0.006*"nuclear" + 0.005*"missil" + 0.005*"state" + 0.005*"water" +

## With TF-IDF

In [0]:
lda_model_tfidf = gensim.models.LdaMulticore(
    tfidf_corpus,
    id2word=dictionary,
    num_topics=20,
    workers=2
)

In [10]:
# Displaying
for idx, topic in lda_model_tfidf.print_topics(-1):
    print('Topic {}: {}\n'.format(idx, topic))

Topic 0: 0.004*"game" + 0.003*"film" + 0.003*"music" + 0.003*"season" + 0.003*"play" + 0.003*"song" + 0.002*"movi" + 0.002*"player" + 0.002*"team" + 0.002*"charact"

Topic 1: 0.006*"percent" + 0.005*"insur" + 0.005*"health" + 0.004*"drug" + 0.003*"rate" + 0.003*"cost" + 0.003*"school" + 0.003*"price" + 0.003*"program" + 0.003*"student"

Topic 2: 0.011*"cuba" + 0.009*"castro" + 0.009*"cuban" + 0.009*"dutert" + 0.008*"trump" + 0.006*"russia" + 0.006*"mexico" + 0.006*"nune" + 0.005*"russian" + 0.005*"reelect"

Topic 3: 0.007*"palin" + 0.004*"manziel" + 0.004*"sharapova" + 0.003*"defenc" + 0.003*"raddatz" + 0.003*"neanderth" + 0.003*"laila" + 0.003*"qualcomm" + 0.003*"nehlen" + 0.003*"dudley"

Topic 4: 0.004*"opel" + 0.002*"keillor" + 0.002*"hernandez" + 0.002*"golf" + 0.002*"shark" + 0.002*"niki" + 0.002*"brewer" + 0.002*"wimbledon" + 0.001*"jupit" + 0.001*"tenni"

Topic 5: 0.007*"israel" + 0.007*"muslim" + 0.006*"refuge" + 0.006*"isra" + 0.005*"palestinian" + 0.005*"macron" + 0.005*"milb