# A short introduction to Topic Modelling using LDAs

In this notebook we will introduce the practical side of LDAs. We have already done most of the work in the previous notebooks, so defining and training LDAs will be a walk in the park, especially when using gensim.

We will process the `shorter_abcnews_text.csv` file in the Drive folder.

## Updating, installing and importing some packages

We will update gensim, and install a visualization tool called `pyldavis`

In [1]:
!pip install --upgrade gensim

Collecting gensim
  Downloading gensim-4.0.1-cp37-cp37m-manylinux1_x86_64.whl (23.9 MB)
[K     |████████████████████████████████| 23.9 MB 93 kB/s 
Installing collected packages: gensim
  Attempting uninstall: gensim
    Found existing installation: gensim 3.6.0
    Uninstalling gensim-3.6.0:
      Successfully uninstalled gensim-3.6.0
Successfully installed gensim-4.0.1


In [2]:
!pip install pyldavis

Collecting pyldavis
  Downloading pyLDAvis-3.3.1.tar.gz (1.7 MB)
[K     |████████████████████████████████| 1.7 MB 4.2 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting numpy>=1.20.0
  Downloading numpy-1.21.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.7 MB)
[K     |████████████████████████████████| 15.7 MB 75 kB/s 
Collecting pandas>=1.2.0
  Downloading pandas-1.3.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (10.8 MB)
[K     |████████████████████████████████| 10.8 MB 52.9 MB/s 
[?25hCollecting funcy
  Downloading funcy-1.16-py2.py3-none-any.whl (32 kB)
Building wheels for collected packages: pyldavis
  Building wheel for pyldavis (PEP 517) ... [?25l[?25hdone
  Created wheel for pyldavis: filename=pyLDAvis-3.3.1-py2.py3-none-any.whl size=136897 sha256=27cc6af15346e

In [1]:
from typing import List

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# sns.set_theme()

import nltk
nltk.download("wordnet")
nltk.download("stopwords")
from nltk import WordNetLemmatizer

from wordcloud import WordCloud

from gensim.utils import simple_preprocess
from gensim.models import LdaModel, LdaMulticore
from gensim.corpora import Dictionary

import pyLDAvis
import pyLDAvis.gensim_models as gensimvis

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


  from collections import Iterable
  from collections import Mapping
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  EPS = np.finfo(np.float).eps


## Loading and Preprocessing the data

In [2]:
def clean_text(tweet_text: str) -> str:
  """
  Grabs the text of a tweet and removes
  everything that starts with an ampersand.
  """
  words = tweet_text.split()

  clean_words = [word for word in words if not ("@" in word)]
  return " ".join(clean_words)

In [3]:
def preprocess_headlines(col: pd.Series, remove_stopwords: bool = True) -> List[List[str]]:
  """
  Grabs the 'text' column in df and
  returns a preprocessed list of documents.

  See more in previous class' notebooks.
  """
  # Doing a simple preprocess using gensim
  preprocessed_docs = [
    simple_preprocess(t, min_len=0, max_len=100) for t in col
  ]

  # Lemmatizing
  lemmatizer = WordNetLemmatizer()
  preprocessed_docs = [
    [lemmatizer.lemmatize(token) for token in doc] for doc in preprocessed_docs
  ]

  # Removing stopwords
  if remove_stopwords:
    english_stopwords = set(nltk.corpus.stopwords.words("english"))
    preprocessed_docs = [
      [word for word in doc if word not in english_stopwords] for doc in preprocessed_docs
    ]

  return preprocessed_docs


In [4]:
news = pd.read_csv("shorter_abcnews_text.csv")

In [5]:
news.head()

Unnamed: 0.1,Unnamed: 0,publish_date,headline_text
0,0,20030219,aba decides against community broadcasting lic...
1,1,20030219,act fire witnesses must be aware of defamation
2,2,20030219,a g calls for infrastructure protection summit
3,3,20030219,air nz staff in aust strike for pay rise
4,4,20030219,air nz strike to affect australian travellers


In [6]:
corpus = preprocess_headlines(news["headline_text"])

In [7]:
corpus[:5]

[['aba', 'decides', 'community', 'broadcasting', 'licence'],
 ['act', 'fire', 'witness', 'must', 'aware', 'defamation'],
 ['g', 'call', 'infrastructure', 'protection', 'summit'],
 ['air', 'nz', 'staff', 'aust', 'strike', 'pay', 'rise'],
 ['air', 'nz', 'strike', 'affect', 'australian', 'traveller']]

## Training an LDA

Because of how it is implemented in `gensim`, we need to define a Dictionary first, get a list with all bag-of-words (using the `Dictionary.doc2bow` method) and training the LDA on this list of all bag-of-words.

In [8]:
dct = Dictionary(corpus)

In [13]:
# There are actually over 1 million headlines, so let's restrict to 10000.
bows = [dct.doc2bow(doc) for doc in corpus[:4000]]

In [14]:
# should take 2.5 minutes.
lda_model = LdaMulticore(
    bows,
    id2word=dct,
    num_topics=10,
    random_state=100,
    chunksize=100,
    passes=10,
    per_word_topics=True
)

## Visualizing and understanding the LDA

We can use pyLDAvis to check the topics in an interactive way.

In [15]:
pyLDAvis.enable_notebook()
lda_viz = gensimvis.prepare(lda_model, bows, dct)

  by='saliency', ascending=False).head(R).drop('saliency', 1)


In [16]:
lda_viz

The `lda_model` object itself holds a `.print_topics` method that gives a quick summary of what each `topic` is, related to the tokens.

In [21]:
lda_model.print_topics()

[(0,
  '0.031*"win" + 0.018*"howard" + 0.018*"probe" + 0.016*"vic" + 0.015*"day" + 0.015*"top" + 0.013*"play" + 0.012*"continues" + 0.011*"welcome" + 0.011*"clash"'),
 (1,
  '0.026*"war" + 0.020*"indian" + 0.017*"get" + 0.016*"water" + 0.014*"australia" + 0.013*"green" + 0.013*"nz" + 0.013*"support" + 0.012*"rally" + 0.011*"well"'),
 (2,
  '0.050*"say" + 0.027*"govt" + 0.022*"call" + 0.019*"council" + 0.018*"plan" + 0.014*"year" + 0.012*"fund" + 0.012*"still" + 0.012*"job" + 0.011*"union"'),
 (3,
  '0.024*"man" + 0.020*"anti" + 0.016*"car" + 0.014*"child" + 0.013*"melbourne" + 0.013*"home" + 0.013*"attack" + 0.012*"protest" + 0.012*"bombing" + 0.012*"crash"'),
 (4,
 (5,
  '0.021*"hospital" + 0.019*"coast" + 0.018*"nsw" + 0.015*"coalition" + 0.014*"minister" + 0.013*"health" + 0.013*"iraqi" + 0.013*"deal" + 0.013*"public" + 0.012*"defends"'),
 (6,
  '0.028*"aust" + 0.027*"court" + 0.026*"claim" + 0.019*"face" + 0.018*"take" + 0.017*"mp" + 0.015*"may" + 0.014*"charge" + 0.013*"killed" + 

In [24]:
idx = 10


bow = bows[idx]
print(lda_model.get_document_topics(bow))
print(corpus[idx])

[(0, 0.01669908), (1, 0.3488181), (2, 0.01669908), (3, 0.01669908), (4, 0.3505845), (5, 0.01669908), (6, 0.01669908), (7, 0.01669908), (8, 0.01669908), (9, 0.18370381)]
['australia', 'contribute', 'million', 'aid', 'iraq']


In [23]:
corpus[1]

['act', 'fire', 'witness', 'must', 'aware', 'defamation']

In [29]:
import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


In [28]:
perplexities = []
for k in [5, 10, 15, 20]:
  lda_model = LdaMulticore(
    bows,
    id2word=dct,
    num_topics=k,
    random_state=100,
    chunksize=100,
    passes=10,
    per_word_topics=True
  )
  perplexities.append(lda_model.log_perplexity(bows))

print(perplexities)

[-9.178746672571533, -10.111345102333573, -16.078761754964447, -19.837467083793882]


In [30]:
lda_model = LdaMulticore(
    bows,
    id2word=dct,
    num_topics=100,
    random_state=100,
    chunksize=100,
    passes=10,
    per_word_topics=True
  )

In [31]:
lda_model.log_perplexity(bows)

-177.69207720075332