<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png"><br />

Adopted by Valdis Saulespurens from  [Nathan Kelber](http://nkelber.com) and Ted Lawless for [JSTOR Labs](https://labs.jstor.org/) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />
For questions/comments/improvements, email valdis.s.coding at gmail com<br />
___

# Latent Dirichlet Allocation (LDA) Topic Modeling

**Description:**
This [notebook](https://docs.constellate.org/key-terms/#jupyter-notebook) demonstrates how to do topic modeling. The following processes are described:

* Filtering based on a [stop words list](https://docs.constellate.org/key-terms/#stop-words)
* Cleaning the tokens in the dataset
* Creating a [gensim dictionary](https://docs.constellate.org/key-terms/#gensim-dictionary)
* Creating a [gensim](https://docs.constellate.org/key-terms/#gensim) [bag of words](https://docs.constellate.org/key-terms/#bag-of-words) [corpus](https://docs.constellate.org/key-terms/#corpus)
* Computing a topic list using [gensim](https://docs.constellate.org/key-terms/#gensim)
* Visualizing the topic list with `pyldavis`

**Use Case:** For Researchers (Mostly code without explanation, not ideal for learners)

**Difficulty:** Intermediate

**Completion time:** 60 minutes

**Knowledge Required:** 
* Python Basics Series ([Start Python Basics I](./python-basics-1.ipynb))

**Knowledge Recommended:**
* [Exploring Metadata](./metadata.ipynb)
* [Working with Dataset Files](./working-with-dataset-files.ipynb)
* [Pandas I](./pandas-1.ipynb)
* [Creating a Stopwords List](./creating-stopwords-list.ipynb)
* A familiarity with [gensim](https://docs.constellate.org/key-terms/#gensim) is helpful but not required.

**Data Format:** [JSON Lines (.jsonl)](https://docs.constellate.org/key-terms/#jsonl)

**Libraries Used:**

* [pandas](https://constellate.org/docs/key-terms/#pandas) to load a preprocessing list
* `csv` to load a custom stopwords list
* [gensim](https://docs.constellate.org/key-terms/#gensim) to accomplish the topic modeling
* [NLTK](https://docs.constellate.org/key-terms/#nltk) to create a stopwords list (if no list is supplied)
* `pyldavis` to visualize our topic model

**Research Pipeline**
1. Build a dataset
2. Create a "Pre-Processing CSV" with [Exploring Metadata](./exploring-metadata.ipynb) (Optional)
3. Create a "Custom Stopwords List" with [Creating a Stopwords List](./creating-stopwords-list.ipynb) (Optional)
4. Complete the Topic Modeling analysis with this notebook
____

## What is Topic Modeling?

**Topic modeling** is a **machine learning** technique that attempts to discover groupings of words (called topics) that commonly occur together in a body of texts. The body of texts could be anything from journal articles to newspaper articles to tweets.

**Topic modeling** is an unsupervised, clustering technique for text. We give the machine a series of texts that it then attempts to cluster the texts into a given number of topics. There is also a *supervised*, clustering technique called **Topic Classification**, where we supply the machine with examples of pre-labeled topics and then see if the machine can identify them given the examples.

**Topic modeling** is usually considered an exploratory technique; it helps us discover new patterns within a set of texts. **Topic Classification**, using labeled data, is intended to be a predictive technique; we want it to find more things like the examples we give it.

<font color='red'>Read more</font>

* ["Latent Dirichlet Allocation: Intuition, math, implementation and visualisation with pyLDAvis" Ioana](https://towardsdatascience.com/latent-dirichlet-allocation-intuition-math-implementation-and-visualisation-63ccb616e094) 2020
* ["Latent Dirichlet Allocation" Blei, Ng, Jordan](https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf?TB_iframe=true&width=370.8&height=658.8) 2003

## Import your dataset

In [1]:

import pandas
import os
import gensim
import requests


In [2]:
pandas.__version__

'1.3.5'

In [3]:
gensim.__version__

'3.6.0'

In [4]:
url = "https://github.com/ValRCS/BSSDH_22/raw/main/corpora/lv_old_newspapers_5k.tsv"

df = pandas.read_csv(url, sep="\t") 
df.head()

Unnamed: 0,Language,Source,Date,Text
0,Latvian,rekurzeme.lv,2008/09/04,"""Viņa pirmsnāves zīmītē bija rakstīts vienīgi ..."
1,Latvian,diena.lv,2012/01/10,info@zurnalistiem.lv
2,Latvian,bauskasdzive.lv,2007/12/27,"Bhuto, kas Pakistānā no trimdas atgriezās tika..."
3,Latvian,bauskasdzive.lv,2008/10/08,Plkst. 4.00 Samoilovs / Pļaviņš (pludmales vol...
4,Latvian,diena.lv,2011/10/05,"CVK bija vērsusies Skaburska, lūdzot izskaidro..."


In [55]:
full_url = "https://github.com/ValRCS/BSSDH_22/raw/main/corpora/lv_old_newspapers.zip"
# so Estonian one is "https://github.com/ValRCS/BSSDH_22/raw/main/corpora/ee_old_newspapers.zip"
# Ukranian "https://github.com/ValRCS/BSSDH_22/raw/main/corpora/ua_old_newspapers.zip"
full_df = pandas.read_csv(full_url, sep="\t", compression="zip")
full_df.shape

(319428, 4)

In [5]:
raw_documents = list(df.Text)
len(raw_documents) # we will use each document separately so again we have a list of strings so far

4999

In [6]:
type(raw_documents)

list

In [7]:
type(raw_documents[0]) # each document is a string so far

str

## Load Stopwords List

If you have created a stopword list in the stopwords notebook, we will import it here. (You can always modify the CSV file to add or subtract words then reload the list.) Otherwise, we'll load the NLTK [stopwords](https://docs.constellate.org/key-terms/#stop-words) list automatically.

In [8]:
# how to find all languages stopwords built in NLTK
# https://stackoverflow.com/questions/54573853/nltk-available-languages-for-stopwords
# bigger collection of all stopwords
# https://github.com/stopwords-iso
# latvian https://github.com/stopwords-iso/stopwords-lv/raw/master/stopwords-lv.txt
url = "https://github.com/stopwords-iso/stopwords-lv/raw/master/stopwords-lv.txt"
stop_words = []
response = requests.get(url)
if response.status_code == 200:
    stop_words = response.text.split()
len(stop_words), stop_words[:5]
# see previous session on how to save locally your stopwords

(161, ['aiz', 'ap', 'apakš', 'apakšpus', 'ar'])

## Define a Function to Process Tokens
Next, we create a short function to clean up our tokens.

In [12]:
def process_token(token):
    token = token.lower()
    if token in stop_words:
        return # return None
    if len(token) < 4:
        return
    if not token.isalpha(): # if we hav any non alphabethic then we return nothing
        return
    return token

In [9]:
"Valdis".isalpha()

True

In [10]:
"Valdis34".isalpha()

False

In [13]:
process_token("Valdis")

'valdis'

In [14]:
process_token("Valdis324")

In [15]:
# %%time
# Limit to n documents. Set to None to use all documents.

documents = [] # start with a blank list of documents
for document in raw_documents:
    # so we get tokens out of each individual document
    tokens = document.split() # here you could modify to use nltk.word_tokenize
    # we create a list of processed tokens for each document
    processed_document = [process_token(token) for token in tokens if process_token(token) is not None] # TODO could be improved with new walrus :=
    documents.append(processed_document)
print(f'Converted all documents to list of clean tokens')
documents[:3]

Converted all documents to list of clean tokens


[['pirmsnāves',
  'zīmītē',
  'rakstīts',
  'vienīgi',
  'smēķēšanas',
  'aizlieguma',
  'radītajiem',
  'laikrakstam',
  'paskaidroja',
  'nelaiķa',
  'svainis',
  'helmuts',
  'nebija',
  'vērsta',
  'viņa',
  'ģimeni'],
 [],
 ['pakistānā',
  'trimdas',
  'atgriezās',
  'diviem',
  'uzstājās',
  'priekšvēlēšanu',
  'organizēts',
  'nākamajā',
  'mēnesī',
  'gaidāmajām',
  'parlamenta']]

In [16]:
documents[-2:] # lets check last two documents

[['vairāki',
  'pasākumi',
  'veltīti',
  'jaunākajiem',
  'lasītājiem',
  'šodien',
  'apgāds',
  'rīko',
  'gada',
  'rakstnieki',
  'oficiālo',
  'pasludināšanu',
  'notiks',
  'antoloģijas',
  'skaitāmi',
  'atvēršanas',
  'sestdien',
  'baudīt',
  'māra',
  'putniņa',
  'stāstus',
  'romānā',
  'atdzīvinātajiem',
  'ēdamajiem',
  'svētdien',
  'risināsies',
  'latvijas',
  'bērnu',
  'žūrijas',
  'lielie',
  'lasīšanas',
  'sveikti',
  'populārākie',
  'bērnu',
  'grāmatu',
  'tulkotāji'],
 ['piecu',
  'stundu',
  'pavadīšanas',
  'taškentas',
  'lidostas',
  'uzgaidāmajā',
  'telpā',
  'nakts',
  'lidmašīnā',
  'citā',
  'izkāpjot',
  'lidmašīnas',
  'sajūtams',
  'taizemi',
  'brauc',
  'vairums',
  'ziemeļvalstīs']]

Build a gensim dictionary corpus and then train the model. More information about parameters can be found at the [Gensim LDA Model page](https://radimrehurek.com/gensim/models/ldamodel.html).

In [17]:
dictionary = gensim.corpora.Dictionary(documents)

In [18]:
len(dictionary)

31687

In [20]:
# so dictionary is just a mapping of ids to tokens - because ML algorithms most often work with numbers
list(dictionary.items())[:10]

[(0, 'aizlieguma'),
 (1, 'helmuts'),
 (2, 'laikrakstam'),
 (3, 'nebija'),
 (4, 'nelaiķa'),
 (5, 'paskaidroja'),
 (6, 'pirmsnāves'),
 (7, 'radītajiem'),
 (8, 'rakstīts'),
 (9, 'smēķēšanas')]

In [30]:
doc_count = len(documents)
num_topics = 7 # Change the number of topics 7 is just a wild guess here
passes = 5 # The number of passes used to train the model
# By default: Remove terms that appear in less than 5 documents and terms that occur in more than 90% of documents.
dictionary.filter_extremes()

In [21]:
single_bow = dictionary.doc2bow(documents[0])
single_bow

[(0, 1),
 (1, 1),
 (2, 1),
 (3, 1),
 (4, 1),
 (5, 1),
 (6, 1),
 (7, 1),
 (8, 1),
 (9, 1),
 (10, 1),
 (11, 1),
 (12, 1),
 (13, 1),
 (14, 1),
 (15, 1)]

In [22]:
last_bow = dictionary.doc2bow(documents[-1])
last_bow


[(1296, 1),
 (1778, 1),
 (1786, 1),
 (4857, 1),
 (6202, 1),
 (6330, 1),
 (6764, 1),
 (9817, 1),
 (10769, 1),
 (17773, 1),
 (24955, 1),
 (31681, 1),
 (31682, 1),
 (31683, 1),
 (31684, 1),
 (31685, 1),
 (31686, 1)]

In [27]:
dictionary[1296],dictionary[1778],dictionary[31686]

('stundu', 'piecu', 'ziemeļvalstīs')

In [28]:
documents[-1]

['piecu',
 'stundu',
 'pavadīšanas',
 'taškentas',
 'lidostas',
 'uzgaidāmajā',
 'telpā',
 'nakts',
 'lidmašīnā',
 'citā',
 'izkāpjot',
 'lidmašīnas',
 'sajūtams',
 'taizemi',
 'brauc',
 'vairums',
 'ziemeļvalstīs']

In [25]:
documents[-1]

['piecu',
 'stundu',
 'pavadīšanas',
 'taškentas',
 'lidostas',
 'uzgaidāmajā',
 'telpā',
 'nakts',
 'lidmašīnā',
 'citā',
 'izkāpjot',
 'lidmašīnas',
 'sajūtams',
 'taizemi',
 'brauc',
 'vairums',
 'ziemeļvalstīs']

In [23]:
# now we are going to generated a bag of words for each document
# we are creating a list of bag of words 
bow_corpus = [dictionary.doc2bow(doc) for doc in documents]

In [32]:
# this code does the same thing (slightly slower) than the above one line
bow_corpus = []
for doc in documents:
    bow = dictionary.doc2bow(doc)
    bow_corpus.append(bow)

In [33]:
%%time
# %%time is a jupyter so called magic command which times your cell
# Train the LDA model
model = gensim.models.LdaModel(
    corpus=bow_corpus, # so list of bag of words
    id2word=dictionary,# so dictionary mapping ids to words
    num_topics=num_topics, # so called hyper parameter which we can adjust
    passes=passes # for optimization no need to run toomuch 5 is a good starting compromise
)

CPU times: user 8.87 s, sys: 213 ms, total: 9.09 s
Wall time: 8.87 s


## Perplexity

After each pass, the LDA model will output a "perplexity" score that measures the "held out log-likelihood". Perplexity is a measure of how "surpised" the machine is to see certain data. In other words, perplexity measures how successfully a trained topic model predicts new data. The model may be trained many times with different parameters, optimizing for the lowest possible perplexity.

In general, the perplexity score should trend downward as the machine "learns" what to expect from the data. While a low perplexity score may signal the machine has learned the documents' patterns, that does not mean that the topics formed from a model with low perplexity will form the most coherent topics. (See ["Reading Tea Leaves: How Humans Interpret Topic Models" Chang, et al. 2009](https://papers.nips.cc/paper/2009/hash/f92586a25bb3145facd64ab20fd554ff-Abstract.html).)



## Topic Coherence

The failure of perplexity scores to consistently create "good" topics has led to new methods in "topic coherence". Here we demonstrate two of these methods with Gensim but there are additional methods available. Ideally, a researcher would run many topic models, discovering the optimum settings for topic coherence.

Ultimately, however, the best judgment of topic coherence is a disciplinary expert, particularly someone with familiarity with the materials in question.

<font color='red'>Read more</font>

* ["Optimizing Semantic Coherence in Topic Models" Mimno, et al. 2011](http://dirichlet.net/pdf/mimno11optimizing.pdf)
* ["Automatic Evaluation of Topic Coherence" Newman, et al. 2010](https://mimno.infosci.cornell.edu/info6150/readings/N10-1012.pdf))


In [35]:
# Compute the coherence score using UMass
# u_mass is measured from -14 to 14, higher is better
from gensim.models import CoherenceModel

coherence_model_lda = CoherenceModel(
    model=model,
    corpus=bow_corpus,
    dictionary=dictionary, 
    coherence='u_mass'
)

# Compute Coherence Score using UMass
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Coherence Score:  -4.75814787728133


## Display a List of Topics
Print the most significant terms, as determined by the model, for each topic.

In [36]:
num_topics

7

In [37]:
model.get_topic_terms(0) #  you can adjust topn to see more or less than default 10 terms

[(59, 0.01357901),
 (4, 0.009470687),
 (58, 0.00822398),
 (14, 0.007737055),
 (451, 0.007566931),
 (84, 0.0064007454),
 (146, 0.00633214),
 (56, 0.0058322484),
 (567, 0.0056131748),
 (350, 0.0053895665)]

In [40]:
dictionary[59], dictionary[350]

('viņš', 'kuri')

In [39]:
for topic_num in range(0, num_topics): # so this loop goes from 0 to 6 (since 7 is not included in range)
    word_ids = model.get_topic_terms(topic_num)
    words = []
    for wid, weight in word_ids:
        word = dictionary.id2token[wid]
        words.append(word)
    print("Topic {}".format(str(topic_num).ljust(5)), " ".join(words))

Topic 0     viņš viņa viņi latvijas ļoti savu gadā visu tagad kuri
Topic 1     gada eiropas šogad latvijas darba latvijā valsts pagasta savu šajā
Topic 2     latvijas viņa vairāk eiropas savukārt laikā lielu ļoti pirmā komandas
Topic 3     rīgas novada policijas domes latu savukārt eiro gada pārvaldes varētu
Topic 4     valsts izglītības kultūras mākslas latvijas skolas jēkabpils bērnu novada darba
Topic 5     latvijas daudz mājas ļoti mums gadā gadus laikā vismaz savukārt
Topic 6     valsts darba gada mūsu kuriem būtu varētu šādu viņš kopumā


## Visualize the Topic Distances

Visualize the model using [`pyLDAvis`](https://pyldavis.readthedocs.io/en/latest/). This visualization can take a while to generate depending on the size of your dataset.

Try choosing a topic and adjusting the λ slider. When λ approaches 0, the words in a given document occur almost entirely in that topic. When λ approaches 1, the words occur more often in other topics.

In [None]:
# most likely we do not have pyLDAvis visualization library so we will install it
!pip install pyLDAvis

In [None]:
# later versions pyLDAvis do not play well with colab
# https://stackoverflow.com/questions/66096149/pyldavis-visualization-from-gensim-not-displaying-the-result-in-google-colab

In [None]:
import pyLDAvis
# you can try installing an older version
# !pip install pyLDAvis==2.1.2

In [43]:
pyLDAvis.__version__

'3.3.1'

In [None]:
# import pyLDAvis
import pyLDAvis.gensim_models as gensimvis
pyLDAvis.enable_notebook()

# feed the LDA model into the pyLDAvis instance
# model is our LDA model
# bow_corpus is our list of bag of words for each document
# again dictionary is our mapping of ids to actual words 
lda_viz = gensimvis.prepare(model, bow_corpus, dictionary)


In [None]:
# Export this visualization as an HTML file
# An internet connection is still required to view the HTML
p = gensimvis.prepare(model, bow_corpus, dictionary)
pyLDAvis.save_html(p, 'my_visualization.html')
# there are other options such as save_json and some other ones

In [48]:
type(p)

pyLDAvis._prepare.PreparedData

In [49]:
p # so you can in fact run inside colab

In [None]:
# lets try creating a model with 10 topics
model_10 = gensim.models.LdaModel(
    corpus=bow_corpus, # so list of bag of words
    id2word=dictionary,# so dictionary mapping ids to words
    num_topics=10, # so called hyper parameter which we can adjust
    passes=passes # for optimization no need to run toomuch 5 is a good starting compromise
)

In [52]:
coherence_model_lda_10 = CoherenceModel(
    model=model_10,
    corpus=bow_corpus,
    dictionary=dictionary, 
    coherence='u_mass'
)

# Compute Coherence Score using UMass
coherence_lda_10 = coherence_model_lda_10.get_coherence()
print('\nCoherence Score: ', coherence_lda_10)


Coherence Score:  -6.056574221505013


In [None]:
model_20 = gensim.models.LdaModel(
    corpus=bow_corpus, # so list of bag of words
    id2word=dictionary,# so dictionary mapping ids to words
    num_topics=20, # so called hyper parameter which we can adjust
    passes=passes # for optimization no need to run toomuch 5 is a good starting compromise
)

In [None]:
model_5 = gensim.models.LdaModel(
    corpus=bow_corpus, # so list of bag of words
    id2word=dictionary,# so dictionary mapping ids to words
    num_topics=5, # so called hyper parameter which we can adjust
    passes=passes # for optimization no need to run toomuch 5 is a good starting compromise
)

## Prepare a topic model on your own corpus

In [None]:
# Your code goes here

## Assignment - Day 2

Submit the following for Day2:
1. my_wordcloud.png from Day 2- Session 2
2. topic-modeling-for-custom-data.ipynb (with your extra code)
3. my_visualization.html - with visualization for YOUR corpus

Assignment is due Thursday July 28th, 2022 21:00 GMT+2 (Riga time).


[Submit Assignment](https://forms.gle/cbBP4LVXNbdMFtfZ8)

Note: requires gmail account, if you do not have one, you can email submission directly to valdis.s.coding at gmail com