## Topic Modelling

Topic Models are a type of statistical model that helps us to uncover hidden structures in a collection of texts. In Topic Modelling, we focus on building clusters of words instead of texts. Hence, we can say that a text is a mixture of all the topics, and each topic has its specific weight.

`Latent Dirichlet allocation (LDA)` is also called a bag-of-words model, which means that it considers every text as a collection of words, without paying any attention to the grammar or the word sequence. Here a document is considered as a probability distribution of topics, which in turn is a probability distribution of words.

The content for the dataset has been gathered from various news sites, containing the term ‘Monash University’

### Importing data

In [1]:
import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize    
from nltk.tokenize import wordpunct_tokenize
from nltk.stem import WordNetLemmatizer
import pandas as pd
import numpy as np
import spacy
import random
import string
import math
import re

In [2]:
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

In [3]:
# uncomment and run to load up this data
text_data = []
df = pd.read_csv('Monash_crawled.csv')
df['body'] = [["".join( j for j in doc if j not in string.punctuation)] for doc in df['body']]
df['body'] = df['body'].apply(lambda x: ','.join(map(str, x)))
df['body'] = [word.replace("\n", "") for word in df['body']] 

In [4]:
df['body'] = [[" ".join(doc.split())] for doc in df['body']]
df['body'] = df['body'].apply(lambda x: ','.join(map(str, x)))

In [5]:
# docs = docs.apply(lambda x: ','.join(map(str, x)))
docs = df['body'].tolist()
print(len(docs))
print(docs[0][0:500])

366
Canberra has experienced its worst air quality on record as bushfire smoke became trapped by atmospheric conditions and residents were told to stay indoors and brace for more smog in the coming days The ACTas acting chief health officer Dr Paul Dugdale said the smoke was the worst since the 2003 bushfires and was acertainly the worsta since air quality monitoring started in the city 15 years ago Air quality index readings in Canberra city were at 3463 on Wednesday afternoon according to the ACT 


In [6]:
# Defining a LemmaTokenizer using Spacy to be used later whie vectoring the abstract
class LemmaTokenizerSpacy(object):        
    def __call__(self,doc):
        trydoc = nlp(doc)
        return [token.lemma_ for token in trydoc]

In [7]:
# Split the documents into tokens.
tokenizer = LemmaTokenizerSpacy()
for idx in range(len(docs)):
    docs[idx] = docs[idx].lower()  # Convert to lowercase.
#     docs[idx] = ["".join( j for j in i if j not in string.punctuation) for i in  docs[idx]]
    docs[idx] = tokenizer(docs[idx])  # Split into words.

docs = [[token for token in doc if token not in nlp.Defaults.stop_words] for doc in docs]

# Remove numbers, but not words that contain numbers.
docs = [[token for token in doc if not token.isnumeric()] for doc in docs]

# Remove words that are only one character.
docs = [[token for token in doc if len(token) > 1] for doc in docs]

In [8]:
docs[0]

['canberra',
 'experience',
 '-PRON-',
 'bad',
 'air',
 'quality',
 'record',
 'bushfire',
 'smoke',
 'trap',
 'atmospheric',
 'condition',
 'resident',
 'tell',
 'stay',
 'indoor',
 'brace',
 'smog',
 'come',
 'day',
 'acta',
 'act',
 'chief',
 'health',
 'officer',
 'dr',
 'paul',
 'dugdale',
 'smoke',
 'bad',
 'bushfire',
 'acertainly',
 'worsta',
 'air',
 'quality',
 'monitoring',
 'start',
 'city',
 'year',
 'ago',
 'air',
 'quality',
 'index',
 'reading',
 'canberra',
 'city',
 'wednesday',
 'afternoon',
 'accord',
 'act',
 'health',
 'website',
 'rating',
 'consider',
 'hazardous',
 'suburb',
 'monash',
 'level',
 'florey',
 'act',
 'health',
 'spokesperson',
 'aqi',
 'read',
 'fine',
 'particle',
 'peak',
 'a.m.',
 'wednesday',
 'monash',
 'monitoring',
 'site',
 'canberrabased',
 'university',
 'nsw',
 'climate',
 'scientist',
 'dr',
 'sophie',
 'lewis',
 'city',
 '-PRON-',
 'twoyearold',
 'daughter',
 'condition',
 'alike',
 'experience',
 'beforea',
 '-PRON-',
 'plan',
 'lea

### Model 1

1. Adding bigrams with min_count = 20, i.e. bigrams that occur in 20 documents or more, to the existing tokens.
2. Removing rare and common tokens, using the filter_extremes() provided by gensim. Set the no_below and no_above parameters as 20 and 50% respectively, which means that words that occur in less than 20 documents, or more than 50% of the documents would be filtered out.
3. Converting the token dictionary into a Bag-of-words representation of the documents using doc2bow() provided by gensim .

In [9]:
from gensim.models import Phrases

# Add bigrams and trigrams to docs (only ones that appear 20 times or more).
bigram = Phrases(docs, min_count=20)
for idx in range(len(docs)):
    for token in bigram[docs[idx]]:
        if '_' in token:
            # Token is a bigram, add to document.
            docs[idx].append(token)

In [10]:
# Remove rare and common tokens.
from gensim.corpora import Dictionary

# Create a dictionary representation of the documents.
dictionary = Dictionary(docs)

# Filter out words that occur less than 20 documents, or more than 50% of the documents.
dictionary.filter_extremes(no_below=20, no_above=0.5)

In [11]:
# Bag-of-words representation of the documents.
corpus = [dictionary.doc2bow(doc) for doc in docs]

In [12]:
print('Number of unique tokens: %d' % len(dictionary))
print('Number of documents: %d' % len(corpus))

Number of unique tokens: 1086
Number of documents: 366


Now, preparing the LDA model with training parameters set as,
1. NUM_TOPICS = 10, this tells the model the number of topics to divide the corpus into.
2. chunksize = 2000, this tells the model the amount of documents to process in one go.
3. passes = 30, aka epochs, the number of times to train the model on the entire corpus.
4. iterations = 500, controls the number of times we loop over a particular document.
5. alpha = ‘auto’, parameter that deals with the per document topic distribution.
6. eta = ‘auto’, parameter that deals with the per topic word distribution.

In [13]:
# Train LDA model.
from gensim.models import LdaModel

# Set training parameters.
NUM_TOPICS = 10
chunksize = 2000
passes = 30
iterations = 500
eval_every = None  # Don't evaluate model perplexity, takes too much time.

# Make a index to word dictionary.
temp = dictionary[0]  # This is only to "load" the dictionary.
id2word = dictionary.id2token

model = LdaModel(
    corpus=corpus,
    id2word=id2word,
    chunksize=chunksize,
    alpha='auto',
    eta='auto',
    iterations=iterations,
    num_topics=NUM_TOPICS,
    passes=passes,
    eval_every=eval_every
)
outputfile = f'model{NUM_TOPICS}.gensim'
print("Saving model in " + outputfile)
print("")
model.save(outputfile)

Saving model in model10.gensim



In [14]:
top_topics = model.top_topics(corpus)

# Average topic coherence is the sum of topic coherences of all topics, divided by the number of topics.
avg_topic_coherence = sum([t[1] for t in top_topics]) / NUM_TOPICS
print('Average topic coherence: %.4f.' % avg_topic_coherence)

from pprint import pprint
pprint(model.print_topics())

Average topic coherence: -0.9302.
[(0,
  '0.033*"student" + 0.017*"school" + 0.014*"ban" + 0.012*"travel" + '
  '0.010*"home" + 0.010*"close" + 0.009*"sydney" + 0.008*"leave" + '
  '0.008*"work" + 0.008*"state"'),
 (1,
  '0.039*"ship" + 0.030*"cruise" + 0.024*"passenger" + 0.022*"princess" + '
  '0.020*"diamond" + 0.020*"cruise_ship" + 0.018*"diamond_princess" + '
  '0.017*"quarantine" + 0.017*"japan" + 0.015*"test"'),
 (2,
  '0.025*"woman" + 0.018*"death" + 0.016*"report" + 0.015*"wuhan" + '
  '0.014*"chinese" + 0.013*"number" + 0.012*"saturday" + 0.011*"hong" + '
  '0.011*"kong" + 0.011*"hong_kong"'),
 (3,
  '0.030*"pandemic" + 0.018*"cent" + 0.015*"care" + 0.014*"covid19" + '
  '0.013*"need" + 0.013*"minister" + 0.012*"professor" + 0.012*"thursday" + '
  '0.011*"disease" + 0.011*"plan"'),
 (4,
  '0.021*"use" + 0.016*"area" + 0.015*"study" + 0.011*"research" + '
  '0.010*"patient" + 0.010*"cell" + 0.009*"woman" + 0.008*"researcher" + '
  '0.007*"find" + 0.007*"work"'),
 (5,
  '0.028*

In [15]:
import pyLDAvis.gensim
lda_display = pyLDAvis.gensim.prepare(model, corpus, dictionary, sort_topics=True)
pyLDAvis.display(lda_display)

### Model 2

1. Added bigrams with min_count = 30.
2. Removed rare and common tokens, using the filter_extremes() provided by gensim. Set the no_below and no_above parameters as 10 and 60% respectively.

In [16]:
# Remove rare and common tokens.
from gensim.corpora import Dictionary

# Create a dictionary representation of the documents.
dictionary = Dictionary(docs)

# Filter out words that occur less than 20 documents, or more than 50% of the documents.
dictionary.filter_extremes(no_below=20, no_above=0.6)

  and should_run_async(code)


In [17]:
# Bag-of-words representation of the documents.
corpus = [dictionary.doc2bow(doc) for doc in docs]

  and should_run_async(code)


In [18]:
print('Number of unique tokens: %d' % len(dictionary))
print('Number of documents: %d' % len(corpus))

Number of unique tokens: 1102
Number of documents: 366


  and should_run_async(code)


Now, preparing the LDA model with training parameters set as,
1. NUM_TOPICS = 5.
2. passes = 60. i.e. the number of epochs.
3. iterations = 1200, controls the number of times we loop over a particular document.

In [19]:
# Train LDA model.
from gensim.models import LdaModel

# Set training parameters.
NUM_TOPICS = 5
chunksize = 2000
passes = 60
iterations = 1200
eval_every = None  # Don't evaluate model perplexity, takes too much time.

# Make a index to word dictionary.
temp = dictionary[0]  # This is only to "load" the dictionary.
id2word = dictionary.id2token

model = LdaModel(
    corpus=corpus,
    id2word=id2word,
    chunksize=chunksize,
    alpha='auto',
    eta='auto',
    iterations=iterations,
    num_topics=NUM_TOPICS,
    passes=passes,
    eval_every=eval_every
)
outputfile = f'model{NUM_TOPICS}.gensim'
print("Saving model in " + outputfile)
print("")
model.save(outputfile)

  and should_run_async(code)


Saving model in model5.gensim



In [20]:
top_topics = model.top_topics(corpus) #, num_words=20)

# Average topic coherence is the sum of topic coherences of all topics, divided by the number of topics.
avg_topic_coherence = sum([t[1] for t in top_topics]) / NUM_TOPICS
print('Average topic coherence: %.4f.' % avg_topic_coherence)

from pprint import pprint
pprint(model.print_topics())

Average topic coherence: -0.8192.
[(0,
  '0.039*"february" + 0.014*"ship" + 0.012*"passenger" + 0.011*"virus" + '
  '0.011*"flight" + 0.011*"cruise" + 0.011*"quarantine" + 0.010*"mask" + '
  '0.009*"japan" + 0.009*"wuhan"'),
 (1,
  '0.019*"fire" + 0.013*"school" + 0.011*"home" + 0.010*"state" + '
  '0.010*"bushfire" + 0.008*"close" + 0.008*"social" + 0.007*"sydney" + '
  '0.006*"melbourne" + 0.006*"work"'),
 (2,
  '0.026*"virus" + 0.020*"china" + 0.018*"case" + 0.015*"wuhan" + 0.011*"test" '
  '+ 0.010*"symptom" + 0.010*"spread" + 0.010*"confirm" + 0.010*"patient" + '
  '0.009*"hospital"'),
 (3,
  '0.029*"february" + 0.024*"student" + 0.018*"china" + 0.014*"ban" + '
  '0.013*"travel" + 0.013*"chinese" + 0.009*"country" + 0.009*"cent" + '
  '0.009*"week" + 0.009*"government"'),
 (4,
  '0.015*"use" + 0.012*"area" + 0.011*"study" + 0.009*"woman" + 0.009*"year" + '
  '0.008*"research" + 0.008*"time" + 0.007*"work" + 0.007*"cell" + '
  '0.007*"change"')]


  and should_run_async(code)


In [21]:
import pyLDAvis.gensim
lda_display = pyLDAvis.gensim.prepare(model, corpus, dictionary, sort_topics=True)
pyLDAvis.display(lda_display)

  and should_run_async(code)


### Analysis:

1. Most often, I got topics with similar context, i.e. in this case the topics related to coronavirus, on one side of the PC1 axis, and the other two on the opposite side distributed. This might be telling us that topics on either side of the PC1 axis are contextually different .
2. During my whole analysis, I found just a single instance which indicated a reference to Monash University, even though all the data provided in the dataset is related to any mention of the term “Monash University”. However, since I am removing words that are appearing in more than 60% of the documents , there is no reference to the terms “monash” or “university” anywhere, but I get “monash_university” as one of the terms as I am adding bigrams that exist in more than 30 documents . So, I believe the university is being mentioned mostly because of its research developments.
3. I believe that the biggest advantage of topic modelling is that just by following a proper procedure for processing our dataset and little bit of intuition we can divide huge datasets with a variety of different titles into multiple topics with similar contexts. Like in the case of the term ‘Monash University’, I can definitely say that although it is present throughout the dataset, it is not the defining term for all the topics. So, I think that this model does quite a good job at classifying the different topics.
4. But, the biggest shortcoming is that we might have to select a proper value for the number of topics, because if it is too less then there will hardly be any discrimination between the topics, and if it is too high then there will be too many topics with a lot of overlap and hence, we will not be able to distinguish them properly. Also, at the end of the day it depends on the person who is looking at the topics to actually decide whether the words clustered together inside a particular topic actually makes sense or not.