# Topic Modeling
In this exercise, we will do topic modeling with gensim. Use the [topics and transformations tutorial](https://radimrehurek.com/gensim/auto_examples/core/run_topics_and_transformations.html) as a reference.

In [1]:
import os
from collections import defaultdict

import gensim
import nltk

For tokenizing words and stopword removal, download the NLTK punkt tokenizer and stopwords list.

In [2]:
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /home/jovyan/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

First, we load the [Lee Background Corpus](https://hekyll.services.adelaide.edu.au/dspace/bitstream/2440/28910/1/hdl_28910.pdf) included with gensim that contains 300 news articles of the Australian Broadcasting Corporation.

In [3]:
from gensim.test.utils import datapath
train_file = datapath('lee_background.cor')
articles_orig = open(train_file).read().splitlines()

Preprocess the text by lowercasing, removing stopwords, stemming, and removing rare words.

In [4]:
# define stopword list
stopwords = set(nltk.corpus.stopwords.words('english'))
stopwords = stopwords | {'\"', '\'', '\'\'', '`', '``', '\'s'}

# initialize stemmer
stemmer = nltk.stem.PorterStemmer()

def preprocess(article):
    # tokenize
    article = nltk.word_tokenize(article)

    # lowercase all words
    article = [word.lower() for word in article]

    # remove stopwords
    article = [word for word in article if word not in stopwords]

    # optional: stem
    article = [stemmer.stem(word) for word in article]
    return article

articles = [preprocess(article) for article in articles_orig]

# create the dictionary and corpus objects that gensim uses for topic modeling
dictionary = gensim.corpora.Dictionary(articles)

# remove words that occur in less than 2 documents, or more than 50% of documents
dictionary.filter_extremes(no_below=2, no_above=0.5)
temp = dictionary[0]  # load the dictionary by calling it once
corpus_bow = [dictionary.doc2bow(article) for article in articles]


Now we create a TF-IDF model and transform the corpus into TF-IDF vectors.

In [5]:
tfidf_model = gensim.models.TfidfModel(corpus_bow)
corpus_tfidf = tfidf_model[corpus_bow] 
print(corpus_tfidf[0])

[(0, 0.045055071138122856), (1, 0.048886979465828866), (2, 0.09375399876622974), (3, 0.06781504448328635), (4, 0.08078452162475803), (5, 0.10134065654795091), (6, 0.058730707569876556), (7, 0.07269531841603805), (8, 0.03468336276559477), (9, 0.07660928255691886), (10, 0.06936672553118954), (11, 0.06363980541544716), (12, 0.14639572768607376), (13, 0.06561120606131572), (14, 0.05734407327660557), (15, 0.21959359152911065), (16, 0.0861673409845086), (17, 0.03352561206466457), (18, 0.05163007712346614), (19, 0.0861673409845086), (20, 0.04649508920613628), (21, 0.08078452162475803), (22, 0.058730707569876556), (23, 0.044374596135133865), (24, 0.048886979465828866), (25, 0.053711220021490196), (26, 0.04725890956009348), (27, 0.06185645660730057), (28, 0.05163007712346614), (29, 0.0861673409845086), (30, 0.056053147633726), (31, 0.06185645660730057), (32, 0.14639572768607376), (33, 0.04725890956009348), (34, 0.06022838670156517), (35, 0.06363980541544716), (36, 0.07660928255691886), (37, 0.0

Now we train an [LDA model](https://radimrehurek.com/gensim/auto_examples/tutorials/run_lda.html) with 10 topics on the TF-IDF corpus. Save it to a variable `model_lda`.

In [6]:
num_topics = 10

model_lda = gensim.models.LdaModel(
    corpus=corpus_tfidf,
    id2word=dictionary,
    num_topics=num_topics,
)

Let's inspect the first 5 topics of our model.

In [7]:
model_lda.print_topics(5)

[(0,
  '0.002*"race" + 0.002*"australia" + 0.002*"storm" + 0.002*"bill" + 0.002*"south" + 0.002*"cancer" + 0.002*"govern" + 0.001*"asic" + 0.001*"bank" + 0.001*"airlin"'),
 (8,
  '0.002*"best" + 0.002*"team" + 0.002*"win" + 0.002*"hill" + 0.002*"cut" + 0.002*"year" + 0.002*"bank" + 0.002*"reid" + 0.002*"afghanistan" + 0.001*"oil"'),
 (2,
  '0.002*"arrest" + 0.002*"hospit" + 0.002*"new" + 0.002*"australia" + 0.002*"road" + 0.002*"mr" + 0.002*"peopl" + 0.002*"death" + 0.002*"die" + 0.002*"whether"'),
 (6,
  '0.003*"fire" + 0.002*"union" + 0.002*"india" + 0.002*"isra" + 0.002*"palestinian" + 0.002*"call" + 0.002*"commiss" + 0.002*"mr" + 0.002*"depart" + 0.002*"lee"'),
 (4,
  '0.004*"palestinian" + 0.003*"isra" + 0.003*"arafat" + 0.002*"farmer" + 0.002*"israel" + 0.002*"worker" + 0.002*"union" + 0.002*"qanta" + 0.002*"hama" + 0.002*"economi"')]

We see the 5 topics with the highest importance. For each topic, the 10 most important words are shown, together with their coefficient of "alignment" to the topic.

## Document Similarity
We now use our LDA model to compare the similarity of new documents (*queries*) to documents in our collection.

First, create an index of the news articles in our corpus. Use the `MatrixSimilarity` transformation as described in gensim's [similarity queries tutorial](https://radimrehurek.com/gensim/auto_examples/core/run_similarity_queries.html).

In [8]:
index = gensim.similarities.MatrixSimilarity(model_lda[corpus_tfidf])

Now, write a function that takes a query string as input and returns the LDA representation for it. Make sure to apply the same preprocessing as we did to the documents.

In [18]:
def get_lda_representation(query):
    query = preprocess(query)
    query_bow = dictionary.doc2bow(query)
    query_tfidf = tfidf_model[query_bow]
    return model_lda[query_tfidf]
    

Print the top 5 most similar documents, together with their similarities, using your index created above.

In [19]:
sims = index[get_lda_representation("Sydney is a harbour city")]
sims = sorted(enumerate(sims), key=lambda item: -item[1])
for doc_position, doc_score in sims[:5]:
    print(doc_score, articles[doc_position])

0.9509123 ['new', 'south', 'wale', 'state', 'emerg', 'servic', '(', 'se', ')', 'say', 'receiv', '5,000', 'call', 'help', 'wake', 'monday', 'fierc', 'storm', '.', 'natur', 'disast', 'area', 'declar', 'throughout', 'sydney', 'surround', 'area', 'part', 'state', 'north-west', '.', 'sydney', ',', '2,000', 'home', ',', 'mainli', 'northern', 'suburb', ',', 'remain', 'without', 'power', '.', 'se', 'spokeswoman', 'laura', 'goodin', 'say', 'sever', 'hundr', 'volunt', 'back', 'field', 'morn', '.', "'ve", '5,000', 'call', 'help', "'ve", 'complet', 'two-third', '.', "'ve", '800', 'volunt', 'field', 'help', 'royal', 'fire', 'servic', 'new', 'south', 'wale', 'fire', 'brigad', "'re", 'expect', 'job', 'complet', 'friday', ',', 'ms', 'goodin', 'said', '.', 'extens', 'storm', 'damag', 'prompt', 'warn', 'peopl', 'fals', 'claim', 'work', 'se', '.', 'warn', ',', 'fair', 'trade', 'minist', 'john', 'aquilina', ',', 'follow', 'report', 'suburb', 'hornsbi', 'peopl', 'claim', 'work', 'se', 'ask', 'payment', 'st

Run your code again, now training an LDA model with 100 topics. Do you see a qualitative difference in the top-5 most similar documents?