# Topic Modeling
In this exercise, we will do topic modeling with gensim. Use the [topics and transformations tutorial](https://radimrehurek.com/gensim/auto_examples/core/run_topics_and_transformations.html) as a reference.

In [1]:
import os
from collections import defaultdict

import gensim
import nltk

For tokenizing words and stopword removal, download the NLTK punkt tokenizer and stopwords list.

In [2]:
nltk.download('punkt_tab')
nltk.download('stopwords')

[nltk_data] Error loading punkt_tab: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:1006)>
[nltk_data] Error loading stopwords: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:1006)>


False

First, we load the [Lee Background Corpus](https://hekyll.services.adelaide.edu.au/dspace/bitstream/2440/28910/1/hdl_28910.pdf) included with gensim that contains 300 news articles of the Australian Broadcasting Corporation.

In [3]:
from gensim.test.utils import datapath
train_file = datapath('lee_background.cor')
articles_orig = open(train_file).read().splitlines()

Preprocess the text by lowercasing, removing stopwords, stemming, and removing rare words.

In [4]:
# define stopword list
stopwords = set(nltk.corpus.stopwords.words('english'))
stopwords = stopwords | {'\"', '\'', '\'\'', '`', '``', '\'s'}

# initialize stemmer
stemmer = nltk.stem.PorterStemmer()

def preprocess(article):
    # tokenize
    article = nltk.word_tokenize(article)  # TODO

    # lowercase all words
    article = [word.lower() for word in article]  # TODO

    # remove stopwords
    article = [word for word in article if word not in stopwords]  # TODO

    # optional: stem
    # article = [stemmer.stem(word) for word in article]
    return article

articles = [preprocess(article) for article in articles_orig]

# create the dictionary and corpus objects that gensim uses for topic modeling
dictionary = gensim.corpora.Dictionary(articles)

# remove words that occur in less than 2 documents, or more than 50% of documents
dictionary.filter_extremes(no_below=2, no_above=0.5)
temp = dictionary[0]  # load the dictionary by calling it once
corpus_bow = [dictionary.doc2bow(article) for article in articles]


Now we create a TF-IDF model and transform the corpus into TF-IDF vectors.

In [5]:
from gensim import models

tfidf = models.TfidfModel(corpus_bow)
corpus_tfidf = tfidf[corpus_bow]
print(corpus_tfidf[0])

[(0, 0.045163832296308125), (1, 0.049004990699027966), (2, 0.09398031720792203), (3, 0.06797874731615453), (4, 0.08637534553463992), (5, 0.10158528888120417), (6, 0.058872481173046734), (7, 0.045871696227162966), (8, 0.04660732651093343), (9, 0.03476708703034139), (10, 0.09174339245432593), (11, 0.06379342938648586), (12, 0.08097953226203827), (13, 0.08637534553463992), (14, 0.06576958891547403), (15, 0.05748249959948285), (16, 0.07679421433236962), (17, 0.09398031720792203), (18, 0.04197717742438698), (19, 0.06379342938648586), (20, 0.09398031720792203), (21, 0.07679421433236962), (22, 0.08097953226203827), (23, 0.058872481173046734), (24, 0.05497796237027076), (25, 0.05497796237027076), (26, 0.07337456058875615), (27, 0.05497796237027076), (28, 0.08637534553463992), (29, 0.058872481173046734), (30, 0.062005775644911734), (31, 0.08637534553463992), (32, 0.09398031720792203), (33, 0.04737299069698862), (34, 0.07048328454536662), (35, 0.09398031720792203), (36, 0.09398031720792203), (37

Now we train an [LDA model](https://radimrehurek.com/gensim/auto_examples/tutorials/run_lda.html) with 10 topics on the TF-IDF corpus. Save it to a variable `model_lda`.

In [6]:
model_lda = models.LsiModel(
    corpus_tfidf,
    id2word=dictionary,
    num_topics=100
)

Let's inspect the first 5 topics of our model.

In [7]:
model_lda.print_topics(5)

[(0,
  '-0.256*"palestinian" + -0.175*"arafat" + -0.163*"israeli" + -0.127*"mr" + -0.108*"israel" + -0.106*"afghanistan" + -0.103*"hamas" + -0.096*"us" + -0.094*"government" + -0.091*"attacks"'),
 (1,
  '0.404*"palestinian" + 0.277*"arafat" + 0.257*"israeli" + 0.170*"israel" + 0.167*"hamas" + 0.145*"gaza" + 0.126*"sharon" + 0.110*"suicide" + -0.099*"afghanistan" + -0.098*"qantas"'),
 (2,
  '0.250*"qantas" + -0.213*"afghanistan" + -0.209*"bin" + -0.206*"laden" + 0.198*"workers" + -0.161*"al" + -0.161*"qaeda" + -0.138*"tora" + -0.138*"bora" + 0.136*"industrial"'),
 (3,
  '-0.332*"qantas" + -0.256*"workers" + 0.251*"test" + -0.180*"industrial" + 0.164*"south" + -0.160*"unions" + -0.157*"maintenance" + 0.146*"africa" + -0.124*"dispute" + 0.116*"waugh"'),
 (4,
  '-0.232*"test" + -0.158*"qantas" + -0.139*"africa" + 0.135*"river" + 0.122*"fire" + -0.112*"waugh" + 0.112*"guides" + 0.109*"canyoning" + -0.108*"workers" + 0.104*"adventure"')]

We see the 5 topics with the highest importance. For each topic, the 10 most important words are shown, together with their coefficient of "alignment" to the topic.

## Document Similarity
We now use our LDA model to compare the similarity of new documents (*queries*) to documents in our collection.

First, create an index of the news articles in our corpus. Use the `MatrixSimilarity` transformation as described in gensim's [similarity queries tutorial](https://radimrehurek.com/gensim/auto_examples/core/run_similarity_queries.html).

In [8]:
from gensim import similarities

index = similarities.MatrixSimilarity(model_lda[corpus_tfidf])

Now, write a function that takes a query string as input and returns the LDA representation for it. Make sure to apply the same preprocessing as we did to the documents.

In [9]:
query = ("Who is president?")

def get_lda_vector(text):
    text = preprocess(text)
    bow = dictionary.doc2bow(text)
    tfidf_vector = tfidf[bow]
    lda_vector = model_lda[tfidf_vector]
    return lda_vector

query_vector = get_lda_vector(query)
print(query_vector)
    

[(0, -0.03213616814593185), (1, -0.014216068744702336), (2, -0.009263189325339068), (3, -0.003663589720304929), (4, 0.02087933745311699), (5, 0.020982025430939886), (6, -0.031368670536165356), (7, -0.015777024118179515), (8, 0.023397708297488433), (9, 0.000852916929276271), (10, -0.006852492871940254), (11, 0.005771291772283533), (12, -0.024219241788723797), (13, -0.02769962525342491), (14, 0.036966597742016695), (15, 0.027731261848940394), (16, 0.0012665565675291993), (17, 0.029555511338381156), (18, 0.01204054413599529), (19, 0.021656190078206076), (20, -0.018458663750750226), (21, -0.014238400445925203), (22, -0.0014801973703616016), (23, 0.04282279769695785), (24, -0.04428804167229525), (25, -0.01757267308809045), (26, -0.04823066004899685), (27, 0.008890126662761934), (28, -0.00928100229141298), (29, -0.024719086986037576), (30, -0.006015124731031865), (31, -0.033120883519865765), (32, -0.005031921733230446), (33, -0.04869540296059207), (34, 0.015470187494727774), (35, -0.03041269

Print the top 5 most similar documents, together with their similarities, using your index created above.

In [10]:
document_similarities = index[query_vector]

sorted_documents = sorted(enumerate(document_similarities), key=lambda x: x[1], reverse=True)
for doc_idx, doc_similarity in sorted_documents[:5]:
    print(doc_similarity, articles_orig[doc_idx])


0.5046015 At least four people, including two policemen, have been killed during an attempted coup in Haiti overnight. Armed commandos had stormed the national palace in the Haitian capital after midnight, local time and seized control of radio communications equipment. The attackers, understood to be former members of the Haitian military, fired at security guards as they entered the palace - the official residence of President Jean Bertrand Aristide. But the President was at another home in the capital Port-au-Prince during the attack. It is understood some of the gunmen have been arrested and the Haitian Government says it is now back in control. President Aristide was deposed in a coup 10 years ago, but was returned to power in 1994 after a United States invasion. He was recently re-elected for five years. 
0.44464743 A director of a defunct Swiss company that organised a canyoning trip in 1999 that ended with 21 people dying, 14 of them Australians, has denied responsibility for t

Run your code again, now training an LDA model with 100 topics. Do you see a qualitative difference in the top-5 most similar documents?