![MSE Logo](https://moodle.msengineering.ch/pluginfile.php/1/core_admin/logocompact/300x300/1613732714/logo-mse.png "MSE Logo") 

# AnTeDe Lab 5: Latent Semantic Analysis with Gensim

## Objective
The goal of this lab is to perform LSA on a small corpus of news.  You will use the LSA word vectors to estimate word similarity, and then to perform ranked retrieval given a query. 

<font color='green'>Please answer the questions in green within this notebook, and submit the completed notebook under the corresponding homework on Moodle.</font>

In [1]:
import os
import nltk
import gensim
import pandas as pd

from gensim import models, corpora, similarities
from gensim.models import LsiModel, LdaModel, LdaMulticore

In [2]:
# Import TextProcessor.py from local directory structure
import sys

module_path = os.path.abspath(os.path.join('../week 1'))
if module_path not in sys.path:
    sys.path.append(module_path)

from TextPreprocessor import *

The data used in this lab the same set of 300 Australian that you used in Lab 4 on Document Representation.  It is a shortened version of the Lee Background Corpus [described here](http://www.socsci.uci.edu/~mdlee/lee_pincombe_welsh_document.PDF) and it is available with the **gensim** package that you installed.  The following code will load the documents into a Pandas dataframe.

In [3]:
# Code inspired from:
# https://github.com/bhargavvader/personal/blob/master/notebooks/text_analysis_tutorial/topic_modelling.ipynb

test_data_dir = '{}'.format(os.sep).join([gensim.__path__[0], 'test', 'test_data'])
lee_train_file = test_data_dir + os.sep + 'lee_background.cor'
text = open(lee_train_file).read().splitlines()
data_df = pd.DataFrame({'text': text})

## Data preprocessing

You will need first to preprocess the data through the following stages:
1. tokenization
2. stopword removal
2. POS-based filtering (optional)
3. lemmatization or stemming (optional)
4. addition of bigrams to each document (optional)
5. filtering of infrequent words
6. inspection and filtering of frequent words

Use our in-house `TextPreprocessor.py` file, as explained in Lab 1.

Preprocessing stages / steps:

In [4]:
nltk.download('popular')

[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to
[nltk_data]    |     /Users/davebrunner/nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package gazetteers to
[nltk_data]    |     /Users/davebrunner/nltk_data...
[nltk_data]    |   Package gazetteers is already up-to-date!
[nltk_data]    | Downloading package genesis to
[nltk_data]    |     /Users/davebrunner/nltk_data...
[nltk_data]    |   Package genesis is already up-to-date!
[nltk_data]    | Downloading package gutenberg to
[nltk_data]    |     /Users/davebrunner/nltk_data...
[nltk_data]    |   Package gutenberg is already up-to-date!
[nltk_data]    | Downloading package inaugural to
[nltk_data]    |     /Users/davebrunner/nltk_data...
[nltk_data]    |   Package inaugural is already up-to-date!
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /Users/davebrunner/nltk_data...
[nltk_data]    |   Pa

True

In [5]:
# Please write here the preprocessing instructions
stopwords = set(nltk.corpus.stopwords.words('english'))
stopwords.update(
    [',', '.', '!', '?', ';', ':', '(', ')', '[', ']', '{', '}', '``', "''", '""', '``', "''", '""', '’', '“', '”', '‘',
     '—', '–', '…', '•', '·', '°', '€', '£', '¥', '¢', '§', '©', '®', '™', '¶', '†', '‡', '‰', '№', 'Ω', '℮', '→', '↔'])
processor = TextPreprocessor(stopwords=stopwords, lemmatize=True, stem=False, min_length=2, n_jobs=4)

In [6]:
data_df['processed'] = processor.transform(data_df['text'])

  return bound(*args, **kwds)


In [7]:
data_df['tokenized'] = data_df['processed'].apply(nltk.word_tokenize)

In [8]:
print(data_df['tokenized'].iloc[120])

['union', 'represent', 'qantas', 'maintenance', 'worker', 'warn', 'escalate', 'industrial', 'action', 'company', 'reject', 'offer', 'long', 'run', 'dispute', 'arbitrate', 'party', 'lock', 'private', 'talk', 'yesterday', 'industrial', 'relation', 'commission', '3,000', 'maintenance', 'worker', 'earlier', 'vote', 'reject', 'qantas', 'propose', 'wage', 'freeze', 'national', 'secretary', 'australian', 'manufacturing', 'worker', 'union', 'amwu', 'doug', 'cameron', 'say', 'union', 'do', 'everything', 'possible', 'resolve', 'dispute', 'qantas', 'prepared', 'accept', 'private', 'arbitration', 'absolutely', 'alternative', 'worker', 'take', 'industrial', 'action', 'escalate', 'industrial', 'action', 'necessary', 'ensure', 'get', 'fair', 'company', 'seem', 'determine', 'crush', 'underfoot', 'say']


Please make a list of all words from all articles.  Then, using `nltk.FreqDist`, consider the most frequent and the least frequent words.  If you find uninformative words among the most frequent ones, please remove them from the articles.  Similarly, please remove from articles the words appearing fewer than 2 or 3 times in the corpus.  

<font color='green'> **Question**:  Please justify these choices. What is now the size of your vocabulary?</font> 

In [9]:
all_words = [word for doc in data_df['tokenized'] for word in doc]
fdist = nltk.FreqDist(all_words)
print(f"Most commen: {fdist.most_common(50)}")
print("------------------")
print(f"Most uncommen: {fdist.most_common()[-50:]}")

Most commen: [('say', 1011), ('australian', 178), ('new', 171), ('palestinian', 168), ('australia', 157), ('people', 153), ('government', 146), ('attack', 140), ('two', 136), ('day', 131), ('south', 130), ('state', 129), ('force', 126), ('year', 126), ('would', 115), ('take', 114), ('one', 114), ('israeli', 112), ('also', 111), ('minister', 106), ('fire', 103), ('last', 102), ('first', 102), ('arafat', 96), ('make', 92), ('afghanistan', 90), ('united', 89), ('three', 87), ('police', 86), ('world', 84), ('security', 83), ('time', 83), ('official', 83), ('could', 82), ('report', 81), ('call', 80), ('kill', 80), ('area', 79), ('give', 78), ('today', 77), ('leader', 77), ('group', 75), ('told', 75), ('come', 74), ('get', 74), ('company', 73), ('union', 71), ('authority', 69), ('laden', 69), ('well', 68)]
------------------
Most uncommen: [('check-in', 1), ('terminate', 1), ('throw', 1), ('basically', 1), ('unwind', 1), ('recrimination', 1), ('congressman', 1), ('excess', 1), ('precursor', 

In [10]:
# To remove uninformative words, I set min length of 2 on the TextPreprocessor

In [11]:
data_df['filtered'] = data_df['tokenized'].apply(lambda x: [word for word in x if fdist[word] > 2])
all_words_filtered = [word for doc in data_df['filtered'] for word in doc]
assert all_words.__sizeof__() > all_words_filtered.__sizeof__()

In [12]:
print(data_df['filtered'].iloc[33])

['new', 'south', 'wale', 'firefighter', 'hop', 'wind', 'help', 'ease', 'workload', 'today', 'predict', 'nasty', 'condition', 'weekend', 'wind', 'expect', 'ease', 'today', 'weather', 'bureau', 'say', 'temperature', 'high', 'fire', 'still', 'burning', 'across', 'new', 'south', 'wale', 'rural', 'fire', 'service', 'say', 'change', 'may', 'allow', 'concentrate', 'action', 'room', 'complacency', 'mark', 'sullivan', 'rural', 'fire', 'service', 'say', 'condition', 'may', 'little', 'today', 'outlook', 'weekend', 'certainly', 'appear', 'weather', 'forecast', 'high', 'temperature', 'high', 'wind', 'certainly', 'could', 'nasty', 'couple', 'day', 'ahead', 'sullivan', 'say', 'one', 'area', 'cause', 'great', 'concern', 'today', 'long', 'blaze', 'low', 'blue', 'mountain', 'firefighter', 'also', 'keep', 'close', 'eye', 'blaze', 'spencer', 'north', 'sydney', 'yesterday', 'broke', 'containment', 'line', 'concern', 'fire', 'may', 'hawkesbury', 'river', 'continue', 'state', 'central', 'west', 'south', 'syd

In [13]:
for (word, freq) in [(word, freq) for (word, freq) in fdist.most_common() if freq < 3]:
    assert all_words_filtered.count(word) == 0

## LSA with Gensim

In this section, you will write the Gensim commands to compute a term-document matrix from the above documents, then transform it using SVD, and truncate the result.  To learn what the commands are, please follow the [Topics and Tranformations tutorial](https://radimrehurek.com/gensim/auto_examples/core/run_topics_and_transformations.html) from Gensim. 

<font color="green"> **Question**: Please gather these commands into a function called `train_lsa`.  They should cover: dictionary creation, corpus mapping, computation of TF-IDF values, and creation of the LSA model.</font> 

In [14]:
def train_lsa(filtered_texts, num_topics=10):
    dictionary = corpora.Dictionary(filtered_texts)
    corpus = [dictionary.doc2bow(text) for text in filtered_texts]
    tfidf = models.TfidfModel(corpus)
    corpus_tfidf = tfidf[corpus]
    lsa_model = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=num_topics)

    return lsa_model, dictionary, corpus, corpus_tfidf

<font color="green"> **Question**: Please fix the number of topics to 10.  Then, execute the cell that performs `train_lsa`.</font>

In [15]:
number_of_topics = 10

In [16]:
lsa_model, dictionary, corpus, corpus_tfidf = train_lsa(data_df['filtered'], number_of_topics)

<font color="green"> **Question**: Please display several topics found by LSA using the Gensim `print_topics` function.  Please explain in your own words the meaning of what is displayed.  How do you relate it with what was explained in the course on LSA?</font>

In [17]:
lsa_model.print_topics()

[(0,
  '0.311*"palestinian" + 0.205*"israeli" + 0.185*"arafat" + 0.117*"israel" + 0.113*"hamas" + 0.111*"attack" + 0.095*"force" + 0.094*"afghanistan" + 0.094*"gaza" + 0.089*"security"'),
 (1,
  '0.426*"palestinian" + 0.281*"israeli" + 0.255*"arafat" + 0.158*"israel" + 0.153*"hamas" + 0.132*"gaza" + 0.114*"sharon" + 0.101*"suicide" + -0.101*"afghanistan" + -0.091*"south"'),
 (2,
  '-0.251*"qantas" + 0.220*"afghanistan" + 0.208*"bin" + 0.205*"laden" + -0.202*"worker" + -0.192*"union" + 0.161*"qaeda" + -0.142*"industrial" + 0.141*"bora" + 0.141*"tora"'),
 (3,
  '0.303*"qantas" + -0.259*"test" + 0.240*"worker" + 0.223*"union" + 0.170*"industrial" + -0.167*"south" + -0.158*"africa" + 0.143*"maintenance" + -0.126*"waugh" + -0.124*"match"'),
 (4,
  '0.231*"fire" + -0.207*"test" + -0.131*"africa" + -0.127*"qantas" + -0.118*"afghanistan" + 0.117*"river" + 0.116*"firefighter" + 0.115*"wind" + 0.113*"sydney" + -0.112*"bin"'),
 (5,
  '-0.249*"fire" + 0.193*"guide" + 0.190*"river" + 0.183*"canyoni

##### Answer:
The print_topics function displays the top words for each topic. The first number in each tuple is the weight of the word in the topic, and the second number is the word itself

<font color="green"> **Question**: Please define a function that returns the cosine similarity between two words (testing first if they are in the vocabulary). Please exemplify its value on two different word pairs, one of which should be obviously more similar than the other, and comment the values.</font>  You can get inspiration from this [Gensim Tutorial on Document Similarity](https://radimrehurek.com/gensim/auto_examples/core/run_similarity_queries.html).

In [18]:
def wordsim(word1, word2, model, dictionary_):
    vec1 = model[dictionary_.doc2bow([word1])]
    vec2 = model[dictionary_.doc2bow([word2])]
    return gensim.matutils.cossim(vec1, vec2)

In [19]:
print(f"Game and wale -> {wordsim("game", "wale", lsa_model, dictionary)}")
print(f"Attack and war -> {wordsim("attack", "war", lsa_model, dictionary)}")

Game and wale -> 0.21850186325286072
Attack and war -> 0.6767999008799849


<font color="green"> **Question**: Please use the [Gensim Tutorial on Document Similarity](https://radimrehurek.com/gensim/auto_examples/core/run_similarity_queries.html) to write a function that prints a list of words sorted by decreasing LSA similarity with a given word and showing the score too.  You don't have to use the cosine_similarity function here.  Please choose a "query" word and ten other words, apply your function, and comment the results.</font>

In [20]:
def word_ranking(word_0, word_list_, model, dictionary_):
    sims = []
    for word in word_list_:
        sims.append((word, wordsim(word, word_0, model, dictionary_)))
    return sorted(sims, key=lambda x: x[1], reverse=True)

In [21]:
word0 = "rocket"
word_list = ["war", "love", "hate", "game", "attack", "defend", "fight", "kill"]
word_ranking(word0, word_list, lsa_model, dictionary)

[('kill', 0.7641262489704287),
 ('attack', 0.5388384066386244),
 ('war', 0.3894481351982706),
 ('defend', 0.21424124862003135),
 ('fight', 0.045305839263935936),
 ('hate', 0.0),
 ('game', -0.06611927564231743),
 ('love', -0.1140072071227226)]

The listing makes sense, as the word rocket is more similar to war and attack than to love and game

<font color="green"> **Question**: Please select now a significantly larger number of topics, and train a new LSA model.  Perform the same `word_ranking` task as above and compare the new ranking with the previous one.  Which one seems better?</font>

In [22]:
number_of_topics = 20
lsa_model, dictionary, corpus, corpus_tfidf = train_lsa(data_df['filtered'], number_of_topics)
word_ranking(word0, word_list, lsa_model, dictionary)

[('kill', 0.7616049117685081),
 ('attack', 0.3424902483484021),
 ('war', 0.19689796118512962),
 ('defend', 0.11185859214100352),
 ('hate', 0.0),
 ('fight', -0.019251714677282705),
 ('game', -0.03188218739950623),
 ('love', -0.03214056290640063)]

In [23]:
number_of_topics = 100
lsa_model, dictionary, corpus, corpus_tfidf = train_lsa(data_df['filtered'], number_of_topics)
word_ranking(word0, word_list, lsa_model, dictionary)

[('defend', 0.1410294627718505),
 ('kill', 0.08478136583807817),
 ('attack', 0.06979602070601751),
 ('love', 0.02934590177466269),
 ('war', 0.027143178786544358),
 ('hate', 0.0),
 ('fight', -0.020421597123538845),
 ('game', -0.022904489906125022)]

In [24]:
number_of_topics = 500
lsa_model, dictionary, corpus, corpus_tfidf = train_lsa(data_df['filtered'], number_of_topics)
word_ranking(word0, word_list, lsa_model, dictionary)

[('kill', 0.048394945314964286),
 ('fight', 0.01062061079054211),
 ('hate', 0.0),
 ('game', -0.003994180377152921),
 ('love', -0.007752508416960653),
 ('defend', -0.07083423555909896),
 ('war', -0.07362298578528105),
 ('attack', -0.19020624280406517)]

For me it seems, that the ranking is better with a lower number of topics. The ranking with 10 topics seems to be the best.

## End of Lab 5
Please make sure all cells have been executed, save this completed notebook, compress it to a *zip* file, and upload it to [Moodle](https://moodle.msengineering.ch/course/view.php?id=1869).