![MSE Logo](https://moodle.msengineering.ch/pluginfile.php/1/core_admin/logocompact/300x300/1613732714/logo-mse.png "MSE Logo") 

# AnTeDe Lab 5: Latent Semantic Analysis with Gensim
## Solution of Adrian Willi & Florian Bär
- adrian.willi@hslu.ch
- florian.baer@hslu.ch

## Objective
The goal of this lab is to perform LSA on a small corpus of news.  You will use the LSA word vectors to estimate word similarity, and then to perform ranked retrieval given a query. 

<font color='green'>Please answer the questions in green within this notebook, and submit the completed notebook under the corresponding homework on Moodle.</font>

In [1]:
!pip install contractions

Collecting contractions
  Downloading contractions-0.1.68-py2.py3-none-any.whl (8.1 kB)
Collecting textsearch>=0.0.21
  Downloading textsearch-0.0.21-py2.py3-none-any.whl (7.5 kB)
Collecting pyahocorasick
  Downloading pyahocorasick-1.4.4-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (106 kB)
[K     |████████████████████████████████| 106 kB 5.6 MB/s 
[?25hCollecting anyascii
  Downloading anyascii-0.3.1-py3-none-any.whl (287 kB)
[K     |████████████████████████████████| 287 kB 47.9 MB/s 
[?25hInstalling collected packages: pyahocorasick, anyascii, textsearch, contractions
Successfully installed anyascii-0.3.1 contractions-0.1.68 pyahocorasick-1.4.4 textsearch-0.0.21


In [2]:
import os    
import nltk
import gensim
import pandas as pd

from gensim import models, corpora, similarities
from gensim.models import LsiModel, LdaModel, LdaMulticore

In [3]:
from google.colab import drive
drive.mount('/content/gdrive')

# Modify path according to your configuration
# !ls "/content/gdrive/MyDrive/ColabNotebooks/MSE_AnTeDe_Spring2022"
import sys
sys.path.insert(0,'/content/gdrive/MyDrive/Colab Notebooks/MSE/AnTeDe/MSE_AnTeDe_Lab4')

from TextPreprocessor import *

Mounted at /content/gdrive


The data used in this lab the same set of 300 Australian that you used in Lab 4 on Document Representation.  It is a shortened version of the Lee Background Corpus [described here](http://www.socsci.uci.edu/~mdlee/lee_pincombe_welsh_document.PDF) and it is available with the **gensim** package that you installed.  The following code will load the documents into a Pandas dataframe.

In [4]:
# Code inspired from:
# https://github.com/bhargavvader/personal/blob/master/notebooks/text_analysis_tutorial/topic_modelling.ipynb

test_data_dir = '{}'.format(os.sep).join([gensim.__path__[0], 'test', 'test_data'])
lee_train_file = test_data_dir + os.sep + 'lee_background.cor'
text = open(lee_train_file).read().splitlines()
data_df = pd.DataFrame({'text': text})

## Data preprocessing

You will need first to preprocess the data through the following stages:
1. tokenization
2. stopword removal
2. POS-based filtering (optional)
3. lemmatization or stemming (optional)
4. addition of bigrams to each document (optional)
5. filtering of infrequent words
6. inspection and filtering of frequent words

You can use NLTK or our in-house `TextPreprocessor.py` file, as explained in Lab 1.

<font color='green'>Please state here which solution you use and list stages you implement.</font>


<font color='red'>I used the following stages:
- tokenization
- stopword-removal
- lemmatization/stemming
- filtering of infrequent words (manually)
- inspection and filtering of frequent words (manually)
</font>

In [5]:
# Please write here the preprocessing instructions if you use TextPreprocessor.py
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
language = 'english'
stop_words = set(stopwords.words(language))
# Extend the list here:
for sw in ['\"', '\'', '\'\'', '`', '``', '\'s']:
    stop_words.add(sw)
# TextPreprocessor? - get help regarding the attributes

processor = TextPreprocessor(
# Add options here:
 language = language,
 stopwords = stop_words,
 lemmatize = True
)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


In [6]:
data_df['processed'] = processor.transform(data_df['text'])

In [7]:
data_df['tokenized'] = data_df['processed'].apply(nltk.word_tokenize)

In [9]:
# Alternatively, please write here the preprocessing instructions if you use NLTK
# -

In [10]:
print(data_df['tokenized'].iloc[120])

['union', 'represent', 'qantas', 'maintenance', 'worker', 'warn', 'escalate', 'industrial', 'action', 'company', 'reject', 'offer', 'long', 'run', 'dispute', 'arbitrate', 'party', 'lock', 'private', 'talk', 'yesterday', 'industrial', 'relation', 'commission', '3,000', 'maintenance', 'worker', 'earlier', 'vote', 'reject', 'qantas', 'propose', 'wage', 'freeze', 'national', 'secretary', 'australian', 'manufacturing', 'worker', 'union', 'amwu', 'doug', 'cameron', 'say', 'union', 'do', 'everything', 'possible', 'resolve', 'dispute', 'qantas', 'prepared', 'accept', 'private', 'arbitration', 'absolutely', 'alternative', 'worker', 'take', 'industrial', 'action', 'escalate', 'industrial', 'action', 'necessary', 'ensure', 'get', 'fair', 'go', 'company', 'seem', 'determine', 'crush', 'underfoot', 'say']


Please make a list of all words from all articles.  Then, using `nltk.FreqDist`, consider the most frequent and the least frequent words.  If you find uninformative words among the most frequent ones, please remove them from the articles.  Similarly, please remove from articles the words appearing fewer than 2 or 3 times in the corpus.  <font color='green'> Please justify these choices. What is now the size of your vocabulary?</font> 

In [11]:
# Please write here all the necessary instructions.  You may use several cells.
word_list = [w for ws in data_df['tokenized'] for w in ws if w.isalpha()]
freqDist = nltk.FreqDist(word_list)
words_to_remove = ['mr', 'also', 'u', '\'\'', '``']
freqDist.most_common(50)

[('say', 1011),
 ('mr', 306),
 ('australian', 178),
 ('new', 171),
 ('palestinian', 168),
 ('australia', 157),
 ('people', 153),
 ('government', 146),
 ('attack', 140),
 ('two', 136),
 ('u', 136),
 ('day', 131),
 ('south', 130),
 ('state', 129),
 ('force', 126),
 ('year', 126),
 ('would', 115),
 ('take', 114),
 ('one', 114),
 ('israeli', 112),
 ('also', 111),
 ('minister', 106),
 ('fire', 103),
 ('last', 102),
 ('first', 102),
 ('arafat', 96),
 ('go', 94),
 ('make', 92),
 ('afghanistan', 90),
 ('united', 89),
 ('three', 87),
 ('police', 86),
 ('world', 84),
 ('security', 83),
 ('time', 83),
 ('official', 83),
 ('could', 82),
 ('report', 81),
 ('call', 80),
 ('kill', 80),
 ('area', 79),
 ('give', 78),
 ('today', 77),
 ('leader', 77),
 ('group', 75),
 ('told', 75),
 ('come', 74),
 ('get', 74),
 ('company', 73),
 ('union', 71)]

In [12]:
print(len(freqDist.most_common()))
least_common_words = [w[0] for w in freqDist.most_common() if w[1] < 2]
print(len(least_common_words))
print(least_common_words)

5336
2182
['vacate', 'deterioration', 'outlying', 'mittagong', 'finger', 'formation', 'cranebrook', 'meteorology', 'claire', 'richards', 'shootout', 'dora', 'kilometer', 'srinagar', 'behest', 'hafiz', 'saeed', 'karachi', 'targetting', 'deepen', 'aldolfo', 'rodregiuez', 'predecessor', 'ramon', 'puerta', 'caretaker', 'lawmaker', 'whoever', 'assumes', 'daunt', 'lending', 'unprofessional', 'sherbon', 'annoyed', 'afghani', 'becomes', 'await', 'unseeded', 'burswood', 'dome', 'unbeatable', 'eighth', 'overpower', 'virginie', 'straight', 'martina', 'hingis', 'federer', 'tentative', 'scrapper', 'frankly', 'ascendancy', 'hunger', 'strive', 'regain', 'peak', 'stride', 'opponent', 'tenacious', 'canoeist', 'dusty', 'thirty', 'canoe', 'yarrawonga', 'gruelling', 'milder', 'favourable', 'pose', 'flare', 'flank', 'glenbrook', 'bulaburra', 'newcastle', 'lightning', 'electrical', 'grafton', 'incendiary', 'inaccessible', 'phil', 'koperberg', 'cancel', 'wilton', 'penrith', 'springwood', 'wisemans', 'ferry',

In [13]:
def filter_words(words, words_to_remove1, least_common_words):
    words = [w for w in words if w.isalpha() and w not in words_to_remove and w not in least_common_words]
    return words

In [14]:
data_df['filtered'] = data_df['tokenized'].apply(filter_words, args=(words_to_remove, least_common_words))

In [15]:
print(data_df['filtered'].iloc[10])

['work', 'continue', 'morning', 'restore', 'power', 'supply', 'ten', 'thousand', 'home', 'black', 'wild', 'storm', 'struck', 'queensland', 'last', 'night', 'gale', 'force', 'wind', 'tree', 'brought', 'power', 'line', 'damage', 'home', 'car', 'energy', 'every', 'available', 'person', 'work', 'night', 'restore', 'power', 'location', 'around', 'brisbane', 'west', 'toowoomba', 'north', 'sunshine', 'coast', 'brisbane', 'ripped', 'home', 'still', 'undergo', 'repair', 'follow', 'severe', 'storm', 'christmas', 'four', 'people', 'rescue', 'high', 'power', 'line', 'fell', 'across', 'car', 'trap', 'inside', 'fierce', 'wind', 'sent', 'large', 'tree', 'crash', 'house', 'one', 'injured']


## LSA with Gensim

In this section, you will write the Gensim commands to compute a term-document matrix from the above documents, then transform it using SVD, and truncate the result.  To learn what the commands are, please follow the [Topics and Tranformations tutorial](https://radimrehurek.com/gensim/auto_examples/core/run_topics_and_transformations.html) from Gensim. 

<font color="green">Please gather these commands into a function called `train_lsa`.  They should cover: dictionary creation, corpus mapping, computation of TF-IDF values, and creation of the LSA model.</font> 

In [16]:
def train_lsa(filtered_texts, num_topics = 10):
    dictionary = corpora.Dictionary([w for w in filtered_texts])
    corpus = [dictionary.doc2bow(w) for w in filtered_texts]
    tfidf = models.TfidfModel(corpus)
    corpus_tfidf = tfidf[corpus]
    lsa = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=num_topics)
    return lsa,dictionary,corpus,corpus_tfidf

<font color="green">Please fix a `number_of_topics`, on the lower side of the range mentioned in the course.  Then, execute the cell that performs `train_lsa`.</font>

In [17]:
number_of_topics = 2

In [18]:
lsa_model, dictionary, corpus, corpus_tfidf = train_lsa(data_df['filtered'], number_of_topics)

<font color="green">Please display several topics found by LSA using the Gensim `print_topics` function.  Please explain in your own words the meaning of what is displayed.  How do you relate it with what was explained in the course on LSA?</font>

In [19]:
lsa_model.print_topics(number_of_topics) # seems to be a pretty war related dataset...

[(0,
  '0.321*"palestinian" + 0.212*"israeli" + 0.191*"arafat" + 0.120*"israel" + 0.117*"hamas" + 0.112*"attack" + 0.098*"gaza" + 0.095*"force" + 0.094*"afghanistan" + 0.088*"suicide"'),
 (1,
  '-0.417*"palestinian" + -0.275*"israeli" + -0.248*"arafat" + -0.153*"israel" + -0.151*"hamas" + -0.130*"gaza" + -0.112*"sharon" + 0.109*"afghanistan" + 0.097*"qantas" + -0.095*"suicide"')]

The listed topics are the most relevant topics with a given weight on the importance (-1;1) of words. I would guess these topics are linear combination from the words listed in the `print_topics` return

<font color="green">Please define a function that returns the cosine similarity between two words (testing first if they are in the vocabulary). Please exemplify its value on two different word pairs, one of which should be obviously more similar than the other, and comment the values.</font>  You can get inspiration from this [Gensim Tutorial on Document Similarity](https://radimrehurek.com/gensim/auto_examples/core/run_similarity_queries.html).

In [20]:
from sklearn.metrics.pairwise import cosine_similarity

In [21]:
from ctypes import ArgumentError
def wordsim(word1, word2, model, dictionary):
    # look up each word in the LSA/LSI model of choice
    vec_bow1 = dictionary.doc2bow([word1])
    vec_bow2 = dictionary.doc2bow([word2])
    
    if len(vec_bow1) <= 0:
        raise ArgumentError(f'Word \'{word1}\' is not in dictionary') 
    if len(vec_bow2) <= 0:
        raise ArgumentError(f'Word \'{word2}\' is not in dictionary') 
    vec_lsi_1 = model[vec_bow1]
    vec_lsi_2 = model[vec_bow2]
    return cosine_similarity(vec_lsi_1, vec_lsi_2)

In [22]:
# print here the cosine similiarities of several pairs and comment the results.
print(wordsim('arafat', 'israeli', lsa_model, dictionary))
print(wordsim('hindu', 'cricket', lsa_model, dictionary))

[[ 1.         -0.26517789]
 [-0.2406257   0.999678  ]]
[[1.         0.03346132]
 [0.00163016 0.99949323]]


<font color="green">Please use the [Gensim Tutorial on Document Similarity](https://radimrehurek.com/gensim/auto_examples/core/run_similarity_queries.html) to write a function that prints a list of words sorted by decreasing LSA similarity with a given word and showing the score too.  You don't have to use the cosine_similarity function here.  Please choose a "query" word and ten other words, apply your function, and comment the results.</font>

In [23]:
from gensim import similarities

In [24]:
def word_ranking(word0, word_list, model, dictionary):
  word0_bow = dictionary.doc2bow([word0])
  words_bow = [dictionary.doc2bow([w]) for w in word_list]
    
  word0_lsa = model[word0_bow]
  words_lsa = model[words_bow]
    
  index = similarities.MatrixSimilarity(words_lsa)
    
  # perform a similarity query against the corpus
  sims = index[word0_lsa]  
  sims = sorted(enumerate(sims), key=lambda item: -item[1])

  for idx, (pos, score) in enumerate(sims):
    print('{0}: "{1}"", score: {2:.5f}'.format(idx, word_list[pos], score))

In [25]:
# call here the function on your choice of words
lsa_model, dictionary, corpus, corpus_tfidf = train_lsa(data_df['filtered'], num_topics=2)
print(word_ranking('gaza', ['israel', 'cricket', 'sport', 'ball', 'war', 'hamas', 'arafat', 'rain', 'military', 'rocket'], lsa_model, dictionary))

0: "rocket"", score: 0.99999
1: "arafat"", score: 0.99992
2: "hamas"", score: 0.99989
3: "israel"", score: 0.99969
4: "military"", score: 0.72846
5: "war"", score: 0.67661
6: "rain"", score: 0.03231
7: "sport"", score: -0.03768
8: "cricket"", score: -0.31153
9: "ball"", score: -0.37062
None


In [26]:
# Please write here your comments on the rankings


<font color='red'>It is nicly visible that words related to the gaza war are highly correlated. It is even visible that the words related to cricket and sports have a similiar distance from our base-word (gaza)</font>

In [27]:
lsa_model, dictionary, corpus, corpus_tfidf = train_lsa(data_df['filtered'], num_topics=300)
print(word_ranking('gaza', ['israel', 'cricket', 'sport', 'ball', 'war', 'hamas', 'arafat', 'rain', 'military', 'rocket'], lsa_model, dictionary))

0: "rocket"", score: 0.30290
1: "arafat"", score: 0.17825
2: "military"", score: 0.07208
3: "hamas"", score: 0.04219
4: "ball"", score: 0.01770
5: "sport"", score: -0.00276
6: "rain"", score: -0.00336
7: "cricket"", score: -0.00434
8: "war"", score: -0.01313
9: "israel"", score: -0.08778
None


<font color="green">Please select now a significantly larger number of topics, and train a new LSA model.  Perform the same `word_ranking` task as above and compare the new ranking with the previous one.  Which one seems better?</font>

<font color='red'>It seems that the model with only two topic has more intuitive understandable similarities. It also seems to be more accurate! I would choose a model with around 100 to 150 topics to do further investigations</font>

## End of Lab 5
Please make sure all cells have been executed, save this completed notebook, compress it to a *zip* file, and upload it to [Moodle](https://moodle.msengineering.ch/course/view.php?id=1869).

In [28]:
lsa_model, dictionary, corpus, corpus_tfidf = train_lsa(data_df['filtered'], num_topics=150)
print(word_ranking('gaza', ['israel', 'cricket', 'sport', 'ball', 'war', 'hamas', 'arafat', 'rain', 'military', 'rocket'], lsa_model, dictionary))

0: "rocket"", score: 0.52206
1: "arafat"", score: 0.30613
2: "rain"", score: 0.07036
3: "cricket"", score: 0.05143
4: "military"", score: 0.03448
5: "hamas"", score: 0.02980
6: "sport"", score: 0.01980
7: "ball"", score: -0.01394
8: "war"", score: -0.08704
9: "israel"", score: -0.16823
None
