
# Solution of Adrian Willi & Florian Bär
- adrian.willi@hslu.ch
- florian.baer@hslu.ch



![MSE Logo](https://moodle.msengineering.ch/pluginfile.php/1/core_admin/logocompact/300x300/1613732714/logo-mse.png "MSE Logo") 

# AnTeDe Lab 5: Latent Semantic Analysis with Gensim

## Objective
The goal of this lab is to perform LSA on a small corpus of news.  You will use the LSA word vectors to estimate word similarity, and then to perform ranked retrieval given a query. 

<font color='green'>Please answer the questions in green within this notebook, and submit the completed notebook under the corresponding homework on Moodle.</font>

In [1]:
!pip install contractions

Collecting contractions
  Downloading contractions-0.1.68-py2.py3-none-any.whl (8.1 kB)
Collecting textsearch>=0.0.21
  Downloading textsearch-0.0.21-py2.py3-none-any.whl (7.5 kB)
Collecting pyahocorasick
  Downloading pyahocorasick-1.4.4-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (106 kB)
[K     |████████████████████████████████| 106 kB 6.0 MB/s 
[?25hCollecting anyascii
  Downloading anyascii-0.3.0-py3-none-any.whl (284 kB)
[K     |████████████████████████████████| 284 kB 31.0 MB/s 
[?25hInstalling collected packages: pyahocorasick, anyascii, textsearch, contractions
Successfully installed anyascii-0.3.0 contractions-0.1.68 pyahocorasick-1.4.4 textsearch-0.0.21


In [2]:
from google.colab import drive
drive.mount('/content/drive')


# Modify path according to your configuration
# !ls "/content/gdrive/MyDrive/ColabNotebooks/MSE_AnTeDe_Spring2022"
import sys
sys.path.insert(0,'/content/drive/MyDrive/Colab Notebooks')

from TextPreprocessor import *

Mounted at /content/drive


In [3]:
import os    
import nltk
import gensim
import pandas as pd
#from TextPreprocessor import *
from gensim import models, corpora, similarities
from gensim.models import LsiModel, LdaModel, LdaMulticore

The data used in this lab the same set of 300 Australian that you used in Lab 4 on Document Representation.  It is a shortened version of the Lee Background Corpus [described here](http://www.socsci.uci.edu/~mdlee/lee_pincombe_welsh_document.PDF) and it is available with the **gensim** package that you installed.  The following code will load the documents into a Pandas dataframe.

In [4]:
# Code inspired from:
# https://github.com/bhargavvader/personal/blob/master/notebooks/text_analysis_tutorial/topic_modelling.ipynb

test_data_dir = '{}'.format(os.sep).join([gensim.__path__[0], 'test', 'test_data'])
lee_train_file = test_data_dir + os.sep + 'lee_background.cor'
text = open(lee_train_file).read().splitlines()
data_df = pd.DataFrame({'text': text})

## Data preprocessing

You will need first to preprocess the data through the following stages:
1. tokenization
2. stopword removal
2. POS-based filtering (optional)
3. lemmatization or stemming (optional)
4. addition of bigrams to each document (optional)
5. filtering of infrequent words
6. inspection and filtering of frequent words

You can use NLTK or our in-house `TextPreprocessor.py` file, as explained in Lab 1.

<font color='green'>Please state here which solution you use and list stages you implement.</font>
- - -
1. tokenization
2. stopword removal
3. lemmatization
4. filtering of infrequent words
5. inspection and filtering of frequent words

In [5]:
# Please write here the preprocessing instructions if you use TextPreprocessor.py
language = 'english'
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

stop_words = set(stopwords.words(language))
for sw in ['\"', '\'', '\'\'', '`', '``', '\'s']:
    stop_words.add(sw)

processor = TextPreprocessor(
# Add options here:
    language = language,
    stopwords = stop_words,
    lemmatize = True
)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


In [6]:
data_df['processed'] = processor.transform(data_df['text'])

In [7]:
data_df['processed'].head()

0    hundred people force vacate home southern high...
1    indian security force shot dead eight suspect ...
2    national road toll christmas-new year holiday ...
3    argentina political economic crisis deepen res...
4    six midwife suspend wollongong hospital south ...
Name: processed, dtype: object

In [8]:
data_df['tokenized'] = data_df['processed'].apply(nltk.word_tokenize)

In [9]:
data_df['tokenized'].head()

0    [hundred, people, force, vacate, home, souther...
1    [indian, security, force, shot, dead, eight, s...
2    [national, road, toll, christmas-new, year, ho...
3    [argentina, political, economic, crisis, deepe...
4    [six, midwife, suspend, wollongong, hospital, ...
Name: tokenized, dtype: object

In [10]:
# Alternatively, please write here the preprocessing instructions if you use NLTK


In [11]:
print(data_df['tokenized'].iloc[120])

['union', 'represent', 'qantas', 'maintenance', 'worker', 'warn', 'escalate', 'industrial', 'action', 'company', 'reject', 'offer', 'long', 'run', 'dispute', 'arbitrate', 'party', 'lock', 'private', 'talk', 'yesterday', 'industrial', 'relation', 'commission', '3,000', 'maintenance', 'worker', 'earlier', 'vote', 'reject', 'qantas', 'propose', 'wage', 'freeze', 'national', 'secretary', 'australian', 'manufacturing', 'worker', 'union', 'amwu', 'doug', 'cameron', 'say', 'union', 'do', 'everything', 'possible', 'resolve', 'dispute', 'qantas', 'prepared', 'accept', 'private', 'arbitration', 'absolutely', 'alternative', 'worker', 'take', 'industrial', 'action', 'escalate', 'industrial', 'action', 'necessary', 'ensure', 'get', 'fair', 'go', 'company', 'seem', 'determine', 'crush', 'underfoot', 'say']


Please make a list of all words from all articles.  Then, using `nltk.FreqDist`, consider the most frequent and the least frequent words.  If you find uninformative words among the most frequent ones, please remove them from the articles.  Similarly, please remove from articles the words appearing fewer than 2 or 3 times in the corpus.  <font color='green'> Please justify these choices. What is now the size of your vocabulary?</font> 

In [12]:
# Please write here all the necessary instructions.  You may use several cells.

# remove non-alphanumeric words
word_list = [w for ws in data_df['tokenized'] for w in ws if w.isalpha()]

fdist = nltk.FreqDist(word_list)

# most frequent
print('Most frequent words (n=50): \n', fdist.most_common(50)) 

# least frequent
print('\nLeast frequent words (n=50): \n', fdist.most_common()[-50:]) 

# remove words that have an occurence of < 3
filtered = dict((word, freq) for word, freq in fdist.items() if freq > 3)
fdist_filtered = nltk.FreqDist(filtered)
print('\nNew least frequent words (n=10): \n', fdist_filtered.most_common()[-10:]) 

# check vocab size
vocab = set(word for word, _ in fdist.items())
vocab_filtered = set(word for word, _ in fdist_filtered.items())
print('\nBefore filtering: %s' % len(vocab))
print('After filtering: %s' % len(vocab_filtered))

Most frequent words (n=50): 
 [('say', 1011), ('mr', 306), ('australian', 178), ('new', 171), ('palestinian', 168), ('australia', 157), ('people', 153), ('government', 146), ('attack', 140), ('two', 136), ('u', 136), ('day', 131), ('south', 130), ('state', 129), ('force', 126), ('year', 126), ('would', 115), ('take', 114), ('one', 114), ('israeli', 112), ('also', 111), ('minister', 106), ('fire', 103), ('last', 102), ('first', 102), ('arafat', 96), ('go', 94), ('make', 92), ('afghanistan', 90), ('united', 89), ('three', 87), ('police', 86), ('world', 84), ('security', 83), ('time', 83), ('official', 83), ('could', 82), ('report', 81), ('call', 80), ('kill', 80), ('area', 79), ('give', 78), ('today', 77), ('leader', 77), ('group', 75), ('told', 75), ('come', 74), ('get', 74), ('company', 73), ('union', 71)]

Least frequent words (n=50): 
 [('uk', 1), ('mysticism', 1), ('dawn', 1), ('tent', 1), ('marquee', 1), ('rob', 1), ('terminate', 1), ('throw', 1), ('basically', 1), ('unwind', 1), (

In [13]:
# filter the data
data_df['filtered'] = [[w for w in ws if w in vocab_filtered] for ws in data_df['tokenized']]
print(data_df['filtered'].iloc[50])
data_df['filtered']

['afghan', 'security', 'force', 'arrest', 'wound', 'arab', 'al', 'qaeda', 'fighter', 'seven', 'others', 'weapon', 'explosive', 'remain', 'hospital', 'southern', 'city', 'kandahar', 'spokesman', 'governor', 'gul', 'agha', 'say', 'man', 'arrest', 'left', 'ward', 'one', 'arab', 'believe', 'take', 'custody', 'come', 'ward', 'mr', 'say', 'say', 'seven', 'carry', 'weapon', 'include', 'grenade', 'explosive', 'try', 'explosive', 'surrender', 'weapon', 'mr', 'say', 'concerned', 'safety', 'arab', 'wound', 'earlier', 'u', 'bombing', 'kandahar', 'airport', 'admit', 'hospital', 'taliban', 'militia', 'earlier', 'month', 'flee', 'taliban', 'hand', 'weapon', 'include', 'grenade', 'explosive', 'arab', 'could', 'protect', 'threaten', 'blow', 'hospital', 'room', 'attempt', 'make', 'arrest']


0      [hundred, people, force, home, southern, highl...
1      [indian, security, force, shot, dead, eight, s...
2      [national, road, toll, year, holiday, period, ...
3      [argentina, political, economic, crisis, inter...
4      [six, midwife, suspend, hospital, south, sydne...
                             ...                        
295    [team, australian, israeli, scientist, conduct...
296    [today, world, aid, day, late, figure, show, m...
297    [federal, national, party, reject, possible, s...
298    [university, canberra, proposal, republic, one...
299    [australia, take, france, double, rubber, davi...
Name: filtered, Length: 300, dtype: object

## LSA with Gensim

In this section, you will write the Gensim commands to compute a term-document matrix from the above documents, then transform it using SVD, and truncate the result.  To learn what the commands are, please follow the [Topics and Tranformations tutorial](https://radimrehurek.com/gensim/auto_examples/core/run_topics_and_transformations.html) from Gensim. 

<font color="green">Please gather these commands into a function called `train_lsa`.  They should cover: dictionary creation, corpus mapping, computation of TF-IDF values, and creation of the LSA model.</font> 

In [14]:
def train_lsa(filtered_texts, num_topics = 10):

  # create dictionary
  dictionary = corpora.Dictionary(filtered_texts)

  # corpus mapping
  corpus = [dictionary.doc2bow(text) for text in filtered_texts]

  # TF-IDF model
  tfidf = models.TfidfModel(corpus, normalize=True)
  corpus_tfidf = tfidf[corpus]

  # LSA model
  lsa = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=num_topics)

  return lsa, dictionary, corpus, corpus_tfidf

<font color="green">Please fix a `number_of_topics`, on the lower side of the range mentioned in the course.  Then, execute the cell that performs `train_lsa`.</font>

In [15]:
number_of_topics = 2

In [26]:
lsa_model, dictionary, corpus, corpus_tfidf = train_lsa(data_df['filtered'], number_of_topics)

<font color="green">Please display several topics found by LSA using the Gensim `print_topics` function.  Please explain in your own words the meaning of what is displayed.  How do you relate it with what was explained in the course on LSA?</font>
- - -
The topics show the latent dimensions of the LSA transformation. The words 'palestinian', 'israeli' and 'arafat' are related and contribute the most to the topic.

In [27]:
lsa_model.print_topics(number_of_topics)

[(0,
  '0.296*"palestinian" + 0.196*"israeli" + 0.173*"arafat" + 0.119*"mr" + 0.110*"attack" + 0.110*"israel" + 0.104*"hamas" + 0.099*"afghanistan" + 0.096*"force" + 0.091*"u"'),
 (1,
  '-0.444*"palestinian" + -0.295*"israeli" + -0.260*"arafat" + -0.162*"israel" + -0.156*"hamas" + -0.139*"gaza" + -0.112*"sharon" + 0.102*"afghanistan" + -0.101*"suicide" + -0.093*"militant"')]

<font color="green">Please define a function that returns the cosine similarity between two words (testing first if they are in the vocabulary). Please exemplify its value on two different word pairs, one of which should be obviously more similar than the other, and comment the values.</font>  You can get inspiration from this [Gensim Tutorial on Document Similarity](https://radimrehurek.com/gensim/auto_examples/core/run_similarity_queries.html).
- - -
We see that the similarity of the first one is significantly higher than of the second one.

In [18]:
from sklearn.metrics.pairwise import cosine_similarity

In [20]:
def wordsim(word1, word2, model, dictionary):
  word1_bow = dictionary.doc2bow([word1])
  word2_bow = dictionary.doc2bow([word2])

  if len(word1_bow) == 0 or len(word2_bow) == 0:
        raise Exception('Words not in dictionary!')

  # convert to LSA space
  word1_lsa = model[word1_bow]
  word2_lsa = model[word2_bow]
    
  # compute the similarity
  sim = cosine_similarity(word1_lsa, word2_lsa)

  return sim

In [24]:
# print here the cosine similiarities of several pairs and comment the results.
sim_high = wordsim('arafat', 'israeli', lsa_model, dictionary)
sim_low = wordsim('hindu', 'gaza', lsa_model, dictionary)

print(sim_high)
print(sim_low)

[[1.         0.2829564 ]
 [0.25114428 0.99945513]]
[[ 1.          0.13786654]
 [-0.0014011   0.99025668]]


<font color="green">Please use the [Gensim Tutorial on Document Similarity](https://radimrehurek.com/gensim/auto_examples/core/run_similarity_queries.html) to write a function that prints a list of words sorted by decreasing LSA similarity with a given word and showing the score too.  You don't have to use the cosine_similarity function here.  Please choose a "query" word and ten other words, apply your function, and comment the results.</font>

In [None]:
from gensim import similarities

In [29]:
def word_ranking(word0, word_list, model, dictionary):
  word0_bow = dictionary.doc2bow([word0])
  words_bow = [dictionary.doc2bow([w]) for w in word_list]
    
  word0_lsa = model[word0_bow]
  words_lsa = model[words_bow]
    
  index = similarities.MatrixSimilarity(words_lsa)
    
  # perform a similarity query against the corpus
  sims = index[word0_lsa]  
  sims = sorted(enumerate(sims), key=lambda item: -item[1])

  for i, (doc_position, doc_score) in enumerate(sims):
    print('{0}: "{1}"", score: {2:.5f}'.format(i, word_list[doc_position], doc_score))


In [30]:
# call here the function on your choice of words
word_ranking('gaza', ['hot', 'water', 'palestinian', 'israel', 'hindu', 'black', 'wind', 'arafat', 'hamas', 'force'], lsa_model, dictionary)

0: "hamas"", score: 0.99990
1: "palestinian"", score: 0.99989
2: "arafat"", score: 0.99987
3: "israel"", score: 0.99975
4: "hot"", score: 0.20795
5: "force"", score: 0.14559
6: "hindu"", score: -0.04589
7: "black"", score: -0.16789
8: "wind"", score: -0.29011
9: "water"", score: -0.30594


In [32]:
# Please write here your comments on the rankings

It can be clearly seen that words like 'hamas' and 'arafat' have a high score with 'gaza' because they often appear together. On the other end, words like 'water' and 'wind' have a low score because they don't show up often together.


<font color="green">Please select now a significantly larger number of topics, and train a new LSA model.  Perform the same `word_ranking` task as above and compare the new ranking with the previous one.  Which one seems better?</font>
- - -
The small one with 2 topics is much better than the bigger one with 300 topics.

In [36]:
lsa_model, dictionary, corpus, corpus_tfidf = train_lsa(data_df['filtered'], num_topics=300)
word_ranking('gaza', ['hot', 'water', 'palestinian', 'israel', 'hindu', 'black', 'wind', 'arafat', 'hamas', 'force'], lsa_model, dictionary)

0: "palestinian"", score: 0.14060
1: "arafat"", score: 0.13423
2: "force"", score: 0.04083
3: "hamas"", score: 0.02699
4: "wind"", score: 0.00703
5: "hindu"", score: 0.00558
6: "water"", score: -0.01309
7: "black"", score: -0.06254
8: "hot"", score: -0.06660
9: "israel"", score: -0.09030


## End of Lab 5
Please make sure all cells have been executed, save this completed notebook, compress it to a *zip* file, and upload it to [Moodle](https://moodle.msengineering.ch/course/view.php?id=1869).