![MSE Logo](https://moodle.msengineering.ch/pluginfile.php/1/core_admin/logo/0x150/1643104191/logo-mse.png)

# AdvNLP Lab 4 GRADED: Testing a pretrained word2vec model on analogy tasks

**Objectives:**  experiment with *word vectors* from word2vec: test them on analogy tasks; use *accuracy and MRR* (Mean Reciprocal Rank) scores.

**Useful documentation:** the [section on KeyedVectors in Gensim](https://radimrehurek.com/gensim/models/keyedvectors.html) and possibly the [section on word2vec](https://radimrehurek.com/gensim/models/word2vec.html).

## 1. Word2vec model trained on Google News
**1a.** Please install the latest version of Gensim, preferably in a Conda environment. 

In [1]:
# !pip install --upgrade gensim
# You can run the following verification:
!pip show gensim

Name: gensim
Version: 4.3.3
Summary: Python framework for fast Vector Space Modelling
Home-page: https://radimrehurek.com/gensim/
Author: Radim Rehurek
Author-email: me@radimrehurek.com
License: LGPL-2.1-only
Location: C:\Users\andre\Documents\SUPSI\NLP\labs\.advNLP\Lib\site-packages
Requires: numpy, scipy, smart-open
Required-by: 


In [2]:
import gensim, os, random
from gensim import downloader
from gensim.test.utils import datapath
from gensim.models import KeyedVectors
from gensim import utils
# help(gensim.models.word2vec) # take a look if needed
import time
import itertools

**1b.** Please download from Gensim the `word2vec-google-news-300` model, upon your first use.  Then, please write code to answer the following questions:
* Where is the model stored on your computer and what is the file name?  You can store the absolute path in a variable called `path_to_model_file`.
* What is the size of the corresponding file?  Please display the size in gigabytes with two decimals.

In [3]:
# Download the model from Gensim (needed only the first time)
#gensim.downloader.load("word2vec-google-news-300")
# No need to store the returned value (uses a lot of memory).

In [4]:
# Where is the model stored on your computer and what is the file name?  You can store the absolute path in a variable called `path_to_model_file`.
path_to_model_file = os.path.join(gensim.downloader.BASE_DIR, "word2vec-google-news-300", "word2vec-google-news-300.gz")
print(f"model is stored at {path_to_model_file}")

# What is the size of the corresponding file?  Please display the size in gigabytes with two decimals.
print(f"its size is {round(os.path.getsize(path_to_model_file) / (1024**3), 2)} GB")


model is stored at C:\Users\andre/gensim-data\word2vec-google-news-300\word2vec-google-news-300.gz
its size is 1.62 GB


**1c.** Please load the word2vec model as an instance of the class `KeyedVectors`, and store it in a variable called `wv_model`. 
What is, at this point, the memory size of the process corresponding to this notebook?  Simply write the value you obtain from any OS-specific utility that you like.

In [5]:
# Please write your Python code below and execute it.  Write the memory size on a commented line.
wv_model = KeyedVectors.load_word2vec_format(datapath(path_to_model_file), binary=True)

# memory size: 0.78 GB

**1d.** Please write the instructions that generate the answers to the following questions.
* What is the size of the vocabulary of the `wv_model` model?  
* What is the dimensionality of each word vector?  
* What is the word corresponding to the vector in position 1234?  
* What are the first 10 coefficients of the word vector for the word *pyramid*?  

In [6]:
# what is the size of the vocabulary of the wv_model ? 
print(f"the size of the vocabulary of the wv_model is {len(wv_model.index_to_key)}")

# what is the dimensionality of each word vector ?
print(f"the dimensionality of each word vector is {wv_model.vector_size}")

# what is the word corresponding to the vector in position 1234? 
print(f"the word corresponding to the vector in position 1234 is {wv_model.index_to_key[1234]}")

# what are the first 10 coefficients of the word vector for the word pyramid ?
print(f"the first 10 coefficients of the word vector for the word pyramid are {wv_model['pyramid'][:10]}")


the size of the vocabulary of the wv_model is 3000000
the dimensionality of each word vector is 300
the word corresponding to the vector in position 1234 is learn
the first 10 coefficients of the word vector for the word pyramid are [ 0.00402832 -0.00260925  0.04296875  0.19433594 -0.03979492 -0.06445312
  0.42773438 -0.18359375 -0.27148438 -0.12890625]


## 2. Solving analogies using word2vec trained on Google News
In this section, you are going to use word vectors to solve analogy tasks provided with Gensim, such as "What is to France what Rome is to Italy?".  The predefined function in Gensim that evaluates a model on this task does not provide enough details, so you will need to make modifications to it.

**2a.** The analogy tasks are stored in a text file called `questions-words.txt` which is typically found in `C:\Users\YourNameHere\.conda\envs\YourEnvNameHere\Lib\site-packages\gensim\test\test_data`.  You can access it from here with Gensim as `datapath('questions-words.txt')`.  

Please create a file called `questions-words-100.txt` with the first 100 lines from the original file.  Please run the evaluation task on this file, using the [documentation of the KeyedVectors class](https://radimrehurek.com/gensim/models/keyedvectors.html), then answer the following questions:
* How many analogy tasks are there in your `questions-words-100.txt` file?
* How many analogies were solved correctly and how many incorrectly?
* What is the accuracy returned by `evaluate_word_analogies`?
* How much time did it take to solve the analogies?

In [7]:
with open(datapath('questions-words.txt'), 'r') as f:
    lines = f.readlines()

with open('questions-words-100.txt', 'w') as f:
    f.writelines(lines[:100])

In [8]:
import time

start_time = time.time()
wv_model_eval = wv_model.evaluate_word_analogies('questions-words-100.txt')
end_time = time.time()

# Number of analogy tasks
num_analogy_tasks = sum(len(section['correct']) + len(section['incorrect']) for section in wv_model_eval[1])
print(f"There are {num_analogy_tasks} analogy tasks in the questions-words-100.txt file.")

# Correct and incorrect analogies
num_correct = sum(len(section['correct']) for section in wv_model_eval[1])
num_incorrect = sum(len(section['incorrect']) for section in wv_model_eval[1])
print(f"There are {num_correct} analogies solved correctly and {num_incorrect} solved incorrectly.")

# Accuracy returned
accuracy = wv_model_eval[0]
print(f"The accuracy returned by evaluate_word_analogies is {accuracy:.4f}")

# Time taken
print(f"It took {end_time - start_time:.2f} seconds to evaluate the analogy tasks.")

There are 198 analogy tasks in the questions-words-100.txt file.
There are 160 analogies solved correctly and 38 solved incorrectly.
The accuracy returned by evaluate_word_analogies is 0.8081
It took 3.49 seconds to evaluate the analogy tasks.


**2b.** Please answer in writing the following questions:
* What is the meaning of the first line of `questions-words-100.txt`?
* How many analogies are there in the original `questions-words.txt`?
* How much time would it take to solve the original set of analogies?

In [9]:
# what is the meaning of the first line of questions-words-100.txt ?
print(f"The first line of question-words-100.txt is {lines[0]} and from the documentation we know that the first line is the section name.")

# how many analogies are there in the original questions-words.txt file ?
with open(datapath('questions-words.txt'), 'r') as f:
    lines = f.readlines()

num_analogies = [line for line in lines if not line.startswith(":")]

print(f"There are {len(num_analogies)} analogies in the original questions-words.txt file.")

# how much time would it take to solve the original set of analogies ?
time_estimate = (end_time - start_time) / num_analogy_tasks * len(num_analogies)
print(f"It would take {time_estimate:.2f} seconds to solve the original set of analogies.")



The first line of question-words-100.txt is : capital-common-countries
 and from the documentation we know that the first line is the section name.
There are 19544 analogies in the original questions-words.txt file.
It would take 344.75 seconds to solve the original set of analogies.


**2c.** The built-in function from Gensim has several weaknesses, which you will address here.  Please copy the source code of the function `evaluate_word_analogies` from the file `gensim\models\keyedvectors.py` and create here a new function which will improve the built-in one as follows.  The function will be called `my_evaluate_word_analogies` and you will also pass it the model as the first argument.  Overall, please proceed gradually and only make minimal modifications, to ensure you don't break the function.  It is important to first understand the structure of the result, `analogies_scores` and `sections`. 

* Modify the line where `section[incorrect]` is assembled in order to also add to each analogy the *incorrect guess* (i.e. what the model thought was the good answer, but got it wrong).

* Modify the code so that when `section[incorrect]` is assembled, you also add the *rank of the correct answer* among the candidates returned by the system (after the incorrect guess).  If the correct answer is not present at all, then code the rank as 0.

In [10]:
import itertools
import logging
import random
from gensim import utils

logger = logging.getLogger(__name__)

def my_evaluate_word_analogies(model, analogies, restrict_vocab=300000, case_insensitive=True):
    oov = 0
    logger.info("Evaluating word analogies for top %i words in the model on %s", restrict_vocab, analogies)
    sections, section = [], None
    quadruplets_no = 0

    with utils.open(analogies, 'rb') as fin:
        for line_no, line in enumerate(fin):
            line = utils.to_unicode(line).strip()
            if line.startswith(': '):
                if section:
                    sections.append(section)
                section = {'section': line.lstrip(': ').strip(), 'correct': [], 'incorrect': []}
            else:
                if not section:
                    raise ValueError(f"Missing section header before line #{line_no} in {analogies}")

                try:
                    if case_insensitive:
                        a, b, c, expected = [word.upper() for word in line.split()]
                    else:
                        a, b, c, expected = [word for word in line.split()]
                except ValueError:
                    logger.info(f"Skipping invalid line #{line_no} in {analogies}")
                    continue

                quadruplets_no += 1

                # Check vocabulary presence using Gensim's built-in method
                if not (model.has_index_for(a) and model.has_index_for(b) and model.has_index_for(c) and model.has_index_for(expected)):
                    oov += 1
                    section['incorrect'].append((a, b, c, expected, None, 0))  # No guess, rank = 0
                    continue

                ignore = {a, b, c}

                try:
                    sims = model.most_similar(positive=[b, c], negative=[a], topn=5, restrict_vocab=restrict_vocab)
                except KeyError as e:
                    logger.warning(f"Skipping analogy due to missing words: {e}")
                    oov += 1
                    section['incorrect'].append((a, b, c, expected, None, 0))
                    continue

                predicted = None
                rank = 0
                for idx, (word, _) in enumerate(sims, start=1):
                    predicted = word.upper() if case_insensitive else word
                    if predicted in ignore:
                        continue
                    if predicted == expected:
                        rank = idx
                        break

                if predicted == expected:
                    section['correct'].append((a, b, c, expected))
                else:
                    section['incorrect'].append((a, b, c, expected, predicted, rank if rank else 0))

    if section:
        sections.append(section)

    total = {
        'section': 'Total accuracy',
        'correct': list(itertools.chain.from_iterable(s['correct'] for s in sections)),
        'incorrect': list(itertools.chain.from_iterable(s['incorrect'] for s in sections)),
    }

    oov_ratio = float(oov) / quadruplets_no * 100 if quadruplets_no > 0 else 0
    logger.info('Quadruplets with out-of-vocabulary words: %.1f%%', oov_ratio)

    sections.append(total)
    return sections

**2d.** Please run the `my_evaluate_word_analogies` function on `questions-words-100.txt` and then write instructions to display, from the results stored in `analogy_scores`:
* one incorrectly-solved analogy (selected at random), including also the error made by the model and the rank of the correct answer, thus adding:
  - a fifth word, which is the incorrect one found by the model
  - a sixth term, which is the integer indicating the rank (or 0)
* one correctly-solved analogy selected at random (in principle, four terms).

In [11]:
# Please write your Python code below and execute it.
sections = my_evaluate_word_analogies(wv_model,'questions-words-100.txt')
incorrect = sections[-1]['incorrect']
correct = sections[-1]['correct']

if incorrect:
    print(f"Incorrect example:{random.choice(incorrect)}")
if correct:
    print(f"Correct example:{random.choice(correct)}")

Incorrect example:('BANGKOK', 'THAILAND', 'LONDON', 'ENGLAND', 'BRITISH', 0)
Correct example:('BERLIN', 'GERMANY', 'HANOI', 'VIETNAM')


**2e.** Please write a function to compute the MRR score given a structure with correctly and incorrectly solved analogies, such as the one that is found in the results from `evaluate_word_analogies`.  The structure is not divided into categories.

The Mean Reciprocal Rank (please use the [formula here](https://en.wikipedia.org/wiki/Mean_reciprocal_rank)) gives some credit for incorrectly solved analogies, in inverse proportion to the rank of the correct answer among the candidates.  This rank is 1 for correctly solved analogies (full credit), and 1/k (or 0) for incorrectly solved ones.

In [12]:
# Please define here the function that computes MRR from the information stored in analogy_scores
def myMRR(analogies):
    '''
    Compute the Mean Reciprocal Rank (MRR) of the model based on the analogy scores.
    For correct analogies, rank is 1 (so reciprocal is 1)
    For incorrect analogies, use the rank stored in position 5, or 0 if not found
    '''
    total_analogies = len(analogies['correct']) + len(analogies['incorrect'])
    if total_analogies == 0:
        return 0.0
    
    reciprocal_ranks = []
    
    # For correct analogies, rank is 1
    reciprocal_ranks.extend([1.0] * len(analogies['correct']))
    
    # For incorrect analogies, use stored rank
    for incorrect in analogies['incorrect']:
        rank = incorrect[5]  # Rank is stored in position 5
        reciprocal_ranks.append(1/rank if rank > 0 else 0)
    
    return sum(reciprocal_ranks) / total_analogies

# example usage:
analogy_scores = (
    0.75,  # Example accuracy score
    [
        {'section': 'Category 1', 'correct': [('king', 'man', 'queen', 'woman')], 'incorrect': [('apple', 'fruit', 'banana', 'vegetable', 'orange', 2)]},
        {'section': 'Category 2', 'correct': [('cat', 'animal', 'dog', 'pet')], 'incorrect': [('car', 'vehicle', 'bike', 'transport', 'plane', 3)]},
        {'section': 'Total accuracy', 'correct': [('king', 'man', 'queen', 'woman'), ('cat', 'animal', 'dog', 'pet')], 'incorrect': [('apple', 'fruit', 'banana', 'vegetable', 'orange', 2), ('car', 'vehicle', 'bike', 'transport', 'plane', 3)]}
    ]
)


In [13]:
# Please test your MRR function by running the following code, which  displays the total number of analogy tasks, 
# the number of different categories (sections), the accuracy of the results (total number of correctly 
# solved analogies), and the MRR score of the results:
print("Total number of analogies:",  # The last dictionary is the total
      len(analogy_scores[1][-1]['correct']) + 
      len(analogy_scores[1][-1]['incorrect']))
print("Total number of categories:", len(analogy_scores[1]) - 1) # the "total" is excluded 
print(f"Overall accuracy: {analogy_scores[0]:.2f} and MRR: {myMRR(analogy_scores[1][-1]):.2f}")

Total number of analogies: 4
Total number of categories: 2
Overall accuracy: 0.75 and MRR: 0.71


**2f.** Please compute now the accuracy and MRR and the total time for the entire `questions-words.txt` file.  Is the timing compatible with your estimate from (2b)?  What do you think about the difference between accuracy and MRR? 

In [14]:
# Please write your Python code below and execute it.

start_time = time.time()
sections = my_evaluate_word_analogies(wv_model,'questions-words-100.txt')
end_time = time.time()

# accuracy returned
accuracy = sum(len(section['correct']) for section in sections) / sum(len(section['correct']) + len(section['incorrect']) for section in sections)
print(f"The accuracy returned by my_evaluate_word_analogies is {accuracy:.4f}")

# MRR score
mrr = myMRR(sections[-1])
print(f"The MRR score of the results is {mrr:.4f}")

# Time taken
print(f"It took {end_time - start_time:.2f} seconds to evaluate the analogy tasks.")

# compare with estimate from 2b
print(f"Estimated time from 2b: {time_estimate:.2f} seconds")

The accuracy returned by my_evaluate_word_analogies is 0.8889
The MRR score of the results is 0.8889
It took 1.27 seconds to evaluate the analogy tasks.
Estimated time from 2b: 344.75 seconds


In [15]:
# Please write you answer here.

## End of AdvNLP Lab 4
Please submit your lab report as a .ipynb file after you have fully run and checked it in Google Colab; then upload it to Moodle.
Please submit one notebook per group only and do not forget to put the last names of all team members in the filename.