![MSE Logo](https://moodle.msengineering.ch/pluginfile.php/1/core_admin/logo/0x150/1643104191/logo-mse.png)

# AdvNLP Lab 4 GRADED: Testing a pretrained word2vec model on analogy tasks

**Objectives:**  experiment with *word vectors* from word2vec: test them on analogy tasks; use *accuracy and MRR* (Mean Reciprocal Rank) scores.

**Useful documentation:** the [section on KeyedVectors in Gensim](https://radimrehurek.com/gensim/models/keyedvectors.html) and possibly the [section on word2vec](https://radimrehurek.com/gensim/models/word2vec.html).

---
#### Submission of Dave Brunner & Andrea Wey
---

## 1. Word2vec model trained on Google News
**1a.** Please install the latest version of Gensim, preferably in a Conda environment. 

In [1]:
# !pip install --upgrade gensim
# You can run the following verification:
# !pip show gensim

In [2]:
import gensim, os, random
from gensim import downloader
from gensim.test.utils import datapath
from gensim.models import KeyedVectors
from pathlib import Path



**1b.** Please download from Gensim the `word2vec-google-news-300` model, upon your first use.  Then, please write code to answer the following questions:
* Where is the model stored on your computer and what is the file name?  You can store the absolute path in a variable called `path_to_model_file`.
* What is the size of the corresponding file?  Please display the size in gigabytes with two decimals.

In [3]:
# Download the model from Gensim (needed only the first time)
# gensim.downloader.load("word2vec-google-news-300")
# No need to store the returned value (uses a lot of memory).

In [4]:
path_to_model_file = Path(gensim.downloader.base_dir) / 'word2vec-google-news-300' / 'word2vec-google-news-300.gz'
print(f'Model stored at: {path_to_model_file}')
assert path_to_model_file.exists()

file_size = path_to_model_file.stat().st_size
file_size_gb = file_size / 1024 ** 3
print(f"Size of the file: {file_size_gb:.2f} GB")

Model stored at: /Users/davebrunner/gensim-data/word2vec-google-news-300/word2vec-google-news-300.gz
Size of the file: 1.62 GB


**1c.** Please load the word2vec model as an instance of the class `KeyedVectors`, and store it in a variable called `wv_model`. 
What is, at this point, the memory size of the process corresponding to this notebook?  Simply write the value you obtain from any OS-specific utility that you like.

In [5]:
wv_model = KeyedVectors.load_word2vec_format(path_to_model_file, binary=True)
# 3.96 GB on my MacBook Pro

**1d.** Please write the instructions that generate the answers to the following questions.
* What is the size of the vocabulary of the `wv_model` model?  
* What is the dimensionality of each word vector?  
* What is the word corresponding to the vector in position 1234?  
* What are the first 10 coefficients of the word vector for the word *pyramid*?  

In [6]:
# 1. Size of the vocabulary
vocab_size = len(wv_model.index_to_key)
print(f"Size of the vocabulary: {vocab_size}")

Size of the vocabulary: 3000000


In [7]:
# 2. Dimensionality of each word vector
vector_dim = wv_model.vector_size
print(f"Dimensionality of each word vector: {vector_dim}")

Dimensionality of each word vector: 300


In [8]:
# 3. Word corresponding to the vector in position 1234
word_at_position_1234 = wv_model.index_to_key[1234]
print(f"Word at position 1234: {word_at_position_1234}")

Word at position 1234: learn


In [9]:
# 4. First 10 coefficients of the word vector for the word 'pyramid'
if 'pyramid' in wv_model:
    pyramid_vector = wv_model['pyramid']
    first_10_coefficients = pyramid_vector[:10]
    print(f"First 10 coefficients of the word vector for 'pyramid': {first_10_coefficients}")

First 10 coefficients of the word vector for 'pyramid': [ 0.00402832 -0.00260925  0.04296875  0.19433594 -0.03979492 -0.06445312
  0.42773438 -0.18359375 -0.27148438 -0.12890625]


## 2. Solving analogies using word2vec trained on Google News
In this section, you are going to use word vectors to solve analogy tasks provided with Gensim, such as "What is to France what Rome is to Italy?".  The predefined function in Gensim that evaluates a model on this task does not provide enough details, so you will need to make modifications to it.

**2a.** The analogy tasks are stored in a text file called `questions-words.txt` which is typically found in `C:\Users\YourNameHere\.conda\envs\YourEnvNameHere\Lib\site-packages\gensim\test\test_data`.  You can access it from here with Gensim as `datapath('questions-words.txt')`.  

Please create a file called `questions-words-100.txt` with the first 100 lines from the original file.  Please run the evaluation task on this file, using the [documentation of the KeyedVectors class](https://radimrehurek.com/gensim/models/keyedvectors.html), then answer the following questions:
* How many analogy tasks are there in your `questions-words-100.txt` file?
* How many analogies were solved correctly and how many incorrectly?
* What is the accuracy returned by `evaluate_word_analogies`?
* How much time did it take to solve the analogies?

In [10]:
import time

# Locate the original analogy tasks file
original_file_path = datapath('questions-words.txt')

# Read the first 100 lines from the original file
with open(original_file_path, 'r', encoding='utf-8') as file:
    lines = [next(file) for _ in range(100)]

# Write these lines to a new file
new_file_path = Path(os.getcwd()) / 'questions-words-100.txt'
with open(new_file_path, 'w', encoding='utf-8') as new_file:
    new_file.writelines(lines)

# Evaluate the analogy tasks
start_time = time.time()
accuracy, results = wv_model.evaluate_word_analogies(new_file_path)
end_time = time.time()

# Extract results
total_tasks = len(results[-1]['correct']) + len(results[-1]['incorrect'])
correct_tasks = len(results[-1]['correct'])
incorrect_tasks = total_tasks - correct_tasks
accuracy_percentage = accuracy * 100
time_taken = end_time - start_time

# Print the results
print(f"Total analogy tasks: {total_tasks}")
print(f"Correctly solved analogies: {correct_tasks}")
print(f"Incorrectly solved analogies: {incorrect_tasks}")
print(f"Accuracy: {accuracy_percentage:.2f}%")
print(f"Time taken to solve analogies: {time_taken:.2f} seconds")

Total analogy tasks: 99
Correctly solved analogies: 80
Incorrectly solved analogies: 19
Accuracy: 80.81%
Time taken to solve analogies: 3.28 seconds


**2b.** Please answer in writing the following questions:
* What is the meaning of the first line of `questions-words-100.txt`?
* How many analogies are there in the original `questions-words.txt`?
* How much time would it take to solve the original set of analogies?

In [11]:
# The first line of question-words-100.txt is : capital-common-countries
#  and from the documentation we know that the first line is the section name.

with open(original_file_path, 'r', encoding='utf-8') as file:
    total_analogies = sum(1 for line in file if not line.startswith(':') and line.strip())
print(f"Total number of analogies in the original 'questions-words.txt': {total_analogies}")

# Time to solve the original set of analogies
# Assuming the time taken to solve the first 100 analogies is proportional to the total number of analogies
time_to_solve_original = time_taken * total_analogies / 100
print(f'Time to solve the original set of analogies: {time_to_solve_original:.2f} seconds')

Total number of analogies in the original 'questions-words.txt': 19544
Time to solve the original set of analogies: 641.09 seconds


**2c.** The built-in function from Gensim has several weaknesses, which you will address here.  Please copy the source code of the function `evaluate_word_analogies` from the file `gensim\models\keyedvectors.py` and create here a new function which will improve the built-in one as follows.  The function will be called `my_evaluate_word_analogies` and you will also pass it the model as the first argument.  Overall, please proceed gradually and only make minimal modifications, to ensure you don't break the function.  It is important to first understand the structure of the result, `analogies_scores` and `sections`. 

* Modify the line where `section[incorrect]` is assembled in order to also add to each analogy the *incorrect guess* (i.e. what the model thought was the good answer, but got it wrong).

* Modify the code so that when `section[incorrect]` is assembled, you also add the *rank of the correct answer* among the candidates returned by the system (after the incorrect guess).  If the correct answer is not present at all, then code the rank as 0.

In [12]:
import itertools
import logging
import random
from gensim import utils

logger = logging.getLogger(__name__)


def my_evaluate_word_analogies(model, analogies, restrict_vocab=300000, case_insensitive=True):
    oov = 0
    logger.info("Evaluating word analogies for top %i words in the model on %s", restrict_vocab, analogies)
    results, section, accuracy = [], None, 0
    quadruplets_no = 0

    with utils.open(analogies, 'rb') as fin:
        for line_no, line in enumerate(fin):
            line = utils.to_unicode(line).strip()
            if line.startswith(': '):
                category_name = line.lstrip(': ')
                if section:
                    results.append(section)
                section = {'section': category_name, 'correct': [], 'incorrect': []}
            else:
                if not section:
                    raise ValueError(f"Missing section header before line #{line_no} in {analogies}")

                try:
                    if case_insensitive:
                        a, b, c, expected = [word.upper() for word in line.split()]
                    else:
                        a, b, c, expected = [word for word in line.split()]
                except ValueError:
                    logger.info(f"Skipping invalid line #{line_no} in {analogies}")
                    continue

                quadruplets_no += 1

                # Check vocabulary presence using Gensim's built-in method
                if not (model.has_index_for(a) and model.has_index_for(b) and model.has_index_for(
                        c) and model.has_index_for(expected)):
                    oov += 1
                    section['incorrect'].append((a, b, c, expected, None, 0))  # No guess, rank = 0
                    continue

                ignore = {a, b, c}

                try:
                    sims = model.most_similar(positive=[b, c], negative=[a], topn=5, restrict_vocab=restrict_vocab)
                except KeyError as e:
                    logger.warning(f"Skipping analogy due to missing words: {e}")
                    oov += 1
                    section['incorrect'].append((a, b, c, expected, None, 0))
                    continue

                predicted = None
                rank = 0

                # check if first token is correct
                predicted = sims[0][0].upper() if case_insensitive else sims[0][0]
                if predicted == expected:
                    section['correct'].append((a, b, c, expected))
                else:
                    for idx, (word, _) in enumerate(sims[1:]):
                        predicted = word.upper() if case_insensitive else word
                        if predicted in ignore:
                            continue
                        if predicted == expected:
                            rank = idx
                            break
                    section['incorrect'].append((a, b, c, expected, predicted, rank if rank else 0))

    results.append(section)

    total = {
        'section': 'Total accuracy',
        'correct': list(itertools.chain.from_iterable(s['correct'] for s in results)),
        'incorrect': list(itertools.chain.from_iterable(s['incorrect'] for s in results)),
    }

    oov_ratio = float(oov) / quadruplets_no * 100 if quadruplets_no > 0 else 0
    logger.info('Quadruplets with out-of-vocabulary words: %.1f%%', oov_ratio)
    accuracy = len(total['correct']) / (len(total['correct']) + len(total['incorrect']))

    results.append(total)
    return accuracy, results


analogy_scores = my_evaluate_word_analogies(wv_model, "questions-words-100.txt")

**2d.** Please run the `my_evaluate_word_analogies` function on `questions-words-100.txt` and then write instructions to display, from the results stored in `analogy_scores`:
* one incorrectly-solved analogy (selected at random), including also the error made by the model and the rank of the correct answer, thus adding:
  - a fifth word, which is the incorrect one found by the model
  - a sixth term, which is the integer indicating the rank (or 0)
* one correctly-solved analogy selected at random (in principle, four terms).

In [13]:
analogy_scores = my_evaluate_word_analogies(wv_model, "questions-words-100.txt")
incorrect_samples = analogy_scores[1][-1]['incorrect']
correct_samples = analogy_scores[1][-1]['correct']

if incorrect_samples:
    print("Incorrectly-solved analogy:", random.choice(incorrect_samples))
if correct_samples:
    print("\nCorrectly-solved analogy:", random.choice(correct_samples))

Incorrectly-solved analogy: ('ATHENS', 'GREECE', 'BAGHDAD', 'IRAQ', 'IRAQ', 0)

Correctly-solved analogy: ('BERLIN', 'GERMANY', 'MOSCOW', 'RUSSIA')


**2e.** Please write a function to compute the MRR score given a structure with correctly and incorrectly solved analogies, such as the one that is found in the results from `evaluate_word_analogies`.  The structure is not divided into categories.

The Mean Reciprocal Rank (please use the [formula here](https://en.wikipedia.org/wiki/Mean_reciprocal_rank)) gives some credit for incorrectly solved analogies, in inverse proportion to the rank of the correct answer among the candidates.  This rank is 1 for correctly solved analogies (full credit), and 1/k (or 0) for incorrectly solved ones.

In [14]:
def myMRR(analogies):
    reciprocal_ranks = []

    q = len(analogies['correct'] + analogies['incorrect'])
    # Process correctly solved analogies
    for analogy in analogies['correct']:
        reciprocal_ranks.append(1.0)  # Full credit for correct answers

    # Process incorrectly solved analogies
    for analogy in analogies['incorrect']:
        rank = analogy[5]  # The rank of the correct answer
        if rank != 0:
            reciprocal_ranks.append(1.0 / rank)

    # Calculate MRR
    mrr = sum(reciprocal_ranks) / q
    return mrr



print("Total number of analogies:",  # The last dictionary is the total
      len(analogy_scores[1][-1]['correct']) +
      len(analogy_scores[1][-1]['incorrect']))
print("Total number of categories:", len(analogy_scores[1]) - 1)  # the "total" is excluded
print(f"Overall accuracy: {analogy_scores[0]:.2f} and MRR: {myMRR(analogy_scores[1][-1]):.2f}")

Total number of analogies: 99
Total number of categories: 1
Overall accuracy: 0.68 and MRR: 0.73


In [15]:
# Please test your MRR function by running the following code, which  displays the total number of analogy tasks, 
# the number of different categories (sections), the accuracy of the results (total number of correctly 
# solved analogies), and the MRR score of the results:
print("Total number of analogies:",  # The last dictionary is the total
      len(analogy_scores[1][-1]['correct']) +
      len(analogy_scores[1][-1]['incorrect']))
print("Total number of categories:", len(analogy_scores[1]) - 1)  # the "total" is excluded
print(f"Overall accuracy: {analogy_scores[0]:.2f} and MRR: {myMRR(analogy_scores[1][-1]):.2f}")

Total number of analogies: 99
Total number of categories: 1
Overall accuracy: 0.68 and MRR: 0.73


**2f.** Please compute now the accuracy and MRR and the total time for the entire `questions-words.txt` file.  Is the timing compatible with your estimate from (2b)?  What do you think about the difference between accuracy and MRR? 

In [16]:
import time

# Run the function on the entire `questions-words.txt` file
start_time = time.time()
sections = my_evaluate_word_analogies(wv_model, datapath('questions-words.txt'))
end_time = time.time()

# Calculate accuracy
total_analogies = len(sections[1][-1]['correct']) + len(sections[1][-1]['incorrect'])
accuracy = len(sections[1][-1]['correct']) / total_analogies

# Calculate MRR
mrr_score = myMRR(sections[1][-1])

# Calculate total time
total_time = end_time - start_time

# Print the results
print(f"Accuracy: {accuracy:.4f}")
print(f"Mean Reciprocal Rank (MRR): {mrr_score:.4f}")
print(f"Total time taken: {total_time:.2f} seconds")

# Compare with the estimate from (2b)
estimated_time_per_analogy = total_time / total_analogies
print(f"Estimated time per analogy: {estimated_time_per_analogy:.4f} seconds")

Accuracy: 0.1823
Mean Reciprocal Rank (MRR): 0.2173
Total time taken: 111.88 seconds
Estimated time per analogy: 0.0057 seconds


## End of AdvNLP Lab 4
Please submit your lab report as a .ipynb file after you have fully run and checked it in Google Colab; then upload it to Moodle.
Please submit one notebook per group only and do not forget to put the last names of all team members in the filename.