<a href="https://colab.research.google.com/github/ghufranullah1997/IHLT_notebooks/blob/main/exercise_12.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task 12: analogy evaluation

In the lecture, we touched upon *Mikolov's analogy dataset* which was one of the first analogy evaluation datasets for word embeddings. It consists of 9+5=14 sets of word analogies. You can find it for example here: https://github.com/nicholas-leonard/word2vec/blob/master/questions-words.txt

It might be interesting to know how well our embeddings fare on each of these 14 tasks. And that will be our exercise. The steps are as follows:

1. Read in the analogy tuples from the file above, for each task separately (the format of the file is kinda self-explanatory once you open it)
2. Write a function `eval_analogy(tuples,embeddings,K)` which will return the top-K accuracy of the `embeddings` (Gensim's KeyedVectors) on `tuples`, which are the analogy 4-tuples. For instance for the tuple ("Athens","Greece","Havana","Cuba") will be counted as correct if the analogy on the first three words results in K nearest neighbors that contain also "Cuba". Hope this is clear. :)
3. Run this function on the 14 tasks you read in step (1) and see if you see any interesting differences

Below is the relevant embedding-loading and analogy example code from the lecture that you can reuse.

**Tip:** these do take a while to compute, so you might want to debug your code on a small sample and when happy, run the whole thing only once. I also like to use `tqdm` to get a progress bar, so I see how long I need to wait to see some output.

In [1]:
pip install gensim



In [None]:
# Fix binary incompatibility issues in Colab
!pip install --upgrade --force-reinstall numpy scipy pandas gensim

# Restart the runtime to reload libraries properly
import os
os.kill(os.getpid(), 9)


[31mERROR: Operation cancelled by user[0m[31m
[0m

In [1]:
import gensim

In [2]:
# I found this link in the NLPL repository
# It refers to English model trained on the Gigaword corpus of news
##!wget http://vectors.nlpl.eu/repository/20/12.zip

## Try these if the download above is too slow, I mirrored these:
!wget http://dl.turkunlp.org/TKO_7095_2023/12.zip
#!wget http://dl.turkunlp.org/TKO_7095_2023/42.zip


--2025-05-06 19:27:36--  http://dl.turkunlp.org/TKO_7095_2023/12.zip
Resolving dl.turkunlp.org (dl.turkunlp.org)... 195.148.30.23
Connecting to dl.turkunlp.org (dl.turkunlp.org)|195.148.30.23|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 613577258 (585M) [application/zip]
Saving to: ‘12.zip’


2025-05-06 19:28:09 (17.7 MB/s) - ‘12.zip’ saved [613577258/613577258]



In [3]:
# Somewhat awkwardly, these are numbered files and both
# .zip files contain "model.bin"
# Let's unzip and rename
# -o means "do not ask, overwrite by default"
!unzip -o 12.zip
!mv model.bin en.bin


Archive:  12.zip
  inflating: meta.json               
  inflating: model.bin               
  inflating: model.txt               
  inflating: README                  




*   Now we can load the embeddings
*   These are huge, but they are sorted by frequency, so we can easily limit ourselves to the top 100,000 words, which will be plenty enough for us
*   This is maybe good to note, now we enter the territory of NLP models which count in the gigabytes in size



In [4]:
# This is how you load the trained embeddings
# check the documentation
# w2v embeddings are traditionlly distributed in one of two formats: a text form, and a binary form
# The embeddings we downloaded above are in the binary form, so we need to indicate that when loading

from gensim.models import KeyedVectors

wv_emb_en=KeyedVectors.load_word2vec_format("en.bin", limit=100000, binary=True)


`KeyedVectors` documentation is here: https://radimrehurek.com/gensim/models/keyedvectors.html

# Basic operations with the embeddings

* The KeyedVectors object allows for all the basic operations with embeddings which we saw in the lecture


# Word analogy

* "A is to B as C is to D"
* Can be implemented as D=B-A+C, where (A,B,C) are word embeddings
* Then list words nearest to the computed embedding D
* In the library, the implementation lets us list words with "+" sign, and words with "-" sign


In [5]:
# B     A      C
# Paris-France+Sweden= ___?
#
# i.e. France is to Paris as Sweden is to X
wv_emb_en.most_similar(positive=["Paris","Sweden"],negative=["France"])

[('Stockholm', 0.7338932752609253),
 ('Malmo', 0.5458161234855652),
 ('Helsinki', 0.5444940328598022),
 ('Goteborg', 0.5421050190925598),
 ('Swedish', 0.5309098362922668),
 ('Malmoe', 0.5198634266853333),
 ('Oslo', 0.5004472732543945),
 ('Gothenburg', 0.4957912266254425),
 ('STOCKHOLM', 0.48791587352752686),
 ('Copenhagen', 0.47769418358802795)]

In [6]:
triples=[("cow","milk","hen"),
         ("Paris","France","Helsinki"),
         ("car","wheel","airplane"),
         ("airplane","propeller","ship"),
         ("king","queen","man"),
         ("man","doctor","woman"),
         ("man","boss","woman")
         ]
for what,is_to_what,as_this_is in triples:
    # is_to_what-what+as_this_is
    to_what=wv_emb_en.most_similar(positive=[is_to_what,as_this_is],negative=[what])[0][0]
    print(f"{what} is to {is_to_what} as {as_this_is} is to: {to_what}")


cow is to milk as hen is to: sauce
Paris is to France as Helsinki is to: Finland
car is to wheel as airplane is to: rudder
airplane is to propeller as ship is to: vessel
king is to queen as man is to: woman
man is to doctor as woman is to: physician
man is to boss as woman is to: bosses


# The exercise code starts below

* I will donate you a function for reading Mikolov's data, but I recommend you delete it and write your own as a further exercise
* Reading annoying file formats is an integral part of NLP

In [7]:
#Remember you always need to download the "raw" link from GitHub, or else you get an HTML with the pretty-printed data, not the data itself
!wget https://raw.githubusercontent.com/nicholas-leonard/word2vec/master/questions-words.txt

--2025-05-06 19:28:25--  https://raw.githubusercontent.com/nicholas-leonard/word2vec/master/questions-words.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 603955 (590K) [text/plain]
Saving to: ‘questions-words.txt’


2025-05-06 19:28:25 (16.7 MB/s) - ‘questions-words.txt’ saved [603955/603955]



In [8]:
tasks={} #A dictionary with taskname as key, and value will then be a list of 4-tuples with the analogy data

with open("questions-words.txt") as f:
    for line in f:
        line=line.rstrip("\n")
        if not line:
            continue #skip possible empty lines
        if line.startswith(": "): #All tasks seem to start with a line like ": task-name"
            taskname=line[2:] #get rid of ": "
            tuple_list=[] #let's make a new list for the tuples and store it into the tasks dictionary
            #then we keep filling it, until a new task starts, when a new list is created, the previous
            #of course remains in the `tasks` dictionary
            tasks[taskname]=tuple_list
        else: #not a task line, so this should be a 4-word line, with words space-separated it seems
            w1,w2,w3,w4=line.split()
            tuple_list.append((w1,w2,w3,w4)) #let's append it and we're done

print(f"We have {len(tasks)} tasks.")


We have 14 tasks.


In [None]:
import tqdm

def eval_analogy(tuples,embeddings,K):
    #### YOUR CODE GOES HERE ########

### MY results are
# Task *gram9-plural-verbs* has top-3 accuracy of 83.45%
# Task *capital-common-countries* has top-3 accuracy of 93.68%
# Task *capital-world* has top-3 accuracy of 95.97%
# Task *currency* has top-3 accuracy of 40.54%
# Task *city-in-state* has top-3 accuracy of 63.76%
# Task *family* has top-3 accuracy of 93.16%
# Task *gram1-adjective-to-adverb* has top-3 accuracy of 49.25%
# Task *gram2-opposite* has top-3 accuracy of 49.62%
# Task *gram3-comparative* has top-3 accuracy of 95.50%
# Task *gram4-superlative* has top-3 accuracy of 86.32%
# Task *gram5-present-participle* has top-3 accuracy of 83.97%
# Task *gram6-nationality-adjective* has top-3 accuracy of 95.45%
# Task *gram7-past-tense* has top-3 accuracy of 90.45%
# Task *gram8-plural* has top-3 accuracy of 89.04%
# Task *gram9-plural-verbs* has top-3 accuracy of 83.45%

In [10]:
import tqdm  # For progress display

def evaluate_analogies(pairs_list, model, top_k):
    correct_matches = 0   # Count of correct analogies
    total_checked = 0     # Total evaluated (excluding skipped ones)

    # Go through each 4-word analogy tuple
    for word_a, word_b, word_c, expected_word in tqdm.tqdm(pairs_list):
        try:
            # Predict: word_b is to word_a as ? is to word_c
            predictions = model.most_similar(positive=[word_b, word_c], negative=[word_a], topn=top_k)

            # Extract only the predicted words from (word, similarity) tuples
            predicted_words = set(w for w, _ in predictions)

            # Check if expected answer is among the top-k predictions
            if expected_word in predicted_words:
                correct_matches += 1

            total_checked += 1

        except KeyError:
            # Skip the tuple if any word is missing from the vocabulary
            continue

    # Return accuracy as a percentage
    return correct_matches / total_checked * 100 if total_checked > 0 else 0


# Store (task name, accuracy) results
evaluation_results = []

# Number of top candidates to consider in each analogy prediction
top_k_value = 3

# Evaluate each task in the analogy set
for name, analogy_tuples in tasks.items():
    task_accuracy = evaluate_analogies(analogy_tuples, wv_emb_en, top_k_value)
    evaluation_results.append((name, task_accuracy))
    print(f"Task *{name}* has top-{top_k_value} accuracy of {task_accuracy:.2f}%")

# Print


 22%|██▏       | 111/506 [00:03<00:12, 30.92it/s]
ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.



Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/IPython/core/interactiveshell.py", line 3553, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-10-9a3a0644753f>", line 38, in <cell line: 0>
    task_accuracy = evaluate_analogies(analogy_tuples, wv_emb_en, top_k_value)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<ipython-input-10-9a3a0644753f>", line 11, in evaluate_analogies
    predictions = model.most_similar(positive=[word_b, word_c], negative=[word_a], topn=top_k)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/gensim/models/keyedvectors.py", line 849, in most_similar
    dists = dot(self.vectors[clip_start:clip_end], mean) / self.norms[clip_start:clip_end]
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt

During handling of the above exception, an

TypeError: object of type 'NoneType' has no len()