<a href="https://colab.research.google.com/github/fubotz/cl_intro_ws2024/blob/main/HomeExercise2_Fabian_SCHAMBECK.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Home Exericse 2: Word Embeddings
In this second home exercise, you will use the knowledge from Tutorial 3 to perform a more systematic evaluation of embeddings based on a small analogy dataset.

In this notebook, please complete all instructions starting with 👋 ⚒ in the code cell after the sign or provide your analysis in the text cell after the sign.

## **Word2Vec Analogy-based Evaluation**

We first need to load the pretrained embeddings and the dataset. The dataset can be found on [GitHub](https://github.com/dgromann/cl_intro_ws2024/blob/main/exercises/HomeExercise2.txt) and will be loaded directly from there.

In [1]:
!wget https://github.com/dgromann/cl_intro_ws2024/raw/main/word2vec_embeddings.bin
!wget !wget https://raw.githubusercontent.com/dgromann/cl_intro_ws2024/master/exercises/HomeExercise2.txt

--2024-11-24 17:06:42--  https://github.com/dgromann/cl_intro_ws2024/raw/main/word2vec_embeddings.bin
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/dgromann/cl_intro_ws2024/main/word2vec_embeddings.bin [following]
--2024-11-24 17:06:42--  https://raw.githubusercontent.com/dgromann/cl_intro_ws2024/main/word2vec_embeddings.bin
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 96769269 (92M) [application/octet-stream]
Saving to: ‘word2vec_embeddings.bin’


2024-11-24 17:06:45 (128 MB/s) - ‘word2vec_embeddings.bin’ saved [96769269/96769269]

--2024-11-24 17:06:45--  http://!wget/
Resolving !wget (!w

Then we need to load the model with gensim so that we can access the embeddings.

In [4]:
import gensim

model = gensim.models.KeyedVectors.load_word2vec_format("word2vec_embeddings.bin", binary=True)   # load pre-trained Word2Vec embeddings in binary format --> .bin == binary encoded file

And we need to open the HomeExercise2.txt file that contains analogy pairs.

In [24]:
analogy = open("HomeExercise2.txt", "r")
analogy_lines = analogy.readlines()   # .readlines() method used to read lines and return them as list of strings; each string represents a line from file


# NB: anology:
# A : B :: C : D

To look at the first few lines, the following code can be used. The analogies are grouped by categories that is indicated on the line before the anlogies are listed with a colon :. The last and fourth element of the line represents the true result we will use to evaluate the embedding model.

In [10]:
line_no = 0
for line in analogy_lines:
  line_no += 1
  print(f"Line number {line_no} with analogy {line}")
  if line_no == 5:
    break

Line number 1 with analogy : capital-common-countries

Line number 2 with analogy Athens Greece Baghdad Iraq

Line number 3 with analogy Athens Greece Berlin Germany

Line number 4 with analogy Athens Greece Cairo Egypt

Line number 5 with analogy Athens Greece Canberra Australia



👋 ⚒ Systematically evaluate this simple word embedding model based on the entire analogy dataset. To do this:


*   Use the analogy function from Tutorial 3 to obtain 'd'
*   Compare 'd' with the true result from the `HomeExercise2.txt` file
*   Calculate the accuracy for all analogies (how many times out of all attempts did the embedding model provide the correct result)
*   Calculate the accuracy for each analogy category separately

When parsing the file, pay attention to the lines indicated with the colon : that represent the analogy categories and not analogies.


In [23]:
# Your code here:
# Example: Athens is to Greece as Baghdad is to ?
# True result from file: Iraq
# Model result: also Iraq?


def analogy(a, b, c):
  result = model.most_similar(positive=[b, c], negative=[a], topn=1)    # calculate the analogy: b - a + c = d
  return result[0][0]

# Initialize variables to track accuracy of prediction
total_attempts = 0    # tracks total number of analogies attempted
total_correct = 0   # total number of correctly predicted analogies
category_attempts = {}    # dict to track total number of attempts for each category
category_correct = {}   # dict to track total number of correct predictions for each category


with open("HomeExercise2.txt", "r") as file:
  current_category = None   # track current analogy category

  for line in file:
    line = line.strip()

    # Check if the line is a category header
    if line.startswith(":"):
      current_category = line[1:].strip()   # extract category name without colon at index [0]
      # Initialize counters for the new category
      category_attempts[current_category] = 0
      category_correct[current_category] = 0

    else:
      # Split the other lines into sperate strings (a, b, c, d_true)
      words = line.split()
      if len(words) == 4:   # if number of words per line == 4
        a, b, c, d_true = words   # assign a, b, c, d_true respectively to each word in that line

      # Use the analogy function to get the predicted result
      predicted_d = analogy(a, b, c)

      # Update overall counts
      total_attempts += 1
      category_attempts[current_category] += 1

      if predicted_d == d_true:
        total_correct += 1
        category_correct[current_category] += 1

# Calculate and print overall accuracy
overall_accuracy = total_correct / total_attempts if total_attempts > 0 else 0
print(f"Overall accuracy: {overall_accuracy:.2f}")

# Calculate and print category-specific accuracies
for category, attempts in category_attempts.items():
  correct = category_correct[category]
  accuracy = correct / attempts if attempts > 0 else 0
  print(f"Accuracy for {category}: {accuracy:.2f}")


print(analogy("Athens", "Greece", "Baghdad"))   # Greece(b) - Athens(a) + Baghdad(c) = Iraq(d); outputs d == "Iraqi" because analogy relies on word embeddings from pre-trained model



Overall accuracy: 0.75
Accuracy for capital-common-countries: 0.87
Accuracy for capital-world: 0.90
Accuracy for currency: 0.00
Accuracy for city-in-state: 0.79
Accuracy for family: 0.93
Accuracy for gram1-adjective-to-adverb: 0.30
Accuracy for gram2-opposite: 0.54
Accuracy for gram3-comparative: 0.91
Accuracy for gram4-superlative: 0.88
Accuracy for gram5-present-participle: 0.78
Accuracy for gram6-nationality-adjective: 0.96
Accuracy for gram7-past-tense: 0.69
Accuracy for gram8-plural: 0.87
Accuracy for gram9-plural-verbs: 0.68
Iraqi


## **Comparison: GloVe Analogy-based Evaluation**

The next step will consist of comparing this very small word2vec embedding model with a different small but more powerfull model available in gensim.

All models and corpora available in gensim can be found [here](https://github.com/piskvorky/gensim-data).

Since this model is considerably bigger than the tiny word2vec model, it takes some time to load when you run the following code cell.

In [None]:
import gensim.downloader as api
from gensim.models import KeyedVectors

model_glove = api.load("glove-wiki-gigaword-100")
print(type(model))

The model can then be used exactly the same as the word2vec model, since gensim standardizes model access.

In [None]:
model_glove["bread"]

👋 ⚒  Run the same systematic analysis for this gensim model as for the word2vec model above. Which model performs better overall and in specific categories?

In [None]:
# Your code here

## **Visual Comparison**

As a final step, use the visualization from Tutorial 3 to visually output the two models based on the following words.

👋 ⚒  ❓ Do the clusters (groupings of embeddings) in the GloVe visualization differ substantially from the clusters in the word2vec visualization from Tutorial 3?

In [None]:
import numpy as np

from sklearn.decomposition import PCA
from matplotlib import pyplot as plt

def display_pca_scatterplot(model, words):

    word_vectors = np.array([model[w] for w in words])

    twodim = PCA().fit_transform(word_vectors)[:,:2]

    plt.figure(figsize=(6,6))
    plt.scatter(twodim[:,0], twodim[:,1], edgecolors='k', c='r')
    for word, (x,y) in zip(words, twodim):
        plt.text(x+0.05, y+0.05, word)

In [None]:
display_pca_scatterplot(model_glove,
                        ['coffee', 'tea', 'beer', 'wine', 'water',
                         'hamburger', 'pizza',  'sushi', 'meatballs',
                         'dog', 'horse', 'cat', 'monkey', 'parrot', 'lizard',
                         'france', 'germany', 'hungary',
                         'school', 'college', 'university', 'institute'])

**Provide your answer to the question on the clusters here.**

## **Bias in Embeddings**

Language models and also embedding models tend to reflect on bias that is present in the textual data they were trained on. This can also be analyzed with embeddings by explicitly testing biased analogies.

For instance, man is to doctor as woman is to ?

The bias here is that professions tend to be assigned a specific gender, e.g. men are doctors and women are nurses.

The same is true for cultures and cultural bias, e.g. Bratwurst or Sauerkraut and Germany.



In [None]:
result1 = model_glove.most_similar(positive=["doctor", "woman"], negative=["man"], topn=3)
print(f"man is to doctor as woman is to {result1}")
result2 = model_glove.most_similar(positive=["bratwurst", "france"], negative=["germany"], topn=3)
print(f"Germany is to Bratwurst as France is to {result2}")

👋 ⚒ Try to come up with two biased analogies yourself and test if the GloVe and word2vec models suffers from this type of bias. Please try to be creative and do not just change woman to girl and man to boy or something similar.

In [None]:
# Test your biased analogies on both models here