<a href="https://colab.research.google.com/github/fubotz/MCMLR_2024W/blob/main/Bonus_Exercise_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Bonus Exercises 1: Aligning Multilingual Embedding Spaces**



This notebook represents the first bonus exercises for the lecture Multilingual and Crosslingual Methods and Language Resources (2024W 340168-1). For each successfully completed bonus exercise, a maximum of three points can be achieved that will be added to the points of the final exam. The tasks to be completed in the following notebook are marked with 👋 ⚒.



In this notebook, you will perform and evaluate a supervised method for aligning the embedding spaces of two languages. The examples in the notebook rely on the language pair English-German, however, feel free to change this pair to languages of your choice from the available embeddings and dictionaries (see below).

-----------
## **Preparing the Embeddings and Data**

In this notebook, we will be using fastText embeddings that represents a character-based version of the word2vec skipgram method. Details on the method can be found in the [original publication](https://aclanthology.org/Q17-1010.pdf) and [this website](https://fasttext.cc/).

Pretrained fastText embeddings are available in [157 languages](https://fasttext.cc/docs/en/crawl-vectors.html). The following code cell loads the fastText embeddings for English and German.

👋 ⚒ Please change the following download command if you wish to align other languages than English and German.

Before you decide on a final language pair, please make sure that:
1.   There are pretrained embeddings for this language (see [here](https://fasttext.cc/docs/en/crawl-vectors.html))
2.   There is a bilingual word list available (see the [MUSE GitHub](https://github.com/facebookresearch/MUSE/tree/main) section "Ground-truth bilingual dictionaries")

If the embeddings are available, change the two-digit ISO code in `cc.en.300.vec.g` and `cc.de.300.vec.gz` to the language(s) of your choice.

In [2]:
!wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.vec.gz        # English
!wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.de.300.vec.gz        # German

--2025-01-04 16:14:30--  https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.vec.gz
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 3.163.24.51, 3.163.24.87, 3.163.24.72, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|3.163.24.51|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1325960915 (1.2G) [binary/octet-stream]
Saving to: ‘cc.en.300.vec.gz’


2025-01-04 16:14:39 (141 MB/s) - ‘cc.en.300.vec.gz’ saved [1325960915/1325960915]

--2025-01-04 16:14:39--  https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.de.300.vec.gz
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 3.163.24.51, 3.163.24.87, 3.163.24.72, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|3.163.24.51|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1278030050 (1.2G) [binary/octet-stream]
Saving to: ‘cc.de.300.vec.gz’


2025-01-04 16:15:11 (38.0 MB/s) - ‘cc.de.300.vec.gz’ saved [1278030050/127803

### Loading the Embeddings

As a next step we will unzip and load the embeddings. For this alignment task, we will only use the top 100,000 words for both languages to speed up the processing. This choice of only using the top 100,000 words also depends on the lenght of the available bilingual word lists.

In [3]:
import gzip
import numpy as np

def load_fasttext_embeddings(file_path, top_n):
    embeddings = {}
    with gzip.open(file_path, 'rb') as f:
        for i, line in enumerate(f):
            # Line 0 is a header line
            if i > 0 and i <= top_n:
              tokens = line.decode('utf-8').strip().split(' ')
              word = tokens[0]
              vector = np.array(tokens[1:], dtype=np.float32)
              vector = vector / np.linalg.norm(vector)
              embeddings[word] = vector
    return embeddings

# Load the top English and German embeddings for the top 100,000 words (100000)
# FastText sorts the embeddings by decreasing order of word frequency by default
en_embeddings = load_fasttext_embeddings('cc.en.300.vec.gz', 100000)
de_embeddings = load_fasttext_embeddings('cc.de.300.vec.gz', 100000)

print(f"Loaded {len(en_embeddings)} English embeddings")
print(f"Loaded {len(de_embeddings)} German embeddings")

print("Top 10 English words:", list(en_embeddings.keys())[:10])     # NB: fasttest_embeddings sorted by frequency
print("Top 10 German words:", list(de_embeddings.keys())[:10])

Loaded 100000 English embeddings
Loaded 100000 German embeddings
Top 10 English words: [',', 'the', '.', 'and', 'to', 'of', 'a', '</s>', 'in', 'is']
Top 10 German words: [',', '.', '</s>', 'und', 'der', ':', 'die', '"', ')', '(']


Let us explore the format of the downloaded and loaded embeddings.

In [4]:
print(f'The loaded embeddings represent a {type(en_embeddings)} datatype.\n')       # fasttext_embeddings represented as dict {key = token string : value = np.array 300 floats} (1 list = 1D)
print(f'Each entry represents the word and the related embedding.\n')

print(f'We can query the word as a key and obtain the embedding, e.g. for good the embedding is: \n')
print(f'{en_embeddings["good"]}.\n')

print(f'The dimensionality of these embeddings corresponds to {len(en_embeddings["good"])}.')

The loaded embeddings represent a <class 'dict'> datatype.

Each entry represents the word and the related embedding.

We can query the word as a key and obtain the embedding, e.g. for good the embedding is: 

[-0.08404064 -0.05785208  0.00155124  0.1233691  -0.05985956  0.00565746
  0.11506542 -0.01505614  0.01587738 -0.00118624 -0.0886031   0.02126109
  0.00912493  0.00419747  0.01450865  0.0062962   0.07829193 -0.01815862
 -0.0549321  -0.02126109  0.01076742  0.07500695  0.01359615  0.00821244
  0.00638745 -0.05867332  0.03056853 -0.01916236  0.0617758   0.0275573
  0.06569952 -0.05192087 -0.03987596  0.00583996  0.04005846  0.05520585
 -0.00556621 -0.11187168 -0.03221102 -0.02463732 -0.01879736  0.0068437
 -0.0062962   0.03312351 -0.03020353  0.05292461  0.00757369 -0.05785208
 -0.05274212  0.00994618 -0.08440564  0.0142349  -0.03722973  0.00611371
 -0.05812582  0.05365461  0.06579077 -0.04918339 -0.13377152 -0.03695598
 -0.02290358 -0.04516842 -0.04763215 -0.06250579  0.04261344  

### Downloading and Loading the Bilingual Word List

To perform this alignment, we will use a bilingual word list that is provided by the Multilingual Unsupervised and Supervised Embeddings (MUSE) project (see [here](https://github.com/facebookresearch/MUSE/tree/main) for all languages).

👋 ⚒ Please change the following downloading command to the language pair of your choice (as long as available on MUSE).


In [5]:
!wget https://dl.fbaipublicfiles.com/arrival/dictionaries/en-de.txt

--2025-01-04 16:16:35--  https://dl.fbaipublicfiles.com/arrival/dictionaries/en-de.txt
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 3.163.189.108, 3.163.189.51, 3.163.189.96, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|3.163.189.108|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1742131 (1.7M) [text/x-c++]
Saving to: ‘en-de.txt’


2025-01-04 16:16:35 (36.7 MB/s) - ‘en-de.txt’ saved [1742131/1742131]



### Creating a bilingual word list

As a next step, we will create a bilingual word list from the downloaded text file.

👋 ⚒ Create a list of tuples `[(en_word1, de_word1), (en_word2, de_word2),...]`from the downloaded text file in the following code cell. To complete this task, please complement the provided function `load_bilingual_word_list` where it says `Your code here`.

For English-German, the first ten tuples of the list look like this:

```
[('the', 'die'), ('the', 'der'), ('the', 'dem'), ('the', 'den'), ('the', 'das'), ('and', 'sowie'), ('and', 'und'), ('was', 'war'), ('was', 'wurde'), ('for', 'für')]
```



In [6]:
def load_bilingual_word_list(file_path: str) -> list[tuple[str, str]]:
    """
    Create a list of tuples that contain word translations.

    Parameters:
        file_path (str): Path to the text file with one bilingual word pair per line.

    Returns:
        list of tuple: A list of tuples, each containing one bilingual word pair (en_word, de_word).
    """
    bilingual_dict = []
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            # Skip empty lines
            if not line.strip():
                continue
            # Split the line by whitespace
            parts = line.strip().split()
            if len(parts) == 2:
                en_word, de_word = parts
                bilingual_dict.append((en_word, de_word))
    return bilingual_dict


# Load English-German word pairs
en_de_pairs = load_bilingual_word_list('en-de.txt')

print(en_de_pairs[:10])

[('the', 'die'), ('the', 'der'), ('the', 'dem'), ('the', 'den'), ('the', 'das'), ('and', 'sowie'), ('and', 'und'), ('was', 'war'), ('was', 'wurde'), ('for', 'für')]


### Getting the Embeddings for our Word List

As a next step, we need to see which words from the word list have a vector representation in the embedding space for both languages and create a list of corresponding embeddings for both languages.


In [7]:
import numpy as np

def extract_word_embeddings(
    bilingual_pairs: list[tuple[str, str]],
    en_embeddings: dict[str, np.ndarray],
    de_embeddings: dict[str, np.ndarray]
) -> tuple[np.ndarray, np.ndarray]:
    """
    Function to create a list of word embeddings that is parallel to a bilingual list of words.

    Parameters:
        Bilingual list of words, embeddings in the first language, embeddings in the second language.

    Returns:
        Two numpy arrays of embeddings that correspond to the bilingual word list.
    """
    en_vecs = []
    de_vecs = []

    for en_word, de_word in bilingual_pairs:
        if en_word in en_embeddings and de_word in de_embeddings:
            en_vecs.append(en_embeddings[en_word])
            de_vecs.append(de_embeddings[de_word])

    # Convert lists to numpy arrays
    en_vecs = np.array(en_vecs)
    de_vecs = np.array(de_vecs)

    return en_vecs, de_vecs

# Extract English and German embeddings for the bilingual lexicon
en_vecs, de_vecs = extract_word_embeddings(en_de_pairs, en_embeddings, de_embeddings)

print(f"Extracted {en_vecs.shape[0]} aligned word vectors.")

Extracted 22546 aligned word vectors.


-----------
## **Embedding Alignment**

We will now use the dictionary and embeddings to align the two vector spaces. The English vector space will be aligned to the German vector space using the [Procrustes](https://en.wikipedia.org/wiki/Orthogonal_Procrustes_problem) alignment method.

Given two matrices, Procrustes finds an orthogonal matrix which most closely maps one input matrix to the other. As a first step, we need to compute this orthogonal transformation matrix.  



In [8]:
def orthogonal_procrustes(X: np.ndarray, Y: np.ndarray) -> np.ndarray:
    """
    Function to perform orthogonal Procrustes alignment to learn a mapping from X to Y.

    Parameters:
        X (numpy array): Source language word embeddings (English).
        Y (numpy array): Target language word embeddings (German).

    Returns:
        W (numpy array): Orthogonal transformation matrix.
    """
    X = X / np.linalg.norm(X, axis=1, keepdims=True)
    Y = Y / np.linalg.norm(Y, axis=1, keepdims=True)

    # Compute matrix product of X^T and Y
    M = np.dot(X.T, Y)

    # Perform Singular Value Decomposition (SVD) on the matrix M
    U, _, Vt = np.linalg.svd(M)

    # Compute the orthogonal transformation matrix W
    W = np.dot(U, Vt)

    return W

W = orthogonal_procrustes(en_vecs, de_vecs)

print("Orthogonal mapping matrix learned.")

Orthogonal mapping matrix learned.


In a second step, the obtained matrix is used to learn an orthogonal mapping of the English vector space to approximate it to the German vector space. Here we can transform the entire vector space of 100,000 embeddings.

In [9]:
def apply_mapping(embeddings: dict[str, np.ndarray], W: np.ndarray) -> dict[str, np.ndarray]:
    """
    Apply the learned orthogonal mapping to the source language embeddings.

    Parameters:
        embeddings (dict): Source language embeddings (English)
        W (numpy array): Orthogonal transformation matrix

    Returns:
        mapped_embeddings (dict): Transformed embeddings
    """
    mapped_embeddings = {}
    for word, vec in embeddings.items():
        mapped_vec = np.dot(vec, W)
        # Normalize the mapped vector
        mapped_vec = mapped_vec / np.linalg.norm(mapped_vec)
        mapped_embeddings[word] = mapped_vec
    return mapped_embeddings

aligned_en_embeddings = apply_mapping(en_embeddings, W)

print(f"Aligned {len(aligned_en_embeddings)} English embeddings into the German space.")        # NB: Why en -> de? because german embeddings need to remain unchanged? -> better comparison accross lang.?

Aligned 100000 English embeddings into the German space.


-----------
## **Evaluation**

In this part, you will explore two different tasks for evaluating the final vector space:


1.   Word Translation
2.   Cross-Lingual Analogy Completion

### Word Translation

We will now use the bilingual word list downloaded from MUSE to evaluate the ability of our newly created aligned embedding space to translate words from English to German.

A function that takes an English word as input and ouputs the nearest neighors (KNN) of the German vector space is already provided for your convenience.

In [10]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def get_nn(word, aligned_en_embeddings, de_embeddings, top_k):
    en_vec = aligned_en_embeddings[word]
    de_words = list(de_embeddings.keys())
    de_vecs = np.array(list(de_embeddings.values()))

    # Compute cosine similarity between the English word vector and all German word vectors (+normalize vectors)
    en_vec = en_vec / np.linalg.norm(en_vec)
    de_vecs_norm = de_vecs / np.linalg.norm(de_vecs, axis=1, keepdims=True)
    similarities = cosine_similarity([en_vec], de_vecs_norm).flatten()

    # Get top_k most similar German words
    nearest_idxs = similarities.argsort()[-top_k:][::-1]
    nearest_words = [de_words[i] for i in nearest_idxs]

    return nearest_words

en_word = 'the'
nearest_neighbors = get_nn(en_word, aligned_en_embeddings, de_embeddings, 5)
print(f"Nearest neighbors of '{en_word}': {nearest_neighbors}")

Nearest neighbors of 'the': ['der', 'die', 'den', 'dem', 'besagten']


👋 ⚒ Use the already downloaded bilingual word list to evaluate the ability of our aligned vector space to translate from English to German. The output of this task should be the **accuracy** calculated on **1000 words** from the word list, i.e., how many of the first 1000 English words result in five German neighbors that correspond to the German translation from the MUSE word list.

Use the provided function `get_nn` to obtain the *k* nearest words in the vector space in German, given an English input word.


In [11]:
# Your code here:
def evaluate_translation_accuracy(bilingual_pairs, aligned_en_embeddings, de_embeddings, top_k=5, num_words=1000, print_limit=20):
    """
    Evaluate the accuracy of the aligned vector space for translation.

    Parameters:
        bilingual_pairs (list of tuple): List of bilingual word pairs (English, German).
        aligned_en_embeddings (dict): Aligned English word embeddings.
        de_embeddings (dict): German word embeddings.
        top_k (int): Number of nearest neighbors to consider for translation.
        num_words (int): Number of English words to evaluate.
        print_limit (int): Maximum number of words to print nearest neighbors for.

    Returns:
        float: Accuracy of translations.
    """
    correct_translations = 0        # Increases when correct german translation (taken from the bilingual MUSE word list) is found among k=5 nn of english embedding
    total_translations = 0          # Tracks total number of bilingual MUSE pairs evaluated (both eng and ger words taken from MUSE have valid embeddings in embedding space)
    printed = 0                     # Counter for printed nearest neighbors
    skipped_pairs = 0               # Counter for skipped pairs due to missing embeddings

    # Iterate over the first "num_words" bilingual pairs; here: 1000
    for en_word, de_word in bilingual_pairs[:num_words]:
        if en_word in aligned_en_embeddings and de_word in de_embeddings:
            # Get the top-k nearest German neighbors
            nearest_neighbors = get_nn(en_word, aligned_en_embeddings, de_embeddings, top_k)

            # Print only the first `print_limit` English words and their neighbors
            if printed < print_limit:
                print(f"English word: '{en_word}' -> Nearest German neighbors: {nearest_neighbors}")
                printed += 1

            # Check if the correct German word is among the top-k nearest neighbors
            if de_word in nearest_neighbors:
                correct_translations += 1
            total_translations += 1
        else:
            skipped_pairs += 1

    # Calculate accuracy
    accuracy = correct_translations / total_translations if total_translations > 0 else 0

    return accuracy, correct_translations, total_translations, skipped_pairs

# Evaluate translation accuracy using the first 1000 words from the bilingual word list
accuracy, correct_translations, total_translations, skipped_pairs = evaluate_translation_accuracy(
    bilingual_pairs=en_de_pairs,
    aligned_en_embeddings=aligned_en_embeddings,
    de_embeddings=de_embeddings,
    top_k=5,
    num_words=1000,
    print_limit=20      # Limited to print only the first 20 words
)

print(f"Accuracy: {accuracy * 100:.2f}% ({correct_translations}/{total_translations})")
print(f"Skipped pairs: {skipped_pairs}")

English word: 'the' -> Nearest German neighbors: ['der', 'die', 'den', 'dem', 'besagten']
English word: 'the' -> Nearest German neighbors: ['der', 'die', 'den', 'dem', 'besagten']
English word: 'the' -> Nearest German neighbors: ['der', 'die', 'den', 'dem', 'besagten']
English word: 'the' -> Nearest German neighbors: ['der', 'die', 'den', 'dem', 'besagten']
English word: 'the' -> Nearest German neighbors: ['der', 'die', 'den', 'dem', 'besagten']
English word: 'and' -> Nearest German neighbors: ['und', 'sowie', 'udn', 'aber', 'sodass']
English word: 'and' -> Nearest German neighbors: ['und', 'sowie', 'udn', 'aber', 'sodass']
English word: 'was' -> Nearest German neighbors: ['war', 'wurde', 'hatte', 'damals', 'waren']
English word: 'was' -> Nearest German neighbors: ['war', 'wurde', 'hatte', 'damals', 'waren']
English word: 'for' -> Nearest German neighbors: ['für', 'Für', 'vor', 'dafür', 'bei']
English word: 'that' -> Nearest German neighbors: ['dass', 'tatsächlich', 'glaube', 'aber', '

### Explanation of the output:

1. The same English word might appear multiple times because of duplicate entries in the bilingual dictionary of "MUSE" (e.g.: ('player', 'spieler'), ('player', 'Mitspieler'), ('player', 'Spieler')).

2. In the aligned embedding space the nearest neighbors are always the same because the embedding for the English word and the German embedding space are static and consistent across evaluations.


### Cross-Lingual Analogy Completion

An analogy compares two related pairs of words, e.g. *man is to woman as king is to queen*. This task can be extended to use analogies for translation, e.g. *man is to woman as Mann ist zu Frau*.


👋 ⚒ Create **twenty** examples of crosslingual analogies and see whether the aligned vector space is able to correctly complete analogies across languages, e.g. positive=(queen, König), negative=(king). You can use examples from the analogy text file in GitHub for this purpose.

Hints:


*   Multilingual Analogies: To create the examples, all you need is a translation of an existing analogy. You can use the already loaded bilingual word list to obtain the translations and the existing analogy list (anlogies.txt on Github) to obtain analogies.
*   Implementation: In the code below, you only need to change the embeddings to the German embeddings for `c` and provide the function with the German embeddings.

In [15]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def norm(vec):
    return vec / np.linalg.norm(vec)        # NB: normalization: length=1

def get_target_words(embeddings, vec_a, vec_b, vec_c, top_k):
    words = list(embeddings.keys())
    vecs = np.array(list(embeddings.values()))

    # Compute analogy based on input vectors b+c-a (woman+king-man)
    positive = norm(vec_b+vec_c)
    target_vec = norm(positive - vec_a)
    vecs_norm = vecs / np.linalg.norm(vecs, axis=1, keepdims=True)
    similarities = cosine_similarity([target_vec], vecs_norm).flatten()

    # Get top_k most similar words for the retrieved result vector d
    nearest_idxs = similarities.argsort()[-top_k:][::-1]
    nearest_words = [words[i] for i in nearest_idxs]

    return nearest_words

# Cross-lingual examples
cross_lingual_examples = [
    # Format: (English word1, English word2, German word1, Expected German word2)
    ("Athens", "Greece", "Berlin", "Deutschland"),
    ("Ottawa", "Canada", "Paris", "Frankreich"),
    ("Madrid", "Spain", "Rom", "Italien"),
    ("London", "England", "Moskau", "Russland"),
    ("Helsinki", "Finland", "Stockholm", "Schweden"),
    ("man", "woman", "Mann", "Frau"),
    ("king", "queen", "König", "Königin"),
    ("brother", "sister", "Bruder", "Schwester"),
    ("teacher", "student", "Lehrer", "Schüler"),
    ("cat", "dog", "Katze", "Hund"),
    ("car", "bicycle", "Auto", "Fahrrad"),
    ("book", "pen", "Buch", "Stift"),
    ("doctor", "nurse", "Arzt", "Krankenschwester"),
    ("city", "village", "Stadt", "Dorf"),
    ("apple", "orange", "Apfel", "Orange"),
    ("morning", "night", "Morgen", "Nacht"),
    ("summer", "winter", "Sommer", "Winter"),
    ("ocean", "river", "Ozean", "Fluss"),
    ("mountain", "valley", "Berg", "Tal"),
    ("sun", "moon", "Sonne", "Mond")
]

# Validate each analogy
for en_word1, en_word2, de_word1, expected_de_word2 in cross_lingual_examples:
    if en_word1 in aligned_en_embeddings and en_word2 in aligned_en_embeddings and de_word1 in de_embeddings:
        vec_a = norm(aligned_en_embeddings[en_word1])       # English word1
        vec_b = norm(aligned_en_embeddings[en_word2])       # English word2
        vec_c = norm(de_embeddings[de_word1])       # German word1
        nearest_neighbors = get_target_words(de_embeddings, vec_a, vec_b, vec_c, top_k=1)

        print(f"Analogy: {en_word1} : {en_word2} :: {de_word1} : ?")
        print(f"Expected: {expected_de_word2}, Predicted: {nearest_neighbors[0]}\n")
    else:
        print(f"Skipping analogy: {en_word1}, {en_word2}, {de_word1} (missing embeddings)\n")

Analogy: Athens : Greece :: Berlin : ?
Expected: Deutschland, Predicted: Berlin

Analogy: Ottawa : Canada :: Paris : ?
Expected: Frankreich, Predicted: Paris

Analogy: Madrid : Spain :: Rom : ?
Expected: Italien, Predicted: Rom

Analogy: London : England :: Moskau : ?
Expected: Russland, Predicted: Russland

Analogy: Helsinki : Finland :: Stockholm : ?
Expected: Schweden, Predicted: Schweden

Analogy: man : woman :: Mann : ?
Expected: Frau, Predicted: Freundinnen

Analogy: king : queen :: König : ?
Expected: Königin, Predicted: Königin

Analogy: brother : sister :: Bruder : ?
Expected: Schwester, Predicted: Schwester

Analogy: teacher : student :: Lehrer : ?
Expected: Schüler, Predicted: Studenten

Analogy: cat : dog :: Katze : ?
Expected: Hund, Predicted: Hündin

Analogy: car : bicycle :: Auto : ?
Expected: Fahrrad, Predicted: Fahrrad

Analogy: book : pen :: Buch : ?
Expected: Stift, Predicted: Stift

Analogy: doctor : nurse :: Arzt : ?
Expected: Krankenschwester, Predicted: Hotelier
