# Homework 2: Word Embeddings
#### Introduction to Natural Language Processing

*Yevin Kim, kimyevin17@gmail.com*

In this homework we're going to be looking at word embeddings and their properties.

*Total Points: 20P*

## Section 1: Free Response Questions

**Question 1: Describe the word similarity evaluation task! (1P)**

*Your answer here*
: Word similarity is a numerical representation of the similarity between two words. The more similar or interchangeable the words, the higher the similarity, and the less similar the words with different meanings. It is one of the methods to evaluate embedding Model that allows to check the accuracy of the embedding process of turning natural language into vector form.

**Question 2: Describe the word analogy evaluation task! (1P)**

*Your answer here*
: Word analogy is also a means of evaluating embedding models. However, word analogy aims to look at pairs of words that are in a similar relationship, find words that are in the same relationship, and determine if there is a grammatical or semantic relationship between the words. For example, a gender relationship or a parent word.

## Section 2: Word Similarity Evaluation

In this section, we will follow part of: https://aclanthology.org/Q15-1016.pdf

1. We'll first load the GloVe embeddings. You can download the embeddings from here: https://nlp.stanford.edu/projects/glove/.
Download the 6B version and get the 50d dimension embedding. If this link is too slow (or too big), you can also try to google and directly get the 6B, 50d version.

2. Download the Word Embedding Similarity dataset here: http://alfonseca.org/eng/research/wordsim353.html. Unpack the file here.

Put both files (*glove.6B.50d.txt and wordsim_similarity_goldstandard.txt*) in this folder.

### Load the Word Similarity Data

In [1]:
# Specify the path to your dataset file
file_path = "wordsim353_sim_rel/wordsim_similarity_goldstandard.txt"

# Initialize an empty list to store the data
data = []

# Open the file and read its contents line by line
with open(file_path, 'r') as file:
    for line in file:
        # Split each line into a list of values using tab as the delimiter
        values = line.strip().split('\t')
        data.append(values)

**Task: Describe the structure of the dataset. (1P)**

*Your answer here*
: A structure that lists word pairs of two words and the similarity between them. The delimiter for each line is '\t'.

Now we load the GloVe embeddings from the text file that you just downloaded.

In [2]:
import numpy

glove_file = 'glove.6B.50d.txt'

embeddings_dict = {}

with open(glove_file, 'r', encoding='utf8') as f:
    for i, line in enumerate(f):
        line = line.strip().split(' ')
        word = line[0]
        embed = numpy.asarray(line[1:], "float")
        embeddings_dict[word] = embed

print('Loaded {} words from glove'.format(len(embeddings_dict)))

Loaded 400000 words from glove


**Task: Describe the structure of embeddings_dict (1P). Then get the embedding for the word 'vietnam'**

*Your answer here*
: Each line of the dataset has a set of values separated by 'spaces'. The first part of the line is the corresponding natural language word, and the rest is the vector representations of the word. In this case, embeddings_dict[word] uses the corresponding lexeme as the key and the word's embedding (vector representations) as the value. 

In [3]:
'''
Your Code here. (1P)

Get the embedding for the word 'vietnam'.

'''

target_word = 'vietnam'
if target_word in embeddings_dict:
    embedding = embeddings_dict[target_word]
    print(f'Embedding for "{target_word}": {embedding}')
else:
    print(f'Embedding for "{target_word}" is not found.')

Embedding for "vietnam": [ 0.015116  -0.10943   -0.27907    0.21508   -0.29031   -0.53296
  0.19078   -0.41052    0.44114   -0.62994    0.18833   -0.21939
  0.41152   -0.50776    0.43244   -0.53374   -0.64635    0.71545
  0.32507    1.2133    -0.28625   -0.70367    0.3289    -0.75187
  0.36866   -1.91      -0.045114  -0.29222   -0.13062   -0.085536
  2.4892     0.38918   -0.20016    0.6271    -0.70908    0.065848
 -0.6934     0.42858   -0.96318    0.21551   -1.26       0.10016
  0.88836   -1.2289     0.66346   -0.34112   -0.59078    0.12597
 -0.30163    0.0054462]


### Calculate the Cosine Similarity

Now that we downloaded our dataset and word embeddings, we can calculate the word similarities for each word pair in the dataset.

We first calculate the similarity based on the cosine similarity!

Cosine similarity is defined as:

**1 - cosine_distance**

In [4]:
'''
Your Code here. (3P)

Calculate the cosine similarity for each pair in the dataset. 
Ignore a word pair, if one of the words are not in the embedding dictionary (Out-of-Vocabulary case).
Print out the first 10 word pairs with their corresponding cosine similarity and ground truth, e.g.

Words: 'tiger' and 'cat', Cosine Similarity: 0.6150732418405161, Ground Truth: 7.35
Words: 'tiger' and 'tiger', Cosine Similarity: 1.0, Ground Truth: 10.00
Words: 'plane' and 'car', Cosine Similarity: 0.6656114206912427, Ground Truth: 5.77
Words: 'train' and 'car', Cosine Similarity: 0.7658227528614756, Ground Truth: 6.31
Words: 'television' and 'radio', Cosine Similarity: 0.8709275278715244, Ground Truth: 6.77
...

'''

cosine_similarities = [] # Save all (valid = Not Out-of-Vocabulary case) cosine similarities
ground_truth_similarities_valid = [] # Save all (valid = Not Out-of-Vocabulary case) ground truth similarities

# Define Cosine Similarity function based on Numpy
import numpy as np
from numpy import dot
from numpy.linalg import norm

def cos_sim(A, B):
    return dot(A, B) / (norm(A) * norm(B))
    
# Calculate cosine similarity for each pair of words in the dataset
for pair in data:
    word1, word2, ground_truth = pair[0], pair[1], float(pair[2])

    # Check word1, word2 are in the embedding dictionary
    if word1 in embeddings_dict and word2 in embeddings_dict:
        vec1 = embeddings_dict[word1]
        vec2 = embeddings_dict[word2]

        # Calcutate cosine similarity
        similarity = cos_sim(vec1, vec2)

        # Append to List
        cosine_similarities.append(similarity)
        ground_truth_similarities_valid.append(ground_truth)

# Print out the first 10 word pairs with their corresponding cosine similarity and gold standard
for i in range(10):
    word1 = data[i][0]
    word2 = data[i][1]
    similarity = cosine_similarities[i]
    ground_truth = ground_truth_similarities_valid[i]
    print(f"Words: '{word1}' and '{word2}', Cosine Similarity: {similarity}, Ground Truth: {ground_truth:.2f}")

Words: 'tiger' and 'cat', Cosine Similarity: 0.6150732418405163, Ground Truth: 7.35
Words: 'tiger' and 'tiger', Cosine Similarity: 1.0, Ground Truth: 10.00
Words: 'plane' and 'car', Cosine Similarity: 0.6656114206912428, Ground Truth: 5.77
Words: 'train' and 'car', Cosine Similarity: 0.7658227528614755, Ground Truth: 6.31
Words: 'television' and 'radio', Cosine Similarity: 0.8709275278715247, Ground Truth: 6.77
Words: 'media' and 'radio', Cosine Similarity: 0.7614966025988953, Ground Truth: 7.42
Words: 'bread' and 'butter', Cosine Similarity: 0.84021999893445, Ground Truth: 6.19
Words: 'cucumber' and 'potato', Cosine Similarity: 0.7118516503650524, Ground Truth: 5.92
Words: 'doctor' and 'nurse', Cosine Similarity: 0.7977497347874546, Ground Truth: 7.00
Words: 'professor' and 'doctor', Cosine Similarity: 0.5824731746059034, Ground Truth: 6.62


### Calculate Euclidean Distance

Now do the same, but this time, we calculate the euclidean distance!

In [5]:
'''
Your Code here. (2P)

Calculate the euclidean distance for each pair in the dataset. 
Ignore a word pair, if one of the words are not in the embedding dictionary (Out-of-Vocabulary case).
Print out the first 10 word pairs with their corresponding cosine similarity and ground truth

'''
euclidean_distances = [] # Save all (valid = Not Out-of-Vocabulary case) euclidean distances
ground_truth_similarities_valid = [] # Save all (valid = Not Out-of-Vocabulary case) ground truth similarities

# Define a function to calculate Euclidean distance based on numpy
import numpy as np

def euclidean_dis(A, B):
    return np.sqrt(np.sum((A-B)**2))

# Calculate Euclidean distances for each pair of words in the dataset
for pair in data:
    word1, word2, ground_truth = pair[0], pair[1], float(pair[2])

    # Check word1, word2 are in the embedding dictionary
    if word1 in embeddings_dict and word2 in embeddings_dict:
        vec1 = embeddings_dict[word1]
        vec2 = embeddings_dict[word2]

        # Calcutate cosine similarity
        distance = euclidean_dis(vec1, vec2)

        # Append to List
        euclidean_distances.append(distance)
        ground_truth_similarities_valid.append(ground_truth)


# Display the computed Euclidean distances for the first 10 samples
for i in range(10):
    word1 = data[i][0]
    word2 = data[i][1]
    distance = euclidean_distances[i]
    similarity = cosine_similarities[i]
    ground_truth = ground_truth_similarities_valid[i]
    print(f"Words: '{word1}' and '{word2}', Euclidean Distance: {distance}, Cosine Similarity: {similarity}, Ground Truth: {ground_truth:.2f}")

Words: 'tiger' and 'cat', Euclidean Distance: 4.122908683917116, Cosine Similarity: 0.6150732418405163, Ground Truth: 7.35
Words: 'tiger' and 'tiger', Euclidean Distance: 0.0, Cosine Similarity: 1.0, Ground Truth: 10.00
Words: 'plane' and 'car', Euclidean Distance: 4.709723268968757, Cosine Similarity: 0.6656114206912428, Ground Truth: 5.77
Words: 'train' and 'car', Euclidean Distance: 3.7620738285880053, Cosine Similarity: 0.7658227528614755, Ground Truth: 6.31
Words: 'television' and 'radio', Euclidean Distance: 2.9268013566606483, Cosine Similarity: 0.8709275278715247, Ground Truth: 6.77
Words: 'media' and 'radio', Euclidean Distance: 3.822406423714884, Cosine Similarity: 0.7614966025988953, Ground Truth: 7.42
Words: 'bread' and 'butter', Euclidean Distance: 3.308872389435863, Cosine Similarity: 0.84021999893445, Ground Truth: 6.19
Words: 'cucumber' and 'potato', Euclidean Distance: 3.8450463825822703, Cosine Similarity: 0.7118516503650524, Ground Truth: 5.92
Words: 'doctor' and 'nu

### Evaluate: Pearson Correlation Coefficient

Now that we calculated our predictions, how well does this correlate to the ground truth? We use the pearson coefficient for evaluation!

In [6]:
'''
Your Code here. (2P)

Calculate the pearson correlation coefficient for the cosine similarity and euclidean distance to the ground truth.
You are allowed to use the numpy implementation of pearson coefficient.
Print out both metric scores.
Which predictions are better, cosine similarity and euclidean distance? 
Take into account that one is measuring the similarity and the other distance!

'''
# Calculate the Pearson correlation coefficient for cosine similarity and ground truth
corr_cos_sim = np.corrcoef(cosine_similarities, ground_truth_similarities_valid)[0, 1]

# Calculate the Pearson correlation coefficient for Euclidean distance and ground truth
corr_euclidean_dis = np.corrcoef(euclidean_distances, ground_truth_similarities_valid)[0, 1]

# Print out the correlation coefficients
print(f"Pearson Correlation Coefficient for Cosine Similarity: {corr_cos_sim}")
print(f"Pearson Correlation Coefficient for Euclidean Distance: {corr_euclidean_dis}")

# Result: which is better?
if abs(corr_cos_sim) > abs(corr_euclidean_dis):
    print("Cosine Similarity is a better predictor.")
else:
    print("Euclidean Distance is a better predictor.")


Pearson Correlation Coefficient for Cosine Similarity: 0.5417623796761136
Pearson Correlation Coefficient for Euclidean Distance: -0.5748440058552572
Euclidean Distance is a better predictor.


## Section 3: Word Analogy

Now, we begin with our second evaluation task: Word Analogy! Download the dataset here http://www.fit.vutbr.cz/~imikolov/rnnlm/word-test.v1.txt, and unpack it in this folder.

In [7]:
# Specify the path to the word analogy file
file_path = "word-test.v1.txt"

# Initialize a list to store the analogies
analogies = []

# Open the file and read its contents line by line
with open(file_path, 'r') as file:
    analogy = []
    for line_number, line in enumerate(file):
        if line_number == 0: 
            continue  # Skip the first line
        line = line.strip()
        if line.startswith(":"):
            if analogy:
                analogies.append(analogy)  
            analogy = [line]
        else:
            analogy.append(line)


# Get all GT words
GT_WORDS = []
for analogy in analogies[0] +  analogies[4]:
    if not analogy.startswith(":"):  # Skip lines starting with ":"
        GT_WORDS.extend(analogy.lower().split())
GT_WORDS = list(set(GT_WORDS))

# Only keep country & family analogy and reduce number of examples
analogies = analogies[0][1::40] +  analogies[4][1::40]


# Display the first few analogies
for i, analogy in enumerate(analogies[:5]):  # Display the first 5 analogies for example
    print(f"Analogy {i + 1}: {analogy}")


Analogy 1: Athens Greece Baghdad Iraq
Analogy 2: Baghdad Iraq Stockholm Sweden
Analogy 3: Beijing China Paris France
Analogy 4: Bern Switzerland Oslo Norway
Analogy 5: Canberra Australia Madrid Spain


**Task: Explain the dataset's structure (stored as analogies), and how you would use this dataset to evaluate your word embeddings. (2P)**

*Your answer here*
: The "analogies" configuration stores word-to-word relationships, specifically related to capital-country and family arrangements, by the given code.
 The "analogies" array comprises four-word groupings, as evidenced by the code output below. The method for evaluating this word embedding involves computing the vector relation between the first two words, and subsequently measuring the precision with which the fourth word, connected in a similar manner whilst inserting the third word, can be derived.

### Get Projections

Now we calculate the projection with the first three words of each sample. The last word is the ground truth of the analogy task.

In [8]:
'''
Your Code here. (2P)

Calculate the projections as the following: word2 - word1 + word3 = projection

Tip: Make sure that all words are lower cased!

'''

# Function to calculate the projection. You are allowed to use numpy. 
import numpy as np #already included in this notebook

def calculate_projection(A, B, C):
    A = A.lower()
    B = B.lower()
    C = C.lower()

    if A in embeddings_dict and B in embeddings_dict and C in embeddings_dict:
        vec1 = embeddings_dict[A]
        vec2 = embeddings_dict[B]
        vec3 = embeddings_dict[C]
        return vec2 - vec1 + vec3
        
    else:
        return None

# Process and calculate projections for each analogy
projections = [] # save the projections here

for analogy in analogies:
    A, B, C, answer = analogy.lower().split()
    projection = calculate_projection(A, B, C)
    if projection is not None:
        projections.append(projection)

# Display the first few analogies
for i, analogy in enumerate(analogies[:5]):  # Display the first 5 analogies for example
    print(f"Projection {i + 1}: {projection}")

Projection 1: [ 0.071815   1.04173   -0.53635   -0.46468    1.26138    0.98833
 -0.68096    0.4872    -0.71339   -0.21135    0.02327    0.584147
 -0.19358   -1.20716    1.261404  -0.28098   -1.09004    0.05105
  0.36973    1.24253   -0.52907    1.45411   -0.16751    0.1921073
  0.912203  -2.38913   -0.17003   -0.80068   -0.24877    0.07134
  1.81577   -0.22733   -0.540853   0.88672    0.98941   -0.037356
  0.10904   -0.03734    0.84183   -0.42766   -0.477841   0.48637
 -0.19205   -0.60838    0.30147   -0.90495   -0.60697   -1.8407
  0.837017   0.108032 ]
Projection 2: [ 0.071815   1.04173   -0.53635   -0.46468    1.26138    0.98833
 -0.68096    0.4872    -0.71339   -0.21135    0.02327    0.584147
 -0.19358   -1.20716    1.261404  -0.28098   -1.09004    0.05105
  0.36973    1.24253   -0.52907    1.45411   -0.16751    0.1921073
  0.912203  -2.38913   -0.17003   -0.80068   -0.24877    0.07134
  1.81577   -0.22733   -0.540853   0.88672    0.98941   -0.037356
  0.10904   -0.03734    0.84183

### Get closest words

Now that we have our projections, we have to find the closest word embeddings (and their corresponding word) for each projection. We will calculate the 5 closest words for each projection.

We only consider the words in GT_WORDS as valid words! Otherwise, we would need to calculate the similarities to all words in the vocab, which is computational expensive.

In [9]:
print(GT_WORDS)

['princess', 'thailand', 'bern', 'her', 'spain', 'daughter', 'japan', 'man', 'policewoman', 'moscow', 'king', 'havana', 'grandmother', 'grandfather', 'egypt', 'queen', 'stepmother', 'sweden', 'rome', 'stepfather', 'athens', 'hanoi', 'stockholm', 'stepson', 'paris', 'policeman', 'tokyo', 'switzerland', 'madrid', 'italy', 'finland', 'baghdad', 'bangkok', 'daughters', 'mom', 'mother', 'russia', 'granddaughter', 'niece', 'stepbrother', 'vietnam', 'stepsister', 'berlin', 'pakistan', 'france', 'aunt', 'he', 'cairo', 'uncle', 'afghanistan', 'grandma', 'husband', 'islamabad', 'greece', 'dad', 'beijing', 'boy', 'sisters', 'england', 'london', 'iraq', 'prince', 'father', 'stepdaughter', 'tehran', 'ottawa', 'son', 'grandson', 'bride', 'helsinki', 'germany', 'wife', 'canberra', 'woman', 'oslo', 'his', 'sister', 'cuba', 'norway', 'canada', 'groom', 'nephew', 'she', 'iran', 'brother', 'australia', 'brothers', 'girl', 'grandpa', 'china', 'sons', 'kabul']


In [12]:
'''
Your Code here. (2p)

Calculate the first 5 closest words for each projection vector.
ONLY consider the words in GT_WORDS.
Print out the 5 closest word, the ground truth and the similarity score for each Analogy, e.g. 

Analogy 11: grandfather:grandmother :: prince:?
Ground Truth:  princess
Top 5 Closest Words:
- prince (Similarity: 0.8549)
- princess (Similarity: 0.8503)
- queen (Similarity: 0.7963)
- wife (Similarity: 0.7559)
- aunt (Similarity: 0.7441)


'''


# Function to find the top 5 closest words for a given projection
def find_top_closest_words(projection, embeddings):
    closest_words = []
    
    # We only consider the words in GT_WORDS
    relevant_embeddings = {word: embedding for word, embedding in embeddings.items() if word in GT_WORDS}
    # ... Fill in the rest...
    if not relevant_embeddings:
        return []

    # Caculate Cosine similarity: projection vs embedding
    similarities = {word: cos_sim(projection, embedding) for word, embedding in relevant_embeddings.items()}

    sorted_words = sorted(similarities.items(), key=lambda x: x[1], reverse=True)
    closest_words = sorted_words[:5]
    return closest_words

# Calculate and store the top 5 closest words for each projection
top_closest_words = []

for i, projection in enumerate(projections):
    analogy = analogies[i]
    A, B, C, word_truth = analogy.lower().split()
    closest_words = find_top_closest_words(projection, embeddings_dict)
    top_closest_words.append((word_truth, closest_words))


# Display the GT and the top 5 closest words
for i, (word_truth, closest_words) in enumerate(top_closest_words):
    analogy = analogies[i]
    A, B, C, _ = analogy.lower().split()
    print(f"Analogy {i + 1}: {A}:{B} :: {C}:?")
    print(f"Ground Truth: {word_truth}")
    print("Top 5 Closest Words:")
    for word, similarity in closest_words:
        print(f"- {word} (Similarity: {similarity:.4f})")
    print()


Analogy 1: athens:greece :: baghdad:?
Ground Truth: iraq
Top 5 Closest Words:
- iraq (Similarity: 0.8618)
- afghanistan (Similarity: 0.8519)
- baghdad (Similarity: 0.8091)
- kabul (Similarity: 0.7362)
- pakistan (Similarity: 0.7206)

Analogy 2: baghdad:iraq :: stockholm:?
Ground Truth: sweden
Top 5 Closest Words:
- sweden (Similarity: 0.7895)
- germany (Similarity: 0.7698)
- switzerland (Similarity: 0.7498)
- stockholm (Similarity: 0.7404)
- russia (Similarity: 0.7235)

Analogy 3: beijing:china :: paris:?
Ground Truth: france
Top 5 Closest Words:
- paris (Similarity: 0.8742)
- france (Similarity: 0.8701)
- spain (Similarity: 0.7322)
- italy (Similarity: 0.7062)
- switzerland (Similarity: 0.6807)

Analogy 4: bern:switzerland :: oslo:?
Ground Truth: norway
Top 5 Closest Words:
- oslo (Similarity: 0.7977)
- norway (Similarity: 0.7822)
- switzerland (Similarity: 0.7212)
- sweden (Similarity: 0.7031)
- helsinki (Similarity: 0.6638)

Analogy 5: canberra:australia :: madrid:?
Ground Truth: sp

### Evaluation of Projections

Now, calculate the accuracy of our predictions. One problem that you have to take care of is, you should exclude all 3 original vectors from the pool of candidates!

In [13]:
'''
Your Code here. (2P)

Calculate the accuracy of your predictions.
Your final prediction (from your top 5) is the top 1 word. However, you should exclude all 3 original words from the pool of candidates! E.g.

Analogy 11: grandfather:grandmother :: prince:?
Ground Truth:  princess
Top 5 Closest Words:
- prince (Similarity: 0.8549)
- princess (Similarity: 0.8503)
- queen (Similarity: 0.7963)
- wife (Similarity: 0.7559)
- aunt (Similarity: 0.7441)

--> Final Prediction = princess

'''

correct_predictions = 0

# Calculate the accuracy of prediction ((excluding original words))
for i, (word_truth, closest_words) in enumerate(top_closest_words):
    analogy = analogies[i]
    A, B, C, _ = analogy.lower().split()

    final_prediction = None
    for word, _ in closest_words:
        if word not in [A, B, C]: # exclude original words
            final_prediction = word
            break
    if final_prediction == word_truth:
        correct_predictions += 1

    print(f"Analogy {i + 1}: {A}:{B} :: {C}:?")
    print(f"Ground Truth: {word_truth}")
    print("Top 5 Closest Words:")
    for word, similarity in closest_words:
        print(f"- {word} (Similarity: {similarity:.4f})")
    print()
    print(f"--> Final Prediction = {final_prediction}")
    print()

accuracy = (correct_predictions / len(top_closest_words))

print(f"Accuracy: {accuracy * 100:.2f}%")

Analogy 1: athens:greece :: baghdad:?
Ground Truth: iraq
Top 5 Closest Words:
- iraq (Similarity: 0.8618)
- afghanistan (Similarity: 0.8519)
- baghdad (Similarity: 0.8091)
- kabul (Similarity: 0.7362)
- pakistan (Similarity: 0.7206)

--> Final Prediction = iraq

Analogy 2: baghdad:iraq :: stockholm:?
Ground Truth: sweden
Top 5 Closest Words:
- sweden (Similarity: 0.7895)
- germany (Similarity: 0.7698)
- switzerland (Similarity: 0.7498)
- stockholm (Similarity: 0.7404)
- russia (Similarity: 0.7235)

--> Final Prediction = sweden

Analogy 3: beijing:china :: paris:?
Ground Truth: france
Top 5 Closest Words:
- paris (Similarity: 0.8742)
- france (Similarity: 0.8701)
- spain (Similarity: 0.7322)
- italy (Similarity: 0.7062)
- switzerland (Similarity: 0.6807)

--> Final Prediction = france

Analogy 4: bern:switzerland :: oslo:?
Ground Truth: norway
Top 5 Closest Words:
- oslo (Similarity: 0.7977)
- norway (Similarity: 0.7822)
- switzerland (Similarity: 0.7212)
- sweden (Similarity: 0.7031)
