# Assignment: week 4

## Objectives
The objectives of this assignment are:

1. to learn how to obtain and use pretrained word embeddings
2. to gain a better understanding of word vectors
Setup

The GloVe file is loaded and stored in a dictionary where each key is a word and each value is its vector representation. During loading, I noticed that a few lines contained artifacts such as ".", "..", "...", or strings like "1/2" that are not meaningful words for this task. These lines caused parsing errors because the non-numeric tokens could not be converted into floats. To ensure clean and valid embeddings, I filtered out these irregular entries and kept only the lines where all vector components were valid numeric values.

In [116]:
import numpy as np


glove_path = "./wiki100d.txt"

embedding_dim = 100

embeddings_index = {}
with open(glove_path, "r", encoding="utf-8") as f:
    for i, line in enumerate(f):
        values = line.strip().split()
        word = values[0]
        # skip malformed lines in embedding data
        if len(values[1:]) != embedding_dim:
            continue
        try:
            vector = np.asarray(values[1:], dtype=np.float32)
            embeddings_index[word] = vector

        except ValueError as e:
            break


In [51]:
print("Loaded vectors:", len(embeddings_index))

Loaded vectors: 1287614


Here is functions for calculating cosine similarity of word vectors and for

In [52]:
def cosine_similarity(a, b):
    return np.dot(a,b)/(np.linalg.norm(a)*np.linalg.norm(b))

def find_most_similar_word(vec, embeddings, exclude=None, top_n = 10, ):

    if exclude is None:
        exclude = set()
    similarities = {word: cosine_similarity(vec,emb)
                    for word, emb in embeddings.items() if word not in exclude}

    return sorted(similarities.items(), key=lambda x: x[1], reverse=True)[:top_n]

I noticed that woman vector - man vector + king vector the closest cosine similarity was queen. The model encodes gender and royalty directions in the vector space. “king” − “man” ≈ “queen” − “woman”. Ofcourse I excluded the words that is in vector computing process

In [73]:
man_vector = embeddings_index["man"]
woman_vector = embeddings_index["woman"]
king_vector = embeddings_index["king"]


true_vector = woman_vector-man_vector+king_vector
nearest_word  = find_most_similar_word(true_vector,embeddings_index, exclude={"king", "man", "woman"},top_n=10)

In [74]:
for word in nearest_word:
    print(word[0], word[1])

queen 0.8120952
throne 0.72555846
daughter 0.72254604
elizabeth 0.70558876
wife 0.7013361
mother 0.7000919
margaret 0.68769395
princess 0.6853057
monarch 0.67529035
niece 0.6662857


Here I calculated winter + sun - snow. As a funny way it can be imagined that when it is winter and sun starts to shine and snow goes away, summer is coming. So taking the indicating thing from winter which is snow is same as taking sun out from the summer.

In [114]:
beer_vector = embeddings_index["beer"]
germany_vector = embeddings_index["germany"]
leather_vector = embeddings_index["leather"]
europe_vector = embeddings_index["europe"]
tradition_vector = embeddings_index["tradition"]

snow_vector = embeddings_index["snow"]
sun_vector = embeddings_index["sun"]
winter_vector = embeddings_index["winter"]

true_vector = winter_vector+sun_vector-snow_vector
nearest_word  = find_most_similar_word(true_vector,embeddings_index, {"snow","winter","sun"},10)

In [115]:
for word in nearest_word:
    print(word[0], word[1])

summer 0.7168366
spring 0.6835645
autumn 0.6784056
1997 0.6032326
1998 0.6010512
fall 0.5967329
2008 0.5878199
beginning 0.58745766
2001 0.58440286
. 0.5843619
