# Word Analogy with GloVe

So far, we’ve finished the CNN-related topics, and now we can learn how to build really powerful networks to do some awesome stuff. In short, we’re ready to do some really cool things!

In [7]:
from  IPython.display import clear_output
import numpy as np

# Word Embedding

“Before anything else, we want to obtain an embedding for each word. To do this, we can either train our own embeddings or use pre-trained ones.

In this exercise, we want to use some well-known embeddings to perform a simple task.

![img](https://uupload.ir/files/ro4_1_uqw1pqumvzkm3geqtao5lq.png)

# First Part

In this exercise, we will use GloVe for word embeddings and download it using axel for faster performance.

In [8]:
!apt-get install axel
clear_output()

In [9]:
!axel https://nlp.stanford.edu/data/glove.6B.zip
clear_output()

In [10]:
!unzip glove.6B.zip

Archive:  glove.6B.zip
  inflating: glove.6B.50d.txt        
  inflating: glove.6B.100d.txt       
  inflating: glove.6B.200d.txt       
  inflating: glove.6B.300d.txt       


In [11]:
!head -10 /content/glove.6B.300d.txt

the 0.04656 0.21318 -0.0074364 -0.45854 -0.035639 0.23643 -0.28836 0.21521 -0.13486 -1.6413 -0.26091 0.032434 0.056621 -0.043296 -0.021672 0.22476 -0.075129 -0.067018 -0.14247 0.038825 -0.18951 0.29977 0.39305 0.17887 -0.17343 -0.21178 0.23617 -0.063681 -0.42318 -0.11661 0.093754 0.17296 -0.33073 0.49112 -0.68995 -0.092462 0.24742 -0.17991 0.097908 0.083118 0.15299 -0.27276 -0.038934 0.54453 0.53737 0.29105 -0.0073514 0.04788 -0.4076 -0.026759 0.17919 0.010977 -0.10963 -0.26395 0.07399 0.26236 -0.1508 0.34623 0.25758 0.11971 -0.037135 -0.071593 0.43898 -0.040764 0.016425 -0.4464 0.17197 0.046246 0.058639 0.041499 0.53948 0.52495 0.11361 -0.048315 -0.36385 0.18704 0.092761 -0.11129 -0.42085 0.13992 -0.39338 -0.067945 0.12188 0.16707 0.075169 -0.015529 -0.19499 0.19638 0.053194 0.2517 -0.34845 -0.10638 -0.34692 -0.19024 -0.2004 0.12154 -0.29208 0.023353 -0.11618 -0.35768 0.062304 0.35884 0.02906 0.0073005 0.0049482 -0.15048 -0.12313 0.19337 0.12173 0.44503 0.25147 0.10781 -0.17716 0.0386

In [13]:
!tail -10 /content/glove.6B.300d.txt

sigarms 0.14649 -0.47266 0.17144 0.26431 -0.13895 -0.20788 0.41624 0.078204 0.10015 1.1079 0.18251 -0.43063 0.045626 -0.026948 0.17895 -0.20265 0.089214 0.19252 -0.10675 -0.68545 -0.47164 -0.26379 -0.82508 -0.13879 0.22361 -0.51137 -0.16747 0.0029942 0.27512 -0.19599 0.09114 0.025339 -0.0082318 -0.36483 0.3991 -0.36836 0.016953 -0.046354 -0.071253 -0.40813 -0.015083 0.10156 -0.057742 0.27617 0.1839 -0.3379 -0.36662 -0.80425 0.15484 -0.29552 0.16137 0.023422 -0.18881 0.015709 0.31194 -0.2346 0.1114 -0.22749 -0.4252 -0.3593 0.069938 -0.18787 -0.39136 -0.18593 0.059576 0.34819 -0.52839 -0.0079031 0.13368 -0.24428 0.24038 0.11544 0.24686 -0.097644 0.36035 -0.39084 -0.4022 -0.014499 -0.05964 0.21511 -0.19882 0.60226 -0.43696 -0.08363 -0.0069641 -0.15582 0.13816 -0.67384 -0.15337 -0.14553 -0.0020241 -0.057246 0.11361 0.097527 0.075266 0.35942 0.14952 -0.011494 -0.15998 0.36846 0.033499 -0.30822 -0.51535 0.34124 -0.30532 -0.059391 0.05799 -0.46331 0.48521 0.14759 -0.18101 0.50789 0.2823 0.514

# Second Part

Now we need to create a dictionary where the keys are our words and the values are the embeddings of those words.

In [14]:
word2embedding = {}

In [15]:
with open('/content/glove.6B.300d.txt') as glove_file:
  for line in glove_file:
    values = line.split()
    word = values[0]
    vector = np.array(values[1:], dtype='float32')
    word2embedding[word] = vector

In [16]:
len(word2embedding.keys())

400000

In [17]:
np.shape(word2embedding['the'])

(300,)

# Word Anology (3rd Part)

## Euclidean Distance

In [21]:
def find_analogy_euclidean(word1, word2, word3, embedding_dict):
    for word in [word1, word2, word3]:
        if word not in embedding_dict:
            print(f"'{word}' not found in embedding dictionary.")
            return None
    vec1 = embedding_dict[word1]
    vec2 = embedding_dict[word2]
    vec3 = embedding_dict[word3]
    target_vec = vec2 - vec1 + vec3

    # Initialize tracking variables
    best_word = None
    best_distance = float('inf')

    # Loop through all words in the embedding to find the closest one
    for word, vec in embedding_dict.items():
        # Skip input words to avoid returning one of them
        if word in [word1, word2, word3]:
            continue
        # Compute Euclidean distance
        distance = np.linalg.norm(target_vec - vec)
        if distance < best_distance:
            best_distance = distance
            best_word = word

    print(f"🔍 Closest word by Euclidean distance: {best_word}")
    return best_word

In [22]:
find_analogy_euclidean('iran', 'tehran', 'germany', word2embedding)

🔍 Closest word by Euclidean distance: berlin


'berlin'

In [24]:
find_analogy_euclidean('france', 'paris', 'italy', word2embedding)

🔍 Closest word by Euclidean distance: rome


'rome'

In [26]:
find_analogy_euclidean('man', 'king', 'woman', word2embedding)

🔍 Closest word by Euclidean distance: queen


'queen'

In [27]:
find_analogy_euclidean('japan', 'yen', 'usa', word2embedding)

🔍 Closest word by Euclidean distance: c1-spa


'c1-spa'

In [28]:
find_analogy_euclidean('fast', 'faster', 'strong', word2embedding)

🔍 Closest word by Euclidean distance: stronger


'stronger'

In [29]:
find_analogy_euclidean('car', 'cars', 'child', word2embedding)

🔍 Closest word by Euclidean distance: children


'children'

## Cosine Similarity

In [30]:
def cosine_similarity(vec1, vec2):
    """
    Computes the cosine similarity between two vectors.
    """
    dot_product = np.dot(vec1, vec2)
    norm_product = np.linalg.norm(vec1) * np.linalg.norm(vec2)
    if norm_product == 0:
        return 0.0
    return dot_product / norm_product

In [31]:
def find_analogy_cosine(word1, word2, word3, embedding_dict):
    """
    Solves the analogy: word1 : word2 :: word3 : ???
    using cosine similarity.

    Parameters:
        word1, word2, word3: input words as strings
        embedding_dict: dictionary mapping words to embedding vectors

    Returns:
        best_word: the word most similar to the target vector
    """
    # Check all words exist in the embedding
    for word in [word1, word2, word3]:
        if word not in embedding_dict:
            print(f"'{word}' not found in embedding dictionary.")
            return None

    # Compute target vector
    vec1 = embedding_dict[word1]
    vec2 = embedding_dict[word2]
    vec3 = embedding_dict[word3]
    target_vec = vec2 - vec1 + vec3

    best_word = None
    best_similarity = -1.0

    for word, vec in embedding_dict.items():
        # Skip the original words
        if word in [word1, word2, word3]:
            continue
        sim = cosine_similarity(target_vec, vec)
        if sim > best_similarity:
            best_similarity = sim
            best_word = word

    print(f"🔍 Closest word by cosine similarity: {best_word}")
    return best_word


In [33]:
find_analogy_cosine('iran', 'tehran', 'germany', word2embedding)

🔍 Closest word by cosine similarity: berlin


'berlin'

In [34]:
find_analogy_euclidean('japan', 'yen', 'usa', word2embedding)

🔍 Closest word by Euclidean distance: c1-spa


'c1-spa'

In [36]:
find_analogy_cosine('france', 'french', 'italy', word2embedding)

🔍 Closest word by cosine similarity: italian


'italian'

In [37]:
find_analogy_cosine('dog', 'bark', 'cat', word2embedding)

🔍 Closest word by cosine similarity: twigs


'twigs'

In [38]:
find_analogy_cosine('tokyo', 'japan', 'paris', word2embedding)

🔍 Closest word by cosine similarity: france


'france'

In [40]:
find_analogy_cosine('cold', 'colder', 'fast', word2embedding)

🔍 Closest word by cosine similarity: faster


'faster'

In [41]:
find_analogy_cosine('run', 'ran', 'swim', word2embedding)

🔍 Closest word by cosine similarity: swam


'swam'

In [42]:
find_analogy_cosine('apple', 'fruit', 'carrot', word2embedding)

🔍 Closest word by cosine similarity: carrots


'carrots'

In [43]:
find_analogy_cosine('brother', 'sister', 'father', word2embedding)

🔍 Closest word by cosine similarity: mother


'mother'

## Comparison

In [52]:
def compare_analogy_methods(analogies, embedding_dict):
    """
    Compares results of cosine similarity and Euclidean distance
    for a list of word analogies.

    Parameters:
        analogies: List of lists, where each list contains 3 words [word1, word2, word3]
        embedding_dict: GloVe embedding dictionary
    """
    print(f"{'Word1':<12} {'Word2':<12} {'Word3':<12} {'Cosine Result':<20} {'Euclidean Result':<20}")
    print("-" * 80)

    for words in analogies:
        if len(words) != 3:
            print(f"Skipping invalid input: {words}")
            continue

        word1, word2, word3 = words

        cosine_result = find_analogy_cosine(word1, word2, word3, embedding_dict)
        euclidean_result = find_analogy_euclidean(word1, word2,
                                                  word3, embedding_dict)

        print(f"{word1:<12} {word2:<12} {word3:<12} {cosine_result:<20} {euclidean_result:<20}")
        print("-" * 80)


In [53]:
test_analogies = [
    ['france', 'french', 'italy'],
    ['tokyo', 'japan', 'paris'],
    ['run', 'ran', 'swim'],
    ['apple', 'fruit', 'carrot'],
    ['brother', 'sister', 'father'],
    ['dog', 'bark', 'cat']
]

In [54]:
compare_analogy_methods(test_analogies, word2embedding)

Word1        Word2        Word3        Cosine Result        Euclidean Result    
--------------------------------------------------------------------------------
🔍 Closest word by cosine similarity: italian
🔍 Closest word by Euclidean distance: italian
france       french       italy        italian              italian             
--------------------------------------------------------------------------------
🔍 Closest word by cosine similarity: france
🔍 Closest word by Euclidean distance: france
tokyo        japan        paris        france               france              
--------------------------------------------------------------------------------
🔍 Closest word by cosine similarity: swam
🔍 Closest word by Euclidean distance: swam
run          ran          swim         swam                 swam                
--------------------------------------------------------------------------------
🔍 Closest word by cosine similarity: carrots
🔍 Closest word by Euclidean distance: carr