# Homework 5 - Representation Learning and Word Embeddings

In this assignment, you will explore word embeddings using Word2Vec and other techniques. You will also visualize the learned embeddings and perform various tasks to understand the properties and applications of word embeddings.

You will:
1. Load pre-trained word embeddings using gensim.
2. Compute cosine similarity between word vectors.
3. Solve word analogy problems using word embeddings.
4. Visualize word embeddings using PCA.
5. Analyze the effect of singular/plural forms and comparative/superlative forms of words in the embedding space.


## Task 1: Load Pre-trained Word Embeddings (1 pt)

You will load pre-trained word embeddings using the gensim library. You will also get the embedding for a specific word and check its dimensionality.

In [None]:
!pip install --upgrade gensim
import gensim.downloader as api

# Load pre-trained word vectors
word_vectors = #TODO: Load pretrained embeddings from gensim
embedding_computer = #TODO: Get the embedding for the word 'computer'

print(f"Embedding for 'computer': {embedding_computer}")
print(f"Dimension of the embedding: {embedding_computer.shape[0]}")

## Task 2: Compute Cosine Similarity (1 pt)

In this task, you will implement a function to compute the cosine similarity between two vectors from scratch (i.e., without using built-in functions except `math.sqrt()`). You will then use this function to compute the similarity between several pairs of words. Understanding cosine similarity is crucial for working with word embeddings and other vector representations.

In [None]:
import numpy as np
import math

def cosine_similarity(vec1, vec2):
    # TODO: Implement function to compute cosine similarity between two embeddings vec1 and vec2

# Compute similarity between pairs of words
words = [('computer', 'science'), ('boy', 'girl'), ('king', 'queen'), ('man', 'woman'), ('apple', 'orange'), ('cat', 'dog')]
for word1, word2 in words:
    vec1 = word_vectors[word1]
    vec2 = word_vectors[word2]
    similarity = cosine_similarity(vec1, vec2)
    print(f"Cosine similarity between '{word1}' and '{word2}': {similarity}")

## Task 3: Solve Word Analogies (1 pt)

In this task, you will use the word embeddings to solve word analogy problems. For example, you can solve the analogy `queen: woman :: ?:man` by finding the word that is most similar to the result of the operation `queen - woman + man`.

Use gensim's built-in function `most_similar` to find the word most similar to the result of vector arithmetic operations

This task illustrated the power of word embeddings in capturing semantic relationships.

In [None]:
# Solve word analogies
analogies = [
    ('queen', 'woman', 'man'),
    ('dad', 'man', 'woman'),
    ('paris', 'france', 'italy'),
    ('sun', 'day', 'night'),
    ('tree', 'forest', 'river'),
    ('happy', 'joy', 'sad')
]
for word1, word2, word3 in analogies:
    #TODO: Solve the analogies above by finding the word most similar (your_answer) to the result of vector arithmetic operations described.
    print(f"{word1} - {word2} + {word3} = {your_answer} ")

## Task 4: Visualize Word Embeddings

Next, we will visualize word embeddings using PCA (Principal Component Analysis). We will select a set of words, get their embeddings, and plot them in a 2D space. This visualization can help us in understanding the structure and relationships in the embedding space.

In [None]:
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# List of words to visualize
words_to_visualize = ['king', 'queen', 'man', 'woman', 'boy', 'girl', 'computer', 'science', 'math', 'art', 'apple', 'orange', 'cat', 'dog', 'car', 'bicycle']
embeddings = [word_vectors[word] for word in words_to_visualize]

# Reduce dimensionality
pca = PCA(n_components=2)
embeddings_2d = pca.fit_transform(embeddings)

# Plot embeddings
plt.figure(figsize=(12, 8))
for i, word in enumerate(words_to_visualize):
    plt.scatter(embeddings_2d[i, 0], embeddings_2d[i, 1])
    plt.annotate(word, (embeddings_2d[i, 0], embeddings_2d[i, 1]))
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.title('2D Visualization of Word Embeddings')
plt.grid(True)
plt.show()

## Task 5: Analyze Singular and Plural Forms

Next, we will analyze the singular and plural forms of several nouns and their relations in the embedding space. We will select a set of words, get their embeddings, and visualize them. This can help understand how embeddings represent morphological variations.

In [None]:
# List of singular and plural words
singular_plural_words = ['king', 'kings', 'queen', 'queens', 'man', 'men', 'woman', 'women', 'boy', 'boys', 'girl', 'girls', 'cat', 'cats', 'dog', 'dogs']
embeddings_singular_plural = [word_vectors[word] for word in singular_plural_words]

# Reduce dimensionality
embeddings_singular_plural_2d = pca.fit_transform(embeddings_singular_plural)

# Plot embeddings
plt.figure(figsize=(12, 8))
for i, word in enumerate(singular_plural_words):
    plt.scatter(embeddings_singular_plural_2d[i, 0], embeddings_singular_plural_2d[i, 1])
    plt.annotate(word, (embeddings_singular_plural_2d[i, 0], embeddings_singular_plural_2d[i, 1]))
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.title('2D Visualization of Singular and Plural Word Embeddings')
plt.grid(True)
plt.show()

## Task 6: Analyze Adjectives and Their Forms

Next, we will analyze the comparative and superlative forms of several adjectives and their relations in the embedding space. We will select a set of words, get their embeddings, and visualize them.

This can help us in understanding how word embeddings capture comparative and superlative relations.

In [None]:
# List of adjectives and their comparative and superlative forms
adjectives = ['good', 'better', 'best', 'bad', 'worse', 'worst', 'slow', 'slower', 'slowest', 'fast', 'faster', 'fastest', 'happy', 'happier', 'happiest', 'sad', 'sadder', 'saddest']
embeddings_adjectives = [word_vectors[word] for word in adjectives]

# Reduce dimensionality
embeddings_adjectives_2d = pca.fit_transform(embeddings_adjectives)

# Plot embeddings
plt.figure(figsize=(12, 8))
for i, word in enumerate(adjectives):
    plt.scatter(embeddings_adjectives_2d[i, 0], embeddings_adjectives_2d[i, 1])
    plt.annotate(word, (embeddings_adjectives_2d[i, 0], embeddings_adjectives_2d[i, 1]))
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.title('2D Visualization of Adjective Word Embeddings')
plt.grid(True)
plt.show()


## Summary (1 pt)

Briefly summarize your qualitative findings from Tasks 4, 5 and 6.