# Quiz: Word Embeddings

### Multiple Choice Questions

1. What is the main goal of word embeddings?  
a) To translate text into another language  
b) To convert words into dense, meaningful numerical vectors  
c) To apply grammar rules  
d) To increase vocabulary size

2. Which of the following is a **dense** representation of text?  
a) One-hot vector  
b) Bag-of-words  
c) TF-IDF  
d) Word2Vec

3. Which method learns word embeddings by predicting surrounding words in a sentence?  
a) CBOW  
b) Skip-gram  
c) LSTM  
d) Transformer

4. What does cosine similarity measure in the context of word vectors?  
a) Angle between vectors  
b) Word frequency  
c) Euclidean distance  
d) Probability of occurrence

5. What is a limitation of static word embeddings like Word2Vec and GloVe?  
a) They are too fast to train  
b) They assign the same vector to every instance of a word regardless of context  
c) They cannot be used with RNNs  
d) They only work on images


### Analytical Questions

1. Why do one-hot vectors fail to capture semantic meaning?

2. What does it mean when we say two words are "close" in embedding space?

3. How can bias in training data affect word embeddings? What are ways to detect or mitigate this?

4. Explain how word embeddings can help downstream tasks like classification or translation.

5. Compare and contrast CBOW and Skip-gram in terms of their objectives and behavior.


# Assignment: Exploring and Visualizing Word Embeddings

### Task 1: Load Pretrained Word Embeddings

- Load GloVe embeddings (e.g., `glove.6B.100d.txt`)
- Write a function to find the most similar words to a given word using cosine similarity
- Test your function on several example words (e.g., “king”, “car”, “university”)


In [None]:
# Your code here (starter)
import numpy as np
from numpy.linalg import norm

def load_glove(file_path):
    embeddings = {}
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            parts = line.split()
            word = parts[0]
            vector = np.array(parts[1:], dtype=np.float32)
            embeddings[word] = vector
    return embeddings

def most_similar(word, embeddings, top_n=5):
    if word not in embeddings:
        return []
    word_vec = embeddings[word]
    sims = {}
    for other, vec in embeddings.items():
        if other == word:
            continue
        cos_sim = np.dot(word_vec, vec) / (norm(word_vec) * norm(vec))
        sims[other] = cos_sim
    return sorted(sims.items(), key=lambda item: item[1], reverse=True)[:top_n]


### Task 2: Visualize Word Embeddings

- Pick 50–100 common words (e.g., animals, professions, places)
- Use t-SNE or PCA to reduce embedding dimensions to 2D
- Plot them using matplotlib and analyze clusters


In [None]:
# Your code here (starter)
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# Define your word list and extract vectors using glove
# words = [...]
# vectors = [embeddings[word] for word in words]

# tsne = TSNE(n_components=2, random_state=0)
# reduced = tsne.fit_transform(vectors)

# plt.figure(figsize=(10, 8))
# for i, word in enumerate(words):
#     plt.scatter(reduced[i, 0], reduced[i, 1])
#     plt.annotate(word, (reduced[i, 0], reduced[i, 1]))
# plt.title("t-SNE Visualization of Word Embeddings")
# plt.grid(True)
# plt.show()
