<a href="https://colab.research.google.com/github/debojit11/ml_nlp_dl_transformers/blob/main/NLP_week_14.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🧠 Week 14 – Word Embeddings (Word2Vec, GloVe, etc.)

---

## 📌 Motivation: Why Do We Need Word Embeddings?

In previous weeks, we used **TF-IDF** or **CountVectorizer**, which treat each word as an independent token.

But:
- They ignore **word meanings** and **context**
- The vectors are **sparse** and **high-dimensional**
- "king" and "queen" are just as unrelated as "king" and "banana"

We need a better way to represent words ✨

---

## 💡 What are Word Embeddings?

Word embeddings map each word to a **dense vector** in a continuous space where:
- Similar words are close in distance
- Relationships between words can be captured (e.g., analogies)

Example:
$$
\text{vec}(\text{king}) - \text{vec}(\text{man}) + \text{vec}(\text{woman}) \approx \text{vec}(\text{queen})
$$  


This is NOT possible with TF-IDF.

---

## 🧰 Word2Vec: Predictive Embeddings (Mikolov et al., 2013)

Two main architectures:
- **CBOW (Continuous Bag of Words)**: Predicts current word based on context
- **Skip-Gram**: Predicts surrounding words given the current word

### How It Works:
- Feed in one-hot encodings
- Use a hidden layer as the **embedding matrix**
- Train using a shallow neural network to maximize prediction probability

This leads to **semantic structure** in the learned vectors.

---

## 🧮 GloVe: Global Vectors for Word Representation (Stanford, 2014)

GloVe uses:
- Global word-word **co-occurrence statistics** from a corpus
- Trains a model where **ratios of co-occurrence probabilities** reflect semantic relationships

Key insight:
$
\text{word vector dot product} \approx \log(\text{number of co-occurrences})
$

It combines the global matrix factorization of LSA with local context-based learning like Word2Vec.

---



## ⚖️ Word2Vec vs GloVe

| Feature         | Word2Vec     | GloVe          |
|----------------|--------------|----------------|
| Training       | Predictive   | Count-based    |
| Context Type   | Local window | Global matrix  |
| Developed By   | Google       | Stanford       |
| Pros           | Fast, scalable | Captures global patterns |
| Output         | Dense vectors| Dense vectors  |

---

## 🔍 Cosine Similarity

We often compare word embeddings using **cosine similarity**:
$$
\text{cos}(\theta) = \frac{\vec{A} \cdot \vec{B}}{\|\vec{A}\| \|\vec{B}\|}
$$

This measures the **angle** between vectors. Closer → more similar.

---



## 🧪 Example: Using Pretrained GloVe Embeddings

We'll use a small GloVe model (100-dimensional vectors).

```python
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Load GloVe vectors
embeddings = {}
with open("glove.6B.100d.txt", encoding='utf8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.asarray(values[1:], dtype='float32')
        embeddings[word] = vector

# Compare words
def similarity(w1, w2):
    v1 = embeddings.get(w1)
    v2 = embeddings.get(w2)
    if v1 is None or v2 is None:
        return None
    return cosine_similarity([v1], [v2])[0][0]

print("Similarity (king, queen):", similarity("king", "queen"))
print("Similarity (man, woman):", similarity("man", "woman"))
print("Similarity (apple, banana):", similarity("apple", "banana"))
```

---



In [1]:
!wget http://nlp.stanford.edu/data/glove.6B.zip

--2025-04-22 11:03:34--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2025-04-22 11:03:34--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2025-04-22 11:03:35--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


202

In [2]:
!unzip glove.6B.zip

Archive:  glove.6B.zip
  inflating: glove.6B.50d.txt        
  inflating: glove.6B.100d.txt       
  inflating: glove.6B.200d.txt       
  inflating: glove.6B.300d.txt       


In [3]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

In [4]:
# Load GloVe vectors
embeddings = {}
with open("glove.6B.100d.txt", encoding='utf8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.asarray(values[1:], dtype='float32')
        embeddings[word] = vector

In [5]:
# Compare words
def similarity(w1, w2):
    v1 = embeddings.get(w1)
    v2 = embeddings.get(w2)
    if v1 is None or v2 is None:
        return None
    return cosine_similarity([v1], [v2])[0][0]

In [6]:
print("Similarity (king, queen):", similarity("king", "queen"))
print("Similarity (man, woman):", similarity("man", "woman"))
print("Similarity (apple, banana):", similarity("apple", "banana"))

Similarity (king, queen): 0.7507691
Similarity (man, woman): 0.8323495
Similarity (apple, banana): 0.505447


## 🔧 Real-World Use Cases of Word Embeddings in NLP

| Task                   | Usage                                 |
|------------------------|----------------------------------------|
| Text Classification    | Embed each word → average → classify   |
| Similarity Search      | Find nearest neighbors in embedding space |
| Clustering/Topic Models| Use dense vectors instead of TF-IDF    |
| Pretrained Layers      | Initialize neural nets with embeddings |

---

## 🔚 Summary

Word embeddings revolutionized NLP by:
- Bringing meaning to vectors
- Enabling semantic reasoning
- Powering neural networks with better inputs

> Next week, we’ll  then move Transformers, and BERT 🔥