## Word2Vec (CBOW & Skip-Gram – Concept)

**Word2Vec** is a technique used to create **word embeddings**, where each word is represented as a **vector** based on the **context** in which it appears.

### Core Idea
- Words that appear in **similar contexts** tend to have **similar meanings**.
- Word2Vec learns word meaning by **observing neighboring words** in sentences.

In simple terms:
> **“Tell me a word’s neighbors, and I’ll tell you its meaning.”**

## Two Word2Vec Models

### 1. CBOW (Continuous Bag of Words)

**Idea:**  
Predict the **target word** using the **surrounding context words**.

### What Question Does CBOW Answer?
> **“Given these surrounding words, what is the missing word?”**

---

### Example Sentence

If the **target word** is **“natural”**, CBOW uses:

- **Context words** → `["I", "love", "language", "processing"]`  
- **Target word**  → `"natural"`

---

### Key Points
- **Faster to train**
- Works well with **large datasets**
- Performs better for **frequent words**



### 2. Skip-Gram

**Idea:**  
Predict the **surrounding context words** using the **target word**.

### What Question Does Skip-Gram Answer?
> **“Given this word, which words are likely to appear around it?”**

---

### Example Sentence

If the **target word** is **“natural”**, Skip-Gram predicts:

- **Target word**   → `"natural"`  
- **Context words** → `["I", "love", "language", "processing"]`

---

### Key Points
- **Slower to train**
- Better for **rare words**
- Produces **more accurate embeddings** on **small datasets**


In [25]:
from gensim.models import Word2Vec

In [26]:
sentences = [
    ["i", "love", "natural", "language", "processing"],
    ["i", "love", "ai"],
    ["natural", "language", "processing", "is", "fun"],
    ["ai", "is", "powerful"]
]

# CBOW Model

In [27]:
cbow_model = Word2Vec(
    sentences,
    vector_size=50,
    window=2,
    min_count=1,
    sg=0   # sg=0 means CBOW
)


In [28]:
cbow_vector = cbow_model.wv["language"]
print("CBOW vector length:", len(cbow_vector))
print(cbow_vector[:10])  # first 10 values

CBOW vector length: 50
[ 0.01563524 -0.01902128 -0.00041185  0.00693853 -0.00187765  0.01676509
  0.01802268  0.01307396 -0.00142376  0.01542243]


# Skip-Gram Model

In [29]:
skipgram_model = Word2Vec(
    sentences,
    vector_size=50,
    window=2,
    min_count=1,
    sg=1   # sg=1 means Skip-Gram
)

In [30]:
skipgram_vector = skipgram_model.wv["language"]
print("Skip-Gram vector length:", len(skipgram_vector))
print(skipgram_vector[:10])  # first 10 values

Skip-Gram vector length: 50
[ 0.01563529 -0.01902219 -0.00041269  0.00693866 -0.00187733  0.01676667
  0.01802381  0.01307491 -0.0014243   0.01542409]


In [31]:
print("CBOW similar words to 'ai':")
print(cbow_model.wv.most_similar("ai"))

print("\nSkip-Gram similar words to 'ai':")
print(skipgram_model.wv.most_similar("ai"))

CBOW similar words to 'ai':
[('natural', 0.12486250698566437), ('i', 0.07399576157331467), ('is', 0.04237300902605057), ('love', 0.018277151510119438), ('processing', 0.011071980930864811), ('fun', -0.1191045343875885), ('language', -0.1742543876171112), ('powerful', -0.1754782646894455)]

Skip-Gram similar words to 'ai':
[('natural', 0.12486250698566437), ('i', 0.07399576157331467), ('is', 0.04237300902605057), ('love', 0.018277151510119438), ('processing', 0.011071980930864811), ('fun', -0.11910455673933029), ('language', -0.17426103353500366), ('powerful', -0.1754782646894455)]


## What You Should Notice

- Both **CBOW** and **Skip-Gram** produce **one vector per word**.
- The vector for a word like **“AI”** is **the same everywhere** it appears.
- **Sentence-level context is not considered**.

### Conclusion
This highlights a key **limitation of Word2Vec**:
> It cannot handle **different meanings** of the same word based on context.

**Word2Vec** is used in basic NLP tasks such as **keyword search**, **text similarity**, and **learning word embeddings**.  
However, it is **not used in modern LLM or RAG systems** because it **cannot understand context** and produces **one fixed vector per word**.