# üìò Skip-Gram Architecture in Word2Vec

## üîç What You'll Learn

In this notebook, we continue our journey into **Natural Language Processing (NLP)** by exploring the **Skip-Gram architecture**, a key component of the Word2Vec model.

We will:

- Revisit how **CBOW (Continuous Bag of Words)** works.
- Understand the **differences** between CBOW and Skip-Gram.
- Learn how Skip-Gram is structured in terms of:
  - Input and Output layers
  - Hidden layer and weight matrices
  - Forward and Backward propagation
- Visualize how word vectors are trained and extracted.
- Learn when to use CBOW vs. Skip-Gram depending on the dataset size.

---

## üß† Skip-Gram vs CBOW ‚Äì Core Concept

- **CBOW** predicts the **target word** from a given **context**.
- **Skip-Gram** predicts the **context words** given a **target word**.

In other words:
- **CBOW:** Context ‚Üí Target
- **Skip-Gram:** Target ‚Üí Context

Both models use the same underlying neural network architecture but reverse the input-output mapping.

---

## ‚öôÔ∏è Architecture Breakdown

- We use a window size (e.g., `window = 5`) to determine the number of context words.
- The vocabulary size determines the input/output dimensions.
- We initialize weight matrices:
  - **Input to Hidden Layer:** `V x N` matrix (e.g., `7x5`)
  - **Hidden to Output Layer:** `N x V` matrix (e.g., `5x7`)
- One-hot encoding is used for the input word.
- The **output** is processed with **softmax**, and **loss** is computed using cross-entropy.
- We train the model using **forward and backward propagation**.

---

## üí° Example: Enron and Data Science

Using the sentence:  
> ‚ÄúEnron company is related to data science.‚Äù

- Vocabulary size = 7
- Window size = 5
- Input word (e.g., ‚Äúis‚Äù) ‚Üí One-hot vector ‚Üí Network predicts surrounding context words.

---

## üìà Training & Performance Tips

To improve Word2Vec (CBOW or Skip-Gram):

1. **Increase training data** ‚Äì more text helps capture better semantic relationships.
2. **Increase window size** ‚Äì larger context window provides more learning signals.
3. **Tune vector dimensions** ‚Äì higher-dimensional embeddings can encode more nuanced meanings.

---

## üß† When to Use CBOW vs Skip-Gram?

| Model      | Best Used For       |
|------------|---------------------|
| **CBOW**   | Smaller datasets     |
| **Skip-Gram** | Larger datasets / better for rare words |

According to research findings, Skip-Gram performs better with **large corpora** and **rare words**, whereas CBOW is faster and more efficient on **smaller datasets**.

---

## üîç What‚Äôs Next?

In the next notebook, we will:

- Use a **pre-trained Google Word2Vec model** trained on **3 billion words** from Google News.
- Each word will be represented as a **300-dimensional vector**.
- Implement Word2Vec using the **Gensim** library.
- Also, learn how to train a Word2Vec model from **scratch**.

---

## ‚úÖ Why Do We Need This?

Understanding Skip-Gram and CBOW is essential because:

- These are foundational models in NLP for creating **dense word embeddings**.
- Word2Vec allows machines to understand **semantic similarity** between words.
- These embeddings are used in downstream tasks like **text classification**, **machine translation**, **chatbots**, and **semantic search**.

---

## üí¨ Final Thoughts

The Skip-Gram model is incredibly powerful when working with **large datasets** and looking to capture **fine-grained semantic details** between words. While CBOW is faster, Skip-Gram generally provides **better representations** for infrequent or rare words.

Having a solid understanding of these architectures‚Äîand how they translate words into meaningful vectors‚Äîis crucial as you move forward in building NLP models or working with tools like **Gensim**, **spaCy**, or **transformers**.

---
