# Understanding Word2Vec with CBOW Architecture

## Prerequisites

Before diving into Word2Vec, ensure you're familiar with the following:

- **Artificial Neural Networks (ANN)**
- **Loss Functions**
- **Optimizers**

---

## What is Word2Vec?

Word2Vec is a popular word embedding technique that converts words into numerical vector representations. There are two primary architectures:

- **CBOW (Continuous Bag of Words)**
- **Skip-gram**

---

## Pre-trained vs. Custom Word2Vec

- **Pre-trained** Word2Vec models (e.g., Google’s model trained on 3 billion words) offer fast implementations.
- **Custom training** allows us to understand internal workings and tailor embeddings to specific datasets.

---

## Sample Corpus

Let's use a simple sentence for demonstration:

"I neuron company is related to data science"


This corpus consists of 7 words.

---

## Step 1: Defining Window Size

Let’s use a **window size of 5** (an odd number to identify a clear center word).

From the phrase:

I neuron company is related to


- **Input**: `I`, `neuron`, `company`, `related`, `to`
- **Center Word (Output)**: `is`

Here’s how input-output pairs are generated using a sliding window:

1. **Input**: `I neuron company related to` → **Output**: `is`
2. **Input**: `neuron company is to data` → **Output**: `related`
3. **Input**: `company is related data science` → **Output**: `to`
4. *(and so on, based on corpus length)*

---

## Step 2: One-Hot Encoding

Each word in the vocabulary is represented as a **one-hot encoded vector**:

Example Vocabulary:

['I', 'neuron', 'company', 'is', 'related', 'to', 'data', 'science']


If vocabulary size = `V = 8`, each word will be encoded as a `1 x 8` vector:

- `neuron`: `[0, 1, 0, 0, 0, 0, 0, 0]`
- `company`: `[0, 0, 1, 0, 0, 0, 0, 0]`
- `is`: `[0, 0, 0, 1, 0, 0, 0, 0]`

---

## Step 3: CBOW Architecture Using ANN

CBOW is implemented using a **fully connected neural network (ANN)**.

### Model Architecture

1. **Input Layer**:
    - Input size = number of context words (`4`)
    - Each word = one-hot vector of size `V`
    - Total input = `4 x V`

2. **Hidden Layer**:
    - Size = Embedding dimensions (`N`, e.g., 5)
    - Each word is projected to a `1 x N` vector
    - Hidden layer output = average of context word embeddings

3. **Output Layer**:
    - Outputs a vector of size `V`
    - Softmax is applied to generate probability distribution
    - Target is the center word’s one-hot vector

### Visualization:

Input Words (one-hot) → Hidden Layer (averaging) → Output Word (Softmax)


---

## Step 4: Training the CBOW Model

### Forward Propagation

- Each context word is converted to its one-hot vector.
- These vectors are multiplied by a weight matrix `W1` (`V x N`) to get word embeddings.
- These embeddings are averaged to form the hidden layer representation.
- Hidden layer output is multiplied with `W2` (`N x V`) to get output logits.
- Apply **Softmax** to get prediction `ŷ`.

### Loss Function

Use **cross-entropy loss** between predicted vector `ŷ` and actual one-hot vector `y`.

\[
\mathcal{L} = -\sum_{i=1}^{V} y_i \log(\hat{y}_i)
\]

### Backpropagation

- Calculate gradients using the loss
- Update `W1` and `W2` using an **optimizer** like SGD or Adam
- Repeat until the loss converges

---

## Step 5: Generating Word Vectors

Once training completes:

- Each word has an embedding (vector) of size `N`
- For example:
    - `"neuron"` → `[0.94, 0.32, 0.56, 0.21, 0.88]`
    - `"company"` → `[0.65, 0.22, 0.77, 0.13, 0.43]`

These vectors capture **semantic relationships** between words.

---

## Recap

- CBOW predicts a center word using surrounding context words.
- It uses a basic feedforward neural network with:
    - One-hot encoded inputs
    - Hidden layer with reduced dimensionality
    - Output layer with Softmax
- Final embeddings are stored in `W1` matrix (input to hidden)

---

## Advantages of CBOW

- Faster to train than Skip-gram
- Works well for frequent words

## Disadvantages

- May not perform well with rare words

---

## What’s Next?

In the next lesson, we will discuss the **Skip-gram** architecture and compare it with CBOW.

