# Foundational Concepts for Word2Vec

To truly understand how models like Word2Vec work under the hood (especially CBOW and Skip-gram), it's essential to grasp these three foundational components:

---

## 🧠 1. Artificial Neural Networks (ANN)

### What is an ANN?
An **Artificial Neural Network** is a computational model inspired by the human brain. It's made up of layers of **neurons** (also called nodes) that process input data and learn to make predictions.

### Basic Structure
Input Layer → Hidden Layer(s) → Output Layer


Each **layer** is made of neurons, and each neuron performs:
- Weighted sum of its inputs
- Applies a **non-linear activation function**

### Components:
- **Input layer**: Takes in feature data (e.g., one-hot vectors for words).
- **Hidden layer(s)**: Perform computations to learn patterns/features.
- **Output layer**: Produces final prediction (e.g., probability of a word).

### Example:
For CBOW:
- Input = context words (one-hot vectors)
- Output = center word prediction

---

## 📉 2. Loss Function

### What is a Loss Function?

A **loss function** measures how wrong the model's prediction is compared to the true value.

> It guides the model on how to adjust its internal weights during training.

### Common Loss Functions:

#### ✅ Cross-Entropy Loss (used in classification)
\[
\mathcal{L} = -\sum_{i=1}^{V} y_i \log(\hat{y}_i)
\]

Where:
- \( y \) = true label (one-hot encoded)
- \( \hat{y} \) = predicted probabilities from the model

### Why Important?
Lower loss = better performance. The goal is to minimize the loss as much as possible.

---

## ⚙️ 3. Optimizers

### What is an Optimizer?

An **optimizer** adjusts the neural network's weights to reduce the loss.

> Think of it as the engine that drives the learning process in ANN.

### How it works:
- Uses gradients (from **backpropagation**) to update weights
- Applies a **learning rate** to control step size

### Common Optimizers:

| Optimizer | Description |
|----------|-------------|
| **SGD** (Stochastic Gradient Descent) | Updates weights using one sample at a time |
| **Adam** (Adaptive Moment Estimation) | Combines momentum + adaptive learning rate. Efficient and popular |
| **RMSProp** | Adapts learning rate based on recent gradients |

### Update Rule (SGD example):

\[
w := w - \eta \cdot \nabla \mathcal{L}
\]

Where:
- \( w \): weight
- \( \eta \): learning rate
- \( \nabla \mathcal{L} \): gradient of loss w.r.t weights

---

## Summary Table

| Component     | Role                                                                 |
|---------------|----------------------------------------------------------------------|
| **ANN**        | Learns to map input to output through layers of neurons              |
| **Loss Function** | Tells the model **how wrong** it is                               |
| **Optimizer**     | Helps the model **improve** by adjusting weights                  |

These three together form the foundation of deep learning models like Word2Vec.