# Notes

**Topic:** Why Transformers used in GenAI

**Why:** From classic NNs & backprop → to architectures suited for modern GenAI

---

**1) What a neural network is and how it learns (recap)**

* Layers apply $z = Wx + b$ then a nonlinearity $a=f(z)$; stacking layers lets the model learn complex functions.
* **Backpropagation** uses the chain rule to compute gradients efficiently and update weights to minimize a loss.
* Choice of activations/optimizers/losses controls stability, speed, and representation quality.

**2) Types of neural networks (at a glance)**

* **Perceptron / Feedforward (MLP)**

  * Fully connected layers; general-purpose function approximators.
  * Works well on tabular data or learned embeddings.
* **Convolutional Neural Networks (CNNs)**

  * Weight sharing + local receptive fields capture spatial patterns.
  * Parameter-efficient for images; translation-aware features.
* **Recurrent Neural Networks (RNNs)**

  * Process sequences step-by-step with a hidden state.
  * Can struggle with very long dependencies and gradient issues.
* **LSTM / GRU**

  * Gated RNNs that mitigate vanishing gradients.
  * Better at mid-range temporal dependencies than vanilla RNNs.
* **Autoencoders (AEs)**

  * Encoder→latent→decoder for reconstruction/denoising.
  * Useful for compression, pretraining, and anomaly detection.
* **Generative Adversarial Networks (GANs)**

  * Generator vs discriminator in a minimax game.
  * Produce sharp samples; training can be unstable/mode-collapse prone.
* **Radial Basis Function Networks (RBFNs)**

  * Radial basis activations for smooth interpolation.
  * Practical in lower-dimensional settings.
* **Self-Organizing Maps (SOMs)**

  * Unsupervised, topology-preserving projections.
  * Handy for visualization and clustering.
* **Modular / Ensemble Networks**

  * Combine specialized subnets or multiple models.
  * Improve robustness and scalability via divide-and-conquer.

**3) Why this leads to Transformers for GenAI**

* Long-context language tasks demand **global dependencies** beyond what RNNs/CNNs handle easily.
* **Attention** learns token-to-token relevance directly, enabling long-range reasoning.
* Transformer training/inference is **highly parallelizable**, scaling well to large datasets and models.

---

**WHAT I VERIFIED TODAY (quick, testable claims)**

- **V1 — Backprop mechanics on a tiny MLP**

  * Implemented forward $z=Wx+b$, activation $a=f(z)$, and loss $L$; then backward via chain rule.
  * A single SGD/Adam step **reduced $L$** as expected, confirming gradient flow and update logic.

- **V2 — Vanishing/exploding gradient intuition**

  * Measured gradient norms across deeper stacks: sigmoid layers shrank gradients more than ReLU.
  * Observed more stable norms with ReLU/gated units, matching the known mitigation of vanishing gradients.

- **V3 — Architecture fit sanity check**

  * Mapped tasks to architectures: CNNs→vision; MLPs→tabular/embeddings; RNN/LSTM/GRU→sequences.
  * Noted limitations on long-range context, motivating attention/Transformers for GenAI workloads.


# **The following image briefly shows the type of neural networks:**
![image.png](attachment:image.png)