<a href="https://colab.research.google.com/github/danieleduardofajardof/DataSciencePrepMaterial/blob/main/8_EmergingAI.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chapter 8. Emerging AI

# Index


- [1. Transformer-Based Model](#1)
- [2. BERT](#2)
- [3. GPT](#3)
- [4. Variational Autoencoders](#4)
- [5. Visual-Language Model](#5)
- [6. Stable diffussion Model](#6)


# Emerging AI Technologies

---

## 1. Transformer-Based Models <a name="1"></a>

Transformers are a class of deep learning architectures introduced by Vaswani et al. in the landmark paper *"Attention is All You Need"* (2017). Unlike previous sequence models (RNNs, LSTMs), transformers do not require sequential processing — instead, they rely on attention mechanisms to model relationships between all elements of a sequence simultaneously, making them highly parallelizable and efficient for large-scale training.



### Key Concepts

- **Self-Attention**:
  - Computes attention weights between all positions in the input sequence.
  - Captures dependencies regardless of distance (long-range relationships).
  - For each token, computes a weighted sum of other tokens based on learned relevance scores.

- **Multi-Head Attention**:
  - Runs multiple self-attention operations (heads) in parallel.
  - Each head focuses on different types of relationships.
  - Outputs are concatenated and projected to combine diverse attention patterns.

  $$
  \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O
  $$

- **Positional Encoding**:
  - Since transformers lack recurrence or convolution, they add positional encodings to input embeddings.
  - Uses sinusoidal or learned vectors to represent token positions:
  
  $$
  PE_{(pos, 2i)} = \sin \left( \frac{pos}{10000^{2i/d}} \right), \quad
  PE_{(pos, 2i+1)} = \cos \left( \frac{pos}{10000^{2i/d}} \right)
  $$



### Architecture Components

- **Input Embedding**:
  - Converts tokens to dense vectors.
  - Positional encodings are added to embeddings to inject sequence order.

- **Encoder Stack** (repeated N times):
  - **Multi-Head Self-Attention Layer**
  - **Feed-Forward Network (FFN)**:
    $$
    \text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2
    $$
  - **Add & Layer Normalization** after each sub-layer (residual connection + normalization).

- **Decoder Stack** (repeated N times):
  - Masked self-attention (to prevent attending to future tokens).
  - Encoder-decoder attention (attends over encoder outputs).
  - Feed-forward network
  - Add & Layer Norm

- **Final Output**:
  - Decoder output goes through a linear projection and softmax to predict tokens.

### Advantages of Transformers

- **Scalability**: Fully parallel computation allows efficient GPU usage.
- **Long-Range Dependency Modeling**: Attention can relate distant tokens directly.
- **Modularity**: Easily extended (e.g., BERT, GPT, Vision Transformers).
- **Transfer Learning**: Pretrained transformers can be fine-tuned on various tasks.

### Common Applications

- Machine Translation (e.g., original use in English↔German translation)
- Language Modeling (e.g., GPT series)
- Text Classification and Summarization
- Vision (e.g., Vision Transformers - ViT)
- Time Series Forecasting and Audio Processing


---

## 2. BERT (Bidirectional Encoder Representations from Transformers) <a name="2"></a>

BERT is a deep learning model introduced by Google in 2018. It is based solely on the **encoder** portion of the Transformer architecture and is pre-trained on massive corpora using unsupervised learning. BERT’s key innovation is its **bidirectional self-attention**, which allows it to learn rich contextual representations of language.



###  Key Features

- **Bidirectional Attention**:
  - Considers both left and right context when encoding a word.
  - Unlike unidirectional models (e.g., GPT), BERT learns from the entire sequence at once.

- **Transfer Learning for NLP**:
  - Once pre-trained, BERT can be fine-tuned for downstream tasks with minimal changes.


###  Pretraining Tasks

- **Masked Language Modeling (MLM)**:
  - Randomly masks 15% of the tokens.
  - Model predicts the masked words using surrounding context.
  - Enables bidirectional encoding.

- **Next Sentence Prediction (NSP)**:
  - Learns sentence relationships by predicting whether one sentence follows another.
  - Later removed in some variants (e.g., RoBERTa).


###  Fine-Tuning Applications

- **Sentiment Analysis**
- **Question Answering** (e.g., SQuAD)
- **Named Entity Recognition (NER)**
- **Text Classification**
- **Natural Language Inference (NLI)**

Fine-tuning typically adds a task-specific head (e.g., classification layer) on top of the encoder stack.

###  Model Architecture

- **Input =** `[CLS] + Sentence A + [SEP] + Sentence B + [SEP]`
- Token, segment, and positional embeddings are summed as input.
- Stacked encoder layers (e.g., 12 in BERT-Base).
- Output of `[CLS]` used for classification tasks.

##  Modern BERT Variants

These models build on BERT to improve efficiency, performance, or scalability.

| Model         | Key Improvements                                  | Notes |
|---------------|----------------------------------------------------|-------|
| **RoBERTa**   | Trained longer, on more data, no NSP               | Better MLM performance |
| **DistilBERT**| 40% smaller, faster inference                      | 97% of BERT accuracy |
| **ALBERT**    | Parameter sharing + factorized embeddings          | Less memory use |
| **TinyBERT**  | Knowledge-distilled from BERT                      | Compact, edge devices |
| **MobileBERT**| Optimized for mobile hardware                      | Lightweight and fast |
| **ModernBERT**| Replaces layer normalization and GELU with SwiGLU and RMSNorm; improves training dynamics | Described in 2023 papers |
| **BERT-of-Theseus** | Progressive layer replacement during training | Improves robustness |

###  ModernBERT: What's New?

**ModernBERT** is a term describing **updated BERT-style architectures** incorporating improvements from recent Transformer research (often inspired by GPT-3.5/4 and PaLM advancements):

- **SwiGLU activation**:
  - A gated activation function: `SwiGLU(x) = x₁ * SiLU(x₂)`
  - Better expressiveness than GELU.
  
- **RMSNorm instead of LayerNorm**:
  - Root-mean-square normalization is faster and more stable.
  
- **Rotary Positional Embeddings (RoPE)**:
  - Used in some ModernBERT variants (e.g., in ChatGPT).
  
- **No Dropout**:
  - Newer architectures often forgo dropout, relying on better regularization elsewhere.

- **Use Cases**:
  - Pretraining large foundation models (e.g., LLMs)
  - Enhancing downstream performance with updated inductive biases

These techniques are increasingly adopted in **modern language models** and are considered "best practices" for training Transformer encoders.

###  Summary

BERT has revolutionized NLP by enabling deep bidirectional understanding of language. Over time, modern variants have improved performance, efficiency, and scalability:

- Use **DistilBERT** for speed, **ALBERT** for memory efficiency, and **RoBERTa** for performance.
- **ModernBERT** concepts are crucial in cutting-edge Transformer research and real-world production systems.

---


## 3. GPT (Generative Pre-trained Transformer) <a name="3"></a>

GPT is a family of large-scale transformer-based language models developed by **OpenAI**, designed for **generative tasks**. Unlike BERT, which uses a bidirectional encoder, GPT utilizes a **unidirectional decoder-only architecture** and predicts the next token in a sequence.


###  Training Approach

- **Causal Language Modeling (CLM)**:
  - Predict the next token based on previous tokens:
    $$
    P(x_t | x_1, x_2, ..., x_{t-1})
    $$
  - Training is done using **autoregressive decoding**, allowing the model to learn natural language generation patterns.

- **Pretraining**:
  - Massive-scale datasets (books, websites, code).
  - Trained without supervision to learn general language understanding.

- **Fine-Tuning (Optional)**:
  - For specific tasks (e.g., summarization, translation, classification).
  - Also includes Reinforcement Learning from Human Feedback (RLHF) in newer versions (e.g., ChatGPT).


###  Architecture

- **Decoder-Only Transformer Stack**:
  - Consists of:
    - Masked multi-head self-attention
    - Feed-forward layers with GELU/SwiGLU activations
    - LayerNorm or RMSNorm
    - Positional embeddings (or RoPE in modern variants)
  - No encoder block like in standard Transformer models.
  - Output is autoregressively generated one token at a time.


### Use Cases

- **Text Generation**: Stories, articles, poetry
- **Dialogue Systems**: ChatGPT, customer service bots
- **Code Completion**: Copilot, GPT-Engineer
- **Summarization and Translation**
- **Zero-shot and Few-shot Learning**: Prompts can guide the model to perform tasks without fine-tuning.

###  GPT Variants and Evolution

| Version | Parameters | Key Innovations |
|---------|------------|------------------|
| **GPT-1** | 117M | Proof of concept |
| **GPT-2** | 1.5B | Text generation, open-ended completion |
| **GPT-3** | 175B | Few-shot learning via prompting |
| **GPT-3.5** | ~6-20B | RLHF, ChatGPT foundation |
| **GPT-4** | Unknown (est. 500B-1T+) | Multimodal input (text + images), advanced reasoning |
| **GPT-4-turbo** | Faster, cheaper variant of GPT-4 |

###  Key Concepts

- **Prompt Engineering**: Using carefully crafted prompts to guide model behavior.
- **Temperature**: Controls randomness in output (0 = deterministic, 1 = creative).
- **Top-k / Top-p Sampling**: Strategies to control generation diversity.

## Generative Autoencoders for Image Generation

Autoencoders are extended to **generative models** by adding probabilistic constraints and sampling capabilities. The main type used for this is the **Variational Autoencoder (VAE)**.

---



## 4. Variational Autoencoders (VAEs) <a name="4"></a>

VAEs are **generative autoencoders** that learn a **probabilistic latent space** to model the distribution of input data. Instead of encoding an input into a fixed latent vector, VAEs encode it into a **distribution**, typically Gaussian.


### Intuition

- In regular autoencoders, the encoder learns a deterministic latent representation.
- VAEs introduce **stochasticity** by encoding inputs into a distribution and sampling from it.
- This allows the model to generate **diverse and realistic new data** by sampling from the learned latent space.


### Architecture

1. **Encoder**:
   - Maps input \( x \) to a latent distribution \( q(z|x) \).
   - Assumes \( q(z|x) \sim \mathcal{N}(\mu(x), \sigma^2(x)) \).
   - Output: Two vectors — \( \mu \) and \( \log \sigma^2 \) (log-variance for numerical stability).

2. **Reparameterization Trick**:
   - To enable backpropagation through stochastic nodes, we reparameterize:
     $$
     z = \mu + \sigma \cdot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)
     $$
   - This allows gradients to flow through \( \mu \) and \( \sigma \).

3. **Decoder**:
   - Receives sampled \( z \) and reconstructs input \( \hat{x} \).
   - Learns the conditional distribution \( p(x|z) \).


### Loss Function: ELBO

The objective is to **maximize the evidence lower bound (ELBO)**:

$$
\mathcal{L}_{VAE}(x) = \mathbb{E}_{q(z|x)}[\log p(x|z)] - D_{KL}(q(z|x) \| p(z))
$$

Or equivalently, **minimize the negative ELBO**, which becomes the loss:

$$
\mathcal{L}(x, \hat{x}, z) = \underbrace{\| x - \hat{x} \|^2}_{\text{Reconstruction Loss}} + \underbrace{D_{KL} \left( \mathcal{N}(\mu(x), \sigma^2(x)) \;||\; \mathcal{N}(0, I) \right)}_{\text{Regularization (KL Divergence)}}
$$

- **Reconstruction Loss**: Ensures the decoded sample \( \hat{x} \) is close to the input \( x \).
- **KL Divergence**: Regularizes the latent space to approximate the prior \( p(z) = \mathcal{N}(0, I) \), encouraging generalization and smooth interpolation.



###  Why Use the KL Term?

- Without it, the encoder might learn a "memorization" strategy.
- The KL divergence term prevents overfitting and ensures **continuity** and **completeness** of the latent space.
- Enables sampling from the prior \( p(z) \) to generate new, realistic outputs.



###  Summary of VAE Training

| Step | Action |
|------|--------|
| 1 | Encode input \( x \) into \( \mu(x) \) and \( \log \sigma^2(x) \) |
| 2 | Sample latent vector \( z \) using reparameterization |
| 3 | Decode \( z \) to reconstruct \( \hat{x} \) |
| 4 | Compute reconstruction loss + KL divergence |
| 5 | Backpropagate and update weights |



###  Visualizing Latent Space

- Latent space often shows **clusters of similar data** (e.g., digits in MNIST).
- Smooth interpolations between points produce **meaningful transitions** in the output space.
- Great for applications like:
  - **Face morphing**
  - **Style transfer**
  - **Data compression**
  - **Semi-supervised learning**

---



## 5. Visual Language Models (VLMs) <a name="5"></a>

Visual Language Models (VLMs) are AI models designed to understand and generate both visual and textual data. They enable powerful cross-modal tasks such as image captioning, visual question answering (VQA), and image-text retrieval.

---

### Core Concept: Multimodal Learning

VLMs jointly process and learn from two data modalities:

- Visual data: images or videos  
- Text data: captions, descriptions, questions, instructions

They learn to align visual and textual information within a shared representation space, allowing for meaningful reasoning across modalities.

---

## Components of a Visual Language Model

### 1. Visual Encoder

Transforms visual input into feature representations.

- Earlier models: CNNs (e.g., ResNet)  
- Recent models: Vision Transformers (ViT)  
- Output: A sequence of embeddings representing image patches

### 2. Text Encoder / Language Model

Processes text input and outputs language embeddings.

- Examples: BERT, RoBERTa, GPT-2, T5  
- Can be pretrained and either frozen or fine-tuned

### 3. Fusion Module

Combines vision and language embeddings.

- Cross-attention layers (as in Transformers)  
- Simple concatenation  
- Contrastive learning objectives for alignment (e.g., CLIP)

---

## Key Visual Language Models

### CLIP (Contrastive Language–Image Pretraining, OpenAI)

- Trains on image-text pairs with a contrastive loss  
- Visual encoder: ResNet or ViT  
- Text encoder: Transformer  
- Learns alignment between images and text  
- Applications: zero-shot image classification, retrieval

### BLIP (Bootstrapped Language Image Pretraining, Salesforce)

- Combines captioning and contrastive pretraining  
- Architecture: Vision Transformer + BERT-like language model  
- Supports tasks like captioning, VQA, image-text retrieval

### Flamingo (DeepMind)

- Few-shot model that uses a frozen language model (e.g., Chinchilla)  
- Processes sequences of alternating images and text  
- Effective for multimodal instruction-following and dialogue

### GIT (Generative Image-to-Text, Google)

- Unified transformer for captioning, VQA, and more  
- Autoregressively generates text from visual input  
- Entirely transformer-based

### LLaVA (Large Language and Vision Assistant)

- Combines a pretrained language model (e.g., LLaMA) with a vision encoder  
- Enables instruction-tuned multimodal interaction  
- Useful for open-ended dialogue with visual grounding

---

## Simplified Architecture Flow


vlm-architecture-diagram.svg

## Training diagram

vlm-training-process-diagram.svg

---
## 6.Stable Diffusion  <a name="6"></a>

Stable Diffusion is a text-to-image generative model developed by Stability AI. It uses latent diffusion to generate high-quality images from natural language descriptions.



### Overview

Stable Diffusion belongs to a class of models called Latent Diffusion Models (LDMs). Instead of operating on raw pixel data, it works in a compressed latent space, which makes it more computationally efficient.


### Key Components

#### 1. Variational Autoencoder (VAE)
- Encodes input images into a lower-dimensional latent space.
- Decodes latent vectors back into image space.

#### 2. U-Net
- A neural network trained to remove noise from latent representations.
- Uses attention mechanisms to capture image structure and detail.

#### 3. Text Encoder (CLIP or T5)
- Encodes input text prompts into dense vector embeddings.
- These embeddings guide the image generation process.

#### 4. Latent Diffusion
- The diffusion process is applied to the latent space instead of pixel space.
- This makes the training and generation process significantly faster.


### Diffusion Process

1. Forward Process: Adds noise to a latent vector over several steps.
2. Reverse Process: A U-Net learns to progressively denoise the latent vector.
3. Conditioning: The process is guided using the text embeddings from the encoder.


### Training Objective

The model is trained to predict the noise added at each diffusion step.

Loss:
$L = E[ || ε - ε_theta(z_t, t, c) ||^2 ]$

Where:
- $z_t$: noisy latent vector at timestep $t$
- $ε$: actual noise
- $ε_theta$: predicted noise by the U-Net
- $c$: text condition (embedding)

---

### Generation Process

1. Encode text prompt into embedding.
2. Sample random noise in latent space.
3. Iteratively denoise using the U-Net conditioned on text.
4. Decode the final latent vector into an image using the VAE decoder.

---

### Advantages

- Operates in latent space for speed and efficiency.
- Highly controllable with natural language input.
- Open-source and easy to fine-tune.
- Supports extensions like DreamBooth and ControlNet.

---

### Applications

- Text-to-image generation
- Image inpainting and outpainting
- Visual storytelling
- Concept art and design
- Data augmentation