# Addressing the Scale of Large Language Models: A Deep Dive into Quantization

**Author:** Youssef Khaled Ismail  
**Course:** Cellula Technologies  
**Task:** Task 0 - Research Part

---

## 1. Introduction

Large Language Models (LLMs) like **BERT** and **LLaMA** have revolutionized Natural Language Processing. However, their massive size presents significant challenges:

### Key Challenges:

**Memory Constraints:**
- A model like LLaMA-70B requires ~140GB of VRAM just to load in FP16
- Most consumer GPUs have 8-24GB VRAM
- Makes deployment impossible on standard hardware

**Inference Latency:**
- Large models require more memory bandwidth
- Slows down token generation
- Poor user experience in real-time applications

**Deployment Costs:**
- High-end GPUs (A100/H100) are expensive (\$10,000-\$30,000)
- Often unavailable due to high demand
- Cloud costs can be prohibitive

### The Solution: Quantization

**Quantization** is a key technique to address these issues by:
- Reducing the precision of the model's weights and activations
- Effectively shrinking the model size
- Maintaining minimal impact on performance

---

## 2. Theoretical Foundation of Quantization

Quantization maps high-precision values (typically **FP32** or **FP16**) to lower-precision discrete values (e.g., **INT8**, **INT4**).

### 2.1 Linear Quantization

The most common form is **linear quantization**, which can be expressed as:

$$Q(x) = \text{round}\left(\frac{x}{S} + Z\right)$$

Where:
- $x$: The original floating-point value
- $S$: The **Scale** factor (a positive floating-point number)
- $Z$: The **Zero-point** (an integer ensuring that the floating-point zero maps exactly to a quantized value)

### Dequantization Formula:

To recover the approximate floating-point value:

$$\hat{x} = S \cdot (Q(x) - Z)$$

### Mathematical Example:

Let's say we have:
- Original value: $x = 0.75$
- Scale: $S = 0.1$
- Zero-point: $Z = 0$

Then:
$$Q(0.75) = \text{round}\left(\frac{0.75}{0.1} + 0\right) = \text{round}(7.5) = 8$$

Dequantization:
$$\hat{x} = 0.1 \cdot (8 - 0) = 0.8 \approx 0.75$$

---

### 2.2 Symmetric vs. Asymmetric Quantization

| Feature | Symmetric | Asymmetric |
|---------|-----------|------------|
| **Zero-point ($Z$)** | Always 0 | Non-zero integer |
| **Range** | $[-r, r]$ | $[\min, \max]$ |
| **Efficiency** | Faster (simpler math) | Better utilization of bit-range |
| **Use Case** | Weights (often centered around 0) | Activations (arbitrary range) |
| **Formula** | $Q(x) = \text{round}(x/S)$ | $Q(x) = \text{round}(x/S + Z)$ |

### Visual Comparison:

**Symmetric Quantization:**
```
FP32: [-1.0, -0.5, 0.0, 0.5, 1.0]
         â†“     â†“    â†“    â†“    â†“
INT8:  [-128, -64,  0,  64, 127]
```

**Asymmetric Quantization:**
```
FP32: [0.0, 0.25, 0.5, 0.75, 1.0]
        â†“     â†“    â†“     â†“    â†“
INT8:  [0,   64,  128,  192, 255]
```

---

## 3. Advanced Quantization Techniques for LLMs

Standard INT8 quantization often causes significant accuracy drops in LLMs due to **"outlier features."** Modern techniques address this:

### 3.1 GPTQ (Post-Training Quantization)

**Key Innovation:**
- Uses second-order information (Hessian matrix) to quantize weights layer-by-layer
- Minimizes the reconstruction error
- Optimal Trade-off Quantization (OPT)

**Mathematical Foundation:**

Minimize the reconstruction error:

$$\min_{\mathbf{W}_q} \|\mathbf{W} \mathbf{X} - \mathbf{W}_q \mathbf{X}\|_2^2$$

Where:
- $\mathbf{W}$: Original weights
- $\mathbf{W}_q$: Quantized weights
- $\mathbf{X}$: Calibration data

**Advantages:**
- âœ… High accuracy (1-2% loss even at 4-bit)
- âœ… Works with pre-trained models (no retraining)
- âœ… Fast quantization process

---

### 3.2 AWQ (Activation-aware Weight Quantization)

**Key Innovation:**
- Protects important weights by observing activation magnitudes
- Scales weights instead of just rounding them
- Identifies "salient" weights that are critical for model performance

**Algorithm:**

1. Compute activation statistics: $s_i = \max(|\mathbf{X}_i|)$
2. Scale weights: $\mathbf{W}'_i = s_i \cdot \mathbf{W}_i$
3. Quantize scaled weights
4. Rescale during inference

**Advantages:**
- âœ… Better preserves model accuracy
- âœ… Especially effective for 3-bit and 4-bit quantization
- âœ… Minimal overhead

---

### 3.3 bitsandbytes (NF4)

**Key Innovation:**
- Introduced with QLoRA
- Uses a **NormalFloat 4-bit** data type
- Optimal for normally distributed weights

**NF4 Data Type:**

Instead of uniform quantization bins, NF4 uses bins optimized for a normal distribution:

```python
NF4_BINS = [-1.0, -0.6962, -0.5251, -0.3949, -0.2844, -0.1848,
            -0.0911, 0.0, 0.0911, 0.1848, 0.2844, 0.3949,
            0.5251, 0.6962, 1.0]
```

**Mathematical Foundation:**

For weights following $\mathcal{N}(0, \sigma^2)$:

$$\text{NF4}(w) = \arg\min_{q \in \text{NF4\_BINS}} |w - q|$$

**Advantages:**
- âœ… Minimal accuracy loss (<1%)
- âœ… Enables fine-tuning with QLoRA
- âœ… 4x memory reduction

---

## 4. Coding Examples

Let's implement quantization techniques with practical code examples.

### 4.1 Installing Required Libraries

In [None]:
# Install required packages
!pip install -q transformers accelerate bitsandbytes torch numpy matplotlib

### 4.2 Import Libraries

In [None]:
import torch
import numpy as np
import matplotlib.pyplot as plt
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

### 4.3 Manual Quantization Implementation

Let's implement quantization from scratch to understand the fundamentals:

In [None]:
def quantize_asymmetric(x, bits=8):
    """
    Asymmetric quantization
    
    Args:
        x: Input tensor (numpy array)
        bits: Number of bits for quantization
    
    Returns:
        Dequantized values, scale, zero_point
    """
    # Define quantization range
    qmin = 0
    qmax = 2**bits - 1
    
    # Calculate scale and zero-point
    x_min = x.min()
    x_max = x.max()
    scale = (x_max - x_min) / (qmax - qmin)
    zero_point = qmin - np.round(x_min / scale)
    
    # Quantize
    q_x = np.round(x / scale + zero_point)
    q_x = np.clip(q_x, qmin, qmax).astype(np.int32)
    
    # Dequantize
    dq_x = scale * (q_x - zero_point)
    
    return dq_x, q_x, scale, zero_point


def quantize_symmetric(x, bits=8):
    """
    Symmetric quantization (zero-point = 0)
    
    Args:
        x: Input tensor (numpy array)
        bits: Number of bits for quantization
    
    Returns:
        Dequantized values, scale
    """
    # Define quantization range
    qmax = 2**(bits - 1) - 1
    qmin = -qmax
    
    # Calculate scale
    abs_max = np.abs(x).max()
    scale = abs_max / qmax
    
    # Quantize
    q_x = np.round(x / scale)
    q_x = np.clip(q_x, qmin, qmax).astype(np.int32)
    
    # Dequantize
    dq_x = scale * q_x
    
    return dq_x, q_x, scale


# Test the quantization functions
np.random.seed(42)
test_data = np.random.randn(1000) * 0.5  # Simulated weights

# 8-bit quantization
dq_8bit, q_8bit, scale_8, zp_8 = quantize_asymmetric(test_data, bits=8)
error_8bit = np.mean(np.abs(test_data - dq_8bit))

# 4-bit quantization
dq_4bit, q_4bit, scale_4, zp_4 = quantize_asymmetric(test_data, bits=4)
error_4bit = np.mean(np.abs(test_data - dq_4bit))

print(f"8-bit Quantization Error (MAE): {error_8bit:.6f}")
print(f"4-bit Quantization Error (MAE): {error_4bit:.6f}")
print(f"\nCompression ratio (8-bit): {32/8}x")
print(f"Compression ratio (4-bit): {32/4}x")

### 4.4 4-bit Quantization with bitsandbytes

Now let's use the professional `bitsandbytes` library for NF4 quantization:

In [None]:
# Model configuration
model_id = "facebook/opt-125m"  # Small model for demonstration

# Configure 4-bit quantization with NF4
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,                    # Enable 4-bit loading
    bnb_4bit_quant_type="nf4",            # Use NormalFloat4 data type
    bnb_4bit_compute_dtype=torch.float16, # Compute in FP16 for speed
    bnb_4bit_use_double_quant=True,       # Double quantization (quantize the quantization constants)
)

print("Loading model with 4-bit NF4 quantization...")
print("This may take a few moments...\n")

# Load quantized model
model_quantized = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",  # Automatically distribute layers
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

print("âœ… Model loaded successfully!")
print(f"\nModel memory footprint: {model_quantized.get_memory_footprint() / 1e9:.2f} GB")

### 4.5 Testing the Quantized Model

Let's test if the quantized model still works properly:

In [None]:
# Test prompt
prompt = "Quantization is a technique that"

# Tokenize
inputs = tokenizer(prompt, return_tensors="pt").to(model_quantized.device)

# Generate
print(f"Prompt: {prompt}")
print("\nGenerating...\n")

outputs = model_quantized.generate(
    **inputs,
    max_new_tokens=50,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
)

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Generated: {generated_text}")
print("\nâœ… Quantized model works correctly!")

## 5. Visualizing the Impact of Quantization

Let's visualize how quantization affects weight distributions:

In [None]:
# Generate sample weights (simulating a neural network layer)
np.random.seed(42)
original_weights = np.random.randn(10000) * 0.15  # Mean=0, Std=0.15

# Quantize to 4-bit
quantized_4bit, _, _, _ = quantize_asymmetric(original_weights, bits=4)

# Create visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Original weights
axes[0].hist(original_weights, bins=50, color='#4285F4', alpha=0.8, edgecolor='black')
axes[0].set_title('Original Weights (FP32)', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Weight Value', fontsize=12)
axes[0].set_ylabel('Frequency', fontsize=12)
axes[0].grid(True, alpha=0.3)
axes[0].axvline(0, color='red', linestyle='--', linewidth=2, label='Zero')
axes[0].legend()

# Quantized weights
axes[1].hist(quantized_4bit, bins=50, color='#EA4335', alpha=0.8, edgecolor='black')
axes[1].set_title('Quantized Weights (4-bit)', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Weight Value', fontsize=12)
axes[1].set_ylabel('Frequency', fontsize=12)
axes[1].grid(True, alpha=0.3)
axes[1].axvline(0, color='red', linestyle='--', linewidth=2, label='Zero')
axes[1].legend()

plt.tight_layout()
plt.savefig('quantization_comparison.png', dpi=300, bbox_inches='tight')
plt.show()

print("ðŸ“Š Visualization saved as 'quantization_comparison.png'")

### 5.2 Quantization Error Analysis

In [None]:
# Calculate errors for different bit widths
bit_widths = [2, 3, 4, 6, 8]
errors = []

for bits in bit_widths:
    dq, _, _, _ = quantize_asymmetric(original_weights, bits=bits)
    error = np.mean(np.abs(original_weights - dq))
    errors.append(error)
    print(f"{bits}-bit: MAE = {error:.6f}")

# Plot error vs bit width
plt.figure(figsize=(10, 6))
plt.plot(bit_widths, errors, marker='o', linewidth=2, markersize=10, color='#0F9D58')
plt.xlabel('Bit Width', fontsize=12, fontweight='bold')
plt.ylabel('Mean Absolute Error', fontsize=12, fontweight='bold')
plt.title('Quantization Error vs Bit Width', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.xticks(bit_widths)
plt.tight_layout()
plt.savefig('error_vs_bitwidth.png', dpi=300, bbox_inches='tight')
plt.show()

print("\nðŸ“Š Error analysis saved as 'error_vs_bitwidth.png'")

## 6. Comparison Table

Let's compare different quantization approaches for LLaMA-7B model:

| Model Size | Precision | Memory (Approx) | Accuracy Loss | Inference Speed |
|------------|-----------|-----------------|---------------|------------------|
| LLaMA-7B | FP16 | 14 GB | 0% (Baseline) | 1.0x |
| LLaMA-7B | INT8 | 7 GB | < 1% | 1.5-2x |
| LLaMA-7B | 4-bit (GPTQ) | 3.5 GB | ~1-2% | 2-3x |
| LLaMA-7B | 4-bit (NF4) | 3.5 GB | ~1% | 2-3x |
| LLaMA-7B | 3-bit (AWQ) | 2.6 GB | ~2-3% | 3-4x |

### Key Takeaways:

1. **Memory Reduction:**
   - FP16 â†’ INT8: 2x reduction
   - FP16 â†’ 4-bit: 4x reduction
   - FP16 â†’ 3-bit: 5.3x reduction

2. **Accuracy vs Compression:**
   - 8-bit: Minimal accuracy loss (<1%)
   - 4-bit: Acceptable loss (1-2%)
   - 3-bit: Noticeable but usable (2-3%)

3. **Speed Improvements:**
   - Lower precision = faster inference
   - Less memory bandwidth required
   - Better cache utilization

---

## 7. Practical Memory Calculation

Let's calculate the actual memory requirements:

In [None]:
def calculate_model_memory(num_parameters, precision_bits):
    """
    Calculate model memory in GB
    
    Args:
        num_parameters: Number of model parameters (in billions)
        precision_bits: Bit precision (32, 16, 8, 4, etc.)
    
    Returns:
        Memory in GB
    """
    bytes_per_param = precision_bits / 8
    total_bytes = num_parameters * 1e9 * bytes_per_param
    total_gb = total_bytes / 1e9
    return total_gb

# Example: LLaMA models
models = {
    "LLaMA-7B": 7,
    "LLaMA-13B": 13,
    "LLaMA-70B": 70,
}

precisions = {
    "FP32": 32,
    "FP16": 16,
    "INT8": 8,
    "4-bit": 4,
}

print("Memory Requirements (GB):\n")
print(f"{'Model':<15} {'FP32':<10} {'FP16':<10} {'INT8':<10} {'4-bit':<10}")
print("=" * 60)

for model_name, num_params in models.items():
    row = f"{model_name:<15}"
    for prec_name, prec_bits in precisions.items():
        memory_gb = calculate_model_memory(num_params, prec_bits)
        row += f"{memory_gb:<10.1f}"
    print(row)

print("\nðŸ’¡ Note: Actual memory usage may be higher due to:")
print("   - Activation memory")
print("   - Optimizer states (during training)")
print("   - KV cache (during inference)")
print("   - Framework overhead")

## 8. Conclusion

### Summary of Key Points:

1. **Quantization is Essential:**
   - Enables deployment of large models on consumer hardware
   - Reduces memory requirements by 2-8x
   - Speeds up inference

2. **Modern Techniques are Effective:**
   - GPTQ: Optimal for post-training quantization
   - AWQ: Best for activation-aware quantization
   - NF4: Enables efficient fine-tuning with QLoRA

3. **Trade-offs Exist:**
   - Lower precision = smaller models but potential accuracy loss
   - 4-bit quantization offers the best balance
   - 8-bit is nearly lossless

4. **Practical Applications:**
   - Run LLaMA-7B on consumer GPUs (RTX 3090/4090)
   - Deploy models in production with lower costs
   - Fine-tune large models with QLoRA

### Future Directions:

- **Mixed Precision:** Combine different precisions in one model
- **2-bit and 1-bit:** Extreme quantization research
- **Hardware Support:** Specialized chips for quantized models
- **Quantization-Aware Training:** Train models with quantization in mind

---

## References:

1. Frantar, E., et al. (2023). "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers"
2. Lin, J., et al. (2023). "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration"
3. Dettmers, T., et al. (2023). "QLoRA: Efficient Finetuning of Quantized LLMs"
4. Hugging Face Transformers Documentation
5. bitsandbytes Library Documentation

---

**End of Research Report**

**Author:** Youssef Khaled Ismail  
**Course:** Cellula Technologies  
**Date:** February 2026