# ✅ **Fine‑Tuning**

Fine‑tuning is the process of taking a pre‑trained language model and training it further on your own curated data so it learns specific skills, style, or formats that match your needs. It changes the model’s *default behavior* to follow your preferences without long prompts.

## Purpose

- **Task Specialization:** Boost accuracy for a specific task (e.g., classification, extraction).
- **Domain Adaptation:** Make the model fluent in your jargon and context.
- **Style & Format Control:** Ensure consistent tone, voice, and output structure.
- **Policy Adherence:** Encode organizational rules and compliance.
- **Efficiency:** Reduce prompt length, cost, and latency while improving reliability.

---

# Types of Fine‑Tuning

## 1. Full Fine‑Tuning

**Definition:**  
Update all parameters of the pre‑trained model on your dataset.

**Pros:**  
Best performance and flexibility; can deeply adapt to new tasks/domains.

**Cons:**  
Requires large compute/resources; risk of overfitting or catastrophic forgetting.

**Use Case:**  
When you have a large, high‑quality dataset and need maximum adaptation.

---

## 2. PEFT (Parameter‑Efficient Fine‑Tuning)

**Definition:**  
Freeze most of the model’s parameters and train only a small subset or added modules.

**Purpose:**  
Reduce compute cost, memory usage, and training time while keeping strong performance.

**Techniques:**

- **LoRA (Low‑Rank Adaptation):** Inject small trainable rank‑decomposition matrices into layers.
- **QLoRA:** LoRA on quantized (lower‑precision) base model to save memory.
- **Adapters:** Small extra layers between existing ones, trained while the base model stays frozen.
- **Prefix Tuning:** Train extra tokens (prefixes) prepended to inputs in each layer, guiding output without changing main weights.

**Use Case:**  
When resources are limited, or you need multiple task‑specific adapters for the same base model.

---

## 3. Instruction Fine‑Tuning

**Definition:**  
Supervised fine‑tuning using datasets of instructions and ideal responses, teaching the model to follow natural language instructions reliably.

**Purpose:**  
Improve general usability, helpfulness, and compliance by training on diverse instruction‑response pairs.

**Use Case:**  
Creating general instruction‑following models or adapting to specific instruction styles.

---

# ✅ **Prompt Engineering vs. Fine‑Tuning**

# Prompt Engineering

- **Definition:**  
  Controlling and steering a model’s behavior at runtime by crafting instructions and optionally providing examples.

- **Types:**
  - **Zero‑Shot:** Only instructions, no examples.
  - **One‑Shot:** One example to show the format or style.
  - **Few‑Shot:** Multiple examples (usually 2–10) to guide output.

- **Pros:**  
  No training required, quick to adjust, works well for many general tasks.

- **Cons:**  
  Prompts can become long, outputs may be inconsistent, harder to enforce strict formats.

- **Best For:**  
  Rapid prototyping, evolving requirements, or tasks where occasional variation is acceptable.

---

# Fine‑Tuning

- **Definition:**  
  Further training a pre‑trained model on curated examples so it learns your desired behavior, style, or format by default.

- **Pros:**  
  Consistent and reliable outputs, shorter prompts, better domain and style adherence.

- **Cons:**  
  Requires dataset preparation, training resources, and time.

- **Best For:**  
  When prompting alone can’t achieve the accuracy, style, or reliability you need.

---

# Key Difference

Prompting is *temporary steering* (few‑shot examples, no extra training),  
while fine‑tuning is *permanent adaptation* (needed when prompting isn’t enough).

---

# ✅ **Quantization & Model Optimization**

## Goal
Make fine‑tuned models smaller, faster, and more efficient without significantly hurting performance.

---

## Concept

Quantization reduces the numerical precision of model weights and activations so they require less memory and compute:

- **FP32 → FP16 → INT8 → INT4**

  - **FP32 (32‑bit floating point):** High precision, large size, slower inference.
  - **FP16 (16‑bit floating point):** Half the size, faster compute, minimal accuracy loss.
  - **INT8 (8‑bit integer):** Much smaller, faster, slightly more accuracy loss possible.
  - **INT4 (4‑bit integer):** Extremely compact, enables very large models on small hardware, but higher risk of accuracy drop.

**Example:**  
A 13B parameter model in FP32 (50GB) -> INT4 (12GB) can now run on a single consumer GPU.

---

## Types of Quantization

### 1. Post‑Training Quantization (PTQ)

- **Definition:** Apply quantization *after* the model is fully trained.
- **Pros:** Fast, easy, no retraining.
- **Cons:** Might cause higher accuracy loss for sensitive tasks.
- **Example:** Train model in FP32, then convert weights to INT8 using a tool like `BitsAndBytes`.

### 2. Quantization‑Aware Training (QAT)

- **Definition:** Simulate quantization *during* training so the model learns to work within reduced precision.
- **Pros:** Higher accuracy than PTQ because the model adapts during training.
- **Cons:** More complex, requires retraining.
- **Example:** During fine‑tuning, activations and weights are simulated as INT8, so the model learns to tolerate it.

### 3. Mixed Precision Training

- **Definition:** Use different precisions for different operations.
- **Pros:** Faster training and reduced memory without major accuracy drop.
- **Cons:** Slight complexity in setup.
- **Example:** Use FP16 for matrix multiplications (speed), FP32 for loss calculations (precision).

---

## Popular Libraries

- **BitsAndBytes:**
  - Allows 4‑bit quantization with minimal accuracy loss.
  - QLoRA: fine‑tune a 4‑bit base model using LoRA adapters.
  - **Benefit:** Train large models on smaller GPUs.

- **GPTQ(General Purpose Quantization for Transformers):**
  - Post‑training quantization method optimized for inference.
  - Focused on minimizing accuracy loss with per‑channel quantization.

- **AWQ (Activation‑aware Weight Quantization):**
  - Optimizes quantization by analyzing activation ranges.
  - Often yields better performance than GPTQ for some models.

---

## Summary Table

| Method          | When Applied              | Precision     | Pros                     | Cons                       |
| --------------- | ------------------------- | ------------- | ------------------------ | -------------------------- |
| PTQ             | After training            | INT8/INT4     | Fast, simple             | Accuracy loss risk         |
| QAT             | During training           | INT8/INT4     | Better accuracy          | More compute/training time |
| Mixed Precision | During training/inference | FP16/FP32 mix | Speed + accuracy balance | Slight complexity          |


- Quantization and model optimization make fine‑tuned models more accessible on smaller hardware, enabling faster, cheaper inference and training.

---

# ✅ **Adapter in Fine-Tuning**

An **adapter** is a small set of **extra trainable parameters** added inside a pre-trained model to learn new skills or styles **without changing the original model weights**.

Think of it like a **plug‑in** for a model:

- **Base model** stays frozen (unchanged).
- **Adapter** learns the new domain/task.
- At inference, inputs pass through both — base + adapter — to produce adapted outputs.

---

## Why Use Adapters?

1. **Efficient Training** — Only a few million parameters are trained, not billions.
2. **Multi-Task Flexibility** — Swap adapters for different tasks/domains.
3. **Low Storage** — Adapters are MBs in size vs GBs for full models.

---

## How It Works ?

1. Insert **small trainable weights** inside existing layers (often after attention or feed-forward blocks).
2. **Freeze** all original parameters.
3. Train **only** the adapter on your dataset.
4. Save just the adapter weights.
5. At inference, load base + adapter for the adapted behavior.

---

## Example

A base model is a **general writer**. To make it write **legal contracts**:

- Add small adapter weights that learn legal terms, tone, and format.
- For medical reports, train another adapter and swap it in.

---

## Popular Adapter Methods

- **LoRA** — Low‑Rank Adaptation; small matrices added to attention/FFN layers.
- **Prefix Tuning** — Trainable “virtual tokens” prepended in model layers.
- **Standard Adapters** — Tiny feed‑forward blocks between Transformer layers.

---