# 🧠 Titanesque Experiment – V2 Training Plan

**Purpose:**  
This document is a draft requirements note for experiments described in paper **V2**.  
It will guide the construction of multiple training configurations from concrete examples.

---

## Models to Evaluate (with Rationale)

| Model Name                                | Key Feature(s)                                                      |
|-------------------------------------------|----------------------------------------------------------------------|
| `meta-llama/Llama-3.2-1B`                 | Standard causal LM (Grouped Attention)                              |
| `Qwen/Qwen2.5-1.5B`                        | RMSNorm architecture                                                |
| `apple/OpenELM-1_1B`                       | QKV combined in a **single Linear layer**                           |
| `mistralai/Mistral-7B-v0.3` (quantized)    | Sliding Window Attention                                            |
| `allenai/OLMoE-1B-7B-0924` (quantized)     | Mixture of Experts                                                  |
| `Dream-org/Dream-v0-Base-7B` (quantized)   | Diffusion-based model (non-causal, bidirectional)                   |

---

## Core Training Variants (All Models)

For **each listed model**, train with:

1. `delta_rule` – constant `mag_weight = 0.5`  
2. `delta_product` – constant `mag_weight = 0.5`  

---

## LLaMA-Specific Variants

For **`meta-llama/Llama-3.2-1B`** add:

- Same as above **but without LoRA**:  
  - `delta_rule` – constant `mag_weight = 0.5`  
  - `delta_product` – constant `mag_weight = 0.5`  

---

## Operator Mode Experiments (*LLaMA only*, constant `mag_weight = 0.5`)

- ✅ `delta_rule` (*already done*)  
- `delta_rule + non-linearity`  
- ✅ `delta_product` (*already done*)  
- `delta_product + rotation`  
- `delta_product_derived + rotation` (combined approach)  

---

## Liza Callback Sweeps (*LLaMA only*)

Test various **mag_weight schedules** for both `delta_rule` and `delta_product`:

1. **Constant values**: 0.125, 0.25, 0.5 (✅), 0.75  
2. **Transition to 0.5** over: 10, 100, 1000 samples  
3. **Alternating schedules**:
    - [0, 0.5]  
    - [0, 1.0]  

---

## Dataset Setup

- **Dataset:** 10% of **alpaca-cleaned** (~5000 samples)  
- **Padding:** `"longest"`, max length **386**  
- **Test set:** 2 sample examples per run (not critical — just sanity check)  
- **Seed:** `42` (fixed across all runs)  

---

## Max Chunk Size Rules

- `delta_rule`: **64**  
- `delta_product`: **32** (due to doubled sequence length)  

---

## Training Parameters (Global Defaults)

- Same LoRA setup, training config, and tokenizer unless otherwise stated  
- Only reduce batch size when memory is insufficient  
- Special handling: diffusion models have their own requirements  
- Hardware Target: **2× T4 GPUs (Kaggle free tier)** to demonstrate *low-compute viability*  