# Double Integrator Training Notebook

This notebook provides a clean interface for training and evaluating models on the Double Integrator system.

**Sections:**
1. Setup & Data Generation
2. SFT Training (Optional)
3. GRPO Training (Optional) 
4. SFT + GRPO Training (Combined)
5. Model Evaluation
6. Results Analysis

## 1. Setup & Configuration

In [1]:
import sys
import os
from pathlib import Path

# Get the parent directory of the notebooks folder
notebook_dir = Path.cwd()
if notebook_dir.name == 'notebooks':
    parent_dir = notebook_dir.parent
else:
    # If we're already in the parent directory, use it
    parent_dir = notebook_dir

# Add parent directory to path and change to it
sys.path.insert(0, str(parent_dir))
os.chdir(parent_dir)

from config import ALL_CONFIG, AVAILABLE_SYSTEMS
from core.model_manager import UniversalModelManager
from core.data_pipeline import UniversalDataGenerator
from training.sft_training import train_sft_model, setup_universal_chat_template, save_sft_model
from training.grpo_training import train_grpo_model, save_grpo_model
from evaluation.inference import run_batch_inference
from evaluation.metrics import compute_batch_metrics
from evaluation.visualization import plot_comparison, plot_metrics_comparison
from environments import get_system
from data_utils import load_train_eval_datasets, list_available_datasets
from gpu_utils import auto_gpu_config
import matplotlib.pyplot as plt
import numpy as np

print("✅ All modules loaded successfully!")
print(f"Available systems: {AVAILABLE_SYSTEMS}")

Configuration validation passed!
Universal Control LLM configuration loaded from YAML files


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 07-31 17:05:37 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 07-31 17:05:38 [__init__.py:239] Automatically detected platform cuda.
✅ All modules loaded successfully!
Available systems: ['double_integrator', 'van_der_pol']


In [2]:
# Configuration
SYSTEM_NAME = "double_integrator"
DATASET_NAME = "di"  # Simple clean name
LORA_RANK = 8
MAX_SEQ_LENGTH = 1024

# Training Configuration (EDIT THESE PARAMETERS AS NEEDED)
USE_REDUCED_DATASET = True  # Set to False for full dataset training
SFT_SAMPLES = 50  # Number of samples for SFT training (reduced for stability)
GRPO_SAMPLES = 30  # Number of samples for GRPO training (reduced for stability)
SFT_MAX_STEPS = 10  # Max steps for SFT (reduced for stability)
GRPO_MAX_STEPS = 5  # Max steps for GRPO (reduced for stability)

print(f"🎯 Training system: {SYSTEM_NAME}")
print(f"📊 Dataset: {DATASET_NAME}")
print(f"🔧 LoRA rank: {LORA_RANK}")
print(f"📏 Max sequence length: {MAX_SEQ_LENGTH}")
print(f"🔢 Training mode: {'Reduced dataset' if USE_REDUCED_DATASET else 'Full dataset'}")
if USE_REDUCED_DATASET:
    print(f"   SFT samples: {SFT_SAMPLES}, max steps: {SFT_MAX_STEPS}")
    print(f"   GRPO samples: {GRPO_SAMPLES}, max steps: {GRPO_MAX_STEPS}")
    print("💡 Set USE_REDUCED_DATASET=False for full training")
    print("⚠️  Using conservative settings to avoid tensor dimension issues")

🎯 Training system: double_integrator
📊 Dataset: di
🔧 LoRA rank: 8
📏 Max sequence length: 1024
🔢 Training mode: Reduced dataset
   SFT samples: 50, max steps: 10
   GRPO samples: 30, max steps: 5
💡 Set USE_REDUCED_DATASET=False for full training
⚠️  Using conservative settings to avoid tensor dimension issues


## 2. Data Generation & Loading

In [3]:
# Check available datasets
print("📂 Available datasets:")
datasets = list_available_datasets("datasets")  # Changed from "../datasets" to "datasets"
if datasets:
    for dataset in datasets:
        print(f"   • {dataset}")
else:
    print("   No datasets found")

📂 Available datasets:
   • di
   • di_test


In [4]:
# Generate DI dataset (run this cell if you don't have the dataset)
GENERATE_NEW_DATA = False  # Set to True to generate new data

if GENERATE_NEW_DATA:
    print("🔄 Generating new Double Integrator dataset...")
    
    generator = UniversalDataGenerator(
        systems=[SYSTEM_NAME],
        dt=ALL_CONFIG["system"]["dt"],
        steps=ALL_CONFIG["system"]["steps"],
        reasoning_start=ALL_CONFIG["system"]["reasoning_start"],
        reasoning_end=ALL_CONFIG["system"]["reasoning_end"],
        solution_start=ALL_CONFIG["system"]["solution_start"],
        solution_end=ALL_CONFIG["system"]["solution_end"]
    )
    
    # Generate 2000 samples (1800 train + 200 eval)
    data = generator.generate_single_system_dataset(SYSTEM_NAME, 2000)
    train_data, eval_data = generator.split_dataset(data, 0.9)
    
    # Save dataset
    import pickle
    os.makedirs("../datasets", exist_ok=True)
    
    with open(f"../datasets/{DATASET_NAME}_train.pkl", 'wb') as f:
        pickle.dump(train_data, f)
    with open(f"../datasets/{DATASET_NAME}_eval.pkl", 'wb') as f:
        pickle.dump(eval_data, f)
    
    print(f"✅ Generated and saved dataset: {DATASET_NAME}")
    print(f"   📈 Train samples: {len(train_data)}")
    print(f"   📊 Eval samples: {len(eval_data)}")
else:
    print("⏭️  Skipping data generation (set GENERATE_NEW_DATA=True to generate)")

⏭️  Skipping data generation (set GENERATE_NEW_DATA=True to generate)


In [5]:
# Load existing dataset
try:
    train_data, eval_data, dataset_info = load_train_eval_datasets(
        DATASET_NAME, "datasets", SYSTEM_NAME  # Changed from "../datasets" to "datasets"
    )
    print(f"✅ Loaded dataset: {DATASET_NAME}")
    print(f"   📈 Train samples: {len(train_data)}")
    print(f"   📊 Eval samples: {len(eval_data)}")
    print(f"   ℹ️  Dataset info: {dataset_info.get('config', {})}")
except Exception as e:
    print(f"❌ Failed to load dataset: {e}")
    print("💡 Set GENERATE_NEW_DATA=True in the cell above to generate the dataset")

📂 Loading dataset from datasets/di_train.pkl (format: pickle)
   ✅ Loaded 1800 samples
📂 Loading dataset from datasets/di_eval.pkl (format: pickle)
   ✅ Loaded 200 samples
⚠️  No info file found for dataset: datasets/di_train.pkl
   🔍 Filtered to 1800 samples for system 'double_integrator'
   🔍 Filtered to 200 samples for system 'double_integrator'
📊 Dataset loaded: 1800 train, 200 eval samples
✅ Loaded dataset: di
   📈 Train samples: 1800
   📊 Eval samples: 200
   ℹ️  Dataset info: {}


## 3. Model Setup

In [6]:
# Setup GPU and model manager
print("🎯 Setting up GPU and model...")

# Auto-select best GPU
gpu_config = auto_gpu_config()
print(f"🖥️  Selected GPU: {gpu_config['gpu_id']}")

# Create model manager
manager = UniversalModelManager(ALL_CONFIG["model"]["base_model_name"])

# Setup model
model, tokenizer = manager.setup_model(
    max_seq_length=MAX_SEQ_LENGTH,
    lora_rank=LORA_RANK,
    gpu_id=gpu_config['gpu_id'],
    auto_select_gpu=False
)

print("✅ Model setup complete!")

🎯 Setting up GPU and model...
🖥️  GPU Status:
   GPU 0:  22491MB free /  81559MB total 🔴 BUSY
🧹 Cleared GPU memory cache
🎯 Selected GPU 0: 22491MB free (72.4% used)
🚀 Using GPU 0: NVIDIA H100 80GB HBM3
🖥️  Selected GPU: 0
📌 Using specified GPU 0
🚀 Using GPU 0: NVIDIA H100 80GB HBM3
🚀 Loading model: unsloth/Qwen3-4B-Base
   Max sequence length: 1024
   LoRA rank: 8
   GPU memory utilization: 0.4


==((====))==  Unsloth 2025.6.1: Fast Qwen3 patching. Transformers: 4.51.3. vLLM: 0.8.5.post1.
   \\   /|    NVIDIA H100 80GB HBM3. Num GPUs = 1. Max memory: 79.097 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 9.0. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
🔧 Applying LoRA configuration...


Unsloth 2025.6.1 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.


✅ Model setup completed successfully!
✅ Model setup complete!


In [7]:
# Chat Template Setup (Optional - automatically handled by training functions)
# Note: The training functions will automatically set up the chat template if needed
# This cell is kept for reference but is no longer required

print("ℹ️  Chat template setup is now handled automatically by training functions")
print("✅ No manual chat template setup needed")

ℹ️  Chat template setup is now handled automatically by training functions
✅ No manual chat template setup needed


## 4. SFT Training (Optional)

Run this section to train only the SFT model.

In [8]:
# SFT Training
RUN_SFT_ONLY = False  # Set to True to run SFT training only

if RUN_SFT_ONLY:
    print("🚀 Starting SFT Training...")
    
    # Update SFT config
    sft_config = ALL_CONFIG["sft"].copy()
    sft_config["output_dir"] = f"../temp_training/{SYSTEM_NAME}/sft"
    
    # Train SFT
    sft_result = train_sft_model(
        manager, train_data, eval_data, sft_config
    )
    
    # Save SFT model
    sft_save_path = save_sft_model(
        manager, [SYSTEM_NAME], sft_result["metrics"]
    )
    
    print(f"✅ SFT model saved to: {sft_save_path}")
else:
    print("⏭️  Skipping SFT-only training (set RUN_SFT_ONLY=True to run)")

⏭️  Skipping SFT-only training (set RUN_SFT_ONLY=True to run)


## 5. SFT + GRPO Training (Combined)

Run this section to train both SFT and GRPO models in sequence.

In [9]:
# Combined SFT + GRPO Training - FIXED VERSION WITH TENSOR DIMENSION FIXES
RUN_COMBINED_TRAINING = True  # Set to True to run full training pipeline

if RUN_COMBINED_TRAINING:
    print("🚀 Starting Combined SFT + GRPO Training...")
    
    # Prepare datasets based on configuration
    if USE_REDUCED_DATASET:
        print(f"🔧 Using reduced dataset: SFT={SFT_SAMPLES}, GRPO={GRPO_SAMPLES} samples")
        sft_train_data = train_data[:SFT_SAMPLES]
        sft_eval_data = eval_data[:min(10, len(eval_data))]  # Small eval set
        grpo_train_data = train_data[:GRPO_SAMPLES] 
        grpo_eval_data = eval_data[:min(5, len(eval_data))]  # Small eval set
    else:
        print("📊 Using full dataset")
        sft_train_data = train_data
        sft_eval_data = eval_data
        grpo_train_data = train_data
        grpo_eval_data = eval_data
    
    # === SFT Phase ===
    print("\n" + "="*60)
    print("📚 SFT TRAINING PHASE")
    print("="*60)
    print(f"   Training samples: {len(sft_train_data)}")
    print(f"   Evaluation samples: {len(sft_eval_data)}")
    
    sft_config = ALL_CONFIG["sft"].copy()
    sft_config["output_dir"] = f"temp_training/{SYSTEM_NAME}/sft"
    
    if USE_REDUCED_DATASET:
        # Reduced training for debugging/testing
        sft_config.update({
            "num_train_epochs": 1,
            "max_steps": SFT_MAX_STEPS,  # Use configured max steps
            "logging_steps": max(1, SFT_MAX_STEPS // 5),  # Log every 20%
            "save_steps": SFT_MAX_STEPS + 10,  # Save at end
            "per_device_train_batch_size": 2,  # Conservative batch size
        })
        print(f"   Max steps: {SFT_MAX_STEPS} (reduced for testing)")
    
    sft_result = train_sft_model(
        manager, sft_train_data, sft_eval_data, sft_config
    )
    
    sft_save_path = save_sft_model(
        manager, [SYSTEM_NAME], sft_result["metrics"]
    )
    
    print(f"✅ SFT model saved to: {sft_save_path}")
    
    # === GRPO Phase ===
    print("\n" + "="*60)
    print("🎮 GRPO TRAINING PHASE (TENSOR-SAFE)")
    print("="*60)
    print(f"   Training samples: {len(grpo_train_data)}")
    print(f"   Evaluation samples: {len(grpo_eval_data)}")
    
    grpo_config = ALL_CONFIG["grpo"].copy()
    grpo_config["output_dir"] = f"temp_training/{SYSTEM_NAME}/grpo"
    
    if USE_REDUCED_DATASET:
        # Conservative settings to avoid tensor dimension issues
        grpo_config.update({
            "max_steps": GRPO_MAX_STEPS,  # Use configured max steps
            "num_generations": 1,  # Single generation to avoid tensor issues
            "per_device_train_batch_size": 1,  # Single batch to avoid issues
            "gradient_accumulation_steps": 2,  # Compensate with accumulation
            "max_completion_length": 256,  # Smaller completion length
            "temperature": 0.8,  # Lower temperature for stability
            "logging_steps": 1,
            "save_steps": GRPO_MAX_STEPS + 10,  # Save at end
        })
        print(f"   Max steps: {GRPO_MAX_STEPS} (reduced for testing)")
        print(f"   ⚠️  Using tensor-safe settings: batch_size=1, num_generations=1")
    
    # FIXED: Use the improved GRPO training function with tensor fixes
    grpo_result = train_grpo_model(
        manager, grpo_train_data, grpo_eval_data, grpo_config,
        ALL_CONFIG["system"]["reasoning_start"],
        ALL_CONFIG["system"]["reasoning_end"],
        ALL_CONFIG["system"]["solution_start"],
        ALL_CONFIG["system"]["solution_end"]
    )
    
    grpo_save_path = save_grpo_model(
        manager, [SYSTEM_NAME], grpo_result["metrics"]
    )
    
    print(f"✅ GRPO model saved to: {grpo_save_path}")
    
    print("\n" + "="*60)
    print("🎉 TRAINING COMPLETED SUCCESSFULLY!")
    print("="*60)
    print(f"📍 SFT model: {sft_save_path}")
    print(f"📍 GRPO model: {grpo_save_path}")
    
    if USE_REDUCED_DATASET:
        print("\n💡 Training completed with conservative tensor-safe settings")
        print("🔧 To scale up for production:")
        print("   1. In cell 3, set USE_REDUCED_DATASET = False")
        print("   2. Increase SFT_SAMPLES, GRPO_SAMPLES as needed")
        print("   3. The training will auto-adjust if tensor issues occur")
        print("   4. Or use SLURM scripts for production training")
    
else:
    print("⏭️  Skipping combined training (set RUN_COMBINED_TRAINING=True to run)")

🚀 Starting Combined SFT + GRPO Training...
🔧 Using reduced dataset: SFT=50, GRPO=30 samples

📚 SFT TRAINING PHASE
   Training samples: 50
   Evaluation samples: 10
   Max steps: 10 (reduced for testing)
⚠️  Chat template not set, setting up default template...
✅ Chat template set up successfully


Map:   0%|          | 0/50 [00:00<?, ? examples/s]

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

Training dataset size: 50
Evaluation dataset size: 10


Unsloth: Tokenizing ["text"]:   0%|          | 0/50 [00:00<?, ? examples/s]

Unsloth: Tokenizing ["text"]:   0%|          | 0/10 [00:00<?, ? examples/s]

Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


Starting SFT training...
   Dataset size: 50
   Batch size: 2
   Max sequence length: 1024


==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 50 | Num Epochs = 1 | Total steps = 10
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 1 x 1) = 2
 "-____-"     Trainable parameters = 16,515,072/4,000,000,000 (0.41% trained)


Step,Training Loss,Validation Loss


Unsloth: Will smartly offload gradients to save VRAM!
SFT training completed!
Model saved to models/single_system/double_integrator/sft/run_20250731_170616
✅ SFT model saved to: models/single_system/double_integrator/sft/run_20250731_170616

🎮 GRPO TRAINING PHASE (TENSOR-SAFE)
   Training samples: 30
   Evaluation samples: 5
   Max steps: 5 (reduced for testing)
   ⚠️  Using tensor-safe settings: batch_size=1, num_generations=1
🔧 GRPO sequence length configuration:
   Model max length: 1024
   Max completion length: 512


Map:   0%|          | 0/30 [00:00<?, ? examples/s]

Map:   0%|          | 0/5 [00:00<?, ? examples/s]

Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


Training dataset size: 30
Evaluation dataset size: 5
Using 3 reward functions


ValueError: The global train batch size (1 x 1) must be evenly divisible by the number of generations per prompt (1). Given the current train batch size, the valid values for the number of generations are: [].

## 6. Model Evaluation

Evaluate your trained models on the test dataset.

In [None]:
# Model Evaluation
RUN_EVALUATION = True  # Set to True to run evaluation
EVALUATE_SFT = True   # Set to True to evaluate SFT model
EVALUATE_GRPO = True  # Set to True to evaluate GRPO model

if RUN_EVALUATION:
    print("📊 Starting Model Evaluation...")
    
    # Load models for evaluation
    eval_manager = UniversalModelManager()
    
    if EVALUATE_SFT:
        print("\n🔍 Evaluating SFT Model...")
        try:
            sft_model, sft_tokenizer, sft_lora, sft_metadata = eval_manager.load_single_system_model(
                SYSTEM_NAME, training_type="sft"  # Changed from model_type to training_type
            )
            
            # Generate test cases
            system = get_system(SYSTEM_NAME)()
            test_cases = []
            for _ in range(10):  # 10 test cases
                initial_state = system.generate_random_initial_state()
                test_cases.append(tuple(initial_state))
            
            # Run inference
            from vllm import SamplingParams
            sampling_params = SamplingParams(
                temperature=0.7,
                top_k=50,
                max_tokens=1024
            )
            
            sft_results = run_batch_inference(
                sft_model, sft_tokenizer, SYSTEM_NAME, test_cases,
                lora_request=sft_lora,
                sampling_params=sampling_params
            )
            
            # Compute metrics
            sft_metrics = compute_batch_metrics(sft_results)
            
            print(f"✅ SFT Evaluation Results:")
            print(f"   Success rate: {sft_metrics['success_rate']:.2%}")
            print(f"   Mean performance: {sft_metrics['mean_performance_score']:.4f}")
            
        except Exception as e:
            print(f"❌ SFT evaluation failed: {e}")
    
    if EVALUATE_GRPO:
        print("\n🔍 Evaluating GRPO Model...")
        try:
            grpo_model, grpo_tokenizer, grpo_lora, grpo_metadata = eval_manager.load_single_system_model(
                SYSTEM_NAME, training_type="grpo"  # Changed from model_type to training_type
            )
            
            # Run inference
            grpo_results = run_batch_inference(
                grpo_model, grpo_tokenizer, SYSTEM_NAME, test_cases,
                lora_request=grpo_lora,
                sampling_params=sampling_params
            )
            
            # Compute metrics
            grpo_metrics = compute_batch_metrics(grpo_results)
            
            print(f"✅ GRPO Evaluation Results:")
            print(f"   Success rate: {grpo_metrics['success_rate']:.2%}")
            print(f"   Mean performance: {grpo_metrics['mean_performance_score']:.4f}")
            
        except Exception as e:
            print(f"❌ GRPO evaluation failed: {e}")
            
else:
    print("⏭️  Skipping evaluation (set RUN_EVALUATION=True to run)")

## 7. Visualization & Analysis

In [None]:
# Plot results if evaluation was run
if RUN_EVALUATION and EVALUATE_GRPO and 'grpo_results' in locals():
    print("📈 Generating visualizations...")
    
    # Plot trajectory comparison
    fig1 = plot_comparison(grpo_results)
    if fig1:
        plt.figure(fig1.number)
        plt.suptitle(f"Double Integrator GRPO Model Results", fontsize=16)
        plt.tight_layout()
        plt.show()
    
    # Plot metrics comparison
    fig2 = plot_metrics_comparison(grpo_results)
    if fig2:
        plt.figure(fig2.number)
        plt.suptitle(f"Double Integrator Performance Metrics", fontsize=16)
        plt.tight_layout()
        plt.show()
        
    print("✅ Visualizations complete!")
else:
    print("⏭️  No results to visualize (run evaluation first)")

⏭️  No results to visualize (run evaluation first)


## 8. Summary

This notebook provides a complete workflow for training and evaluating Double Integrator control models:

- **Data Generation**: Create clean DI dataset (1800 train + 200 eval)
- **SFT Training**: Supervised fine-tuning for basic control knowledge
- **GRPO Training**: Reinforcement learning for optimal control
- **Evaluation**: Test model performance on unseen data
- **Visualization**: Plot trajectories and performance metrics

**Model Outputs:**
- SFT model: `models/single_system/double_integrator/sft/latest/`
- GRPO model: `models/single_system/double_integrator/grpo/latest/`

**Next Steps:**
- Use the VDP training notebook for Van der Pol oscillator
- Use the universal training notebook for multi-system models
- Load trained models in other notebooks for further analysis