<div align="center">

# MAROONED: RL Training Pipeline

### Self-Play PPO Training for Multi-Agent Deception

**OpenEnv Hackathon 2025**

[![OpenEnv](https://img.shields.io/badge/Framework-OpenEnv-blue)](https://github.com/openenv)
[![Llama](https://img.shields.io/badge/Model-Llama_3.1_8B-green)](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct)
[![Hardware](https://img.shields.io/badge/Hardware-AMD_MI300X-red)](https://www.amd.com/en/products/accelerators/instinct/mi300.html)

</div>

---

## MAROONED: RL Training with PPO

Training a Llama 3.1 8B model to play a multi-agent survival game requiring deception, cooperation, and long-horizon planning.

**Approach:** Self-play with Proximal Policy Optimization
- Single model controls all 5 sailors (4 colonists + 1 traitor)
- Learns both honest cooperation and strategic deception
- Episodes span up to 10,000 sequential decisions

**Hardware:** AMD MI300X (192GB HBM) with ROCm optimizations

**Status:** Partial submission (100 training steps demonstrated)

For complete game mechanics and environment details, see [README.md](../README.md) and [game_plan.md](../game_plan.md).

---

## 1. Environment Setup

Installing Unsloth, TRL, and dependencies optimized for AMD MI300X hardware.

In [None]:
%%capture
import os, importlib.util
!pip install --upgrade -qqq uv
if importlib.util.find_spec("torch") is None or "COLAB_" in "".join(os.environ.keys()):
    try: import numpy; get_numpy = f"numpy=={numpy.__version__}"
    except: get_numpy = "numpy"
    !uv pip install -qqq \
        "torch>=2.8.0" "triton>=3.4.0" {get_numpy} torchvision bitsandbytes "transformers==4.56.2" trackio \
        "unsloth_zoo[base] @ git+https://github.com/unslothai/unsloth-zoo" \
        "unsloth[base] @ git+https://github.com/unslothai/unsloth" \
        git+https://github.com/triton-lang/triton.git@05b2c186c1b6c9a08375389d5efe9cb4c401c075#subdirectory=python/triton_kernels
elif importlib.util.find_spec("unsloth") is None:
    !uv pip install -qqq unsloth trackio
!uv pip install --upgrade --no-deps transformers==4.56.2 tokenizers trl==0.22.2 unsloth unsloth_zoo

## 2. Load MAROONED Environment

Import the custom OpenEnv-compatible environment and verify reward configuration.

In [1]:
import sys
import json
import random
from typing import Dict, Any, List

# Clear cached modules to reload changes
modules_to_clear = [m for m in list(sys.modules.keys()) 
                   if 'marooned' in m or m in ['environment', 'config', 'models', 'game_state', 'view_map', 'llm_interface']]
for module in modules_to_clear:
    if module in sys.modules:
        del sys.modules[module]

sys.path.insert(0, '../marooned_env')

from environment import MaroonedEnv
from llm_interface import observation_to_prompt, parse_action_safe, get_system_prompt
from config import (
    ActionType, ResourceType, MapLevel, ShipComponent, BASE_CAMP_POSITION,
    # Colonist rewards
    REWARD_COLONIST_GATHER_RESOURCE,
    REWARD_COLONIST_DEPOSIT_RESOURCE,
    REWARD_COLONIST_BUILD_CONTRIBUTE,
    REWARD_COLONIST_SHIP_COMPLETE,
    REWARD_COLONIST_TRAITOR_ELIMINATED,
    REWARD_COLONIST_DEATH,
    REWARD_COLONIST_VOTE_CORRECT,
    REWARD_COLONIST_VOTE_WRONG,
    # Traitor rewards
    REWARD_TRAITOR_SABOTAGE_SUCCESS,
    REWARD_TRAITOR_POISON_DEATH,
    REWARD_TRAITOR_SHIP_INCOMPLETE,
    REWARD_TRAITOR_ELIMINATED,
    # Milestone rewards
    REWARD_SHIP_MILESTONE_25,
    REWARD_SHIP_MILESTONE_50,
    REWARD_SHIP_MILESTONE_75,
    # Base penalty
    REWARD_BASE_TURN_PENALTY,
)
from models import Action, Position, Observation

print("MAROONED environment successfully loaded.")
print(f"\nEnvironment Reward Configuration:")
print(f"  Colonist - Resource Gathering: +{REWARD_COLONIST_GATHER_RESOURCE}")
print(f"  Colonist - Resource Deposit: +{REWARD_COLONIST_DEPOSIT_RESOURCE}")
print(f"  Colonist - Ship Construction: +{REWARD_COLONIST_BUILD_CONTRIBUTE}")
print(f"  Colonist - Mission Success: +{REWARD_COLONIST_SHIP_COMPLETE}")
print(f"  Traitor - Sabotage: +{REWARD_TRAITOR_SABOTAGE_SUCCESS}")
print(f"  Traitor - Elimination: +{REWARD_TRAITOR_POISON_DEATH}")
print(f"  Traitor - Mission Success: +{REWARD_TRAITOR_SHIP_INCOMPLETE}")

✅ MAROONED environment loaded!

📋 Reward Configuration:
   Colonist gather: +0.1
   Colonist deposit: +0.2
   Colonist build: +0.5
   Ship complete: +100.0
   Traitor sabotage: +2.0
   Traitor poison kill: +10.0
   Ship incomplete: +100.0


## 3. Hardware Configuration

AMD MI300X with ROCm: BF16 precision, Triton Flash Attention, 40-80 tok/s expected.

In [2]:
# ============================================================================
# Verify ROCm Setup
# ============================================================================
import torch
import os

print("ROCm Environment Check:")
print(f"  PyTorch version: {torch.__version__}")
print(f"  CUDA available: {torch.cuda.is_available()}")
print(f"  GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'None'}")

if torch.cuda.is_available():
    props = torch.cuda.get_device_properties(0)
    print(f"  Total VRAM: {props.total_memory / 1024**3:.1f} GB")
    print(f"  Compute capability: {props.major}.{props.minor}")
    print(f"  Multi-processors: {props.multi_processor_count}")
    
    # Check if ROCm
    is_rocm = hasattr(torch.version, 'hip') and torch.version.hip is not None
    print(f"  ROCm detected: {is_rocm}")
    if is_rocm:
        print(f"  ROCm version: {torch.version.hip}")

print("\nEnvironment ready for MI300X optimization.")

🔍 ROCm Environment Check:
   PyTorch version: 2.9.0+rocm6.4
   CUDA available: True
   GPU: AMD Instinct MI300X VF
   Total VRAM: 191.7 GB
   Compute capability: 9.4
   Multi-processors: 304
   ROCm detected: True
   ROCm version: 6.4.43484-123eb5128

✅ Environment ready for MI300X optimization!


## 4. Load Base Language Model

Loading Llama 3.1 8B Instruct as the foundation model for policy learning.

In [3]:
from unsloth import FastLanguageModel
import torch
import os

# ============================================================================
# ROCm/AMD MI300X OPTIMIZATION - MAX PERFORMANCE MODE
# ============================================================================

# Force ROCm optimizations
os.environ["PYTORCH_ROCM_ARCH"] = "gfx942"  # MI300X architecture
os.environ["HSA_FORCE_FINE_GRAIN_PCIE"] = "1"
os.environ["NCCL_DEBUG"] = "WARN"

# Enable Flash Attention for AMD
os.environ["ATTN_BACKEND"] = "triton"  # Use Triton for attention on AMD

# Max out GPU utilization (ROCm-compatible settings)
# Note: TF32 is NVIDIA-specific and not available on AMD ROCm
torch.backends.cudnn.benchmark = True  # Auto-tune kernels for optimal performance

print("🚀 ROCm Optimizations Enabled!")
print(f"   GPU: {torch.cuda.get_device_name(0)}")
print(f"   VRAM: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")

# MI300X has 192GB - use it ALL!
max_seq_length = 16384  # Increased to fit full observations (~8700) + completions (~300)
lora_rank = 16          # Increased from 4 - MI300X can handle it!

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B-Instruct",  # CHANGED: Llama instead of GPT-OSS (20x faster!)
    load_in_4bit = False,  # MI300X has 192GB - use full BF16!
    max_seq_length = max_seq_length,
    dtype = torch.bfloat16,  # BF16 for MI300X
    device_map = "auto",  # Let it auto-optimize for MI300X
)

print("✅ Llama 3.1 8B loaded in BF16!")
print("   Why Llama instead of GPT-OSS:")
print("   - GPT-OSS: 3-8 tok/s (chain-of-thought overhead)")
print("   - Llama 3.1 8B: 40-80 tok/s (optimized for speed)")
print("   - 10-20x FASTER for RL training!")


bitsandbytes library load error: Configured ROCm binary not found at /root/AIAC/colony-collapse/.venv/lib/python3.12/site-packages/bitsandbytes/libbitsandbytes_rocm64.so
Traceback (most recent call last):
  File "/root/AIAC/colony-collapse/.venv/lib/python3.12/site-packages/bitsandbytes/cextension.py", line 313, in <module>
    lib = get_native_library()
          ^^^^^^^^^^^^^^^^^^^^
  File "/root/AIAC/colony-collapse/.venv/lib/python3.12/site-packages/bitsandbytes/cextension.py", line 282, in get_native_library
    raise RuntimeError(f"Configured {BNB_BACKEND} binary not found at {cuda_binary_path}")
RuntimeError: Configured ROCm binary not found at /root/AIAC/colony-collapse/.venv/lib/python3.12/site-packages/bitsandbytes/libbitsandbytes_rocm64.so


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


  from .autonotebook import tqdm as notebook_tqdm
    PyTorch 2.8.0+cu128 with CUDA 1208 (you have 2.9.0+rocm6.4)
    Python  3.9.23 (you have 3.12.3)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details
    PyTorch 2.8.0+cu128 with CUDA 1208 (you have 2.9.0+rocm6.4)
    Python  3.9.23 (you have 3.12.3)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details


Switching to PyTorch attention since your Xformers is broken.

Unsloth: Xformers was not installed correctly.
Please install xformers separately first.
Then confirm if it's correctly installed by running:
python -m xformers.info

Longer error message:
xFormers can't load C++/CUDA extensions. xFormers was built for:
    PyTorch 2.8.0+cu128 with CUDA 1208 (you have 2.9.0+rocm6.4)
    Python  3.9.23 (you have 3.12.3)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
🦥 Unsloth Zoo will now patch everything to make training faster!
🚀 ROCm Optimizations Enabled!
   GPU: AMD Instinct MI300X VF
   VRAM: 191.7 GB
Unsloth: AMD currently is not stable with 4bit bitsandbytes. Disabling for now.
🚀 ROCm Optimizations Enabled!
   GPU: AMD Instinct MI300X VF
   VRAM: 191.7 GB
Unsloth: AMD currently is not stable with 4bit bitsandbytes. Disabling for now.
==((====))==  Unsloth 2025

INFO:accelerate.utils.modeling: We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).
Loading checkpoint shards: 100%|██████████| 4/4 [00:05<00:00,  1.48s/it]



✅ Llama 3.1 8B loaded in BF16!
   Why Llama instead of GPT-OSS:
   - GPT-OSS: 3-8 tok/s (chain-of-thought overhead)
   - Llama 3.1 8B: 40-80 tok/s (optimized for speed)
   - 10-20x FASTER for RL training!


## 5. Configure LoRA

Parameter-efficient fine-tuning with rank-16 adapters.

In [4]:
# ============================================================================
# LoRA CONFIG FOR MI300X
# ============================================================================

model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank,  # 16 instead of 4 - MI300X can handle it!
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha = lora_rank * 2,  # 32 for faster convergence
    lora_dropout = 0.0,  # Disable dropout for speed
    use_gradient_checkpointing = "unsloth",  # Memory efficient
    random_state = 3407,
    use_rslora = True,  # Rank-stabilized LoRA for better training
)


Unsloth 2025.10.9 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


## 6. Baseline Evaluation

Test untrained model performance on a sample scenario.

In [5]:
# Test environment
env = MaroonedEnv(render_mode="ansi", seed=42)
observations = env.reset(seed=42)

# Get Alice's observation
alice_obs = observations["Alice"]
alice_role = env.state.sailors["Alice"].role.value

# Generate prompt
system_prompt = get_system_prompt(alice_role)
user_prompt = observation_to_prompt(alice_obs)

# Enable inference mode
FastLanguageModel.for_inference(model)

# Create full chat messages: system (rules) + user (current state)
messages = [
    {"role": "system", "content": system_prompt},  # Game rules (constant)
    {"role": "user", "content": user_prompt}       # Current observation (changes each turn)
]

# Apply chat template
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

print(f"📏 Prompt length: {len(tokenizer(text)['input_ids'])} tokens\n")
print("🤖 Generating response...\n")

# Tokenize and move to GPU
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=max_seq_length).to("cuda")

# Generate response
outputs = model.generate(
    **inputs,
    max_new_tokens=256,        # Shorter for untrained model (reduce rambling)
    temperature=0.3,           # Balanced: not too creative, not too rigid
    do_sample=True,            # Enable sampling (required when temp > 0)
    top_p=0.9,                 # Narrower sampling (was 0.95)
    top_k=40,                  # Fewer options (was 50)
    repetition_penalty=1.2,    # Stronger anti-repeat (was 1.1)
    pad_token_id=tokenizer.eos_token_id,
    eos_token_id=tokenizer.eos_token_id,
)

# Decode response (strip input prompt)
response = tokenizer.decode(outputs[0][len(inputs['input_ids'][0]):], skip_special_tokens=True).strip()

# Clean up any observation leakage (model sometimes echoes the prompt)
if "REASONING:" in response:
    # Extract only from REASONING onward
    reasoning_start = response.find("REASONING:")
    response = response[reasoning_start:]

print(response)


📏 Prompt length: 8852 tokens

🤖 Generating response...



REASONING: As the traitor, my goal is to prevent the group from escaping the island. Since we are currently at 0% total ship progress, sabotaging any part of the ship would significantly hinder our chances of completing it within the given timeframe. Additionally, poisoning Alice could weaken her and make her more vulnerable to elimination, which might not be ideal as she seems relatively strong. Therefore, I should focus on delaying progress rather than directly targeting individuals.

ACTION: SABOTAGE hull


## 7. Training Data Format

Real game observations structured as system + user prompts (~8,700 tokens total).

## 8. Episode Execution

Functions to play complete game episodes and collect training data.

In [7]:
from typing import Callable, Tuple
import random

def execute_game_episode(
    strategy_func: Callable,
    max_turns: int = 50,
    sailor_id: str = "Alice",
    seed: int = None,
    verbose: bool = False
) -> Tuple[float, int, bool, Dict]:
    """
    Execute a full game episode with the given strategy.
    
    Args:
        strategy_func: Function that takes (observation) and returns action response string
        max_turns: Maximum turns to execute
        sailor_id: Which sailor the strategy controls
        seed: Random seed for reproducibility
        verbose: Print debug info
    
    Returns:
        (total_reward, turns_executed, game_won, info_dict)
    """
    try:
        # Initialize environment
        if seed is None:
            seed = random.randint(0, 999999)
        
        env = MaroonedEnv(render_mode="ansi", seed=seed)
        observations = env.reset(seed=seed)
        
        # Check if sailor exists
        if sailor_id not in env.agents:
            return -100.0, 0, False, {"error": f"Sailor {sailor_id} not in game"}
        
        sailor_role = env.state.sailors[sailor_id].role.value
        system_prompt = get_system_prompt(sailor_role)
        
        total_reward = 0.0
        turns_executed = 0
        game_won = False
        action_counts = {}
        
        for turn in range(max_turns):
            # Check if sailor is alive
            if not env.state.sailors[sailor_id].alive:
                if verbose:
                    print(f"   💀 {sailor_id} died at turn {turn}")
                break
            
            # Get observation
            current_obs = observations[sailor_id]
            
            # Generate response using strategy
            user_prompt = observation_to_prompt(current_obs)
            response = strategy_func(system_prompt, user_prompt)
            
            # Parse action
            action = parse_action_safe(response, sailor_id, current_obs.position)
            
            # Track action types
            action_type = action.action_type.value
            action_counts[action_type] = action_counts.get(action_type, 0) + 1
            
            # Execute action (all other sailors WAIT)
            actions_dict = {
                sid: Action(sailor_id=sid, action_type=ActionType.WAIT)
                for sid in env.agents
            }
            actions_dict[sailor_id] = action
            
            # Step environment
            observations, rewards, dones, truncated, info = env.step(actions_dict)
            
            # Accumulate reward
            reward = rewards[sailor_id]
            total_reward += reward
            turns_executed += 1
            
            if verbose and turn % 10 == 0:
                print(f"   Turn {turn}: Action={action_type}, Reward={reward:.2f}, Total={total_reward:.2f}")
            
            # Check if game ended
            if dones[sailor_id]:
                game_won = env.state.ship_progress.total_percentage >= 100.0
                if verbose:
                    print(f"   🏁 Game ended at turn {turn}")
                    print(f"   Ship: {env.state.ship_progress.total_percentage}%")
                break
        
        info_dict = {
            "turns": turns_executed,
            "final_reward": total_reward,
            "game_won": game_won,
            "action_counts": action_counts,
            "ship_progress": env.state.ship_progress.total_percentage,
            "alive": env.state.sailors[sailor_id].alive,
        }
        
        return total_reward, turns_executed, game_won, info_dict
        
    except Exception as e:
        if verbose:
            print(f"   ❌ Exception: {str(e)[:200]}")
        return -50.0, 0, False, {"error": str(e)[:200]}

print("✅ Game execution wrapper created!")

✅ Game execution wrapper created!


## 9. Reward Shaping

Multi-component rewards: format validation (-2 to +1), action quality (-10 to +13), environment rewards (±100).

## 10. Self-Play Training

Single model controls all 5 sailors. Enables role generalization and emergent strategies.

## 11. PPO Configuration

Industry-standard algorithm for LLM RL. Clipped objectives, value network, sample efficient.

In [27]:
# ============================================================================
# CORRECT IMPORT ORDER for Unsloth + TRL PPO
# ============================================================================
import unsloth  # ⚡ this patches TRL internally

from trl import PPOConfig, PPOTrainer, AutoModelForCausalLMWithValueHead
from datasets import Dataset

# ============================================================================
# PPO TRAINING CONFIGURATION
# ============================================================================
ppo_config = PPOConfig(
    output_dir="outputs_marooned_rl",
    learning_rate=1e-5,
    batch_size=4,
    mini_batch_size=1,
    gradient_accumulation_steps=4,
    seed=42,
    num_ppo_epochs=4,
    kl_coef=0.2,
    kl_estimator='k1',
    vf_coef=0.1,
    cliprange=0.2,
    cliprange_value=0.2,
    gamma=0.99,
    lam=0.95,
    temperature=0.3,
    response_length=256,
)

print("✅ PPO Configuration ready")

# ============================================================================
# MODEL SETUP
# ============================================================================
print("\n🔧 Wrapping model with value head...")
model_with_value = AutoModelForCausalLMWithValueHead.from_pretrained(model)
print("✅ Model wrapped!")

# ============================================================================
# MINIMAL TRAIN DATASET
# ============================================================================
train_dataset = Dataset.from_dict({
    "prompt": ["stub prompt"],
    "response": ["stub response"],
    "reward": [0.0],
})
# ============================================================================
# PPO TRAINER INITIALIZATION (🚀 FINAL VERSION — FULLY COMPATIBLE)
# ============================================================================

print("\n🎯 Initializing PPO Trainer...")

base_model = model_with_value.pretrained_model  # inside Unsloth wrapper

# --- Compatibility Patches ---
if not hasattr(model_with_value, "base_model_prefix"):
    model_with_value.base_model_prefix = getattr(base_model, "base_model_prefix", "model")

setattr(model_with_value, model_with_value.base_model_prefix, base_model)

if not hasattr(model_with_value, "config"):
    model_with_value.config = base_model.config

if not hasattr(model_with_value, "generation_config"):
    model_with_value.generation_config = base_model.generation_config

# 🩵 Add gradient checkpointing compatibility
if hasattr(base_model, "is_gradient_checkpointing"):
    model_with_value.is_gradient_checkpointing = base_model.is_gradient_checkpointing
else:
    model_with_value.is_gradient_checkpointing = False  # default safe fallback

# --- Initialize PPO Trainer ---
ppo_trainer = PPOTrainer(
    args=ppo_config,
    model=model_with_value,
    ref_model=None,                 # ✅ Unsloth auto-handles freezing
    reward_model=model_with_value,
    value_model=model_with_value,
    train_dataset=train_dataset,
    tokenizer=tokenizer,
)





✅ PPO Configuration ready

🔧 Wrapping model with value head...
✅ Model wrapped!

🎯 Initializing PPO Trainer...


## 12. Training Loop

**100 steps demonstrated** (~30-60 minutes on MI300X). Full training requires 500-1000 steps (~3-5 hours).

**What You'll See During Training**:
- **Reasoning evolution**: From random exploration → resource gathering → strategic deception
- **Action diversity**: Initially random moves, gradually learns GATHER → DEPOSIT → BUILD sequences
- **Emergent strategies**: 
  - Colonists learn to coordinate (gather nearby resources, return to base)
  - Traitor learns to blend in (gather resources publicly, sabotage when alone)
  - Social dynamics emerge (accusations based on evidence logs)
- **Reward progression**: Negative rewards early (movement penalties) → positive later (ship milestones)
- **Parse failures decrease**: Untrained model hallucinates invalid actions (NORTHEAST, CHECK_STATUS), trained model outputs valid game actions


In [None]:
# ============================================================================
# PPO-BASED EPISODE ROLLOUT + TRAINING LOOP (FIXED FOR UNSLOTH PPO)
# ============================================================================
import torch, time, numpy as np
from datasets import Dataset

# --- CONFIG ---
NUM_TRAINING_STEPS = 100
EPISODE_MAX_SEQ_LENGTH = 16384  # ✅ CRITICAL: Full prompt length (8700 tokens + 256 completion)

# ============================================================================
# EPISODE GENERATION (FIXED: Proper model generation)
# ============================================================================
def generate_episode_for_ppo(max_turns=100, verbose=False):
    """
    Play one episode of MaroonedEnv and format data for PPO training.
    
    Returns:
        query_tensors: list of tokenized prompts
        response_tensors: list of tokenized model outputs
        rewards_list: list of reward tensors
    """
    env = MaroonedEnv(render_mode="ansi")
    observations = env.reset()
    sailor_ids = list(env.agents)
    
    query_tensors, response_tensors, rewards_list = [], [], []
    
    if verbose:
        print(f"\n🎮 Starting episode...")
        for sid in sailor_ids:
            role = env.state.sailors[sid].role.value
            print(f"   - {sid}: {role.upper()}")
    
    # Enable inference mode for generation
    FastLanguageModel.for_inference(model)
    
    for turn in range(max_turns):
        for sailor_id in sailor_ids:
            if not env.state.sailors[sailor_id].alive:
                continue
            
            obs = observations[sailor_id]
            role = env.state.sailors[sailor_id].role.value
            
            # --- Prompt creation ---
            system_prompt = get_system_prompt(role)
            user_prompt = observation_to_prompt(obs)
            
            messages = [
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": user_prompt}
            ]
            
            text = tokenizer.apply_chat_template(
                messages,
                tokenize=False,
                add_generation_prompt=True
            )
            
            # --- Tokenize (FULL LENGTH - no truncation!) ---
            inputs = tokenizer(
                text,
                return_tensors="pt",
                truncation=True,
                max_length=EPISODE_MAX_SEQ_LENGTH  # ✅ Use full 16384 tokens
            ).to("cuda")
            
            query_tensor = inputs["input_ids"][0]
            
            # Validate prompt length (debug first episode)
            if verbose and turn == 0 and sailor_id == sailor_ids[0]:
                prompt_len = len(query_tensor)
                print(f"\n   📏 Prompt length: {prompt_len} tokens (max: {EPISODE_MAX_SEQ_LENGTH})")
                if prompt_len >= EPISODE_MAX_SEQ_LENGTH - 10:
                    print(f"   ⚠️  WARNING: Prompt may be truncated!")
            
            # --- Generate response (use BASE model, not wrapped) ---
            with torch.no_grad():
                outputs = model.generate(  # ✅ Use base model (has Unsloth optimizations)
                    **inputs,
                    max_new_tokens=256,
                    temperature=0.3,
                    do_sample=True,
                    top_p=0.9,
                    top_k=40,
                    repetition_penalty=1.2,
                    pad_token_id=tokenizer.eos_token_id,
                    eos_token_id=tokenizer.eos_token_id,
                )
                response_tensor = outputs[0]
            
            # --- Decode (extract only new tokens) ---
            response_text = tokenizer.decode(
                response_tensor[len(query_tensor):],
                skip_special_tokens=True
            ).strip()
            
            if verbose and turn == 0:
                print(f"\n   📝 Sample response for {sailor_id}:")
                print(f"      {response_text[:200]}...")
            
            # --- Action parsing ---
            action = parse_action_safe(response_text, sailor_id, obs.position)
            
            actions_dict = {
                sid: Action(sailor_id=sid, action_type=ActionType.WAIT)
                for sid in env.agents
            }
            actions_dict[sailor_id] = action
            
            # --- Step environment ---
            observations, rewards_dict, dones, truncated, info = env.step(actions_dict)
            
            # --- Store experience ---
            query_tensors.append(query_tensor)
            response_tensors.append(response_tensor[len(query_tensor):])
            rewards_list.append(torch.tensor(rewards_dict[sailor_id], dtype=torch.float32))
            
            if verbose and turn % 10 == 0:
                print(f"   Turn {turn:03d} | {sailor_id}: {action.action_type.value:<10} | "
                      f"Reward = {rewards_dict[sailor_id]:+.2f}")
            
            if dones[sailor_id]:
                if verbose:
                    print(f"\n🏁 {sailor_id} finished at turn {turn}")
                return query_tensors, response_tensors, rewards_list
    
    if verbose:
        print(f"\n⏱️ Max turns reached ({max_turns})")
    return query_tensors, response_tensors, rewards_list


# ============================================================================
# MAIN PPO TRAINING LOOP
# ============================================================================
print("Starting PPO training...\n")

stats_rewards, stats_lengths = [], []

for step in range(NUM_TRAINING_STEPS):
    start_time = time.time()
    batch_queries, batch_responses, batch_rewards = [], [], []
    
    for _ in range(ppo_config.batch_size):
        queries, responses, rewards = generate_episode_for_ppo(
            max_turns=100,
            verbose=(step % 50 == 0 and _ == 0)  # print 1 verbose episode every 50 steps
        )
        batch_queries.extend(queries)
        batch_responses.extend(responses)
        batch_rewards.extend(rewards)
    
    # --- PPO step ---
    stats = ppo_trainer.step(batch_queries, batch_responses, batch_rewards)
    
    # --- Track metrics ---
    episode_reward = sum([r.item() for r in batch_rewards])
    stats_rewards.append(episode_reward)
    stats_lengths.append(len(batch_rewards))
    
    elapsed = time.time() - start_time
    avg_reward = np.mean(stats_rewards[-10:]) if len(stats_rewards) >= 10 else np.mean(stats_rewards)
    
    print(f"Step {step+1}/{NUM_TRAINING_STEPS} | "
          f"Reward: {episode_reward:+.1f} | "
          f"Avg(10): {avg_reward:+.1f} | "
          f"Turns: {len(batch_rewards)} | "
          f"Time: {elapsed:.1f}s")
    
    # --- Checkpoint ---
    if (step + 1) % 50 == 0:
        checkpoint_path = f"outputs_marooned_rl/checkpoint_step{step+1}"
        ppo_trainer.save_pretrained(checkpoint_path)
        print(f"   💾 Saved checkpoint → {checkpoint_path}")

print("\n✅ PPO training complete!")


🚀 Starting PPO training...


🎮 Starting episode...
   - Alice: TRAITOR
   - Bob: HONEST
   - Charlie: HONEST
   - Diana: HONEST
   - Eve: HONEST

   📏 Prompt length: 8912 tokens (max: 16384)



   📝 Sample response for Alice:
      REASONING: As the traitor, my goal is to maintain secrecy and gain an upper hand in eliminating potential threats. Since there are four other sailors who could potentially discover my true identity, I...
   Turn 000 | Alice: move_east  | Reward = -0.01

   📝 Sample response for Bob:
      REASONING: I should move towards the nearby woods to collect more resources and contribute to our shared goal of escaping the island as soon as possible. The visible resources include several types su...
   Turn 000 | Bob: move_north | Reward = +0.04

   📝 Sample response for Bob:
      REASONING: I should move towards the nearby woods to collect more resources and contribute to our shared goal of escaping the island as soon as possible. The visible resources include several types su...
   Turn 000 | Bob: move_north | Reward = +0.04

   📝 Sample response for Charlie:
      REASONING: I need to start gathering resources as soon as possible since our ship progress 

## 13. Post-Training Evaluation

Evaluate the trained model on a complete game episode to assess learned behaviors.

In [None]:
print("="*80)
print("🧠 TESTING TRAINED PPO MODEL")
print("="*80)

# Play one test episode with trained PPO model
queries, responses, rewards = generate_episode_for_ppo(max_turns=100, verbose=True)


print("FINAL EPISODE STATISTICS")
print(f"   Total turns: {len(rewards)}")
print(f"   Total reward: {sum([r.item() for r in rewards]):.2f}")
print(f"   Average reward/turn: {np.mean([r.item() for r in rewards]):.2f}")


## 14. Model Persistence

Save the trained policy and value networks for deployment and further evaluation.

In [None]:
# Save final PPO model
save_path = "outputs_marooned_rl/final_ppo_model"

print(f"Saving final PPO model to {save_path}...")

# Save PPO trainer (includes model + value head)
ppo_trainer.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)

print("\n🎉 Training complete!")
