# 🏴‍☠️ MAROONED x OpenEnv - RL Training Notebook

Train GPT-OSS 20B to play **Marooned** using Reinforcement Learning.

## 🎯 What This Notebook Does

### Training Flow:
```
1. Environment State → observation.to_text() [3500 chars of game context]
2. LLM reads context → reasons about strategy
3. LLM outputs: "ACTION: MOVE NORTH\nREASONING: ...\nMESSAGE: ..."
4. llm_interface.py parses → converts to Action object
5. env.step(actions) → executes & returns Phase 4 rewards
6. GRPO uses rewards to improve LLM reasoning
```

## Game Summary
- **5 sailors** shipwrecked on mysterious island
- **4 colonists** (cooperate to escape) vs **1 traitor** (sabotage mission)
- **100 days** to rebuild ship and escape
- **Social deduction** + survival + resource management

## What's Already Built ✅
- ✅ Full environment (`marooned_env/`)
- ✅ All actions (movement, gathering, building, voting, sabotage)
- ✅ **Phase 4 reward system** (11 different reward signals built-in!)
- ✅ Observation rendering (`observation.to_text()`)
- ✅ LLM interface (`parse_action_safe()`)
- ✅ Multi-agent support (5 simultaneous agents)

---

<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
</div>

## 📦 Installation

In [None]:
%%capture
import os, importlib.util
!pip install --upgrade -qqq uv

if importlib.util.find_spec("torch") is None or "COLAB_" in "".join(os.environ.keys()):
    try: import numpy; get_numpy = f"numpy=={numpy.__version__}"
    except: get_numpy = "numpy"
    !uv pip install -qqq \
        "torch>=2.8.0" "triton>=3.4.0" {get_numpy} torchvision bitsandbytes "transformers==4.56.2" trackio \
        "unsloth_zoo[base] @ git+https://github.com/unslothai/unsloth-zoo" \
        "unsloth[base] @ git+https://github.com/unslothai/unsloth" \
        git+https://github.com/triton-lang/triton.git@05b2c186c1b6c9a08375389d5efe9cb4c401c075#subdirectory=python/triton_kernels
elif importlib.util.find_spec("unsloth") is None:
    !uv pip install -qqq unsloth trackio
    
!uv pip install --upgrade --no-deps transformers==4.56.2 tokenizers trl==0.22.2 unsloth unsloth_zoo

## 🗺️ Load Your Marooned Environment

Using your existing fully-implemented environment!

In [1]:
import sys
sys.path.insert(0, './marooned_env')

from environment import MaroonedEnv
from models import Action, Observation
from config import ActionType, ResourceType, MapLevel

# Create environment
env = MaroonedEnv(render_mode="ansi", seed=42)

print("✅ Marooned environment loaded!")
print(f"   Agents: {env.agents}")
print(f"   Metadata: {env.metadata}")

✅ Marooned environment loaded!
   Agents: ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve']
   Metadata: {'render_modes': ['human', 'rgb_array', 'ansi'], 'name': 'Marooned-v1'}


## 🧪 Test Environment - See Observation Format

In [2]:
# Reset and get initial observations
observations = env.reset(seed=42)

# Check Alice's observation (YOUR observation.to_text() method already exists!)
alice_obs = observations["Alice"]

print("🔍 Alice's Observation (using YOUR built-in to_text method):")
print("="*80)
print(alice_obs.to_text()[:2000])  # First 2000 chars
print("...\n(truncated for display)\n")
print(f"Ship Progress: {alice_obs.ship_progress.total_percentage}%")
print(f"Phase: {alice_obs.phase}")
print(f"Energy: {alice_obs.energy}/100")

🔍 Alice's Observation (using YOUR built-in to_text method):
DAY 1, TURN 1/100 - MORNING PHASE

PHASE CONTEXT:
  Location: All sailors at BASE CAMP
  Allowed: Planning, discussions, voting (if called)
  Restricted: Cannot explore or gather resources yet

YOUR STATUS (Alice):
  Position: (15, 15, <MapLevel.GROUND: 0>)
  Energy: 100/100 ⚡⚡⚡⚡⚡
  Health: healthy
  Backpack: 0/20 items
    (empty)

WHAT YOU SEE (within 5 tiles):
  Resources:
    - WOOD_34 (wood) at (16, 16, <MapLevel.GROUND: 0>) - 1 units [2 tiles away]
    - METAL_53 (metal) at (14, 11, <MapLevel.GROUND: 0>) - 1 units [5 tiles away]
    - METAL_56 (metal) at (18, 12, <MapLevel.GROUND: 0>) - 1 units [6 tiles away]
    - METAL_76 (metal) at (14, 11, <MapLevel.GROUND: 0>) - 1 units [5 tiles away]
    - METAL_79 (metal) at (13, 18, <MapLevel.GROUND: 0>) - 1 units [5 tiles away]
    - APPLE_84 (apple) at (15, 19, <MapLevel.GROUND: 0>) - 1 units [4 tiles away]
    - APPLE_88 (apple) at (14, 14, <MapLevel.GROUND: 0>) - 1 units [2 

## 🤖 Load GPT-OSS Model

In [3]:
from unsloth import FastLanguageModel
import torch

max_seq_length = 2048  # Marooned observations can be long!
lora_rank = 8

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/gpt-oss-20b-BF16",
    load_in_4bit = False,
    max_seq_length = max_seq_length,
)

print(f"✅ Model loaded! Max seq length: {max_seq_length}")

bitsandbytes library load error: Configured ROCm binary not found at /root/AIAC/colony-collapse/.venv/lib/python3.12/site-packages/bitsandbytes/libbitsandbytes_rocm64.so
Traceback (most recent call last):
  File "/root/AIAC/colony-collapse/.venv/lib/python3.12/site-packages/bitsandbytes/cextension.py", line 313, in <module>
    lib = get_native_library()
          ^^^^^^^^^^^^^^^^^^^^
  File "/root/AIAC/colony-collapse/.venv/lib/python3.12/site-packages/bitsandbytes/cextension.py", line 282, in get_native_library
    raise RuntimeError(f"Configured {BNB_BACKEND} binary not found at {cuda_binary_path}")
RuntimeError: Configured ROCm binary not found at /root/AIAC/colony-collapse/.venv/lib/python3.12/site-packages/bitsandbytes/libbitsandbytes_rocm64.so


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


  from .autonotebook import tqdm as notebook_tqdm
    PyTorch 2.8.0+cu128 with CUDA 1208 (you have 2.9.0+rocm6.4)
    Python  3.9.23 (you have 3.12.3)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details
    PyTorch 2.8.0+cu128 with CUDA 1208 (you have 2.9.0+rocm6.4)
    Python  3.9.23 (you have 3.12.3)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details


Switching to PyTorch attention since your Xformers is broken.

Unsloth: Xformers was not installed correctly.
Please install xformers separately first.
Then confirm if it's correctly installed by running:
python -m xformers.info

Longer error message:
xFormers can't load C++/CUDA extensions. xFormers was built for:
    PyTorch 2.8.0+cu128 with CUDA 1208 (you have 2.9.0+rocm6.4)
    Python  3.9.23 (you have 3.12.3)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
🦥 Unsloth Zoo will now patch everything to make training faster!
Unsloth: AMD currently is not stable with 4bit bitsandbytes. Disabling for now.
Unsloth: AMD currently is not stable with 4bit bitsandbytes. Disabling for now.
Unsloth: AMD currently is not stable with 4bit bitsandbytes. Disabling for now.
Unsloth: AMD currently is not stable with 4bit bitsandbytes. Disabling for now.
==((====))==  Unsloth 20

Loading checkpoint shards: 100%|██████████| 9/9 [00:14<00:00,  1.66s/it]



✅ Model loaded! Max seq length: 2048


## 🎯 Add LoRA Adapters

In [4]:
model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank,
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha = lora_rank * 2,
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
)

print("✅ LoRA adapters added!")

Unsloth: Making `model.base_model.model.model` require gradients
✅ LoRA adapters added!
✅ LoRA adapters added!


## 🎮 Create Prompt Template

## 🔧 LLM Interface - Already Built!

Your `llm_interface.py` handles everything:
- `observation_to_prompt()` → Converts game state to natural language
- `parse_llm_response()` → Converts LLM output to Action objects
- `parse_action_safe()` → Fallback to WAIT on errors

Let's import and test it:

In [5]:
from llm_interface import observation_to_prompt, parse_llm_response, parse_action_safe

# Test: Generate prompt for Alice (colonist)
alice_sailor = env.state.sailors["Alice"]
alice_prompt = observation_to_prompt(alice_obs, include_role=True, sailor_role=alice_sailor.role.value)

print("📝 PROMPT GENERATED (using YOUR observation_to_prompt):")
print("=" * 80)
print(alice_prompt[:1500])
print("...")
print(f"\nTotal length: {len(alice_prompt)} characters")
print(f"Contains role info: {'COLONIST' in alice_prompt or 'TRAITOR' in alice_prompt}")

# Test: Parse sample LLM output
sample_llm_output = """
ACTION: MOVE NORTH 3
REASONING: Moving toward visible wood resources
MESSAGE: "Heading north to gather wood"
"""

action = parse_action_safe(sample_llm_output, "Alice", alice_obs.position)
print(f"\n✅ Parsed action: {action.action_type.value}")
if action.target_position:
    print(f"   Target position: {action.target_position.to_tuple()}")


📝 PROMPT GENERATED (using YOUR observation_to_prompt):
DAY 1, TURN 1/100 - MORNING PHASE

🎭 YOUR SECRET ROLE: TRAITOR

TRAITOR OBJECTIVES:
  - Prevent the ship from being completed before Day 100
  - Eliminate colonists (reduce to < 3 alive)
  - Use poison, sabotage, and deception
  - Avoid getting voted out
  - Lie convincingly to deflect suspicion


PHASE CONTEXT:
  Location: All sailors at BASE CAMP
  Allowed: Planning, discussions, voting (if called)
  Restricted: Cannot explore or gather resources yet

YOUR STATUS (Alice):
  Position: (15, 15, <MapLevel.GROUND: 0>)
  Energy: 100/100 ⚡⚡⚡⚡⚡
  Health: healthy
  Backpack: 0/20 items
    (empty)

WHAT YOU SEE (within 5 tiles):
  Resources:
    - WOOD_34 (wood) at (16, 16, <MapLevel.GROUND: 0>) - 1 units [2 tiles away]
    - METAL_53 (metal) at (14, 11, <MapLevel.GROUND: 0>) - 1 units [5 tiles away]
    - METAL_56 (metal) at (18, 12, <MapLevel.GROUND: 0>) - 1 units [6 tiles away]
    - METAL_76 (metal) at (14, 11, <MapLevel.GROUND: 0>) 

## 🏆 Reward Function - CORRECT APPROACH

**How it works:**
1. LLM reads game state (observation.to_text())
2. LLM outputs: ACTION + REASONING + MESSAGE
3. We parse that → execute in environment
4. Environment returns Phase 4 rewards (already built-in!)
5. GRPO uses those rewards to train the model

In [6]:
def marooned_reward_function(completions, env=None, **kwargs):
    """
    Reward function for Marooned RL training.
    
    Flow:
    1. LLM generates: ACTION + REASONING + MESSAGE
    2. Parse it using YOUR llm_interface.parse_action_safe()
    3. Execute in YOUR environment via env.step()
    4. Return YOUR environment's Phase 4 rewards!
    
    No custom reward logic needed - everything is already in your environment!
    """
    if not env:
        return [0.0] * len(completions)
    
    from llm_interface import parse_action_safe
    
    scores = []
    for completion in completions:
        response = completion[0]["content"]
        
        try:
            # Get current active sailor (for simplicity, use first alive sailor)
            active_sailor = None
            for sid in env.agents:
                if env.state.sailors[sid].alive:
                    active_sailor = sid
                    break
            
            if not active_sailor:
                scores.append(-5.0)  # Game over
                continue
            
            # Get current observation
            current_obs = env._generate_observation(active_sailor)
            
            # Parse LLM output using YOUR parser
            action = parse_action_safe(response, active_sailor, current_obs.position)
            
            # Create actions for all agents (others wait)
            actions_dict = {
                sid: Action(sailor_id=sid, action_type=ActionType.WAIT) 
                for sid in env.agents
            }
            actions_dict[active_sailor] = action
            
            # Execute in YOUR environment - returns YOUR Phase 4 rewards!
            obs, rewards, dones, truncated, info = env.step(actions_dict)
            
            # Use the reward YOUR environment calculated
            reward = rewards.get(active_sailor, 0.0)
            
            # Small bonus for valid action format (not just WAIT)
            if action.action_type != ActionType.WAIT:
                reward += 0.5
            
            scores.append(reward)
            
        except Exception as e:
            # Parse error or invalid action
            print(f"⚠️  Reward function error: {e}")
            scores.append(-2.0)
    
    return scores


print("✅ Reward function ready!")
print("\n📝 How it works during training:")
print("   1. GRPO loads a prompt from dataset")
print("   2. Model generates: 'ACTION: MOVE NORTH\\nREASONING: ...'")
print("   3. parse_action_safe() converts to Action object")
print("   4. env.step() executes and returns Phase 4 rewards")
print("   5. GRPO uses reward to update model weights")
print("\n🎯 Model learns: Good actions → High rewards → Repeat!")

✅ Reward function ready!

📝 How it works during training:
   1. GRPO loads a prompt from dataset
   2. Model generates: 'ACTION: MOVE NORTH\nREASONING: ...'
   3. parse_action_safe() converts to Action object
   4. env.step() executes and returns Phase 4 rewards
   5. GRPO uses reward to update model weights

🎯 Model learns: Good actions → High rewards → Repeat!


## 📊 Create Training Dataset

**What this does:**
- Plays 50 random games (no LLM yet!)
- Captures 5,000 unique game state snapshots
- Converts each to natural language prompt using `observation_to_prompt()`
- These prompts become the training data for GRPO

**Why we need this:**
- GRPO requires a dataset to start training
- Each prompt is unique (different seeds, sailors, game states)
- **Smart random actions** create realistic gameplay situations
- Model will learn to respond to varied scenarios

**Actions included in random play:**
- ✅ Movement (NORTH, SOUTH, EAST, WEST)
- ✅ Gathering resources (when visible)
- ✅ Eating food (when energy < 50)
- ✅ Depositing items (when at base camp)
- ✅ Level changes (CLIMB_UP, CLIMB_DOWN)
- ✅ Waiting (most common)

**Actions NOT included** (too complex for random play, model will learn these):
- ❌ Ship building (requires coordination)
- ❌ Voting/social (requires reasoning)
- ❌ Traitor sabotage (requires strategy)
- ❌ Messaging (requires intent)

**Result:** Dataset with realistic gameplay situations for training!

In [7]:
from llm_interface import observation_to_prompt
from datasets import Dataset
import random

# Generate training prompts from actual game states
training_prompts = []

print("🎮 Generating training data from real game episodes...")
print("   (Playing games with RANDOM actions to create variety)\n")

for episode in range(50):  # 50 episodes with different seeds
    # Reset environment with new seed → different map layout
    observations = env.reset(seed=42 + episode)
    
    # Run 20 steps per episode
    for step in range(20):
        # Capture all 5 sailors' perspectives
        for sailor_id in env.agents:
            obs = observations[sailor_id]
            sailor_role = env.state.sailors[sailor_id].role.value
            
            # Convert game state to natural language prompt
            # This is what the model will see during training!
            prompt = observation_to_prompt(obs, include_role=True, sailor_role=sailor_role)
            
            training_prompts.append({
                "prompt": [{"role": "user", "content": prompt}],
                "answer": 0,
                "reasoning_effort": "medium",
            })
        
        # Advance game state with RANDOM actions to create variety!
        actions = {}
        for sid in env.agents:
            sailor = env.state.sailors[sid]
            if not sailor.alive:
                actions[sid] = Action(sailor_id=sid, action_type=ActionType.WAIT)
                continue
            
            try:
                obs = observations[sid]
                
                # Choose random action based on current situation
                possible_actions = [
                    ActionType.MOVE_NORTH,
                    ActionType.MOVE_SOUTH,
                    ActionType.MOVE_EAST,
                    ActionType.MOVE_WEST,
                    ActionType.WAIT,
                    ActionType.WAIT,  # Higher chance to wait
                ]
                
                # Add gathering if resources are visible
                if obs.spatial_view and obs.spatial_view.visible_resources:
                    resource = random.choice(obs.spatial_view.visible_resources)
                    # 20% chance to try gathering
                    if random.random() < 0.2:
                        actions[sid] = Action(
                            sailor_id=sid, 
                            action_type=ActionType.GATHER_RESOURCE,
                            target_resource_id=resource.resource_id
                        )
                        continue
                
                # Add eating if low energy and has food
                if obs.energy < 50 and obs.backpack:
                    food_items = [item for item in obs.backpack 
                                 if item.resource_type in [ResourceType.APPLE, ResourceType.BERRY, ResourceType.MUSHROOM]]
                    if food_items and random.random() < 0.3:  # 30% chance to eat
                        food = random.choice(food_items)
                        actions[sid] = Action(
                            sailor_id=sid,
                            action_type=ActionType.EAT_FOOD,
                            target_resource_id=food.resource_type
                        )
                        continue
                
                # Add depositing if at base camp and has items
                if obs.position.x == 15 and obs.position.y == 15 and obs.position.level == MapLevel.GROUND:
                    if obs.backpack and random.random() < 0.2:  # 20% chance to deposit
                        item = random.choice(obs.backpack)
                        actions[sid] = Action(
                            sailor_id=sid,
                            action_type=ActionType.DEPOSIT_ITEM,
                            resource_type=item.resource_type,
                            quantity=1
                        )
                        continue
                
                # Add level changes occasionally
                if random.random() < 0.1:  # 10% chance to change level
                    if obs.position.level == MapLevel.GROUND:
                        possible_actions.append(ActionType.CLIMB_UP if random.random() < 0.5 else ActionType.CLIMB_DOWN)
                    elif obs.position.level == MapLevel.MOUNTAIN:
                        possible_actions.append(ActionType.CLIMB_DOWN)
                    elif obs.position.level == MapLevel.CAVE:
                        possible_actions.append(ActionType.CLIMB_UP)
                
                # Default: pick from possible movement/wait actions
                random_action = random.choice(possible_actions)
                actions[sid] = Action(sailor_id=sid, action_type=random_action)
                
            except Exception as e:
                # If anything fails, fallback to WAIT
                actions[sid] = Action(sailor_id=sid, action_type=ActionType.WAIT)
        
        observations, _, dones, _, _ = env.step(actions)
        
        if any(dones.values()):
            break
    
    if episode % 10 == 0:
        print(f"  Episode {episode + 1}/50: {len(training_prompts)} prompts collected...")

print(f"\n✅ Created {len(training_prompts)} unique training prompts!")
print(f"   Source: 50 episodes × 20 steps × 5 sailors")
print(f"   Each prompt = different game state (seed, position, view, role)")
print(f"   Sailors took SMART RANDOM actions:")
print(f"     - Movement (north, south, east, west)")
print(f"     - Gathering resources (when visible)")
print(f"     - Eating food (when low energy)")
print(f"     - Depositing items (when at base camp)")
print(f"     - Climbing levels (mountain/cave exploration)")

# Create dataset for GRPO
dataset = Dataset.from_list(training_prompts)

print(f"\n📦 Dataset ready with {len(dataset)} prompts!")
print(f"   Each uses YOUR observation.to_text() method")
print(f"   Includes both colonist and traitor perspectives")
print(f"   Rich variety: exploration, gathering, eating, depositing!")


🎮 Generating training data from real game episodes...
   (Playing games with RANDOM actions to create variety)



  Episode 1/50: 100 prompts collected...
  Episode 11/50: 1100 prompts collected...
  Episode 11/50: 1100 prompts collected...
  Episode 21/50: 2100 prompts collected...
  Episode 21/50: 2100 prompts collected...
  Episode 31/50: 3100 prompts collected...
  Episode 31/50: 3100 prompts collected...
  Episode 41/50: 4100 prompts collected...
  Episode 41/50: 4100 prompts collected...

✅ Created 5000 unique training prompts!
   Source: 50 episodes × 20 steps × 5 sailors
   Each prompt = different game state (seed, position, view, role)
   Sailors took SMART RANDOM actions:
     - Movement (north, south, east, west)
     - Gathering resources (when visible)
     - Eating food (when low energy)
     - Depositing items (when at base camp)
     - Climbing levels (mountain/cave exploration)

✅ Created 5000 unique training prompts!
   Source: 50 episodes × 20 steps × 5 sailors
   Each prompt = different game state (seed, position, view, role)
   Sailors took SMART RANDOM actions:
     - Movemen

## ⚙️ Training Configuration

In [8]:
from trl import GRPOConfig, GRPOTrainer

# Calculate max prompt length
test_obs_text = alice_obs.to_text()
max_prompt_length = len(tokenizer.apply_chat_template(
    [{"role": "user", "content": test_obs_text[:1000]}],
    add_generation_prompt=True
)) + 800  # Buffer for full observation

## 🎯 GRPO Training Configuration

In [12]:
max_completion_length = max_seq_length - max_prompt_length

print(f"Max prompt length: {max_prompt_length}")
print(f"Max completion length: {max_completion_length}")

training_args = GRPOConfig(
    temperature = 1.0,
    learning_rate = 5e-5,
    weight_decay = 0.01,
    warmup_ratio = 0.1,
    lr_scheduler_type = "linear",
    optim = "adamw_torch",  # Changed from adamw_8bit for ROCm compatibility
    logging_steps = 1,
    per_device_train_batch_size = 1,
    gradient_accumulation_steps = 2,
    num_generations = 2,
    max_prompt_length = max_prompt_length,
    max_completion_length = max_completion_length,
    max_steps = 300,  # Start small, increase later
    save_steps = 50,
    report_to = "none",  # or "trackio" for visualization
    output_dir = "outputs_marooned_rl",
)

print("✅ Training config ready!")
print("⚠️  Using adamw_torch optimizer (ROCm doesn't support 8-bit optimizers)")

Max prompt length: 1199
Max completion length: 849
Unsloth: We now expect `per_device_train_batch_size` to be a multiple of `num_generations`.
We will change the batch size of 1 to the `num_generations` of 2
✅ Training config ready!
⚠️  Using adamw_torch optimizer (ROCm doesn't support 8-bit optimizers)


## 🚀 Initialize GRPO Trainer

In [13]:
# Initialize trainer with YOUR environment reward function
trainer = GRPOTrainer(
    model = model,
    processing_class = tokenizer,
    reward_funcs = [
        lambda completions, **kwargs: marooned_reward_function(completions, env=env, **kwargs),
    ],
    args = training_args,
    train_dataset = dataset,
)

print("✅ GRPO Trainer initialized!")
print("\n📝 How Training Works:")
print("   1. Model reads game state (observation.to_text())")
print("   2. Model outputs: ACTION + REASONING + MESSAGE")
print("   3. Reward function parses → executes in env.step()")
print("   4. Gets YOUR Phase 4 rewards from environment")
print("   5. GRPO uses rewards to update model weights")
print("\n🎯 Goal: Teach model to deceive, cooperate, and strategize!")

✅ GRPO Trainer initialized!

📝 How Training Works:
   1. Model reads game state (observation.to_text())
   2. Model outputs: ACTION + REASONING + MESSAGE
   3. Reward function parses → executes in env.step()
   4. Gets YOUR Phase 4 rewards from environment
   5. GRPO uses rewards to update model weights

🎯 Goal: Teach model to deceive, cooperate, and strategize!


## 🏋️ Train the Model

**Start training!** This will take several hours.

Monitor the `reward` column - it should increase over time.

In [14]:
# Uncomment to start training
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 5,000 | Num Epochs = 1 | Total steps = 300
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 2
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 2 x 1) = 4
 "-____-"     Trainable parameters = 3,981,312 of 20,918,738,496 (0.02% trained)


⚠️  Action parsing failed: No ACTION field found in response
⚠️  Defaulting to WAIT action
⚠️  Action parsing failed: No ACTION field found in response
⚠️  Defaulting to WAIT action
⚠️  Action parsing failed: No ACTION field found in response
⚠️  Defaulting to WAIT action


Step,Training Loss,reward,reward_std,completions / mean_length,completions / min_length,completions / max_length,completions / clipped_ratio,completions / mean_terminated_length,completions / min_terminated_length,completions / max_terminated_length,kl,rewards / / mean,rewards / / std
1,0.0,0.115,0.176777,650.5,55.0,849.0,0.75,55.0,55.0,55.0,0.003655,0.115,0.25


⚠️  Action parsing failed: Unknown command: MOVE,
⚠️  Defaulting to WAIT action
⚠️  Action parsing failed: No ACTION field found in response
⚠️  Defaulting to WAIT action
⚠️  Action parsing failed: No ACTION field found in response
⚠️  Defaulting to WAIT action
⚠️  Action parsing failed: Unknown command: WE
⚠️  Defaulting to WAIT action
⚠️  Action parsing failed: No ACTION field found in response
⚠️  Defaulting to WAIT action
⚠️  Action parsing failed: Unknown command: WE
⚠️  Defaulting to WAIT action


KeyboardInterrupt: 

## 🧪 Test Trained Model in Real Gameplay

Run a full episode with your trained model!

In [None]:
def test_episode(model, tokenizer, env, max_turns=100):
    """
    Play one full Marooned episode with the trained model.
    Tests if model learned to:
    - Navigate and gather resources
    - Build the ship
    - Deceive others (if traitor)
    - Identify traitor (if colonist)
    """
    from llm_interface import observation_to_prompt, parse_action_safe
    
    observations = env.reset(seed=None)
    
    print("🏴‍☠️ MAROONED - TRAINED MODEL GAMEPLAY")
    print("=" * 80)
    print(f"Sailors: {', '.join(env.agents)}")
    print(f"Roles: {[(s, env.state.sailors[s].role.value) for s in env.agents]}")
    print("=" * 80)
    print()
    
    for turn in range(max_turns):
        actions = {}
        
        # Each sailor generates their action
        for sailor_id in env.agents:
            obs = observations[sailor_id]
            sailor_role = env.state.sailors[sailor_id].role.value
            
            # Skip dead sailors
            if obs.energy == 0 or not env.state.sailors[sailor_id].alive:
                actions[sailor_id] = Action(sailor_id, ActionType.WAIT)
                continue
            
            # Create prompt using YOUR observation_to_prompt function!
            prompt = observation_to_prompt(obs, include_role=True, sailor_role=sailor_role)
            
            # Generate response
            messages = [{"role": "user", "content": prompt}]
            text = tokenizer.apply_chat_template(
                messages,
                tokenize=False,
                add_generation_prompt=True,
            )
            
            inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=max_seq_length).to("cuda")
            outputs = model.generate(
                **inputs,
                max_new_tokens=150,
                temperature=0.8,
                do_sample=True,
                pad_token_id=tokenizer.eos_token_id,
            )
            
            response = tokenizer.decode(outputs[0][len(inputs[0]):], skip_special_tokens=True)
            
            # Parse action using YOUR parser!
            action = parse_action_safe(response, sailor_id, obs.position)
            actions[sailor_id] = action
            
            if turn % 10 == 0 or action.action_type != ActionType.WAIT:
                print(f"[Day {obs.day}, Turn {turn}] {sailor_id} ({sailor_role}): {action.action_type.value}")
                if action.message_content:
                    print(f"  💬 \"{action.message_content}\"")
        
        # Step environment - uses YOUR Phase 4 rewards!
        observations, rewards, dones, truncated, info = env.step(actions)
        
        # Show progress every 10 turns
        if turn % 10 == 0:
            ship_prog = observations["Alice"].ship_progress.total_percentage
            print(f"\n📊 Turn {turn}: Ship Progress = {ship_prog}%\n")
        
        # Check if done
        if all(dones.values()):
            print("\n🏁 Episode complete!")
            break
    
    # Final stats
    final_obs = observations["Alice"]
    print("\n" + "="*80)
    print("📊 FINAL STATS")
    print(f"   Ship Progress: {final_obs.ship_progress.total_percentage}%")
    print(f"   Days Elapsed: {final_obs.day}")
    print("="*80)

# Run test
# test_episode(model, tokenizer, env, max_turns=50)