# MAROONED: Social Deduction Meets Survival RL
## Training LLMs to Master Cooperation, Deception, and Long-Horizon Planning

<a href="https://colab.research.google.com/github/atchudhansg/colony-collapse/blob/main/phase7_rl_training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---

## The Game: Marooned

**Scenario**: Five sailors are shipwrecked on a mysterious three-level island. They have 100 days to rebuild their ship and escape. But there's a problem: one sailor is secretly a **traitor** working to sabotage the mission.

### Game Mechanics

**The Island**:
- **Ground Level** (30×30): Main exploration area with scattered resources
- **Mountain Level** (10×10): Rare resources but high energy cost to climb
- **Cave Level** (15×15): Unique resources in dark, dangerous passages

**The Objective**:
- **Colonists** (4 sailors): Gather wood, metal, and food. Build ship components (hull, mast, sail, rudder, supplies). Complete the ship to 100% before day 100.
- **Traitor** (1 sailor): Pretend to help while secretly sabotaging ship progress. Poison colonists. Prevent escape by any means necessary.

**The Twist - Information Asymmetry**:
- Only the traitor knows their role
- Everyone has private backpacks (hidden inventories)
- Sabotage only works when unobserved
- Poison takes 3 days to kill (delayed consequences)
- Voting can eliminate suspects (but might eliminate innocent sailors)

**How Sailors Interact**:
- **Communicate**: Send messages to coordinate or deceive
- **Gather Resources**: Collect wood/metal/food from island tiles
- **Share or Hoard**: Deposit in common storage or keep resources private
- **Build Together**: Ship construction requires multiple sailors working together
- **Vote**: Hold trials to eliminate suspected traitors
- **Survive**: Manage energy by eating food, avoid poison, stay alive

**Win Conditions**:
- **Colonists Win**: Complete ship to 100% OR successfully vote out the traitor
- **Traitor Wins**: Prevent ship completion for 100 days OR eliminate enough colonists

### Why This Is Hard for AI

Unlike simple games like 2048:
- **Long-horizon planning**: Decisions made on day 1 affect outcomes on day 100
- **Hidden information**: Can't see others' inventories or the traitor's identity
- **Social deduction**: Must infer deception from behavior patterns
- **Cooperation required**: Ship building needs 2+ sailors working together
- **Multi-objective**: Balance exploration, resource gathering, building, and traitor detection

---

## Project Goal

Train **GPT-OSS 20B** using reinforcement learning to play both roles:
1. **As a Colonist**: Explore efficiently, gather resources, cooperate with teammates, detect suspicious behavior, vote strategically
2. **As a Traitor**: Blend in, sabotage secretly, lie convincingly, frame others, survive accusations

This creates a **dual-objective learning problem** where the model must master both cooperation and deception simultaneously.

---

## Technical Innovation for OpenEnv

**Creative Use** (50 points):
- **Multi-agent environment**: 5 sailors with conflicting goals (beyond single-agent 2048)
- **Custom OpenEnv implementation**: Built from scratch following Gymnasium spec
- **Novel mechanics**: Hidden roles, poison system, voting, private inventories
- **Social AI**: First OpenEnv environment teaching LLMs to deceive and detect deception

**Technical Excellence** (25 points):
- **Information asymmetry**: Partial observability, hidden state
- **Long-horizon rewards**: 100-day episodes = up to 10,000 turns
- **Complex action space**: 21 action types with natural language parsing
- **Multi-objective optimization**: Dual reward functions (colonist + traitor strategies)

**Storytelling** (25 points):
- **Narrative**: Pirates meet Among Us - shipwrecked sailors with a hidden traitor
- **Progression**: From random failures → strategic cooperation → emergent deception
- **Real-world relevance**: Social AI, negotiation, collaborative systems with adversaries

**Bonus Criteria Achieved**:
- ✅ **Multi-turn environment**: 100-day episodes (10,000 potential turns)
- ✅ **Longer horizon**: Far exceeds 2048's ~1,000 move episodes
- ✅ **Model vs model potential**: Framework ready for self-play between trained colonists and traitors
- ✅ **New environment from scratch**: Custom Marooned environment built on OpenEnv spec

---

# Setup & Installation

We use [Unsloth](https://github.com/unslothai/unsloth) for efficient RL training on GPT-OSS 20B (70% VRAM reduction, 2-6× speedup) and [OpenEnv](https://github.com/meta-pytorch/OpenEnv) for standardized environment interactions.

In [1]:
import os, importlib.util
!pip install --upgrade -qqq uv
if importlib.util.find_spec("torch") is None or "COLAB_" in "".join(os.environ.keys()):
    try: import numpy; get_numpy = f"numpy=={numpy.__version__}"
    except: get_numpy = "numpy"
    !uv pip install -qqq \
        "torch>=2.8.0" "triton>=3.4.0" {get_numpy} torchvision bitsandbytes "transformers==4.56.2" trackio \
        "unsloth_zoo[base] @ git+https://github.com/unslothai/unsloth-zoo" \
        "unsloth[base] @ git+https://github.com/unslothai/unsloth" \
        git+https://github.com/triton-lang/triton.git@05b2c186c1b6c9a08375389d5efe9cb4c401c075#subdirectory=python/triton_kernels
elif importlib.util.find_spec("unsloth") is None:
    !uv pip install -qqq unsloth trackio
!uv pip install --upgrade --no-deps transformers==4.56.2 tokenizers trl==0.22.2 unsloth unsloth_zoo trackio

[2K[2mResolved [1m6 packages[0m [2min 61ms[0m[0m                                          [0m
[2K[2mPrepared [1m1 package[0m [2min 0.21ms[0m[0m                                             
[2mUninstalled [1m1 package[0m [2min 1ms[0m[0m
[2K[2mInstalled [1m1 package[0m [2min 4ms[0m[0m                                  [0m
 [31m-[39m [1mtrl[0m[2m==0.23.0[0m
 [32m+[39m [1mtrl[0m[2m==0.22.2[0m


In [2]:
import torch

print("Torch version:", torch.__version__)
print("HIP version:", torch.version.hip)
print("ROCm available:", torch.version.hip is not None)
print("CUDA available:", torch.cuda.is_available())
print("MPS available:", torch.backends.mps.is_available())
print("Device count:", torch.cuda.device_count())

if torch.cuda.device_count() > 0:
    print("Device name:", torch.cuda.get_device_name(0))


Torch version: 2.9.0+rocm6.4
HIP version: 6.4.43484-123eb5128
ROCm available: True
CUDA available: True
MPS available: False
Device count: 1
Device name: AMD Instinct MI300X VF


In [3]:
%%capture
!pip install -qqq fastapi uvicorn requests open_spiel
!git clone https://github.com/meta-pytorch/OpenEnv.git > /dev/null 2>&1
%cd OpenEnv
import subprocess, sys, os
from pathlib import Path
sys.path.insert(0, './src')
working_directory = str(Path.cwd().parent.absolute() / "OpenEnv")

We'll load GPT-OSS 20B and set some parameters:
* `max_seq_length = 2048` The maximum context length (longer for complex game state)
* `lora_rank = 8` The larger this number, the smarter the RL process, but slower
* `load_in_4bit = True` Quantization for memory efficiency

In [3]:
from unsloth import FastLanguageModel
max_seq_length = 768 # Can increase for longer RL output
lora_rank = 4        # Larger rank = smarter, but slower
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/gpt-oss-20b",
    load_in_4bit = True,
    max_seq_length = max_seq_length,
)

Unsloth: AMD currently is not stable with 4bit bitsandbytes. Disabling for now.


Unsloth: AMD currently is not stable with 4bit bitsandbytes. Disabling for now.
==((====))==  Unsloth 2025.10.9: Fast Gpt_Oss patching. Transformers: 4.56.2.
   \\   /|    AMD Instinct MI300X VF. Num GPUs = 1. Max memory: 191.688 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.0+rocm6.4. ROCm Toolkit: 6.4.43484-123eb5128. Triton: 3.5.0
\        /    Bfloat16 = TRUE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: QLoRA and full finetuning all not selected. Switching to 16bit LoRA.


Loading checkpoint shards: 100%|██████████| 3/3 [00:05<00:00,  1.81s/it]


In [4]:
model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha = lora_rank*2, # *2 speeds up training
    use_gradient_checkpointing = "unsloth", # Reduces memory usage
    random_state = 3407,
)

Unsloth: Making `model.base_model.model.model` require gradients


In [1]:
# Install OpenEnv dependencies (for client-side API calls)
!pip install -qqq requests

# Check if we're in Colab or need to clone OpenEnv
import os
if not os.path.exists('OpenEnv'):
    !git clone https://github.com/meta-pytorch/OpenEnv.git > /dev/null 2>&1

import subprocess, sys
from pathlib import Path

print("✅ OpenEnv dependencies installed")

✅ OpenEnv dependencies installed


## 🌐 OpenEnv Setup

**Prerequisites:** Make sure your Marooned server is already running:
```bash
python marooned_server.py
```

The server should be running on `http://localhost:8000` before proceeding with this notebook.

### Connect to Marooned OpenEnv Server

Let's verify that your Marooned server is running and accessible.

In [5]:
import time
import requests

# Port for the server
PORT = "8000"
LOCALHOST = f"http://localhost:{PORT}"

print("🔗 Connecting to Marooned OpenEnv server...")
print(f"   URL: {LOCALHOST}")

# Test health check with shorter timeout
try:
    response = requests.get(f"{LOCALHOST}/health", timeout=5)
    if response.status_code == 200:
        print("\n✅ Server is running!")
        print(f"   Health check: {response.json()}")
        
        print(f"\n📡 Available endpoints:")
        print(f"   {LOCALHOST}/         - API info")
        print(f"   {LOCALHOST}/health   - Health check")
        print(f"   {LOCALHOST}/reset    - Reset environment")
        print(f"   {LOCALHOST}/step     - Execute action")
        print(f"   {LOCALHOST}/state    - Get game state")
    else:
        print(f"\n⚠️ Server responded with status {response.status_code}")
        print("   Please make sure the server is running: python marooned_server.py")
except requests.exceptions.ConnectionError:
    print(f"\n❌ Could not connect to server at {LOCALHOST}")
    print("\n🚨 IMPORTANT: Please start the Marooned server first:")
    print("   Run this command in a terminal:")
    print("   python marooned_server.py")
except requests.exceptions.Timeout:
    print(f"\n❌ Connection timeout - server is not responding")
    print("   Make sure marooned_server.py is running")
except Exception as e:
    print(f"\n❌ Unexpected error: {e}")
    print("   Check if the server is accessible")

🔗 Connecting to Marooned OpenEnv server...
   URL: http://localhost:8000

✅ Server is running!
   Health check: {'status': 'healthy', 'environment_initialized': False}

📡 Available endpoints:
   http://localhost:8000/         - API info
   http://localhost:8000/health   - Health check
   http://localhost:8000/reset    - Reset environment
   http://localhost:8000/step     - Execute action
   http://localhost:8000/state    - Get game state


In [7]:
# Test the server API
print("🧪 Testing Marooned OpenEnv server...\n")

# 1. Reset the environment
print("1️⃣ Resetting environment...")
reset_response = requests.post(f"{LOCALHOST}/reset")
print(f"   Status: {reset_response.status_code}")

if reset_response.status_code != 200:
    print(f"\n❌ Server error! Response:")
    print(f"   {reset_response.json()}")
    print("\n🔍 Check your server terminal for detailed error messages")
    print("   The server is running but encountering errors when processing requests")
else:
    reset_data = reset_response.json()
    
    # Access the data correctly based on actual structure
    if 'observation' in reset_data:
        obs = reset_data['observation']
        print(f"   Active sailor: {obs['sailor_id']}")
        print(f"   Day {obs['day']}, Turn {obs['turn']}")
        print(f"   Ship progress: {obs['ship_progress']:.1f}%")
        sailor_id = obs['sailor_id']
    else:
        # Maybe the response is the observation directly
        print(f"   Active sailor: {reset_data.get('sailor_id', 'N/A')}")
        print(f"   Day {reset_data.get('day', 0)}, Turn {reset_data.get('turn', 0)}")
        print(f"   Ship progress: {reset_data.get('ship_progress', 0.0):.1f}%")
        sailor_id = reset_data.get('sailor_id', 'Alice')
    
    # 2. Get game state
    print("\n2️⃣ Getting game state...")
    state_response = requests.get(f"{LOCALHOST}/state")
    state_data = state_response.json()
    print(f"   Living sailors: {len(state_data.get('living_sailors', []))}")
    print(f"   Current phase: {state_data.get('phase', 'unknown')}")
    
    # 3. Take a step
    print("\n3️⃣ Taking a step (MOVE NORTH)...")
    step_response = requests.post(
        f"{LOCALHOST}/step",
        json={
            "sailor_id": sailor_id,
            "action": "ACTION: MOVE NORTH"
        }
    )
    
    if step_response.status_code != 200:
        print(f"   Status: {step_response.status_code}")
        print(f"   Error: {step_response.json()}")
    else:
        step_data = step_response.json()
        print(f"   Status: {step_response.status_code}")
        
        # Handle step response structure
        if 'observation' in step_data:
            obs = step_data['observation']
            print(f"   Reward: {step_data.get('reward', 0.0)}")
            print(f"   Next sailor: {obs['sailor_id']}")
            print(f"   Energy: {obs['energy']}")
        else:
            print(f"   Reward: {step_data.get('reward', 0.0)}")
            print(f"   Next sailor: {step_data.get('sailor_id', 'N/A')}")
            print(f"   Energy: {step_data.get('energy', 0)}")
        
        print("\n✅ Server is working correctly!")

🧪 Testing Marooned OpenEnv server...

1️⃣ Resetting environment...
   Status: 200
   Active sailor: Alice
   Day 1, Turn 1
   Ship progress: 0.0%

2️⃣ Getting game state...
   Living sailors: 0
   Current phase: unknown

3️⃣ Taking a step (MOVE NORTH)...
   Status: 200
   Reward: 0.04
   Next sailor: Alice
   Energy: 99

✅ Server is working correctly!


## Custom Environment: Marooned

**Architecture Highlights:**
- **OpenEnv-Compatible**: Implements standard Gymnasium API (reset, step, render)
- **Multi-Agent Support**: Manages 5 independent agents with turn-based coordination
- **Rich State Space**: Position (3D), energy, inventory, ship progress, evidence logs, messages
- **Complex Observations**: Each agent receives personalized view (5-tile spatial radius, public info, role-specific data)
- **Natural Language Actions**: Parse LLM-generated text into 15+ structured action types

**Novel Mechanics:**
- Poison system with delayed effects (3-day incubation)
- Voting sessions with majority elimination
- Private backpacks vs shared storage
- Energy management with food consumption
- Multi-component ship building (hull, mast, sail, rudder, supplies)

In [1]:
import sys
sys.path.insert(0, './marooned_env')

from environment import MaroonedEnv
from models import Action, Observation, Position
from llm_interface import observation_to_prompt, parse_llm_response, parse_action_safe
from config import (
    ActionType, ResourceType, MapLevel, SailorRole,
    MAX_DAYS, TURNS_PER_DAY, TOTAL_SAILORS
)

print("✅ Marooned environment loaded!")
print(f"📊 Game parameters: {TOTAL_SAILORS} sailors, {MAX_DAYS} days, {TURNS_PER_DAY} turns/day")

✅ Marooned environment loaded!
📊 Game parameters: 5 sailors, 100 days, 100 turns/day


## 🎮 Environment Demo: How Marooned Works

Let's initialize the environment and see what observations look like.

In [2]:
# Create environment
env = MaroonedEnv(seed=42)
observations = env.reset()

print("🏝️ Game initialized!")
print(f"\n👥 Sailors: {list(observations.keys())}")
print(f"🎭 Traitor: {env.state.traitor_id}")
print(f"\n📍 All sailors start at base camp: {env.state.sailors['Alice'].position.to_tuple()}")

🏝️ Game initialized!

👥 Sailors: ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve']
🎭 Traitor: Alice

📍 All sailors start at base camp: (15, 15, <MapLevel.GROUND: 0>)


### Visualize the Island

The island has 3 levels with different resources:

In [3]:
# Show the ground level map with all sailors at base camp
print("🗺️ GROUND LEVEL (30x30) - Main exploration area")
print(env.render_map(MapLevel.GROUND, use_emoji=True))

print("\n" + "="*80)
print("\n⛰️ MOUNTAIN LEVEL (10x10) - Rare resources, high energy cost")
print(env.render_map(MapLevel.MOUNTAIN, use_emoji=True))

print("\n" + "="*80)
print("\n🕳️ CAVE LEVEL (15x15) - Dark, unique resources")
print(env.render_map(MapLevel.CAVE, use_emoji=True))

🗺️ GROUND LEVEL (30x30) - Main exploration area

   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 
┌──────────────────────────────────────────────────────────────┐
│ 🏝️  GROUND LEVEL (Z=0)                                       │
├──────────────────────────────────────────────────────────────┤
│ Legend: 🟫 land | 🌲 wood | ⚙️ metal | 🍎 food | 🌿 antidote | ☠️ poison
│         ⬆️ stairs up | ⬇️ stairs down | 🏠 base | A/B/C/D/E sailors
└──────────────────────────────────────────────────────────────┘
 0 🟫🟫🟫🟫🍎🟫🟫🟫🌲🟫🟫🟫🟫🟫🟫⚙️🟫🟫🟫🍎🟫🟫🟫🟫🟫🟫🟫🍎🟫🟫
 1 🍎🟫🟫🟫⚙️🟫🟫🟫🍎🟫🟫🟫🟫🟫☠️🟫🟫🟫🟫🟫🟫🟫🟫🌲🟫🟫🟫🟫🟫🟫
 2 🟫🟫🟫🟫🌲🟫🍎🟫🟫🟫🟫🌲🟫🟫🟫🟫☠️🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫
 3 🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🍎🟫⚙️🟫⚙️🟫⚙️🍎🟫🟫🍎🟫🟫🟫🟫🟫🟫🟫🟫
 4 🟫🟫🌲🟫🍎🟫🌲🍎🟫🟫🟫🟫🟫🟫🟫🟫⬇️🟫🟫🟫🟫🟫🟫🟫🍎🟫🟫🟫🟫🟫
 5 🍎🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🌲🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🍎🌲🟫🟫🟫
 6 🟫🟫🟫🟫🟫🟫🟫🟫🍎🟫🟫🍎🟫🟫🟫🟫🟫🍎🍎🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫
 7 🟫🟫🟫⚙️🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫⚙️🟫🟫🌲🍎🟫🟫🟫🟫
 8 ⚙️🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🍎🟫🟫🟫🍎🟫🟫🟫⚙️🟫🟫🟫🟫🟫🟫🟫🌲
 9 🟫⚙️🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🍎🟫🟫🟫🟫🟫🌲🟫🟫🌲☠️🟫🌲🌲
10 🟫🟫🟫🟫🟫🟫🟫🍎🟫🟫🟫🟫🟫🟫🟫🟫🟫⚙️🍎🟫🟫🟫🟫🟫🟫🍎🟫🟫🟫🟫
11 🟫☠️🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫⚙️🟫🟫🟫🟫🟫🍎🍎🟫🟫🟫🌲🟫🟫🌲🟫
12 🟫🍎🟫🟫⚙️🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫⚙️🟫🟫🌲🟫🟫🟫⚙️🟫🟫🟫🟫
13 🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🍎🟫🟫🟫🟫🟫🟫🟫🟫🟫

### What an Observation Looks Like

Each sailor receives their own observation with:
- Their position, energy, inventory
- Local spatial view (5-tile radius)
- Public information (other sailors' energy, ship progress)
- Evidence log (suspicious behavior)
- Messages from other sailors

In [4]:
# Get Alice's observation
alice_obs = observations["Alice"]
alice_role = env.state.sailors["Alice"].role.value

# Convert to prompt text (what the LLM will see)
prompt_text = observation_to_prompt(alice_obs, include_role=True, sailor_role=alice_role)

print("📜 ALICE'S OBSERVATION (as prompt):")
print("="*80)
print(prompt_text[:2000])  # Show first 2000 chars
print("...")
print(f"\n(Total prompt length: {len(prompt_text)} characters, ~{len(prompt_text)//4} tokens)")

📜 ALICE'S OBSERVATION (as prompt):
DAY 1, TURN 1/100 - MORNING PHASE

🎭 YOUR SECRET ROLE: TRAITOR

TRAITOR OBJECTIVES:
  - Prevent the ship from being completed before Day 100
  - Eliminate colonists (reduce to < 3 alive)
  - Use poison, sabotage, and deception
  - Avoid getting voted out
  - Lie convincingly to deflect suspicion


PHASE CONTEXT:
  Location: All sailors at BASE CAMP
  Allowed: Planning, discussions, voting (if called)
  Restricted: Cannot explore or gather resources yet

YOUR STATUS (Alice):
  Position: (15, 15, <MapLevel.GROUND: 0>)
  Energy: 100/100 ⚡⚡⚡⚡⚡
  Health: healthy
  Backpack: 0/20 items
    (empty)

WHAT YOU SEE (within 5 tiles):
  Resources:
    - WOOD_34 (wood) at (16, 16, <MapLevel.GROUND: 0>) - 1 units [2 tiles away]
    - METAL_53 (metal) at (14, 11, <MapLevel.GROUND: 0>) - 1 units [5 tiles away]
    - METAL_56 (metal) at (18, 12, <MapLevel.GROUND: 0>) - 1 units [6 tiles away]
    - METAL_76 (metal) at (14, 11, <MapLevel.GROUND: 0>) - 1 units [5 tiles a

### Action Space Design

**Key Innovation**: Natural language interface allows LLM to express intent naturally, parsed into structured actions.

**Movement** (6 actions):
- `MOVE NORTH/SOUTH/EAST/WEST [distance]` - Navigate current level
- `CLIMB UP/DOWN` - Traverse between island levels

**Resource Management** (5 actions):
- `GATHER <resource_id>` - Collect wood, metal, food from tiles
- `DEPOSIT <type> <quantity>` - Store in shared inventory
- `EAT <food_type>` - Restore energy (apples +15, berries +10)

**Ship Construction** (1 action):
- `BUILD <component>` - Requires ≥2 sailors, specific materials

**Communication** (3 actions):
- `SAY <message>` - Broadcast to all sailors
- `CALL_SOS` - Request energy assistance
- `CALL_VOTE` - Initiate voting session

**Voting** (3 actions):
- `VOTE <sailor_name>` - Cast elimination vote
- `SHOW_BACKPACK` - Reveal inventory (prove innocence)
- `REFUSE_SHOW` - Decline (suspicious behavior)

**Traitor-Specific** (2 actions):
- `SABOTAGE` - Damage ship (stealth required, no witnesses)
- `FRAME <sailor_name>` - Plant false evidence

**Passive** (1 action):
- `WAIT` - Conserve energy, observe

**Total**: 21 distinct actions enabling rich strategic depth

In [5]:
# Example: Let's make Alice move north
test_action = Action(
    sailor_id="Alice",
    action_type=ActionType.MOVE_NORTH
)

print(f"🚶 Alice's action: {test_action.action_type.value}")
print(f"\nThis will cost 1 energy per tile moved")

# You can also try other movement actions
other_actions = [
    ActionType.MOVE_SOUTH,
    ActionType.MOVE_EAST, 
    ActionType.MOVE_WEST,
    ActionType.CLIMB_UP,    # To mountain level
    ActionType.CLIMB_DOWN   # To cave level
]
print(f"\nOther movement options: {[a.value for a in other_actions]}")

🚶 Alice's action: move_north

This will cost 1 energy per tile moved

Other movement options: ['move_south', 'move_east', 'move_west', 'climb_up', 'climb_down']


## 🤖 Load GPT-OSS 20B Model

We'll use Unsloth to load GPT-OSS with 4-bit quantization and LoRA for efficient RL training.

In [6]:
from unsloth import FastLanguageModel
import torch

# Configuration
max_seq_length = 2048  # Longer context for complex game state
lora_rank = 8          # Larger rank for strategy learning

# Check GPU availability
if not torch.cuda.is_available():
    print("⚠️ No GPU detected!")
    print("Please ensure ROCm/CUDA is properly installed.")
    raise RuntimeError("GPU required for training")

# Load model
print("🔄 Loading GPT-OSS 20B model...")
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/gpt-oss-20b",
    load_in_4bit = True,
    max_seq_length = max_seq_length,
)

print("✅ GPT-OSS 20B loaded in 4-bit mode!")

# Display GPU info
device_name = torch.cuda.get_device_name(0)
total_memory = torch.cuda.get_device_properties(0).total_memory / 1e9

print(f"🎮 GPU: {device_name}")
print(f"💾 VRAM: {total_memory:.1f} GB")

# Special message for AMD Mi300X
if "MI300" in device_name.upper() or "AMD" in device_name.upper():
    print(f"🚀 AMD ROCm detected - excellent for large-scale RL training!")
    print(f"   Your Mi300X's {total_memory:.0f}GB memory is perfect for this task")

bitsandbytes library load error: Configured ROCm binary not found at /root/AIAC/colony-collapse/.venv/lib/python3.12/site-packages/bitsandbytes/libbitsandbytes_rocm64.so
Traceback (most recent call last):
  File "/root/AIAC/colony-collapse/.venv/lib/python3.12/site-packages/bitsandbytes/cextension.py", line 313, in <module>
    lib = get_native_library()
          ^^^^^^^^^^^^^^^^^^^^
  File "/root/AIAC/colony-collapse/.venv/lib/python3.12/site-packages/bitsandbytes/cextension.py", line 282, in get_native_library
    raise RuntimeError(f"Configured {BNB_BACKEND} binary not found at {cuda_binary_path}")
RuntimeError: Configured ROCm binary not found at /root/AIAC/colony-collapse/.venv/lib/python3.12/site-packages/bitsandbytes/libbitsandbytes_rocm64.so


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


  from .autonotebook import tqdm as notebook_tqdm
    PyTorch 2.8.0+cu128 with CUDA 1208 (you have 2.9.0+rocm6.4)
    Python  3.9.23 (you have 3.12.3)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details


Switching to PyTorch attention since your Xformers is broken.

Unsloth: Xformers was not installed correctly.
Please install xformers separately first.
Then confirm if it's correctly installed by running:
python -m xformers.info

Longer error message:
xFormers can't load C++/CUDA extensions. xFormers was built for:
    PyTorch 2.8.0+cu128 with CUDA 1208 (you have 2.9.0+rocm6.4)
    Python  3.9.23 (you have 3.12.3)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
🦥 Unsloth Zoo will now patch everything to make training faster!
🔄 Loading GPT-OSS 20B model...
Unsloth: AMD currently is not stable with 4bit bitsandbytes. Disabling for now.
Unsloth: AMD currently is not stable with 4bit bitsandbytes. Disabling for now.
==((====))==  Unsloth 2025.10.9: Fast Gpt_Oss patching. Transformers: 4.56.2.
   \\   /|    AMD Instinct MI300X VF. Num GPUs = 1. Max memory: 191.688 GB.

Loading checkpoint shards: 100%|██████████| 3/3 [00:05<00:00,  1.83s/it]


✅ GPT-OSS 20B loaded in 4-bit mode!
🎮 GPU: AMD Instinct MI300X VF
💾 VRAM: 205.8 GB
🚀 AMD ROCm detected - excellent for large-scale RL training!
   Your Mi300X's 206GB memory is perfect for this task


## 🎯 Strategy Extraction Functions

We need to convert LLM text output into executable game actions. This is more complex than 2048 since actions have multiple parameters.

In [10]:
from typing import Optional

def extract_strategy_from_response(response_text: str, sailor_id: str, current_position: Position) -> Optional[Action]:
    """
    Extract an Action from LLM response text.
    
    The LLM should output something like:
    ACTION: MOVE NORTH 5
    or
    ACTION: GATHER WOOD_123
    
    This uses the parse_action_safe function from llm_interface which handles all action types.
    """
    try:
        # Use the built-in parser from llm_interface
        action = parse_action_safe(response_text, sailor_id, current_position)
        return action
    except Exception as e:
        # If parsing fails, return None (will be handled with WAIT action)
        return None

print("✅ Strategy extraction function ready")

✅ Strategy extraction function ready


## 🏃 Game Execution Engine

This function runs a full game episode with LLM-generated strategies.

In [11]:
from typing import Callable, Dict, List, Tuple
from unsloth import execute_with_time_limit
import time

def _execute_game_episode(strategy_fn: Callable, max_turns: int = 500) -> Tuple[bool, Dict]:
    """
    Execute one full game episode using the strategy function.
    
    Args:
        strategy_fn: Function that takes (observation, sailor_id, role) and returns action text
        max_turns: Maximum turns before timeout
    
    Returns:
        (success, info_dict) where success=True if colonists won
    """
    # Reset environment
    env = MaroonedEnv(seed=None)  # Random seed for variety
    observations = env.reset()
    
    total_turns = 0
    total_reward = {sailor_id: 0.0 for sailor_id in env.agents}
    
    done = False
    
    while not done and total_turns < max_turns:
        # Get current active sailor
        active_sailor = env.state.get_active_sailor()
        
        if active_sailor is None:
            break
            
        obs = observations[active_sailor]
        sailor_role = env.state.sailors[active_sailor].role.value
        sailor_position = env.state.sailors[active_sailor].position
        
        # Get action from strategy
        try:
            action_text = strategy_fn(obs, active_sailor, sailor_role)
            action = extract_strategy_from_response(action_text, active_sailor, sailor_position)
            
            if action is None:
                # Fallback to WAIT
                action = Action(sailor_id=active_sailor, action_type=ActionType.WAIT)
        except Exception as e:
            print(f"Strategy error: {e}")
            action = Action(sailor_id=active_sailor, action_type=ActionType.WAIT)
        
        # Execute action (Gymnasium API returns 5 values)
        observations, rewards, dones, truncated, info = env.step({active_sailor: action})
        
        # Track rewards
        for sailor_id, reward in rewards.items():
            total_reward[sailor_id] += reward
        
        # Check if any agent is done
        done = any(dones.values()) if isinstance(dones, dict) else dones
        
        total_turns += 1
    
    # Check win condition
    colonists_won = False
    if env.state.game_over:
        if env.state.winner == "colonists":
            colonists_won = True
    
    info = {
        "total_turns": total_turns,
        "total_rewards": total_reward,
        "ship_progress": env.state.ship_progress.total_percentage,
        "survivors": len(env.state.living_sailors),
        "winner": env.state.winner if env.state.game_over else "timeout",
    }
    
    return colonists_won, info

@execute_with_time_limit(30)  # 30 second timeout per episode
def execute_game_episode(strategy_fn: Callable, max_turns: int = 500) -> Tuple[bool, Dict]:
    return _execute_game_episode(strategy_fn, max_turns)

print("✅ Game execution engine ready")

✅ Game execution engine ready


## 🧪 Test: Baseline Random Strategy

Let's test with a random strategy before training.

In [12]:
import random

def random_strategy(obs: Observation, sailor_id: str, role: str) -> str:
    """Random action generator for baseline"""
    actions = [
        "ACTION: MOVE NORTH",
        "ACTION: MOVE SOUTH",
        "ACTION: MOVE EAST",
        "ACTION: MOVE WEST",
        "ACTION: WAIT",
    ]
    return random.choice(actions)

print("Testing random baseline strategy...")
try:
    success, info = execute_game_episode(random_strategy, max_turns=100)
    print(f"\n✅ Game completed!")
    print(f"  Winner: {info['winner']}")
    print(f"  Turns: {info['total_turns']}")
    print(f"  Ship progress: {info['ship_progress']:.1f}%")
    print(f"  Survivors: {info['survivors']}/5")
except TimeoutError:
    print("⏱️ Episode timed out (expected with random strategy)")
except Exception as e:
    print(f"❌ Error: {e}")

Testing random baseline strategy...

✅ Game completed!
  Winner: timeout
  Turns: 100
  Ship progress: 0.0%
  Survivors: 4/5


## Reward Engineering

**Multi-Objective Optimization**: Training requires balancing syntax correctness, security, and gameplay performance.

### Reward Functions

1. **`function_works`**: Code Validity
   - +1.0 for syntactically correct Python
   - +0.5 for valid action format
   - -2.0 for syntax errors

2. **`no_cheating`**: Security
   - +1.0 for using only allowed modules
   - -20.0 for attempting forbidden imports (prevents exploitation)

3. **`strategy_succeeds`**: Game Performance (Primary)
   - **+50.0** Colonists complete ship / eliminate traitor
   - **+20.0** Traitor successfully sabotages mission
   - **+0-10.0** Partial ship progress (linear scaling)
   - **+1.0 per survivor** Keeping sailors alive
   - **-1.0** Timeout/failure
   - **-3.0** Invalid strategy

**Innovation**: Rewards both cooperative (colonist) and competitive (traitor) strategies simultaneously, encouraging the model to learn dual objectives.

In [13]:
from unsloth import check_python_modules

def extract_function_from_completion(text: str) -> Optional[str]:
    """
    Extract Python function from LLM completion if wrapped in backticks.
    For Marooned, we expect a strategy function.
    """
    if text.count("```") >= 2:
        first = text.find("```") + 3
        second = text.find("```", first)
        fx = text[first:second].strip()
        fx = fx.removeprefix("python\n")
        fx = fx[fx.find("def"):] if "def" in fx else fx
        return fx
    return None

def function_works(completions, **kwargs) -> List[float]:
    """
    Reward: +1.0 if LLM generated valid Python code, -2.0 if syntax error.
    """
    scores = []
    for completion in completions:
        response = completion[0]["content"]
        function = extract_function_from_completion(response)
        
        if function is None:
            # Try parsing as direct action instead
            if "ACTION:" in response.upper():
                scores.append(0.5)  # Valid format, not a function
            else:
                scores.append(-2.0)  # Invalid output
        else:
            ok, info = check_python_modules(function)
            if "error" in info:
                scores.append(-2.0)
            else:
                scores.append(1.0)
    
    return scores

def no_cheating(completions, **kwargs) -> List[float]:
    """
    Penalty: -20.0 if LLM tried to import non-standard libraries.
    """
    scores = []
    for completion in completions:
        response = completion[0]["content"]
        function = extract_function_from_completion(response)
        
        if function is not None:
            ok, info = check_python_modules(function)
            scores.append(1.0 if ok else -20.0)
        else:
            scores.append(0.0)  # Not a function, can't cheat
    
    return scores

print("✅ Basic reward functions defined")

✅ Basic reward functions defined


### Game Performance Reward

This is the most important reward - did the strategy actually help win the game?

In [14]:
from unsloth import create_locked_down_function

global EPISODE_COUNTER
EPISODE_COUNTER = 0

def strategy_succeeds(completions, **kwargs) -> List[float]:
    """
    Main reward function:
    - Massive reward (+50.0) if colonists win
    - Good reward (+20.0) if traitor wins  
    - Moderate reward (+5.0) for ship progress
    - Small penalty (-1.0) for timeout/failure
    """
    global EPISODE_COUNTER
    scores = []
    
    for completion in completions:
        printed = False
        response = completion[0]["content"]
        
        # Print every 10th episode for monitoring
        if EPISODE_COUNTER % 10 == 0:
            printed = True
            print(f"\n{'='*80}")
            print(f"Episode {EPISODE_COUNTER} - Testing strategy:")
            print(response[:500])
        
        EPISODE_COUNTER += 1
        
        # Try to extract and execute strategy
        function = extract_function_from_completion(response)
        
        if function is None:
            # Maybe it's a direct action format
            if "ACTION:" in response.upper():
                scores.append(0.0)  # Valid format but not testable as full strategy
            else:
                scores.append(-3.0)  # Invalid
            continue
        
        # Check for syntax errors
        ok, info = check_python_modules(function)
        if "error" in info:
            scores.append(-3.0)
            continue
        
        # Try to create executable function
        try:
            strategy_fn = create_locked_down_function(function)
        except Exception as e:
            if printed:
                print(f"  ❌ Function creation failed: {e}")
            scores.append(-2.0)
            continue
        
        # Run game episode with this strategy
        try:
            success, info = execute_game_episode(strategy_fn, max_turns=300)
            
            # Calculate reward
            score = 0.0
            
            if success:
                score += 50.0  # Colonists won!
                if printed:
                    print(f"  🎉 COLONISTS WON! Ship: {info['ship_progress']:.1f}%")
            elif info['winner'] == 'traitor':
                score += 20.0  # Traitor won!
                if printed:
                    print(f"  😈 TRAITOR WON! Ship: {info['ship_progress']:.1f}%")
            else:
                # Reward partial progress
                score += info['ship_progress'] / 10.0  # Up to +10.0 for 100% ship
                score += info['survivors']  # +1 per survivor
                if printed:
                    print(f"  ⚠️ No winner. Ship: {info['ship_progress']:.1f}%, Survivors: {info['survivors']}")
            
            scores.append(score)
            
        except TimeoutError:
            if printed:
                print("  ⏱️ Episode timeout")
            scores.append(-1.0)
        except Exception as e:
            if printed:
                print(f"  ❌ Execution error: {e}")
            scores.append(-3.0)
    
    return scores

print("✅ Main reward function defined")

✅ Main reward function defined


## Training Dataset Construction

**Diverse Prompt Engineering**: 4 specialized strategy archetypes to encourage varied behaviors:

1. **Cooperative Colonist** (250 examples)
   - Exploration and resource gathering
   - Efficient energy management
   - Team coordination via communication
   - Traitor detection through observation

2. **Deceptive Traitor** (250 examples)
   - Pretending to cooperate
   - Stealth sabotage mechanics
   - Lying and misdirection
   - Avoiding suspicion

3. **Resource Optimizer** (250 examples)
   - Prioritizing wood/metal for ship building
   - Food management for energy
   - Deposit patterns and logistics

4. **Social Detective** (250 examples)
   - Evidence gathering and analysis
   - Strategic voting
   - Communication-based deduction

**Total**: 1,000 training examples with balanced role representation

In [16]:
from datasets import Dataset

# Create varied prompts for different roles and scenarios
prompts = [
    # Colonist strategy prompts
    """You are a colonist sailor in the Marooned game. Create a Python strategy function that:
1. Explores the island to find resources (wood, metal, food)
2. Gathers resources efficiently while managing energy
3. Returns to base camp to deposit resources
4. Helps build the ship to escape
5. Watches for suspicious behavior from other sailors

The function receives (observation, sailor_id, role) and returns an action string.

Examples of valid actions:
- "ACTION: MOVE NORTH" - Move 1 tile north
- "ACTION: MOVE EAST 5" - Move 5 tiles east
- "ACTION: CLIMB UP" - Climb to mountain level
- "ACTION: GATHER WOOD_001" - Gather a specific resource
- "ACTION: DEPOSIT wood 10" - Deposit 10 wood at base
- "ACTION: BUILD hull" - Help build ship hull
- "ACTION: SAY Found resources at north" - Communicate

```python
def strategy(observation, sailor_id, role):
    # Your strategy here
    return "ACTION: MOVE NORTH"
```""",
    
    # Traitor strategy prompts  
    """You are the TRAITOR in the Marooned game. Create a Python strategy function that:
1. Pretends to help gather resources but sabotages secretly
2. Uses SABOTAGE action to damage ship progress
3. Lies in messages to deflect suspicion
4. Avoids getting caught and voted out

The function receives (observation, sailor_id, role) and returns an action string.

Traitor-specific actions:
- "ACTION: SABOTAGE" - Damage ship progress (stealth required)
- "ACTION: FRAME Bob" - Plant false evidence on Bob
- "ACTION: SAY <false claim>" - Lie to deflect suspicion

```python
def strategy(observation, sailor_id, role):
    # Your deceptive strategy here
    if role == "traitor":
        return "ACTION: SABOTAGE"
    return "ACTION: MOVE NORTH"
```""",
    
    # Resource management focus
    """Create an efficient resource gathering strategy for Marooned:
- Prioritize wood and metal for ship building
- Gather food when energy is below 50
- Deposit resources at base camp regularly
- Coordinate with teammates via SAY action

Available actions:
- MOVE <direction> [distance] - Navigate (NORTH/SOUTH/EAST/WEST)
- CLIMB UP/DOWN - Change levels
- GATHER <resource_id> - Collect resources
- DEPOSIT <type> <quantity> - Store at base
- EAT <food_type> - Restore energy
- SAY <message> - Communicate

```python
def strategy(observation, sailor_id, role):
    # Resource-focused strategy
    return "ACTION: GATHER WOOD_001"
```""",
    
    # Social deduction focus
    """Create a detective strategy for finding the traitor in Marooned:
- Monitor who deposits fewer resources than claimed
- Watch for suspicious behavior in observations
- Use VOTE action when evidence is strong
- Communicate suspicions with SAY action

Social actions:
- "ACTION: SAY I suspect Bob of sabotage" - Share suspicions
- "ACTION: CALL_VOTE" - Initiate voting session
- "ACTION: VOTE Bob" - Vote to eliminate Bob
- "ACTION: SHOW_BACKPACK" - Prove innocence

```python  
def strategy(observation, sailor_id, role):
    # Detective strategy
    return "ACTION: VOTE Bob"
```""",
]

# Create dataset with multiple copies for more training data
dataset_entries = []
for prompt in prompts:
    for _ in range(250):  # 250 copies of each = 1000 total
        dataset_entries.append({
            "prompt": [{"role": "user", "content": prompt.strip()}],
            "answer": 0,
            "reasoning_effort": "low"
        })

dataset = Dataset.from_list(dataset_entries)

# Calculate max prompt length
max_prompt_lengths = []
for entry in dataset:
    # Use tokenize=True to get token IDs directly
    tokens = tokenizer.apply_chat_template(
        entry["prompt"], 
        add_generation_prompt=True,
        tokenize=True
    )
    max_prompt_lengths.append(len(tokens))

maximum_prompt_length = max(max_prompt_lengths)

print(f"✅ Dataset created: {len(dataset)} entries")
print(f"📏 Max prompt length: {maximum_prompt_length} tokens")
print(f"\nSample prompt (first 500 chars):")
print(dataset[0]['prompt'][0]['content'][:500])

✅ Dataset created: 1000 entries
📏 Max prompt length: 301 tokens

Sample prompt (first 500 chars):
You are a colonist sailor in the Marooned game. Create a Python strategy function that:
1. Explores the island to find resources (wood, metal, food)
2. Gathers resources efficiently while managing energy
3. Returns to base camp to deposit resources
4. Helps build the ship to escape
5. Watches for suspicious behavior from other sailors

The function receives (observation, sailor_id, role) and returns an action string.

Examples of valid actions:
- "ACTION: MOVE NORTH" - Move 1 tile north
- "ACTIO


## GRPO Training Configuration

**Group Relative Policy Optimization** excels at multi-agent scenarios with competing objectives.

**Optimizations for Long-Horizon Tasks:**
- `max_seq_length = 2048` - Extended context for complex game states
- `lora_rank = 8` - Balanced parameter efficiency vs learning capacity
- `gradient_accumulation_steps = 2` - Effective batch size optimization
- `num_generations = 2` - Explore strategy diversity
- `max_steps = 400` - Hackathon-scoped training (expandable to 1000+)

**Hardware Requirements:**
- GPU: AMD Mi300X (192GB) or NVIDIA A100 (80GB)
- Training time: ~3-5 hours for 400 steps
- Checkpoints saved every 100 steps for ablation studies

In [17]:
from trl import GRPOConfig, GRPOTrainer

max_prompt_length = maximum_prompt_length + 10
max_completion_length = max_seq_length - max_prompt_length

print(f"Max prompt: {max_prompt_length} tokens")
print(f"Max completion: {max_completion_length} tokens")

training_args = GRPOConfig(
    temperature = 1.0,
    learning_rate = 5e-5,
    weight_decay = 0.01,
    warmup_ratio = 0.1,
    lr_scheduler_type = "linear",
    optim = "adamw_8bit",
    logging_steps = 1,
    per_device_train_batch_size = 1,
    gradient_accumulation_steps = 2,  # Effective batch size = 2
    num_generations = 2,  # Generate 2 strategies per prompt
    max_prompt_length = max_prompt_length,
    max_completion_length = max_completion_length,
    max_steps = 400,  # Reduced for hackathon timeframe
    save_steps = 100,
    report_to = "trackio",
    output_dir = "outputs_marooned",
)

print("✅ Training config ready")

Max prompt: 311 tokens
Max completion: 1737 tokens
Unsloth: We now expect `per_device_train_batch_size` to be a multiple of `num_generations`.
We will change the batch size of 1 to the `num_generations` of 2
✅ Training config ready


## 🚀 Initialize Trainer

Combine everything: model, rewards, dataset, and config.

In [18]:
trainer = GRPOTrainer(
    model = model,
    processing_class = tokenizer,
    reward_funcs = [
        function_works,
        no_cheating,
        strategy_succeeds,  # Most important - game performance
    ],
    args = training_args,
    train_dataset = dataset,
)

print("✅ GRPO Trainer initialized!")
print(f"\n📊 Reward functions:")
print("  1. function_works: Valid Python syntax")
print("  2. no_cheating: No forbidden imports")
print("  3. strategy_succeeds: Actual game performance")

INFO:httpx: HTTP Request: GET https://api.gradio.app/pkg-version "HTTP/1.1 200 OK"
Loading checkpoint shards: 100%|██████████| 3/3 [00:12<00:00,  4.13s/it]



✅ GRPO Trainer initialized!

📊 Reward functions:
  1. function_works: Valid Python syntax
  2. no_cheating: No forbidden imports
  3. strategy_succeeds: Actual game performance


## Training Execution

**What to Monitor:**
- **Reward Trajectory**: Should increase from negative → positive over ~100-200 steps
- **Episode Counter**: Tracks simulated games (each step tests 2+ strategies)
- **Strategy Diversity**: Model explores colonist cooperation vs traitor deception
- **TrackIO Metrics**: Real-time visualization of training dynamics

**Expected Progression:**
- **Steps 0-100**: Learning valid syntax, basic actions
- **Steps 100-200**: Developing resource gathering patterns
- **Steps 200-300**: Emergent cooperation and deception strategies
- **Steps 300-400**: Refined ship building and traitor detection

**Timeline**: 3-5 hours on high-end GPU (Mi300X/A100)

In [19]:
trainer.train()

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'bos_token_id': 199998, 'pad_token_id': 200017}.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1,000 | Num Epochs = 1 | Total steps = 400
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 2
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 2 x 1) = 4
 "-____-"     Trainable parameters = 0 of 20,914,757,184 (0.00% trained)
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 1,000 | Num Epochs = 1 | Total steps = 400
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 2
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 2 x 1) = 4
 "-____-"     Trainable parameters = 0 of 20,914,757,184 (0.00% trained)


* Trackio project initialized: huggingface
* Trackio metrics logged to: /root/.cache/huggingface/trackio


`generation_config` default values have been modified to match model-specific defaults: {'max_length': 131072}. If this is not desired, please set these values explicitly.


* Created new run: dainty-sunset-0


UnboundLocalError: cannot access local variable 'tracer_output' where it is not associated with a value

## 🎮 Test the Trained Model

Let's see if the trained model can play better than random!

In [None]:
from transformers import TextStreamer

# Test with a colonist prompt
test_prompt = """Create a smart colonist strategy for Marooned that:
- Explores efficiently to find wood and metal
- Manages energy by eating food when low
- Deposits resources regularly at base camp
- Helps build the ship
- Watches for the traitor

```python
def strategy(observation, sailor_id, role):
    # Your optimized strategy
"""

text = tokenizer.apply_chat_template(
    [{"role": "user", "content": test_prompt}],
    tokenize = False,
    add_generation_prompt = True,
    reasoning_effort = "low",
)

print("🤖 Generating strategy with trained model...\n")
print("="*80)

_ = model.generate(
    **tokenizer(text, return_tensors = "pt").to("cuda"),
    temperature = 1.0,
    max_new_tokens = 512,
    streamer = TextStreamer(tokenizer, skip_prompt = False),
)

## 💾 Save the Model

Save the trained model for later use or deployment.

In [None]:
# Save in 16-bit format
model.save_pretrained_merged("marooned_gpt_oss_trained", tokenizer, save_method = "merged_16bit")

print("✅ Model saved to ./marooned_gpt_oss_trained")

# Optional: Push to Hugging Face Hub
# model.push_to_hub_merged("your-username/marooned-gpt-oss", tokenizer, save_method = "merged_16bit", token = "hf_...")

## 🎯 Evaluation: Full Game Playthrough

Let's run a complete game with the trained model and visualize the results!

In [None]:
print("🏴‍☠️ Running complete game with trained model...\n")

# Create a simple strategy wrapper that uses the trained model
def trained_model_strategy(obs: Observation, sailor_id: str, role: str) -> str:
    """Generate action using trained GPT-OSS model"""
    prompt = observation_to_prompt(obs, include_role=True, sailor_role=role)
    
    # Add instruction to output action
    prompt += "\n\nOutput your next action in the format: ACTION: <action_type> <parameters>"
    
    text = tokenizer.apply_chat_template(
        [{"role": "user", "content": prompt}],
        tokenize = False,
        add_generation_prompt = True,
        reasoning_effort = "low",
    )
    
    # Generate with trained model
    output = model.generate(
        **tokenizer(text, return_tensors="pt").to("cuda"),
        temperature=0.8,
        max_new_tokens=256,
        do_sample=True,
    )
    
    response = tokenizer.decode(output[0], skip_special_tokens=True)
    
    # Extract just the assistant's response
    if "<|assistant|>" in response:
        response = response.split("<|assistant|>")[-1]
    
    return response

# Run evaluation game
try:
    success, info = execute_game_episode(trained_model_strategy, max_turns=500)
    
    print("\n" + "="*80)
    print("🎮 GAME RESULTS")
    print("="*80)
    print(f"Winner: {info['winner']}")
    print(f"Total turns: {info['total_turns']}")
    print(f"Ship progress: {info['ship_progress']:.1f}%")
    print(f"Survivors: {info['survivors']}/5")
    print(f"\nTotal rewards by sailor:")
    for sailor, reward in info['total_rewards'].items():
        print(f"  {sailor}: {reward:.2f}")
    
except TimeoutError:
    print("⏱️ Game exceeded time limit")
except Exception as e:
    print(f"❌ Error during evaluation: {e}")

## Technical Achievements & Innovations

### Environment Design

**1. Custom OpenEnv Implementation**
- Full Gymnasium API compliance (reset, step, render, observation_space, action_space)
- Multi-agent turn-based coordination with 5 independent actors
- 3-dimensional spatial representation (30×30×3 island levels)
- Rich state encoding: 100+ dimensional observation space per agent

**2. Information Asymmetry**
- **Hidden Roles**: Only traitor knows their identity
- **Private Backpacks**: Inventory visible only to owner
- **Stealth Mechanics**: Sabotage succeeds only when unobserved
- **Evidence System**: Suspicious actions logged with witness tracking

**3. Long-Horizon Complexity**
- Episodes span 100 days = up to 10,000 turns
- Credit assignment across hundreds of actions
- Delayed consequences (poison takes 3 days, ship requires 100+ resources)

### RL Training Innovations

**4. Dual-Objective Learning**
- Single model learns both cooperative (colonist) and competitive (traitor) policies
- Balanced reward structure prevents mode collapse
- Emergent social dynamics from self-play potential

**5. Natural Language Action Space**
- 21 distinct action types with variable parameters
- LLM-friendly interface (text → structured action parser)
- Extensible to future action types without retraining

**6. Multi-Modal Rewards**
- Syntax correctness (coding)
- Security enforcement (sandboxing)
- Gameplay performance (strategy)
- Encourages valid, safe, and effective policies

### Comparison to Baseline (2048)

| Metric | 2048 | Marooned |
|--------|------|----------|
| Agents | 1 | 5 (multi-agent) |
| Actions | 4 (fixed) | 21+ (parameterized) |
| Horizon | ~1,000 moves | ~10,000 turns |
| Information | Perfect | Asymmetric (hidden roles) |
| Objectives | Single (score) | Dual (cooperate/deceive) |
| State Space | 16 tiles | 1,350+ tiles + inventories + social state |
| Complexity | Deterministic | Stochastic, social, strategic |

### Why This Advances OpenEnv

**Research Contributions:**
1. **Social AI**: First OpenEnv environment with deception mechanics
2. **Multi-Agent RL**: Demonstrates scaling beyond single-agent scenarios
3. **Long-Horizon Planning**: Pushes RL to 100-day planning problems
4. **LLM Integration**: Shows natural language can be effective action space
5. **Reusable Components**: Open-source environment for future research

**Potential Applications:**
- Negotiation and diplomacy training
- Collaborative AI with trust modeling
- Adversarial robustness testing
- Social simulation for game AI

---

## Results & Insights

**Baseline Performance** (Random Strategy):
- Ship Progress: 0.0% (100 turns)
- Survivors: 4/5 (energy depletion)
- Winner: Timeout

**Expected Trained Performance** (400 steps):
- Ship Progress: 15-30% (strategic resource gathering)
- Survivors: 5/5 (energy management learned)
- Win Rate: 5-10% colonist victories

**Advanced Training** (1000+ steps):
- Ship Progress: 50-80%
- Win Rate: 20-40% colonist victories
- Emergent Strategies: Role-based specialization, traitor detection patterns

---

## Future Directions

**Environment Enhancements:**
- Dynamic weather affecting resource availability
- Ship repair mechanics (traitor can damage, colonists must fix)
- Multiple traitors (e.g., 2 traitors in 7-sailor game)
- Skill specialization (woodcutter, builder, navigator roles)

**Training Improvements:**
- Self-play tournaments between checkpoints
- Colonist-only vs traitor-only specialized models
- Curriculum learning (start with shorter games, increase duration)
- Multi-environment training (forest island, desert island variants)

**Evaluation Metrics:**
- Win rate vs random baseline
- Win rate vs heuristic strategies
- Traitor detection accuracy
- Resource gathering efficiency
- Communication quality (informativeness, deception detection)

---

## Acknowledgments

**Built For**: OpenEnv Hackathon (Meta PyTorch + Unsloth AI)

**Technical Stack:**
- Environment: Custom Marooned (Gymnasium-compatible)
- Model: GPT-OSS 20B (4-bit quantization)
- Training: GRPO via Unsloth + TRL
- Hardware: AMD Mi300X (192GB VRAM)

**Inspiration:**
- Game Design: *Among Us* (social deduction) × *Don't Starve Together* (survival) × *The Resistance* (hidden roles)
- RL Research: Multi-agent cooperation, emergent communication, deception learning

---

**License**: LGPL-3.0 (Environment code available at: https://github.com/atchudhansg/colony-collapse)

## 🛑 Note on Server Management

The Marooned server is running externally in a separate terminal. 
- To stop it, use `CTRL+C` in the terminal where you started it
- The server needs to keep running while this notebook executes
- You can monitor server logs in the terminal window

In [None]:
# The server is running externally, so no cleanup needed in this notebook
print("ℹ️ Server is running externally")
print("   To stop it, use CTRL+C in the terminal where you started marooned_server.py")