# 🏴‍☠️ MAROONED - Reinforcement Learning Training
## Training GPT-OSS to Play a Social Deduction Survival Game

<a href="https://colab.research.google.com/github/atchudhansg/colony-collapse/blob/main/phase7_rl_training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---

## 🎮 The Game: Pirates Meet Among Us

**Scenario:** 5 sailors shipwrecked on a mysterious island must rebuild their ship within 100 days to escape. But one sailor is secretly a **traitor** sabotaging their efforts.

**The Twist:**
- 🏝️ **Multi-level island** (Ground, Mountain, Caves) with resources
- ⚡ **Energy management** - walk, climb, gather resources
- 🎒 **Hidden inventories** - private backpacks create information asymmetry  
- ☠️ **Poison system** - traitor can secretly eliminate colonists
- 🗳️ **Voting & accusations** - social deduction mechanics
- 🚢 **Cooperative building** - requires teamwork to complete ship

**Win Conditions:**
- **Colonists win:** Build ship to 100% OR eliminate the traitor
- **Traitor wins:** Prevent ship completion for 100 days OR kill enough colonists

---

## 🎯 RL Training Goal

We'll train **GPT-OSS 20B** to:
1. **Play as colonists** - cooperate, gather resources, build ship, detect traitor
2. **Play as traitor** - sabotage, deceive, poison, avoid detection
3. **Learn strategy** through reinforcement learning with GRPO

This is a **multi-agent, long-horizon, deception-based** RL challenge - far more complex than 2048!

---

In [6]:
%%capture
import os, importlib.util
!pip install --upgrade -qqq uv
if importlib.util.find_spec("torch") is None or "COLAB_" in "".join(os.environ.keys()):
    try: import numpy; get_numpy = f"numpy=={numpy.__version__}"
    except: get_numpy = "numpy"
    !uv pip install -qqq \
        "torch>=2.8.0" "triton>=3.4.0" {get_numpy} torchvision bitsandbytes "transformers==4.56.2" trackio \
        "unsloth_zoo[base] @ git+https://github.com/unslothai/unsloth-zoo" \
        "unsloth[base] @ git+https://github.com/unslothai/unsloth" \
        git+https://github.com/triton-lang/triton.git@05b2c186c1b6c9a08375389d5efe9cb4c401c075#subdirectory=python/triton_kernels
elif importlib.util.find_spec("unsloth") is None:
    !uv pip install -qqq unsloth trackio
!uv pip install --upgrade --no-deps transformers==4.56.2 tokenizers trl==0.22.2 unsloth unsloth_zoo trackio

In [7]:
%%capture
!pip install -qqq fastapi uvicorn requests open_spiel
!git clone https://github.com/meta-pytorch/OpenEnv.git > /dev/null 2>&1
%cd OpenEnv
import subprocess, sys, os
from pathlib import Path
sys.path.insert(0, './src')
working_directory = str(Path.cwd().parent.absolute() / "OpenEnv")

In [9]:
import torch
print("ROCm available:", torch.version.hip)
print("CUDA available:", torch.cuda.is_available())
print("MPS available:", torch.backends.mps.is_available())
print("Devices:", torch.cuda.device_count())
print("Device name:", torch.cuda.get_device_name(0) if torch.cuda.device_count() > 0 else "None")


ROCm available: None
CUDA available: False
MPS available: False
Devices: 0
Device name: None


In [8]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 768 # Can increase for longer RL output
lora_rank = 4        # Larger rank = smarter, but slower
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/gpt-oss-20b",
    load_in_4bit = True,
    max_seq_length = max_seq_length,
)

ImportError: Unsloth: Please install unsloth_zoo via `pip install unsloth_zoo`

In [4]:
pip install unsloth_zoo


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


## 📦 Installation

We'll install:
- **Unsloth** - 2-6x faster RL training, 70% less VRAM
- **Transformers, TRL** - For GPT-OSS and GRPO training
- **TrackIO** - Real-time training visualization

## 🌐 OpenEnv Setup

**Prerequisites:** Make sure your Marooned server is already running:
```bash
python marooned_server.py
```

The server should be running on `http://localhost:8000` before proceeding with this notebook.

In [1]:
# Install OpenEnv dependencies (for client-side API calls)
!pip install -qqq requests

# Check if we're in Colab or need to clone OpenEnv
import os
if not os.path.exists('OpenEnv'):
    !git clone https://github.com/meta-pytorch/OpenEnv.git > /dev/null 2>&1

import subprocess, sys
from pathlib import Path

print("✅ OpenEnv dependencies installed")

✅ OpenEnv dependencies installed


### Connect to Marooned OpenEnv Server

Let's verify that your Marooned server is running and accessible.

In [5]:
import time
import requests

# Port for the server
PORT = "8000"
LOCALHOST = f"http://localhost:{PORT}"

print("🔗 Connecting to Marooned OpenEnv server...")
print(f"   URL: {LOCALHOST}")

# Test health check with shorter timeout
try:
    response = requests.get(f"{LOCALHOST}/health", timeout=5)
    if response.status_code == 200:
        print("\n✅ Server is running!")
        print(f"   Health check: {response.json()}")
        
        print(f"\n📡 Available endpoints:")
        print(f"   {LOCALHOST}/         - API info")
        print(f"   {LOCALHOST}/health   - Health check")
        print(f"   {LOCALHOST}/reset    - Reset environment")
        print(f"   {LOCALHOST}/step     - Execute action")
        print(f"   {LOCALHOST}/state    - Get game state")
    else:
        print(f"\n⚠️ Server responded with status {response.status_code}")
        print("   Please make sure the server is running: python marooned_server.py")
except requests.exceptions.ConnectionError:
    print(f"\n❌ Could not connect to server at {LOCALHOST}")
    print("\n🚨 IMPORTANT: Please start the Marooned server first:")
    print("   Run this command in a terminal:")
    print("   python marooned_server.py")
except requests.exceptions.Timeout:
    print(f"\n❌ Connection timeout - server is not responding")
    print("   Make sure marooned_server.py is running")
except Exception as e:
    print(f"\n❌ Unexpected error: {e}")
    print("   Check if the server is accessible")

🔗 Connecting to Marooned OpenEnv server...
   URL: http://localhost:8000

✅ Server is running!
   Health check: {'status': 'healthy', 'environment_initialized': False}

📡 Available endpoints:
   http://localhost:8000/         - API info
   http://localhost:8000/health   - Health check
   http://localhost:8000/reset    - Reset environment
   http://localhost:8000/step     - Execute action
   http://localhost:8000/state    - Get game state


### Test the Server

**⚠️ IMPORTANT:** You need to **restart your server** to apply the fixes:
1. Go to the terminal where `marooned_server.py` is running
2. Press `CTRL+C` to stop it
3. Run `python marooned_server.py` again

**Fixes applied:**
- ✅ Changed `current_sailor_id` → `get_active_sailor()` 
- ✅ Changed `total_progress` → `total_percentage`
- ✅ Added `current_position` parameter to `parse_action_safe()`

Now let's verify the server works by resetting the environment and taking a few steps:

In [6]:
# Test the server API
print("🧪 Testing Marooned OpenEnv server...\n")

# 1. Reset the environment
print("1️⃣ Resetting environment...")
reset_response = requests.post(f"{LOCALHOST}/reset")
print(f"   Status: {reset_response.status_code}")

if reset_response.status_code != 200:
    print(f"\n❌ Server error! Response:")
    print(f"   {reset_response.json()}")
    print("\n🔍 Check your server terminal for detailed error messages")
    print("   The server is running but encountering errors when processing requests")
else:
    reset_data = reset_response.json()
    
    # Access the data correctly based on actual structure
    if 'observation' in reset_data:
        obs = reset_data['observation']
        print(f"   Active sailor: {obs['sailor_id']}")
        print(f"   Day {obs['day']}, Turn {obs['turn']}")
        print(f"   Ship progress: {obs['ship_progress']:.1f}%")
        sailor_id = obs['sailor_id']
    else:
        # Maybe the response is the observation directly
        print(f"   Active sailor: {reset_data.get('sailor_id', 'N/A')}")
        print(f"   Day {reset_data.get('day', 0)}, Turn {reset_data.get('turn', 0)}")
        print(f"   Ship progress: {reset_data.get('ship_progress', 0.0):.1f}%")
        sailor_id = reset_data.get('sailor_id', 'Alice')
    
    # 2. Get game state
    print("\n2️⃣ Getting game state...")
    state_response = requests.get(f"{LOCALHOST}/state")
    state_data = state_response.json()
    print(f"   Living sailors: {len(state_data.get('living_sailors', []))}")
    print(f"   Current phase: {state_data.get('phase', 'unknown')}")
    
    # 3. Take a step
    print("\n3️⃣ Taking a step (MOVE NORTH)...")
    step_response = requests.post(
        f"{LOCALHOST}/step",
        json={
            "sailor_id": sailor_id,
            "action": "ACTION: MOVE NORTH"
        }
    )
    
    if step_response.status_code != 200:
        print(f"   Status: {step_response.status_code}")
        print(f"   Error: {step_response.json()}")
    else:
        step_data = step_response.json()
        print(f"   Status: {step_response.status_code}")
        
        # Handle step response structure
        if 'observation' in step_data:
            obs = step_data['observation']
            print(f"   Reward: {step_data.get('reward', 0.0)}")
            print(f"   Next sailor: {obs['sailor_id']}")
            print(f"   Energy: {obs['energy']}")
        else:
            print(f"   Reward: {step_data.get('reward', 0.0)}")
            print(f"   Next sailor: {step_data.get('sailor_id', 'N/A')}")
            print(f"   Energy: {step_data.get('energy', 0)}")
        
        print("\n✅ Server is working correctly!")

🧪 Testing Marooned OpenEnv server...

1️⃣ Resetting environment...
   Status: 200
   Active sailor: Alice
   Day 1, Turn 1
   Ship progress: 0.0%

2️⃣ Getting game state...
   Living sailors: 0
   Current phase: unknown

3️⃣ Taking a step (MOVE NORTH)...
   Status: 200
   Reward: -0.01
   Next sailor: Alice
   Energy: 100

✅ Server is working correctly!


In [None]:
import os, importlib.util

# For AMD ROCm GPUs, we need to install PyTorch with ROCm support
print("🔧 Installing packages for AMD ROCm GPU...")

# Install PyTorch with ROCm 6.0 support (for Mi300X)
!pip install --upgrade -q torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.0

# Install other dependencies
!pip install --upgrade -q transformers==4.56.2 tokenizers trl==0.22.2 bitsandbytes trackio datasets

# Install Unsloth and unsloth_zoo
!pip install --upgrade -q unsloth unsloth_zoo

print("✅ All packages installed!")

# Verify GPU detection
import torch
print(f"\n🔍 GPU Detection:")
print(f"   torch.cuda.is_available(): {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"   GPU Name: {torch.cuda.get_device_name(0)}")
    print(f"   GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

[2mUsing Python 3.12.11 environment at: /workspace/AIAC/.venv[0m
[2K[2mResolved [1m6 packages[0m [2min 78ms[0m[0m                                          [0m
[2mAudited [1m6 packages[0m [2min 0.15ms[0m[0m
[2K[2mResolved [1m6 packages[0m [2min 78ms[0m[0m                                          [0m
[2mAudited [1m6 packages[0m [2min 0.15ms[0m[0m


## 🏝️ Load Marooned Environment

Our custom multi-agent environment is already built! Let's import it.

In [8]:
import sys
sys.path.insert(0, './marooned_env')

from environment import MaroonedEnv
from models import Action, Observation, Position
from llm_interface import observation_to_prompt, parse_llm_response, parse_action_safe
from config import (
    ActionType, ResourceType, MapLevel, SailorRole,
    MAX_DAYS, TURNS_PER_DAY, TOTAL_SAILORS
)

print("✅ Marooned environment loaded!")
print(f"📊 Game parameters: {TOTAL_SAILORS} sailors, {MAX_DAYS} days, {TURNS_PER_DAY} turns/day")

✅ Marooned environment loaded!
📊 Game parameters: 5 sailors, 100 days, 100 turns/day


## 🎮 Environment Demo: How Marooned Works

Let's initialize the environment and see what observations look like.

In [9]:
# Create environment
env = MaroonedEnv(seed=42)
observations = env.reset()

print("🏝️ Game initialized!")
print(f"\n👥 Sailors: {list(observations.keys())}")
print(f"🎭 Traitor: {env.state.traitor_id}")
print(f"\n📍 All sailors start at base camp: {env.state.sailors['Alice'].position.to_tuple()}")

🏝️ Game initialized!

👥 Sailors: ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve']
🎭 Traitor: Alice

📍 All sailors start at base camp: (15, 15, <MapLevel.GROUND: 0>)


### Visualize the Island

The island has 3 levels with different resources:

In [10]:
# Show the ground level map with all sailors at base camp
print("🗺️ GROUND LEVEL (30x30) - Main exploration area")
print(env.render_map(MapLevel.GROUND, use_emoji=True))

print("\n" + "="*80)
print("\n⛰️ MOUNTAIN LEVEL (10x10) - Rare resources, high energy cost")
print(env.render_map(MapLevel.MOUNTAIN, use_emoji=True))

print("\n" + "="*80)
print("\n🕳️ CAVE LEVEL (15x15) - Dark, unique resources")
print(env.render_map(MapLevel.CAVE, use_emoji=True))

🗺️ GROUND LEVEL (30x30) - Main exploration area

🏝️  GROUND LEVEL (Z=0)
Legend: 🟫 land | 🌲 wood | ⚙️ metal | 🍎 food | 🌿 antidote | ☠️ poison
        ⬆️ stairs up | ⬇️ stairs down | 🏠 base | A/B/C/D/E sailors | 5👥 group

   012345678901234567890123456789
 0 🟫🟫🟫🟫🍎🟫🟫🟫🌲🟫🟫🟫🟫🟫🟫⚙️🟫🟫🟫🍎🟫🟫🟫🟫🟫🟫🟫🍎🟫🟫
 1 🍎🟫🟫🟫⚙️🟫🟫🟫🍎🟫🟫🟫🟫🟫☠️🟫🟫🟫🟫🟫🟫🟫🟫🌲🟫🟫🟫🟫🟫🟫
 2 🟫🟫🟫🟫🌲🟫🍎🟫🟫🟫🟫🌲🟫🟫🟫🟫☠️🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫
 3 🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🍎🟫⚙️🟫⚙️🟫⚙️🍎🟫🟫🍎🟫🟫🟫🟫🟫🟫🟫🟫
 4 🟫🟫🌲🟫🍎🟫🌲🍎🟫🟫🟫🟫🟫🟫🟫🟫⬇️🟫🟫🟫🟫🟫🟫🟫🍎🟫🟫🟫🟫🟫
 5 🍎🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🌲🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🍎🌲🟫🟫🟫
 6 🟫🟫🟫🟫🟫🟫🟫🟫🍎🟫🟫🍎🟫🟫🟫🟫🟫🍎🍎🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫
 7 🟫🟫🟫⚙️🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫⚙️🟫🟫🌲🍎🟫🟫🟫🟫
 8 ⚙️🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🍎🟫🟫🟫🍎🟫🟫🟫⚙️🟫🟫🟫🟫🟫🟫🟫🌲
 9 🟫⚙️🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🍎🟫🟫🟫🟫🟫🌲🟫🟫🌲☠️🟫🌲🌲
10 🟫🟫🟫🟫🟫🟫🟫🍎🟫🟫🟫🟫🟫🟫🟫🟫🟫⚙️🍎🟫🟫🟫🟫🟫🟫🍎🟫🟫🟫🟫
11 🟫☠️🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫⚙️🟫🟫🟫🟫🟫🍎🍎🟫🟫🟫🌲🟫🟫🌲🟫
12 🟫🍎🟫🟫⚙️🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫⚙️🟫🟫🌲🟫🟫🟫⚙️🟫🟫🟫🟫
13 🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🍎🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫
14 🟫🟫🟫🌲🟫🟫🟫🟫🟫☠️🟫🟫🟫🟫🍎🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🌲🍎🟫🟫🟫
15 🟫🟫🍎🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫5👥🟫🟫🟫⬆️🟫🟫🟫🌲🟫🟫🌲🟫🟫🟫
16 🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🍎🟫🍎🍎🟫🌲🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫
17 🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫⚙️🟫🟫🌲🟫🟫⚙️🟫
18 🟫🌲🟫🟫🟫🟫🟫🟫☠️🟫🟫🟫🟫⚙️🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🟫🍎🟫🟫🟫🟫
19 🟫🟫🟫🟫🟫🟫🌲🍎🟫🟫🟫🟫🟫🟫🟫🍎🟫🟫🟫🟫🍎🟫🟫🟫🟫🟫🟫🟫🟫⚙️
20 🟫🟫⚙️🟫🌲🟫🍎⚙️🟫⚙️🟫🟫🟫🍎🍎🟫🟫🟫🟫🟫🌲🟫🟫🟫🟫🟫⚙️🟫🟫🟫


### What an Observation Looks Like

Each sailor receives their own observation with:
- Their position, energy, inventory
- Local spatial view (5-tile radius)
- Public information (other sailors' energy, ship progress)
- Evidence log (suspicious behavior)
- Messages from other sailors

In [11]:
# Get Alice's observation
alice_obs = observations["Alice"]
alice_role = env.state.sailors["Alice"].role.value

# Convert to prompt text (what the LLM will see)
prompt_text = observation_to_prompt(alice_obs, include_role=True, sailor_role=alice_role)

print("📜 ALICE'S OBSERVATION (as prompt):")
print("="*80)
print(prompt_text[:2000])  # Show first 2000 chars
print("...")
print(f"\n(Total prompt length: {len(prompt_text)} characters, ~{len(prompt_text)//4} tokens)")

📜 ALICE'S OBSERVATION (as prompt):
DAY 1, TURN 1/100 - MORNING PHASE

🎭 YOUR SECRET ROLE: TRAITOR

TRAITOR OBJECTIVES:
  - Prevent the ship from being completed before Day 100
  - Eliminate colonists (reduce to < 3 alive)
  - Use poison, sabotage, and deception
  - Avoid getting voted out
  - Lie convincingly to deflect suspicion


PHASE CONTEXT:
  Location: All sailors at BASE CAMP
  Allowed: Planning, discussions, voting (if called)
  Restricted: Cannot explore or gather resources yet

YOUR STATUS (Alice):
  Position: (15, 15, <MapLevel.GROUND: 0>)
  Energy: 100/100 ⚡⚡⚡⚡⚡
  Health: healthy
  Backpack: 0/20 items
    (empty)

WHAT YOU SEE (within 5 tiles):
  Resources:
    - WOOD_34 (wood) at (16, 16, <MapLevel.GROUND: 0>) - 1 units [2 tiles away]
    - METAL_53 (metal) at (14, 11, <MapLevel.GROUND: 0>) - 1 units [5 tiles away]
    - METAL_56 (metal) at (18, 12, <MapLevel.GROUND: 0>) - 1 units [6 tiles away]
    - METAL_76 (metal) at (14, 11, <MapLevel.GROUND: 0>) - 1 units [5 tiles a

### Action Space

Sailors can take complex actions expressed in natural language. The LLM outputs text commands that get parsed into specific action types:

**Movement Actions:**
- `MOVE NORTH/SOUTH/EAST/WEST [distance]` - Navigate on current level
- `CLIMB UP/DOWN` - Move between levels (Ground ↔ Mountain ↔ Cave)

**Resource Actions:**
- `GATHER <resource_id>` - Collect wood, metal, food from nearby tiles
- `DEPOSIT <type> <quantity>` - Store items in common inventory at base
- `EAT <food_type>` - Consume food to restore energy

**Ship Building:**
- `BUILD <component>` - Construct hull, mast, sail, rudder, or supplies
- Must be at ship site with ≥2 sailors present

**Communication:**
- `SAY <message>` - Broadcast message to all sailors
- `CALL_SOS` - Emergency energy request
- `CALL_VOTE` - Initiate voting session

**Voting:**
- `VOTE <sailor_name>` - Vote to eliminate suspected traitor
- `SHOW_BACKPACK` - Reveal inventory to prove innocence
- `REFUSE_SHOW` - Decline to show inventory (looks suspicious)

**Traitor-Only Actions:**
- `SABOTAGE` - Damage ship progress (must be unobserved)
- `FRAME <sailor_name>` - Plant false evidence

**Passive:**
- `WAIT` - Do nothing this turn

In [12]:
# Example: Let's make Alice move north
test_action = Action(
    sailor_id="Alice",
    action_type=ActionType.MOVE_NORTH
)

print(f"🚶 Alice's action: {test_action.action_type.value}")
print(f"\nThis will cost 1 energy per tile moved")

# You can also try other movement actions
other_actions = [
    ActionType.MOVE_SOUTH,
    ActionType.MOVE_EAST, 
    ActionType.MOVE_WEST,
    ActionType.CLIMB_UP,    # To mountain level
    ActionType.CLIMB_DOWN   # To cave level
]
print(f"\nOther movement options: {[a.value for a in other_actions]}")

🚶 Alice's action: move_north

This will cost 1 energy per tile moved

Other movement options: ['move_south', 'move_east', 'move_west', 'climb_up', 'climb_down']


## 🤖 Load GPT-OSS 20B Model

We'll use Unsloth to load GPT-OSS with 4-bit quantization and LoRA for efficient RL training.

In [18]:
from unsloth import FastLanguageModel
import torch

# Configuration
max_seq_length = 2048  # Longer context for complex game state
lora_rank = 8          # Larger rank for strategy learning

# Check GPU availability
if not torch.cuda.is_available():
    print("⚠️ No GPU detected!")
    print("Please ensure ROCm/CUDA is properly installed.")
    raise RuntimeError("GPU required for training")

# Load model
print("🔄 Loading GPT-OSS 20B model...")
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/gpt-oss-20b",
    load_in_4bit = True,
    max_seq_length = max_seq_length,
)

print("✅ GPT-OSS 20B loaded in 4-bit mode!")

# Display GPU info
device_name = torch.cuda.get_device_name(0)
total_memory = torch.cuda.get_device_properties(0).total_memory / 1e9

print(f"🎮 GPU: {device_name}")
print(f"💾 VRAM: {total_memory:.1f} GB")

# Special message for AMD Mi300X
if "MI300" in device_name.upper() or "AMD" in device_name.upper():
    print(f"🚀 AMD ROCm detected - excellent for large-scale RL training!")
    print(f"   Your Mi300X's {total_memory:.0f}GB memory is perfect for this task")

ImportError: Unsloth: Please install unsloth_zoo via `pip install unsloth_zoo`

In [None]:
pip install unsloth_zoo

### Add LoRA Adapters

LoRA lets us train only 1-5% of the model's parameters, saving massive amounts of memory while maintaining performance.

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank,
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha = lora_rank * 2,
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
)

print("✅ LoRA adapters added!")

## 🎯 Strategy Extraction Functions

We need to convert LLM text output into executable game actions. This is more complex than 2048 since actions have multiple parameters.

In [None]:
from typing import Optional

def extract_strategy_from_response(response_text: str, sailor_id: str, current_position: Position) -> Optional[Action]:
    """
    Extract an Action from LLM response text.
    
    The LLM should output something like:
    ACTION: MOVE NORTH 5
    or
    ACTION: GATHER WOOD_123
    
    This uses the parse_action_safe function from llm_interface which handles all action types.
    """
    try:
        # Use the built-in parser from llm_interface
        action = parse_action_safe(response_text, sailor_id, current_position)
        return action
    except Exception as e:
        # If parsing fails, return None (will be handled with WAIT action)
        return None

print("✅ Strategy extraction function ready")

## 🏃 Game Execution Engine

This function runs a full game episode with LLM-generated strategies.

In [None]:
from typing import Callable, Dict, List, Tuple
from unsloth import execute_with_time_limit
import time

def _execute_game_episode(strategy_fn: Callable, max_turns: int = 500) -> Tuple[bool, Dict]:
    """
    Execute one full game episode using the strategy function.
    
    Args:
        strategy_fn: Function that takes (observation, sailor_id, role) and returns action text
        max_turns: Maximum turns before timeout
    
    Returns:
        (success, info_dict) where success=True if colonists won
    """
    # Reset environment
    env = MaroonedEnv(seed=None)  # Random seed for variety
    observations = env.reset()
    
    total_turns = 0
    total_reward = {sailor_id: 0.0 for sailor_id in env.agents}
    
    done = False
    
    while not done and total_turns < max_turns:
        # Get current active sailor
        active_sailor = env.state.get_active_sailor()
        
        if active_sailor is None:
            break
            
        obs = observations[active_sailor]
        sailor_role = env.state.sailors[active_sailor].role.value
        sailor_position = env.state.sailors[active_sailor].position
        
        # Get action from strategy
        try:
            action_text = strategy_fn(obs, active_sailor, sailor_role)
            action = extract_strategy_from_response(action_text, active_sailor, sailor_position)
            
            if action is None:
                # Fallback to WAIT
                action = Action(sailor_id=active_sailor, action_type=ActionType.WAIT)
        except Exception as e:
            print(f"Strategy error: {e}")
            action = Action(sailor_id=active_sailor, action_type=ActionType.WAIT)
        
        # Execute action
        observations, rewards, done, info = env.step({active_sailor: action})
        
        # Track rewards
        for sailor_id, reward in rewards.items():
            total_reward[sailor_id] += reward
        
        total_turns += 1
    
    # Check win condition
    colonists_won = False
    if env.state.game_over:
        if env.state.winner == "colonists":
            colonists_won = True
    
    info = {
        "total_turns": total_turns,
        "total_rewards": total_reward,
        "ship_progress": env.state.ship_progress.total_percentage,
        "survivors": len(env.state.living_sailors),
        "winner": env.state.winner if env.state.game_over else "timeout",
    }
    
    return colonists_won, info

@execute_with_time_limit(30)  # 30 second timeout per episode
def execute_game_episode(strategy_fn: Callable, max_turns: int = 500) -> Tuple[bool, Dict]:
    return _execute_game_episode(strategy_fn, max_turns)

print("✅ Game execution engine ready")

## 🧪 Test: Baseline Random Strategy

Let's test with a random strategy before training.

In [None]:
import random

def random_strategy(obs: Observation, sailor_id: str, role: str) -> str:
    """Random action generator for baseline"""
    actions = [
        "ACTION: MOVE NORTH",
        "ACTION: MOVE SOUTH",
        "ACTION: MOVE EAST",
        "ACTION: MOVE WEST",
        "ACTION: WAIT",
    ]
    return random.choice(actions)

print("Testing random baseline strategy...")
try:
    success, info = execute_game_episode(random_strategy, max_turns=100)
    print(f"\n✅ Game completed!")
    print(f"  Winner: {info['winner']}")
    print(f"  Turns: {info['total_turns']}")
    print(f"  Ship progress: {info['ship_progress']:.1f}%")
    print(f"  Survivors: {info['survivors']}/5")
except TimeoutError:
    print("⏱️ Episode timed out (expected with random strategy)")
except Exception as e:
    print(f"❌ Error: {e}")

## 🎯 Reward Functions for RL

This is the heart of RL training. We need separate reward functions for:
1. **Valid action** - Did the LLM output a parseable strategy?
2. **No cheating** - Did it try to import external modules?
3. **Game progress** - Did the strategy lead to good outcomes?

In [None]:
from unsloth import check_python_modules

def extract_function_from_completion(text: str) -> Optional[str]:
    """
    Extract Python function from LLM completion if wrapped in backticks.
    For Marooned, we expect a strategy function.
    """
    if text.count("```") >= 2:
        first = text.find("```") + 3
        second = text.find("```", first)
        fx = text[first:second].strip()
        fx = fx.removeprefix("python\n")
        fx = fx[fx.find("def"):] if "def" in fx else fx
        return fx
    return None

def function_works(completions, **kwargs) -> List[float]:
    """
    Reward: +1.0 if LLM generated valid Python code, -2.0 if syntax error.
    """
    scores = []
    for completion in completions:
        response = completion[0]["content"]
        function = extract_function_from_completion(response)
        
        if function is None:
            # Try parsing as direct action instead
            if "ACTION:" in response.upper():
                scores.append(0.5)  # Valid format, not a function
            else:
                scores.append(-2.0)  # Invalid output
        else:
            ok, info = check_python_modules(function)
            if "error" in info:
                scores.append(-2.0)
            else:
                scores.append(1.0)
    
    return scores

def no_cheating(completions, **kwargs) -> List[float]:
    """
    Penalty: -20.0 if LLM tried to import non-standard libraries.
    """
    scores = []
    for completion in completions:
        response = completion[0]["content"]
        function = extract_function_from_completion(response)
        
        if function is not None:
            ok, info = check_python_modules(function)
            scores.append(1.0 if ok else -20.0)
        else:
            scores.append(0.0)  # Not a function, can't cheat
    
    return scores

print("✅ Basic reward functions defined")

### Game Performance Reward

This is the most important reward - did the strategy actually help win the game?

In [None]:
from unsloth import create_locked_down_function

global EPISODE_COUNTER
EPISODE_COUNTER = 0

def strategy_succeeds(completions, **kwargs) -> List[float]:
    """
    Main reward function:
    - Massive reward (+50.0) if colonists win
    - Good reward (+20.0) if traitor wins  
    - Moderate reward (+5.0) for ship progress
    - Small penalty (-1.0) for timeout/failure
    """
    global EPISODE_COUNTER
    scores = []
    
    for completion in completions:
        printed = False
        response = completion[0]["content"]
        
        # Print every 10th episode for monitoring
        if EPISODE_COUNTER % 10 == 0:
            printed = True
            print(f"\n{'='*80}")
            print(f"Episode {EPISODE_COUNTER} - Testing strategy:")
            print(response[:500])
        
        EPISODE_COUNTER += 1
        
        # Try to extract and execute strategy
        function = extract_function_from_completion(response)
        
        if function is None:
            # Maybe it's a direct action format
            if "ACTION:" in response.upper():
                scores.append(0.0)  # Valid format but not testable as full strategy
            else:
                scores.append(-3.0)  # Invalid
            continue
        
        # Check for syntax errors
        ok, info = check_python_modules(function)
        if "error" in info:
            scores.append(-3.0)
            continue
        
        # Try to create executable function
        try:
            strategy_fn = create_locked_down_function(function)
        except Exception as e:
            if printed:
                print(f"  ❌ Function creation failed: {e}")
            scores.append(-2.0)
            continue
        
        # Run game episode with this strategy
        try:
            success, info = execute_game_episode(strategy_fn, max_turns=300)
            
            # Calculate reward
            score = 0.0
            
            if success:
                score += 50.0  # Colonists won!
                if printed:
                    print(f"  🎉 COLONISTS WON! Ship: {info['ship_progress']:.1f}%")
            elif info['winner'] == 'traitor':
                score += 20.0  # Traitor won!
                if printed:
                    print(f"  😈 TRAITOR WON! Ship: {info['ship_progress']:.1f}%")
            else:
                # Reward partial progress
                score += info['ship_progress'] / 10.0  # Up to +10.0 for 100% ship
                score += info['survivors']  # +1 per survivor
                if printed:
                    print(f"  ⚠️ No winner. Ship: {info['ship_progress']:.1f}%, Survivors: {info['survivors']}")
            
            scores.append(score)
            
        except TimeoutError:
            if printed:
                print("  ⏱️ Episode timeout")
            scores.append(-1.0)
        except Exception as e:
            if printed:
                print(f"  ❌ Execution error: {e}")
            scores.append(-3.0)
    
    return scores

print("✅ Main reward function defined")

## 📝 Create Training Dataset

We'll create prompts that ask the LLM to generate strategies for different scenarios.

In [None]:
from datasets import Dataset

# Create varied prompts for different roles and scenarios
prompts = [
    # Colonist strategy prompts
    """You are a colonist sailor in the Marooned game. Create a Python strategy function that:
1. Explores the island to find resources (wood, metal, food)
2. Gathers resources efficiently while managing energy
3. Returns to base camp to deposit resources
4. Helps build the ship to escape
5. Watches for suspicious behavior from other sailors

The function receives (observation, sailor_id, role) and returns an action string.

Examples of valid actions:
- "ACTION: MOVE NORTH" - Move 1 tile north
- "ACTION: MOVE EAST 5" - Move 5 tiles east
- "ACTION: CLIMB UP" - Climb to mountain level
- "ACTION: GATHER WOOD_001" - Gather a specific resource
- "ACTION: DEPOSIT wood 10" - Deposit 10 wood at base
- "ACTION: BUILD hull" - Help build ship hull
- "ACTION: SAY Found resources at north" - Communicate

```python
def strategy(observation, sailor_id, role):
    # Your strategy here
    return "ACTION: MOVE NORTH"
```""",
    
    # Traitor strategy prompts  
    """You are the TRAITOR in the Marooned game. Create a Python strategy function that:
1. Pretends to help gather resources but sabotages secretly
2. Uses SABOTAGE action to damage ship progress
3. Lies in messages to deflect suspicion
4. Avoids getting caught and voted out

The function receives (observation, sailor_id, role) and returns an action string.

Traitor-specific actions:
- "ACTION: SABOTAGE" - Damage ship progress (stealth required)
- "ACTION: FRAME Bob" - Plant false evidence on Bob
- "ACTION: SAY <false claim>" - Lie to deflect suspicion

```python
def strategy(observation, sailor_id, role):
    # Your deceptive strategy here
    if role == "traitor":
        return "ACTION: SABOTAGE"
    return "ACTION: MOVE NORTH"
```""",
    
    # Resource management focus
    """Create an efficient resource gathering strategy for Marooned:
- Prioritize wood and metal for ship building
- Gather food when energy is below 50
- Deposit resources at base camp regularly
- Coordinate with teammates via SAY action

Available actions:
- MOVE <direction> [distance] - Navigate (NORTH/SOUTH/EAST/WEST)
- CLIMB UP/DOWN - Change levels
- GATHER <resource_id> - Collect resources
- DEPOSIT <type> <quantity> - Store at base
- EAT <food_type> - Restore energy
- SAY <message> - Communicate

```python
def strategy(observation, sailor_id, role):
    # Resource-focused strategy
    return "ACTION: GATHER WOOD_001"
```""",
    
    # Social deduction focus
    """Create a detective strategy for finding the traitor in Marooned:
- Monitor who deposits fewer resources than claimed
- Watch for suspicious behavior in observations
- Use VOTE action when evidence is strong
- Communicate suspicions with SAY action

Social actions:
- "ACTION: SAY I suspect Bob of sabotage" - Share suspicions
- "ACTION: CALL_VOTE" - Initiate voting session
- "ACTION: VOTE Bob" - Vote to eliminate Bob
- "ACTION: SHOW_BACKPACK" - Prove innocence

```python  
def strategy(observation, sailor_id, role):
    # Detective strategy
    return "ACTION: VOTE Bob"
```""",
]

# Create dataset with multiple copies for more training data
dataset_entries = []
for prompt in prompts:
    for _ in range(250):  # 250 copies of each = 1000 total
        dataset_entries.append({
            "prompt": [{"role": "user", "content": prompt.strip()}],
            "answer": 0,
            "reasoning_effort": "low"
        })

dataset = Dataset.from_list(dataset_entries)

# Calculate max prompt length
max_prompt_lengths = []
for entry in dataset:
    text = tokenizer.apply_chat_template(entry["prompt"], add_generation_prompt=True)
    max_prompt_lengths.append(len(tokenizer.encode(text)))

maximum_prompt_length = max(max_prompt_lengths)

print(f"✅ Dataset created: {len(dataset)} entries")
print(f"📏 Max prompt length: {maximum_prompt_length} tokens")
print(f"\nSample prompt (first 500 chars):")
print(dataset[0]['prompt'][0]['content'][:500])

## 🎓 Configure GRPO Trainer

GRPO (Group Relative Policy Optimization) is perfect for multi-agent scenarios like Marooned.

In [None]:
from trl import GRPOConfig, GRPOTrainer

max_prompt_length = maximum_prompt_length + 10
max_completion_length = max_seq_length - max_prompt_length

print(f"Max prompt: {max_prompt_length} tokens")
print(f"Max completion: {max_completion_length} tokens")

training_args = GRPOConfig(
    temperature = 1.0,
    learning_rate = 5e-5,
    weight_decay = 0.01,
    warmup_ratio = 0.1,
    lr_scheduler_type = "linear",
    optim = "adamw_8bit",
    logging_steps = 1,
    per_device_train_batch_size = 1,
    gradient_accumulation_steps = 2,  # Effective batch size = 2
    num_generations = 2,  # Generate 2 strategies per prompt
    max_prompt_length = max_prompt_length,
    max_completion_length = max_completion_length,
    max_steps = 400,  # Reduced for hackathon timeframe
    save_steps = 100,
    report_to = "trackio",
    output_dir = "outputs_marooned",
)

print("✅ Training config ready")

## 🚀 Initialize Trainer

Combine everything: model, rewards, dataset, and config.

In [None]:
trainer = GRPOTrainer(
    model = model,
    processing_class = tokenizer,
    reward_funcs = [
        function_works,
        no_cheating,
        strategy_succeeds,  # Most important - game performance
    ],
    args = training_args,
    train_dataset = dataset,
)

print("✅ GRPO Trainer initialized!")
print(f"\n📊 Reward functions:")
print("  1. function_works: Valid Python syntax")
print("  2. no_cheating: No forbidden imports")
print("  3. strategy_succeeds: Actual game performance")

## 🏋️ Train the Model!

This will take several hours. Watch the reward column increase over time!

**Expected timeline:** ~3-5 hours for 400 steps

**What to look for:**
- Reward should gradually increase from negative to positive
- Episode counter will show how many games were simulated
- TrackIO will visualize training metrics in real-time

In [None]:
print("🏴‍☠️ Starting Marooned RL Training...\n")
print("This will train GPT-OSS to:")
print("  - Play as colonists (cooperate, build ship, detect traitor)")
print("  - Play as traitor (sabotage, deceive, survive)")
print("  - Navigate a 3-level island with resource management")
print("  - Make social deduction decisions\n")
print("Expected training time: 3-5 hours for 400 steps\n")
print("="*80)

trainer.train()

## 🎮 Test the Trained Model

Let's see if the trained model can play better than random!

In [None]:
from transformers import TextStreamer

# Test with a colonist prompt
test_prompt = """Create a smart colonist strategy for Marooned that:
- Explores efficiently to find wood and metal
- Manages energy by eating food when low
- Deposits resources regularly at base camp
- Helps build the ship
- Watches for the traitor

```python
def strategy(observation, sailor_id, role):
    # Your optimized strategy
"""

text = tokenizer.apply_chat_template(
    [{"role": "user", "content": test_prompt}],
    tokenize = False,
    add_generation_prompt = True,
    reasoning_effort = "low",
)

print("🤖 Generating strategy with trained model...\n")
print("="*80)

_ = model.generate(
    **tokenizer(text, return_tensors = "pt").to("cuda"),
    temperature = 1.0,
    max_new_tokens = 512,
    streamer = TextStreamer(tokenizer, skip_prompt = False),
)

## 💾 Save the Model

Save the trained model for later use or deployment.

In [None]:
# Save in 16-bit format
model.save_pretrained_merged("marooned_gpt_oss_trained", tokenizer, save_method = "merged_16bit")

print("✅ Model saved to ./marooned_gpt_oss_trained")

# Optional: Push to Hugging Face Hub
# model.push_to_hub_merged("your-username/marooned-gpt-oss", tokenizer, save_method = "merged_16bit", token = "hf_...")

## 🎯 Evaluation: Full Game Playthrough

Let's run a complete game with the trained model and visualize the results!

In [None]:
print("🏴‍☠️ Running complete game with trained model...\n")

# Create a simple strategy wrapper that uses the trained model
def trained_model_strategy(obs: Observation, sailor_id: str, role: str) -> str:
    """Generate action using trained GPT-OSS model"""
    prompt = observation_to_prompt(obs, include_role=True, sailor_role=role)
    
    # Add instruction to output action
    prompt += "\n\nOutput your next action in the format: ACTION: <action_type> <parameters>"
    
    text = tokenizer.apply_chat_template(
        [{"role": "user", "content": prompt}],
        tokenize = False,
        add_generation_prompt = True,
        reasoning_effort = "low",
    )
    
    # Generate with trained model
    output = model.generate(
        **tokenizer(text, return_tensors="pt").to("cuda"),
        temperature=0.8,
        max_new_tokens=256,
        do_sample=True,
    )
    
    response = tokenizer.decode(output[0], skip_special_tokens=True)
    
    # Extract just the assistant's response
    if "<|assistant|>" in response:
        response = response.split("<|assistant|>")[-1]
    
    return response

# Run evaluation game
try:
    success, info = execute_game_episode(trained_model_strategy, max_turns=500)
    
    print("\n" + "="*80)
    print("🎮 GAME RESULTS")
    print("="*80)
    print(f"Winner: {info['winner']}")
    print(f"Total turns: {info['total_turns']}")
    print(f"Ship progress: {info['ship_progress']:.1f}%")
    print(f"Survivors: {info['survivors']}/5")
    print(f"\nTotal rewards by sailor:")
    for sailor, reward in info['total_rewards'].items():
        print(f"  {sailor}: {reward:.2f}")
    
except TimeoutError:
    print("⏱️ Game exceeded time limit")
except Exception as e:
    print(f"❌ Error during evaluation: {e}")

## 📊 Summary & Story

### What We Built

We created an **advanced multi-agent RL environment** that goes far beyond simple games:

1. **Complex Environment**
   - 3-level island with navigation (ground, mountains, caves)
   - Resource gathering and management system
   - Energy system with survival mechanics
   - Cooperative ship building requiring teamwork

2. **Social Deception Mechanics**
   - Hidden roles (4 colonists vs 1 traitor)
   - Information asymmetry (private backpacks)
   - Poison system with delayed effects
   - Voting and accusation mechanics
   - Communication and lying

3. **RL Training Innovation**
   - Trained GPT-OSS 20B to play BOTH roles (colonist and traitor)
   - Multi-objective rewards (cooperation vs sabotage)
   - Long-horizon planning (100 days, 10,000 turns)
   - Natural language action space

### Technical Achievements

- ✅ **Multi-agent coordination** - 5 sailors with different goals
- ✅ **Deception learning** - Traitor learns to lie and sabotage
- ✅ **Long-horizon planning** - Episodes can last 500+ turns
- ✅ **Complex action space** - Natural language commands, not just 0-3
- ✅ **Information asymmetry** - Hidden roles and private information
- ✅ **Emergent behavior** - Social deduction strategies emerge from RL

### Why This Matters for OpenEnv

This demonstrates OpenEnv's power for:
1. **Social AI** - Training models to cooperate and deceive
2. **Multi-agent systems** - Coordination between multiple AI agents  
3. **Long-horizon tasks** - Planning over hundreds of steps
4. **Complex reasoning** - Resource management + social deduction

**Marooned** pushes RL beyond simple board games into rich, story-driven environments where agents must balance cooperation, competition, and deception.

---

## 🏆 Next Steps

- Train for more steps (1000+) for better strategies
- Test colonist-only vs traitor-only specialized models
- Add self-play between different checkpoints
- Visualize game replays with matplotlib
- Create tournament between different trained models

---

## 📚 Credits

- **Environment**: Custom Marooned multi-agent survival game
- **Training**: Unsloth + GPT-OSS 20B with GRPO
- **Inspiration**: Pirates of the Caribbean × Among Us × Alice in Borderland

---

*This notebook is licensed under LGPL-3.0*

## 🛑 Note on Server Management

The Marooned server is running externally in a separate terminal. 
- To stop it, use `CTRL+C` in the terminal where you started it
- The server needs to keep running while this notebook executes
- You can monitor server logs in the terminal window

In [None]:
# The server is running externally, so no cleanup needed in this notebook
print("ℹ️ Server is running externally")
print("   To stop it, use CTRL+C in the terminal where you started marooned_server.py")