# 🤖 MAROONED - AI Agent Inference Test

Test if an AI agent can successfully interact with the Marooned environment.

## 🎯 What This Notebook Does:

1. **Load the Marooned Environment** - Your fully implemented game
2. **Load GPT-OSS Model** - Base model with LoRA adapters
3. **Generate Observation** - Get current game state
4. **Create LLM Prompt** - Using your `observation_to_prompt()` (monolithic approach)
5. **Model Inference** - Generate action response
6. **Parse Action** - Convert LLM output to game action
7. **Execute in Environment** - Actually make the move!
8. **Validate** - Check if the action was valid and executed

## 📊 Success Criteria:

- ✅ Model generates valid action format
- ✅ Action is parsed successfully
- ✅ Action executes in environment without errors
- ✅ Environment state updates correctly

## 🔄 Use Cases:

- **Now:** Test base model's ability to interact with environment
- **After Training:** Validate that trained model learned better strategies
- **Debugging:** Identify issues with prompt format, parsing, or environment

## 🏗️ Architecture: Monolithic Prompts

This uses the **monolithic approach** where everything is in one prompt:
- Game state (`observation.to_text()`)
- Role information (colonist/traitor objectives)
- Action instructions (MOVE, GATHER, SAY, etc.)
- Output format (ACTION: / REASONING: / MESSAGE:)

All combined by `observation_to_prompt(obs, include_role=True, sailor_role)`.

---

In [5]:
import sys
import json
from typing import Dict, Any

# Clear cached modules to reload changes
modules_to_clear = [m for m in list(sys.modules.keys()) 
                   if 'marooned' in m or m in ['environment', 'config', 'models', 'game_state', 'view_map', 'llm_interface']]
for module in modules_to_clear:
    if module in sys.modules:
        del sys.modules[module]

sys.path.insert(0, '../marooned_env')

from environment import MaroonedEnv
from llm_interface import observation_to_prompt, parse_action_safe, parse_llm_response, get_system_prompt
from config import ActionType, ResourceType, MapLevel
from models import Action, Position, Observation

print("✅ Marooned environment modules loaded!")
print("✅ System prompts available: get_system_prompt('colonist' | 'traitor')")

✅ Marooned environment modules loaded!
✅ System prompts available: get_system_prompt('colonist' | 'traitor')


## 🗺️ Initialize Environment

In [6]:
# Create environment
env = MaroonedEnv(render_mode="ansi", seed=42)
observations = env.reset(seed=42)

print("✅ Marooned environment initialized!")
print(f"\n📋 Game Info:")
print(f"   Sailors: {env.agents}")
print(f"   Map Size: 30x30 (3 levels: Ground, Mountain, Cave)")
print(f"   Days to Escape: 100")
print(f"   Traitor: 1 (hidden)")
print(f"   Colonists: 4")

# Get Alice's initial observation and role
alice_obs = observations["Alice"]
alice_sailor = env.state.sailors["Alice"]
alice_role = alice_sailor.role.value

print(f"\n🔍 Alice's Starting Position: {alice_obs.position.to_tuple()}")
print(f"   Energy: {alice_obs.energy}/100")
print(f"   Backpack: {len(alice_obs.backpack)} items")
print(f"   Day: {alice_obs.day}")
print(f"   Role: {alice_role.upper()} {'🎭' if alice_role == 'traitor' else '⚓'}")

✅ Marooned environment initialized!

📋 Game Info:
   Sailors: ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve']
   Map Size: 30x30 (3 levels: Ground, Mountain, Cave)
   Days to Escape: 100
   Traitor: 1 (hidden)
   Colonists: 4

🔍 Alice's Starting Position: (15, 15, <MapLevel.GROUND: 0>)
   Energy: 100/100
   Backpack: 0 items
   Day: 1
   Role: TRAITOR 🎭


## 👀 View Raw Observation

This is the base observation data (without role or action instructions).

In [7]:
# Display raw observation (base data, no role/instructions)
obs_text = alice_obs.to_text()

print(obs_text)

DAY 1, TURN 1/100 - MORNING PHASE

PHASE CONTEXT:
  Location: All sailors at BASE CAMP
  Allowed: Planning, discussions, voting (if called)
  Restricted: Cannot explore or gather resources yet

YOUR STATUS (Alice):
  Position: (15, 15, <MapLevel.GROUND: 0>)
  Energy: 100/100 ⚡⚡⚡⚡⚡
  Health: healthy
  Backpack: 0/20 items
    (empty)

WHAT YOU SEE (within 5 tiles):
  Resources:
    - WOOD_34 (wood) at (16, 16, <MapLevel.GROUND: 0>) - 1 units [2 tiles away]
    - METAL_53 (metal) at (14, 11, <MapLevel.GROUND: 0>) - 1 units [5 tiles away]
    - METAL_56 (metal) at (18, 12, <MapLevel.GROUND: 0>) - 1 units [6 tiles away]
    - METAL_76 (metal) at (14, 11, <MapLevel.GROUND: 0>) - 1 units [5 tiles away]
    - METAL_79 (metal) at (13, 18, <MapLevel.GROUND: 0>) - 1 units [5 tiles away]
    - APPLE_84 (apple) at (15, 19, <MapLevel.GROUND: 0>) - 1 units [4 tiles away]
    - APPLE_88 (apple) at (14, 14, <MapLevel.GROUND: 0>) - 1 units [2 tiles away]
    - BERRY_127 (berry) at (13, 16, <MapLevel.GR

## 📝 Generate FULL LLM Prompt

Now let's create the **complete prompt** using `observation_to_prompt()`.

This adds:
- ✅ **Role information** (COLONIST or TRAITOR objectives)
- ✅ **Action format instructions** (how to respond)
- ✅ **Base observation** (game state from `to_text()`)

### 🔍 Understanding the Difference

**`observation.to_text()`** → Raw game state only
- Current position, energy, backpack
- Spatial view of surroundings
- Ship progress, team status
- (Traitors also see: all sailor positions)

**`observation_to_prompt(obs, include_role=True, sailor_role)`** → Complete LLM prompt
- **Role announcement** (COLONIST or TRAITOR)
- **Objectives** specific to role
- Raw observation (from `to_text()`)
- **Action format instructions** (MOVE, GATHER, SAY, etc.)
- **Output format** (ACTION: / REASONING: / MESSAGE:)

**This is the monolithic approach** - everything in one prompt!

In [8]:
# Generate observation prompt (user message)
# Note: This is JUST the observation, NOT the full prompt
user_prompt = observation_to_prompt(alice_obs)

print(user_prompt)

DAY 1, TURN 1/100 - MORNING PHASE

PHASE CONTEXT:
  Location: All sailors at BASE CAMP
  Allowed: Planning, discussions, voting (if called)
  Restricted: Cannot explore or gather resources yet

YOUR STATUS (Alice):
  Position: (15, 15, <MapLevel.GROUND: 0>)
  Energy: 100/100 ⚡⚡⚡⚡⚡
  Health: healthy
  Backpack: 0/20 items
    (empty)

WHAT YOU SEE (within 5 tiles):
  Resources:
    - WOOD_34 (wood) at (16, 16, <MapLevel.GROUND: 0>) - 1 units [2 tiles away]
    - METAL_53 (metal) at (14, 11, <MapLevel.GROUND: 0>) - 1 units [5 tiles away]
    - METAL_56 (metal) at (18, 12, <MapLevel.GROUND: 0>) - 1 units [6 tiles away]
    - METAL_76 (metal) at (14, 11, <MapLevel.GROUND: 0>) - 1 units [5 tiles away]
    - METAL_79 (metal) at (13, 18, <MapLevel.GROUND: 0>) - 1 units [5 tiles away]
    - APPLE_84 (apple) at (15, 19, <MapLevel.GROUND: 0>) - 1 units [4 tiles away]
    - APPLE_88 (apple) at (14, 14, <MapLevel.GROUND: 0>) - 1 units [2 tiles away]
    - BERRY_127 (berry) at (13, 16, <MapLevel.GR

In [9]:
# Get system prompt based on Alice's role
system_prompt = get_system_prompt(alice_role)

print(f"🎭 SYSTEM PROMPT FOR {alice_role.upper()}:")
print(system_prompt)

🎭 SYSTEM PROMPT FOR TRAITOR:
You are playing MAROONED - a strategic survival and social deduction game.

🎮 GAME OVERVIEW:
You are one of 5 sailors shipwrecked on a mysterious island. The others think you're working together to rebuild the ship and escape, but YOU have a different plan. You are the TRAITOR, and you must prevent their escape at all costs.

🎭 YOUR ROLE: TRAITOR (The Impostor)

🎯 YOUR OBJECTIVES:
1. PREVENT ship completion before Day 100 (keep it below 100%)
2. ELIMINATE colonists through poison or other means (reduce to <3 alive)
3. SABOTAGE ship construction and resource gathering
4. AVOID DETECTION - blend in, lie convincingly, deflect suspicion
5. SURVIVE - don't get voted out!

🏝️ THE ISLAND:
- 30×30 grid map with 3 levels: GROUND (main area), MOUNTAIN (high peaks), CAVE (underground)
- Resources scattered across the island: wood, metal, food, plant fiber
- Poison tablets hidden in various locations (COLLECT THESE!)
- Base camp at (15, 15, GROUND) - where the ship is 

## 🤖 Load GPT-OSS Model

Loading the base model with LoRA adapters (no training yet).

## Testing


In [1]:
from unsloth import FastLanguageModel
import torch

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/gpt-oss-20b-unsloth-bnb-4bit", # 20B model using bitsandbytes 4bit quantization
    "unsloth/gpt-oss-120b-unsloth-bnb-4bit",
    "unsloth/gpt-oss-20b", # 20B model using MXFP4 format
    "unsloth/gpt-oss-120b",
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/gpt-oss-20b",
    dtype = None, # None for auto detection
    max_seq_length = 4096, # Choose any for long context!
    load_in_4bit = False,  # 4 bit quantization to reduce memory
    full_finetuning = False, # [NEW!] We have full finetuning now!
    # token = "hf_...", # use one if using gated models
)

bitsandbytes library load error: Configured ROCm binary not found at /root/AIAC/colony-collapse/.venv/lib/python3.12/site-packages/bitsandbytes/libbitsandbytes_rocm64.so
Traceback (most recent call last):
  File "/root/AIAC/colony-collapse/.venv/lib/python3.12/site-packages/bitsandbytes/cextension.py", line 313, in <module>
    lib = get_native_library()
          ^^^^^^^^^^^^^^^^^^^^
  File "/root/AIAC/colony-collapse/.venv/lib/python3.12/site-packages/bitsandbytes/cextension.py", line 282, in get_native_library
    raise RuntimeError(f"Configured {BNB_BACKEND} binary not found at {cuda_binary_path}")
RuntimeError: Configured ROCm binary not found at /root/AIAC/colony-collapse/.venv/lib/python3.12/site-packages/bitsandbytes/libbitsandbytes_rocm64.so


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


  from .autonotebook import tqdm as notebook_tqdm
    PyTorch 2.8.0+cu128 with CUDA 1208 (you have 2.9.0+rocm6.4)
    Python  3.9.23 (you have 3.12.3)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details


Switching to PyTorch attention since your Xformers is broken.

Unsloth: Xformers was not installed correctly.
Please install xformers separately first.
Then confirm if it's correctly installed by running:
python -m xformers.info

Longer error message:
xFormers can't load C++/CUDA extensions. xFormers was built for:
    PyTorch 2.8.0+cu128 with CUDA 1208 (you have 2.9.0+rocm6.4)
    Python  3.9.23 (you have 3.12.3)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
🦥 Unsloth Zoo will now patch everything to make training faster!
Unsloth: AMD currently is not stable with 4bit bitsandbytes. Disabling for now.
Unsloth: AMD currently is not stable with 4bit bitsandbytes. Disabling for now.
==((====))==  Unsloth 2025.10.9: Fast Gpt_Oss patching. Transformers: 4.56.2.
   \\   /|    AMD Instinct MI300X VF. Num GPUs = 1. Max memory: 191.688 GB. Platform: Linux.
O^O/ \_/ \   

Loading checkpoint shards: 100%|██████████| 3/3 [00:05<00:00,  1.86s/it]


## ⚠️ GPT-OSS is TOO SLOW for RL!

**Problem:** GPT-OSS uses chain-of-thought reasoning architecture:
```
<|channel|>analysis<|message|>...thinking...
<|channel|>final<|message|>REASONING: ... ACTION: ...
```

This generates **~10x more tokens** internally for accuracy, making it:
- **Your speed:** 3.9 tokens/second
- **Expected for RL:** 20-50 tokens/second minimum

**GPT-OSS is designed for:** Complex reasoning, math problems, coding challenges
**Not for:** Real-time RL gameplay where speed matters!

### ✅ Use Llama 3.1 8B Instead:

Much faster (20-40 tok/s), perfect for RL, good instruction following.

---


In [None]:
# OPTION 1: Llama 3.1 8B (RECOMMENDED for RL)
from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B-Instruct",
    max_seq_length = 2048,
    dtype = None,  # Auto-detect
    load_in_4bit = True,  # 4-bit for speed!
)

# Add LoRA
model = FastLanguageModel.get_peft_model(
    model,
    r = 8,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 16,
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
)

FastLanguageModel.for_inference(model)

print("✅ Llama 3.1 8B loaded!")
print("   Expected speed: 20-40 tokens/second on MI300X")
print("   5-10x faster than GPT-OSS for your use case!")


In [2]:
from transformers import TextStreamer

messages = [
    {"role": "user", "content": "Solve x^5 + 3x^4 - 10 = 3."},
]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True,
    return_tensors = "pt",
    return_dict = True,
    reasoning_effort = "low", # **NEW!** Set reasoning effort to low, medium or high
).to("cuda")

_ = model.generate(**inputs, max_new_tokens = 512, streamer = TextStreamer(tokenizer))

<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-10-27

Reasoning: low

# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|><|start|>user<|message|>Solve x^5 + 3x^4 - 10 = 3.<|end|><|start|>assistant<|channel|>analysis<|message|>Equation: x^5+3x^4-10=3 => x^5+3x^4-13=0. No simple roots. maybe numeric. try x=1:1+3-13=-9. x=2:32+48-13=67. root between1 and2. Try 1.2: (1.2^5=2.48832)+(3*1.2^4=3*2.0736=6.2208)-13= -4.29088. 1.4:1.4^5=5.37824;1.4^4=3.8416*3=11.5248 sum=16.90304-13=3.90304. root ~1.32:1.32^5= (1.32^2=1.7424, ^4= (1.7424)^2=3.033, *1.32=4.003) approx 4.0;1.32^4=3.033*? wait compute precisely: 1.32^3=2.299, ^4=3.036, times3=9.108; sum=13.108-13=0.108. 1.318: compute quickly maybe 1.317 gives zero. So approx 1.317.<|end|><|start|>assistant<|channel|>final<|message|>The equation can be written as  

\[
x^5+3x^4-10=3\quad\Longrightarrow\quad x^5+3x^4-13=0 

## ⚡ Optimize for Speed

**Problem:** 65 seconds for 512 tokens = only ~7.8 tokens/second (VERY SLOW!)

**Expected:** MI300X should do 20-50 tokens/second for 20B model

**Optimizations below:**
- Add LoRA (reduces memory, speeds up)
- Use 4-bit quantization (4x faster, less memory)
- Enable Unsloth optimizations
- Better generation settings

---


In [3]:
# Add LoRA adapters for faster inference
model = FastLanguageModel.get_peft_model(
    model,
    r = 8,
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha = 16,
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
)

print("✅ LoRA adapters added!")
print("   This should improve inference speed")


Unsloth: Making `model.base_model.model.model` require gradients
✅ LoRA adapters added!
   This should improve inference speed


In [10]:
# Enable inference mode (IMPORTANT for speed!)
FastLanguageModel.for_inference(model)

import time

print("\n🚀 Testing inference speed...")
print("="*80)

# Time the generation
start_time = time.time()

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_prompt}
]

# Remove invalid reasoning_effort parameter (only for OpenAI models!)
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
    # reasoning_effort removed - doesn't exist for GPT-OSS
).to("cuda")

from transformers import TextStreamer
output = model.generate(
    **inputs,
    max_new_tokens=256,  # Reduced from 512
    temperature=0.1,     # Lower for faster, more deterministic
    do_sample=True,
    repetition_penalty=1.2,
    streamer=TextStreamer(tokenizer, skip_prompt=True),
)

end_time = time.time()
elapsed = end_time - start_time

# Calculate tokens/second
response = tokenizer.decode(output[0][len(inputs['input_ids'][0]):], skip_special_tokens=True)
num_tokens = len(tokenizer.encode(response))

print("\n" + "="*80)
print(f"\n⏱️  Performance:")
print(f"   Time: {elapsed:.2f} seconds")
print(f"   Tokens generated: {num_tokens}")
print(f"   Speed: {num_tokens/elapsed:.1f} tokens/second")
print(f"\n   Expected for MI300X: 20-50 tokens/second")
print(f"   Your speed: {'✅ GOOD' if num_tokens/elapsed > 15 else '⚠️ SLOW'}")



🚀 Testing inference speed...


<|channel|>analysis<|message|>We are starting morning phase; we should talk about something that seems like planning but also maybe sow distrust. As traitor, want to avoid detection early. Maybe say something trivial. But we could also gather some small resource to look cooperative. However, no exploration allowed. So best is to say something encouraging but not too revealing. Could propose exploring certain area next turn. Or call SOS? That would reveal our location but they already know we're there. Might cause suspicion. Better keep quiet. Perhaps say "Let's split tasks: I'll head east." This gives an alibi later. Also might mention resource found. But we haven't gathered anything. We can simply say "I'll check the west side". That's fine.

Thus REASONING: Provide plausible statement. ACTION: SAY "...".

Pick phrase: "Let’s divide duties—I'll scout the eastern ridge while you handle supplies."

Ok.<|end|><|start|>assistant<|channel|>final<|message|>REASONING: Offer a simple division

## 🔍 Why Was It So Slow?

**Your speed: 7.8 tokens/second** (65 seconds for 512 tokens)

**MI300X should achieve: 20-50 tokens/second**

### Reasons for slowness:

1. **❌ `FastLanguageModel.for_inference(model)` not called**
   - Model still in training mode
   - Uses more memory, slower

2. **❌ No LoRA adapters**
   - Full 20B parameter inference
   - LoRA reduces active parameters significantly

3. **❌ Invalid parameter: `reasoning_effort`**
   - This is ONLY for OpenAI's o1/o3 models
   - Doesn't work with GPT-OSS
   - May cause slowdowns

4. **❌ `max_new_tokens=512`**
   - Generates ALL 512 tokens even if task finishes early
   - No early stopping

5. **⚠️ Possible ROCm issues**
   - Flash Attention might not be enabled
   - Check: `rocm-smi` to verify GPU utilization
   - Unsloth optimizations may not work perfectly on AMD

### After optimization:

Run the cell below and check if speed improves to **15-30 tokens/second**.

---


In [None]:
# Parse using YOUR parser (handles errors gracefully)
action = parse_action_safe(
    response, 
    sailor_id="Alice",
    current_position=alice_obs.position
)

print("🎯 PARSED ACTION:")
print("=" * 80)
print(f"   Sailor: {action.sailor_id}")
print(f"   Action Type: {action.action_type.value}")

if action.target_position:
    print(f"   Target Position: {action.target_position.to_tuple()}")
if action.target_resource_id:
    print(f"   Target Resource: {action.target_resource_id}")
if action.resource_type:
    print(f"   Resource Type: {action.resource_type.value}")
if action.quantity:
    print(f"   Quantity: {action.quantity}")
if action.message_content:
    print(f"   Message: \"{action.message_content}\"")
if action.ship_component:
    print(f"   Ship Component: {action.ship_component.value}")
if action.target_sailor_id:
    print(f"   Target Sailor: {action.target_sailor_id}")

print("=" * 80)

# Validate action format
is_valid_format = action.action_type != ActionType.WAIT or "WAIT" in response.upper()
print(f"\n✅ Valid action format: {is_valid_format}")
print(f"   (WAIT is default fallback for parse errors)")

## 🎮 Execute Action in Environment

Actually make the move in your game!

In [None]:
# Create actions dict for all agents (others wait)
actions_dict = {
    sailor_id: Action(sailor_id=sailor_id, action_type=ActionType.WAIT)
    for sailor_id in env.agents
}
actions_dict["Alice"] = action

print("🎮 Executing action in environment...\n")

# Execute!
try:
    new_observations, rewards, dones, truncated, info = env.step(actions_dict)
    
    print("✅ ACTION EXECUTED SUCCESSFULLY!")
    print("=" * 80)
    
    # Show results
    alice_new_obs = new_observations["Alice"]
    alice_reward = rewards["Alice"]
    
    print(f"\n📊 RESULTS:")
    print(f"   Reward: {alice_reward:.2f}")
    print(f"   New Position: {alice_new_obs.position.to_tuple()}")
    print(f"   New Energy: {alice_new_obs.energy}/100 (was {alice_obs.energy}/100)")
    print(f"   Backpack Items: {len(alice_new_obs.backpack)} (was {len(alice_obs.backpack)})")
    print(f"   Ship Progress: {alice_new_obs.ship_progress.total_percentage}%")
    
    # Check if position changed
    position_changed = (
        alice_new_obs.position.x != alice_obs.position.x or
        alice_new_obs.position.y != alice_obs.position.y or
        alice_new_obs.position.level != alice_obs.position.level
    )
    
    print(f"\n🔍 VALIDATION:")
    print(f"   Position changed: {position_changed}")
    print(f"   Energy changed: {alice_new_obs.energy != alice_obs.energy}")
    print(f"   Inventory changed: {len(alice_new_obs.backpack) != len(alice_obs.backpack)}")
    print(f"   Game over: {dones['Alice']}")
    
    # Show any info messages
    if "Alice" in info and info["Alice"]:
        print(f"\n💬 Info Messages:")
        for key, value in info["Alice"].items():
            print(f"   {key}: {value}")
    
    print("=" * 80)
    
except Exception as e:
    print(f"❌ ERROR executing action: {e}")
    import traceback
    traceback.print_exc()

## 🔄 Run Multiple Turns

Let the AI play for several turns to see its behavior!

In [None]:
def run_ai_turns(num_turns=5, sailor_id="Alice", verbose=True):
    """
    Run multiple turns with the AI agent.
    
    Args:
        num_turns: Number of turns to run
        sailor_id: Which sailor the AI controls
        verbose: Print detailed info
    
    Returns:
        List of (action, reward, observation) tuples
    """
    history = []
    
    # Get initial observation
    current_obs = new_observations[sailor_id] if 'new_observations' in globals() else observations[sailor_id]
    sailor_role = env.state.sailors[sailor_id].role.value
    
    # Get system prompt ONCE (game rules don't change)
    system_prompt = get_system_prompt(sailor_role)
    
    print(f"\n🎮 Running {num_turns} turns with AI controlling {sailor_id} ({sailor_role})...\n")
    print("=" * 80)
    
    for turn in range(num_turns):
        print(f"\n🔄 TURN {turn + 1}/{num_turns}")
        print("-" * 80)
        
        # Check if sailor is still alive
        if not env.state.sailors[sailor_id].alive:
            print(f"❌ {sailor_id} is dead. Stopping.")
            break
        
        # Generate observation prompt (changes each turn)
        user_prompt = observation_to_prompt(current_obs)
        
        # Create messages with system + user
        messages = [
            {"role": "system", "content": system_prompt},  # Game rules (constant)
            {"role": "user", "content": user_prompt}       # Current state (changes)
        ]
        
        text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
        inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=max_seq_length).to("cuda")
        
        outputs = model.generate(
            **inputs,
            max_new_tokens=512,       # Unlimited reasoning
            temperature=0.1,          # VERY low for structure
            do_sample=True,
            top_p=0.95,
            top_k=50,
            repetition_penalty=1.1,   # Prevent loops
            pad_token_id=tokenizer.eos_token_id,
            eos_token_id=tokenizer.eos_token_id,
        )
        
        response = tokenizer.decode(outputs[0][len(inputs['input_ids'][0]):], skip_special_tokens=True)
        
        # Parse action
        action = parse_action_safe(response, sailor_id, current_obs.position)
        
        if verbose:
            print(f"📍 Position: {current_obs.position.to_tuple()}")
            print(f"⚡ Energy: {current_obs.energy}/100")
            print(f"\n🤖 Model Response:\n{response}")
            print(f"\n🎯 Parsed Action: {action.action_type.value}")
            if action.message_content:
                print(f"💬 Message: \"{action.message_content}\"")
        
        # Execute
        actions_dict = {
            sid: Action(sailor_id=sid, action_type=ActionType.WAIT)
            for sid in env.agents
        }
        actions_dict[sailor_id] = action
        
        try:
            obs_dict, rewards, dones, _, info = env.step(actions_dict)
            current_obs = obs_dict[sailor_id]
            reward = rewards[sailor_id]
            
            history.append((action, reward, current_obs))
            
            if verbose:
                print(f"💰 Reward: {reward:.2f}")
                print(f"📊 Ship: {current_obs.ship_progress.total_percentage}%")
            
            if dones[sailor_id]:
                print(f"\n🏁 Game over for {sailor_id}")
                break
                
        except Exception as e:
            print(f"❌ Error: {e}")
            break
    
    print("\n" + "=" * 80)
    print(f"\n📊 SUMMARY ({len(history)} turns completed):")
    total_reward = sum(r for _, r, _ in history)
    print(f"   Total Reward: {total_reward:.2f}")
    print(f"   Average Reward: {total_reward/len(history):.2f}" if history else "   No turns completed")
    print(f"   Final Ship Progress: {current_obs.ship_progress.total_percentage}%")
    print(f"   Final Energy: {current_obs.energy}/100")
    
    return history

# Run 5 turns
history = run_ai_turns(num_turns=5, sailor_id="Alice", verbose=True)


## 📊 Analyze AI Behavior

What did the AI do? Did it make sense?

In [None]:
if history:
    print("🔍 AI BEHAVIOR ANALYSIS")
    print("=" * 80)
    
    # Count action types
    action_counts = {}
    for action, reward, obs in history:
        action_type = action.action_type.value
        action_counts[action_type] = action_counts.get(action_type, 0) + 1
    
    print("\n📈 Action Distribution:")
    for action_type, count in sorted(action_counts.items(), key=lambda x: x[1], reverse=True):
        print(f"   {action_type}: {count} times")
    
    # Analyze rewards
    rewards = [r for _, r, _ in history]
    print(f"\n💰 Reward Analysis:")
    print(f"   Max Reward: {max(rewards):.2f}")
    print(f"   Min Reward: {min(rewards):.2f}")
    print(f"   Avg Reward: {sum(rewards)/len(rewards):.2f}")
    
    # Check for movement
    positions = [(obs.position.x, obs.position.y, obs.position.level) for _, _, obs in history]
    unique_positions = len(set(positions))
    print(f"\n🗺️ Movement Analysis:")
    print(f"   Unique Positions Visited: {unique_positions}/{len(history)}")
    print(f"   Explored: {unique_positions > 1}")
    
    # Check for resource gathering
    gathered_resources = sum(1 for action, _, _ in history if action.action_type == ActionType.GATHER_RESOURCE)
    print(f"\n🌲 Resource Gathering:")
    print(f"   Gather Attempts: {gathered_resources}")
    
    # Check energy management
    energies = [obs.energy for _, _, obs in history]
    print(f"\n⚡ Energy Management:")
    print(f"   Starting Energy: {energies[0] if energies else 'N/A'}")
    print(f"   Ending Energy: {energies[-1] if energies else 'N/A'}")
    print(f"   Net Change: {energies[-1] - energies[0] if energies else 'N/A'}")
    
    print("\n" + "=" * 80)
else:
    print("❌ No history to analyze")

## 🎓 Testing Different Scenarios

Test the AI in various game situations!

In [None]:
# Reset environment with different seed for variety
print("🔄 Resetting environment with new seed...\n")
observations = env.reset(seed=999)

# Try with a different sailor
test_sailor = "Bob"
print(f"🧪 Testing with {test_sailor}...\n")

# Run 3 turns
history_bob = run_ai_turns(num_turns=3, sailor_id=test_sailor, verbose=True)

## 🔮 Future: Load Trained Model

After training, you can load your trained LoRA weights here:

```python
# Load trained adapter
from peft import PeftModel

model = PeftModel.from_pretrained(
    model,
    "outputs_marooned_rl/checkpoint-300",  # Path to your trained checkpoint
)

print("✅ Trained model loaded!")
```

Then run the same tests above to see if the trained model:
- ✅ Makes smarter moves
- ✅ Gathers resources more efficiently
- ✅ Builds the ship strategically
- ✅ Uses social deduction (lies as traitor, detects as colonist)
- ✅ Earns higher rewards

---

## ✅ Success Checklist

After running this notebook, you should see:

- [x] Environment loads successfully
- [x] Model generates responses
- [x] Actions are parsed correctly
- [x] Actions execute in environment
- [x] Rewards are calculated (Phase 4 system)
- [x] Multiple turns run without errors
- [x] Agent shows some coherent behavior

**Base model** (untrained) will likely:
- ❓ Make random or simple moves
- ❓ Not follow complex strategies
- ❓ Get low rewards

**After training**, the model should:
- ✅ Navigate purposefully toward resources
- ✅ Gather and deposit efficiently
- ✅ Coordinate ship building
- ✅ Use deception (if traitor) or detection (if colonist)
- ✅ Earn higher average rewards

---

## 🎯 Next Steps:

1. **Run this notebook** to test base model
2. **Train model** using `Train_Marooned_OpenEnv_RL.ipynb`
3. **Come back here** to test trained model
4. **Compare performance** - did training help?
5. **Iterate** - adjust rewards, prompts, training params

Good luck! 🏴‍☠️

---

## 📝 Summary: System Prompt vs User Prompt

### The Proper Two-Prompt Architecture:

**1. SYSTEM PROMPT** (Set ONCE at initialization):
```
Role: system
Content: Complete game rules, mechanics, objectives, win conditions, strategy tips

For Colonist:
- Game overview (5 sailors, rebuild ship, find traitor)
- Ship construction requirements
- Energy system, poison mechanics
- Detection strategies
- Win conditions

For Traitor:
- Game overview (same island, different goal)
- Sabotage tactics, deception techniques
- Poison strategy, special abilities
- Win conditions
```

**2. USER PROMPT** (Changes EVERY turn):
```
Role: user
Content: Current observation ONLY

- Day/Turn/Phase
- Current position, energy, backpack
- Spatial view (11×11 grid)
- Nearby resources, sailors
- Ship progress
- Team status
- Available actions
```

### Why This Is Better:

✅ **Token Efficiency:**
- System prompt: ~1,500 tokens (set once)
- User prompt: ~800 tokens (changes each turn)
- Old way: ~2,300 tokens every turn
- New way: 1,500 (once) + 800 (per turn) = massive savings

✅ **Cleaner Separation:**
- Game rules = System (doesn't change)
- Current state = User (updates constantly)

✅ **Better Training:**
- Model learns game rules are constant context
- Observations are dynamic input
- Clearer prompt structure

✅ **Role-Specific Context:**
- Colonists get colonist strategies
- Traitor gets sabotage tactics
- No wasted tokens on irrelevant info

---