Skip to content

drPod/ghostroom

Repository files navigation

ghostroom

A live AI escape room. Teams write prompts; an AI ghost operates a Linux desktop and tries to solve the puzzle. The spectacle of watching the ghost work is the experience.

Each team gets their own browser tab pointed at a full Linux desktop running inside a Modal container. They edit a SKILL.md prompt file in an in-desktop editor, hit Enter in the RUN AGENT terminal, and watch a local Qwen 2.5 7B (served by Ollama) autonomously navigate the filesystem, read clues, correlate information, and submit answers. The differentiator between teams is prompt quality, not CS skill.

ghostroom agent solving all three phases

What it does

  • 20 isolated Linux desktops deployed on Modal (L4 GPU per container).
  • Each container gets its own public HTTPS URL via modal.forward(6080) into an XFCE desktop served through noVNC — teams just open a browser tab.
  • A local Qwen 2.5 7B runs in each container via Ollama. No API keys, no cloud LLM billing.
  • 3 phases of filesystem puzzles gated by an unlock command. Each phase requires the agent to find and correlate information from different parts of the filesystem.
  • Live scoreboard: each container tails scores.jsonl and pushes events to a shared modal.Dict that the driver renders as a live-updating Rich table.
  • Unlimited retries. Phase unlock state persists across agent runs, so teams can iterate on their prompt without losing progress.

The puzzles

  1. "The Captain" — Correlate a departure log with a crew roster. Answer is a captain's last name.
  2. "The Marked Shelf" — Cross-reference a librarian's note against a book catalog and a shelf inventory. Answer is an author's first name.
  3. "The Shifted Message" — Decode a short Caesar cipher. The decoded word must appear in a provided dictionary. Requires telling the agent to actually compute the shift (e.g. "use python3 to decode") rather than doing the arithmetic in its head — this is the knob that separates winning prompts from losing ones.

Architecture

┌─ Modal L4 container (one per team, spawned by .spawn()) ─────────────┐
│                                                                      │
│  Ollama ── qwen2.5:7b (pre-baked into a modal.Volume)                │
│     │                                                                │
│  agent_runner.py  ── calls /api/chat with a tools array              │
│     │                    (exec, read, list, done)                    │
│     │                                                                │
│  Xvfb :99 ── xfce4-session                                           │
│     ├─ mousepad (auto-opened on /root/puzzle/SKILL.md)               │
│     └─ xfce4-terminal running run_agent_loop.sh                      │
│         "Press Enter to run the agent"                               │
│                                                                      │
│  x11vnc :5900 ─▶ websockify+noVNC :6080 ─▶ modal.forward(6080)       │
│                                                 │                    │
│                                                 ▼                    │
│                                      public HTTPS URL                │
│                                                                      │
│  /root/puzzle/                                                       │
│    ├─ README.md, SKILL.md, AGENTS.md                                 │
│    ├─ phase1/  (visible from the start)                              │
│    ├─ phase2/  (copied from /var/lib/puzzle-staging when unlocked)   │
│    ├─ phase3/  (copied when phase2 unlocks)                          │
│    ├─ state/phase{N}.unlocked  (persistent markers)                  │
│    ├─ scores.jsonl                                                   │
│    └─ unlock.sh                                                      │
│                                                                      │
│  score_watcher.py ── tails scores.jsonl ─▶ modal.Dict                │
│                                                                      │
└──────────────────────────────────────────────────────────────────────┘

Local driver (@app.local_entrypoint):
  1. Read teams.txt (20 names)
  2. .spawn() 20 run_team_desktop containers
  3. Collect 20 URLs from modal.Queue.from_name("ghostroom-urls")
  4. Print the team → URL table
  5. Poll modal.Dict.from_name("ghostroom-scores") every 15s
     and render a live Rich scoreboard until duration expires

Why not use OpenClaw's agent directly?

The original design called for OpenClaw's gateway + openclaw agent to drive the session. In practice, OpenClaw's Ollama tool-call wiring produces schemas that Qwen 2.5 7B emits as stringified JSON inside markdown code blocks rather than as real function calls. container/agent_runner.py bypasses the gateway and hits Ollama's /api/chat directly with a tools array, where Qwen 2.5 7B emits proper tool_calls 100% of the time. OpenClaw is still installed and runnable (best-effort gateway at :18789) for teams who want to explore it, but the puzzle-solving loop goes through the direct agent.

Layout

.
├── modal_app.py                     # Image, run_team_desktop, local_entrypoint driver
├── teams.txt                        # 20 team names, one per line (edit freely)
├── requirements-driver.txt          # Local deps for the driver (modal, rich)
├── container/                       # In-container assets (baked into image)
│   ├── boot.sh                      # Orchestration entrypoint: ollama, xfce, vnc, gateway, score watcher
│   ├── agent_runner.py              # Python agent loop over Ollama /api/chat with tools
│   ├── run_agent_loop.sh            # "Press Enter to run" terminal
│   ├── _agent_inner.sh              # One-shot launcher (alternative to run_agent_loop)
│   ├── run_agent.sh                 # Desktop-launcher target
│   ├── openclaw.config.json         # Pre-baked OpenClaw config (Ollama provider, coding profile)
│   ├── skill_template.md            # Starter SKILL.md teams edit
│   ├── score_watcher.py             # Tails scores.jsonl → modal.Dict
│   ├── autostart/                   # XFCE autostart entries
│   └── desktop/                     # Desktop launcher file
├── puzzle/                          # Puzzle content (baked into image)
│   ├── README.md                    # Agent entry briefing
│   ├── unlock.sh                    # The gate: validates answers, unlocks next phase
│   ├── phase1/                      # Crew roster + departure log
│   ├── phase2/                      # Library catalog + shelf inventory
│   └── phase3/                      # Caesar cipher + dictionary
├── docs/                            # Screenshots for this README
├── modal.md                         # Modal rules/guidelines reference
├── CLAUDE.md                        # Project-scoped Claude instructions
├── LICENSE
└── README.md

Quickstart

Prerequisites

  • Python 3.12+
  • A Modal account (modal setup)
  • Modal credits (~$100–200 for a 3-hour, 20-team event on L4)

Install

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements-driver.txt
modal setup   # first time only

Single-team dry run

The first modal run builds the image, which takes ~5–15 minutes (it installs Node 22, OpenClaw, XFCE, Ollama, and pre-pulls qwen2.5:7b into a shared modal.Volume). Subsequent runs use cached layers and are much faster.

modal run --detach modal_app.py::run_team_desktop \
    --team test-01 \
    --duration-s 1800 \
    --url-queue-name ghostroom-urls-test

In a second shell, grab the URL off the test queue:

python -c "
import modal
print(modal.Queue.from_name('ghostroom-urls-test').get()['url'])
"

Open that URL in a browser. You should see:

Run Agent terminal ready state

  • Mousepad open on /root/puzzle/SKILL.md (edit your prompt here).
  • Welcome terminal showing /root/puzzle/README.md.
  • RUN AGENT terminal with a banner, live SKILL.md preview, and Press Enter to run the agent....

Click into the RUN AGENT terminal, press Enter, and watch tool calls scroll in real time as the ghost reads files, reasons, and calls unlock <phase> <answer>.

Fleet rehearsal (3 teams, 20 minutes)

head -n3 teams.txt > teams.small.txt
modal run modal_app.py --duration-min 20 --teams-file teams.small.txt

The driver prints the URL table as containers come up, then switches to a live scoreboard refreshing every 15 seconds.

The real event (20 teams, 3 hours)

modal run modal_app.py --duration-min 180 --teams-file teams.txt

Kill switch

modal app stop ghostroom

Scoring

Each unlock is logged to /root/puzzle/scores.jsonl as a line of JSON:

{"ts": 1775772636, "phase": 3, "event": "unlock", "answer": "lantern"}

score_watcher.py tails this file and pushes updates into a shared modal.Dict keyed by team name. The driver reads the dict every 15 seconds and renders a table sorted by:

  1. Phases unlocked (descending)
  2. Sum of first-unlock timestamps (ascending — faster wins)
  3. Wrong-answer count (ascending — fewer wrong guesses breaks ties)

Only the first unlock of a phase counts. Wrong, out-of-order, and bad-phase attempts all increment the tiebreaker counter.

Puzzle answers (operator spoiler)

Click to reveal
  • Phase 1: delgado (captain of the ship that departed 2024-03-14)
  • Phase 2: dalia (first name of the author of the asterisked book on shelf B3)
  • Phase 3: lantern (Caesar −3 decode of odqwhuq, verified against the provided word list)

Keep these away from teams.

Budget

  • GPU: 20 × L4 × 3 h × ~$0.80/hr ≈ $48
  • Compute / egress / storage: ~$30–50
  • Total: ~$80–100 — comfortably under $200.

Extend to 4 hours for ~$16 more. The first build pulls the model (~5 GB) into a persistent modal.Volume, so subsequent deploys skip the download entirely.

Knobs worth tuning

  • Event duration--duration-min on the driver (default 180).
  • Default promptcontainer/skill_template.md. Make it weaker to widen the prompt-quality gap between teams; make it stronger if teams are struggling.
  • ModelMODEL_TAG in modal_app.py and container/boot.sh. qwen2.5:7b is the current sweet spot for Ollama native tool calling at 7B scale. Other models tested: qwen2.5-coder:7b and qwen3:8b both emit tool calls as stringified JSON rather than real function calls in Ollama.
  • Agent turn budget--max-turns in agent_runner.py (default 60), and the nudge budget (INITIAL_NUDGES = 6).
  • ResolutionXvfb :99 -screen 0 1280x800x24 in container/boot.sh. Drop to 1024×768 if noVNC is laggy.

Risks and gotchas

  • npm install -g openclaw without a version tag can resolve to a squatter placeholder (openclaw@0.0.1). Always pin to openclaw@2026.4.9 or later.
  • OpenClaw requires Node ≥22.12 at runtime. Debian's default Node 18 is too old; we install Node 22 LTS from NodeSource.
  • OpenClaw's gateway boot hook auto-drops BOOTSTRAP.md, SOUL.md, USER.md, etc. into any workspace on first run, which confuses the agent's role. boot.sh deletes them.
  • XFCE desktop icons require trust metadata (gio set … metadata::trusted true) to be launchable with a double-click. Rather than fight this, the RUN AGENT experience is an auto-opened terminal — more discoverable anyway.
  • Gate bypass — a curious team could ls /var/lib/puzzle-staging/ and peek at phase 2/3 before unlocking phase 1. Scoring only credits phases unlocked via unlock.sh, so cheating doesn't produce score events. Honor system for gameplay; hard-enforced for scoring.

Credits

Built over a single afternoon against Modal, Ollama, noVNC, XFCE, and OpenClaw. The puzzles, agent loop, and container plumbing were iterated live against a real Modal container with Claude Code.

License

MIT — see LICENSE.

About

A live AI escape room — teams write SKILL.md prompts and watch a local Qwen 2.5 7B operate a Linux desktop via noVNC to solve filesystem puzzles. One Modal container per team.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors