ghostroom

A live AI escape room. Teams write prompts; an AI ghost operates a Linux desktop and tries to solve the puzzle. The spectacle of watching the ghost work is the experience.

Each team gets their own browser tab pointed at a full Linux desktop running inside a Modal container. They edit a SKILL.md prompt file in an in-desktop editor, hit Enter in the RUN AGENT terminal, and watch a local Qwen 2.5 7B (served by Ollama) autonomously navigate the filesystem, read clues, correlate information, and submit answers. The differentiator between teams is prompt quality, not CS skill.

What it does

20 isolated Linux desktops deployed on Modal (L4 GPU per container).
Each container gets its own public HTTPS URL via modal.forward(6080) into an XFCE desktop served through noVNC — teams just open a browser tab.
A local Qwen 2.5 7B runs in each container via Ollama. No API keys, no cloud LLM billing.
3 phases of filesystem puzzles gated by an unlock command. Each phase requires the agent to find and correlate information from different parts of the filesystem.
Live scoreboard: each container tails scores.jsonl and pushes events to a shared modal.Dict that the driver renders as a live-updating Rich table.
Unlimited retries. Phase unlock state persists across agent runs, so teams can iterate on their prompt without losing progress.

The puzzles

"The Captain" — Correlate a departure log with a crew roster. Answer is a captain's last name.
"The Marked Shelf" — Cross-reference a librarian's note against a book catalog and a shelf inventory. Answer is an author's first name.
"The Shifted Message" — Decode a short Caesar cipher. The decoded word must appear in a provided dictionary. Requires telling the agent to actually compute the shift (e.g. "use python3 to decode") rather than doing the arithmetic in its head — this is the knob that separates winning prompts from losing ones.

Architecture

┌─ Modal L4 container (one per team, spawned by .spawn()) ─────────────┐
│                                                                      │
│  Ollama ── qwen2.5:7b (pre-baked into a modal.Volume)                │
│     │                                                                │
│  agent_runner.py  ── calls /api/chat with a tools array              │
│     │                    (exec, read, list, done)                    │
│     │                                                                │
│  Xvfb :99 ── xfce4-session                                           │
│     ├─ mousepad (auto-opened on /root/puzzle/SKILL.md)               │
│     └─ xfce4-terminal running run_agent_loop.sh                      │
│         "Press Enter to run the agent"                               │
│                                                                      │
│  x11vnc :5900 ─▶ websockify+noVNC :6080 ─▶ modal.forward(6080)       │
│                                                 │                    │
│                                                 ▼                    │
│                                      public HTTPS URL                │
│                                                                      │
│  /root/puzzle/                                                       │
│    ├─ README.md, SKILL.md, AGENTS.md                                 │
│    ├─ phase1/  (visible from the start)                              │
│    ├─ phase2/  (copied from /var/lib/puzzle-staging when unlocked)   │
│    ├─ phase3/  (copied when phase2 unlocks)                          │
│    ├─ state/phase{N}.unlocked  (persistent markers)                  │
│    ├─ scores.jsonl                                                   │
│    └─ unlock.sh                                                      │
│                                                                      │
│  score_watcher.py ── tails scores.jsonl ─▶ modal.Dict                │
│                                                                      │
└──────────────────────────────────────────────────────────────────────┘

Local driver (@app.local_entrypoint):
  1. Read teams.txt (20 names)
  2. .spawn() 20 run_team_desktop containers
  3. Collect 20 URLs from modal.Queue.from_name("ghostroom-urls")
  4. Print the team → URL table
  5. Poll modal.Dict.from_name("ghostroom-scores") every 15s
     and render a live Rich scoreboard until duration expires

Why not use OpenClaw's agent directly?

The original design called for OpenClaw's gateway + openclaw agent to drive the session. In practice, OpenClaw's Ollama tool-call wiring produces schemas that Qwen 2.5 7B emits as stringified JSON inside markdown code blocks rather than as real function calls. container/agent_runner.py bypasses the gateway and hits Ollama's /api/chat directly with a tools array, where Qwen 2.5 7B emits proper tool_calls 100% of the time. OpenClaw is still installed and runnable (best-effort gateway at :18789) for teams who want to explore it, but the puzzle-solving loop goes through the direct agent.

Layout

.
├── modal_app.py                     # Image, run_team_desktop, local_entrypoint driver
├── teams.txt                        # 20 team names, one per line (edit freely)
├── requirements-driver.txt          # Local deps for the driver (modal, rich)
├── container/                       # In-container assets (baked into image)
│   ├── boot.sh                      # Orchestration entrypoint: ollama, xfce, vnc, gateway, score watcher
│   ├── agent_runner.py              # Python agent loop over Ollama /api/chat with tools
│   ├── run_agent_loop.sh            # "Press Enter to run" terminal
│   ├── _agent_inner.sh              # One-shot launcher (alternative to run_agent_loop)
│   ├── run_agent.sh                 # Desktop-launcher target
│   ├── openclaw.config.json         # Pre-baked OpenClaw config (Ollama provider, coding profile)
│   ├── skill_template.md            # Starter SKILL.md teams edit
│   ├── score_watcher.py             # Tails scores.jsonl → modal.Dict
│   ├── autostart/                   # XFCE autostart entries
│   └── desktop/                     # Desktop launcher file
├── puzzle/                          # Puzzle content (baked into image)
│   ├── README.md                    # Agent entry briefing
│   ├── unlock.sh                    # The gate: validates answers, unlocks next phase
│   ├── phase1/                      # Crew roster + departure log
│   ├── phase2/                      # Library catalog + shelf inventory
│   └── phase3/                      # Caesar cipher + dictionary
├── docs/                            # Screenshots for this README
├── modal.md                         # Modal rules/guidelines reference
├── CLAUDE.md                        # Project-scoped Claude instructions
├── LICENSE
└── README.md

Quickstart

Prerequisites

Python 3.12+
A Modal account (modal setup)
Modal credits (~$100–200 for a 3-hour, 20-team event on L4)

Install

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements-driver.txt
modal setup   # first time only

Single-team dry run

The first modal run builds the image, which takes ~5–15 minutes (it installs Node 22, OpenClaw, XFCE, Ollama, and pre-pulls qwen2.5:7b into a shared modal.Volume). Subsequent runs use cached layers and are much faster.

modal run --detach modal_app.py::run_team_desktop \
    --team test-01 \
    --duration-s 1800 \
    --url-queue-name ghostroom-urls-test

In a second shell, grab the URL off the test queue:

python -c "
import modal
print(modal.Queue.from_name('ghostroom-urls-test').get()['url'])
"

Open that URL in a browser. You should see:

Mousepad open on /root/puzzle/SKILL.md (edit your prompt here).
Welcome terminal showing /root/puzzle/README.md.
RUN AGENT terminal with a banner, live SKILL.md preview, and Press Enter to run the agent....

Click into the RUN AGENT terminal, press Enter, and watch tool calls scroll in real time as the ghost reads files, reasons, and calls unlock <phase> <answer>.

Fleet rehearsal (3 teams, 20 minutes)

head -n3 teams.txt > teams.small.txt
modal run modal_app.py --duration-min 20 --teams-file teams.small.txt

The driver prints the URL table as containers come up, then switches to a live scoreboard refreshing every 15 seconds.

The real event (20 teams, 3 hours)

modal run modal_app.py --duration-min 180 --teams-file teams.txt

Kill switch

modal app stop ghostroom

Scoring

Each unlock is logged to /root/puzzle/scores.jsonl as a line of JSON:

{"ts": 1775772636, "phase": 3, "event": "unlock", "answer": "lantern"}

score_watcher.py tails this file and pushes updates into a shared modal.Dict keyed by team name. The driver reads the dict every 15 seconds and renders a table sorted by:

Phases unlocked (descending)
Sum of first-unlock timestamps (ascending — faster wins)
Wrong-answer count (ascending — fewer wrong guesses breaks ties)

Only the first unlock of a phase counts. Wrong, out-of-order, and bad-phase attempts all increment the tiebreaker counter.

Puzzle answers (operator spoiler)

Click to reveal

Phase 1: delgado (captain of the ship that departed 2024-03-14)
Phase 2: dalia (first name of the author of the asterisked book on shelf B3)
Phase 3: lantern (Caesar −3 decode of odqwhuq, verified against the provided word list)

Keep these away from teams.

Budget

GPU: 20 × L4 × 3 h × ~$0.80/hr ≈ $48
Compute / egress / storage: ~$30–50
Total: ~$80–100 — comfortably under $200.

Extend to 4 hours for ~$16 more. The first build pulls the model (~5 GB) into a persistent modal.Volume, so subsequent deploys skip the download entirely.

Knobs worth tuning

Event duration — --duration-min on the driver (default 180).
Default prompt — container/skill_template.md. Make it weaker to widen the prompt-quality gap between teams; make it stronger if teams are struggling.
Model — MODEL_TAG in modal_app.py and container/boot.sh. qwen2.5:7b is the current sweet spot for Ollama native tool calling at 7B scale. Other models tested: qwen2.5-coder:7b and qwen3:8b both emit tool calls as stringified JSON rather than real function calls in Ollama.
Agent turn budget — --max-turns in agent_runner.py (default 60), and the nudge budget (INITIAL_NUDGES = 6).
Resolution — Xvfb :99 -screen 0 1280x800x24 in container/boot.sh. Drop to 1024×768 if noVNC is laggy.

Risks and gotchas

npm install -g openclaw without a version tag can resolve to a squatter placeholder (openclaw@0.0.1). Always pin to openclaw@2026.4.9 or later.
OpenClaw requires Node ≥22.12 at runtime. Debian's default Node 18 is too old; we install Node 22 LTS from NodeSource.
OpenClaw's gateway boot hook auto-drops BOOTSTRAP.md, SOUL.md, USER.md, etc. into any workspace on first run, which confuses the agent's role. boot.sh deletes them.
XFCE desktop icons require trust metadata (gio set … metadata::trusted true) to be launchable with a double-click. Rather than fight this, the RUN AGENT experience is an auto-opened terminal — more discoverable anyway.
Gate bypass — a curious team could ls /var/lib/puzzle-staging/ and peek at phase 2/3 before unlocking phase 1. Scoring only credits phases unlocked via unlock.sh, so cheating doesn't produce score events. Honor system for gameplay; hard-enforced for scoring.

Credits

Built over a single afternoon against Modal, Ollama, noVNC, XFCE, and OpenClaw. The puzzles, agent loop, and container plumbing were iterated live against a real Modal container with Claude Code.

License

MIT — see LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ghostroom

What it does

The puzzles

Architecture

Why not use OpenClaw's agent directly?

Layout

Quickstart

Prerequisites

Install

Single-team dry run

Fleet rehearsal (3 teams, 20 minutes)

The real event (20 teams, 3 hours)

Kill switch

Scoring

Puzzle answers (operator spoiler)

Budget

Knobs worth tuning

Risks and gotchas

Credits

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
container		container
docs		docs
puzzle		puzzle
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
HOW-TO-PLAY.md		HOW-TO-PLAY.md
LICENSE		LICENSE
README.md		README.md
modal.md		modal.md
modal_app.py		modal_app.py
requirements-driver.txt		requirements-driver.txt
teams-10.txt		teams-10.txt
teams.txt		teams.txt

Folders and files

Latest commit

History

Repository files navigation

ghostroom

What it does

The puzzles

Architecture

Why not use OpenClaw's agent directly?

Layout

Quickstart

Prerequisites

Install

Single-team dry run

Fleet rehearsal (3 teams, 20 minutes)

The real event (20 teams, 3 hours)

Kill switch

Scoring

Puzzle answers (operator spoiler)

Budget

Knobs worth tuning

Risks and gotchas

Credits

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages