# LLMs vs. TextWorld

This notebook tests and proves the hypothesis that modern, off-the-shelf LLMs — even non-reasoning models such as GPT-4o — can achieve high win rates in TextWorld when given a detailed prompt and a walkthrough of an example game, *without* any task-specific training. Games are generated via `tw-cooking --recipe 3 --take 2 --go 12 --open --cook --cut --drop`, essentially configuring them for maximum difficulty. Of the four models tested, the best-performing is Llama-3.1-405B-Instruct-FP8, which won at 89 out of 100 seeds on its first attempt, and all 100 within two attempts. Difficulty navigating the map is the most common source of failure.

## Environment setup

First we need to pip-install some prerequisites and locate the `tw-make` binary.

In [1]:
import os.path, sys, sysconfig
!{sys.executable} -m pip install textworld openai
TW_MAKE_BIN : str = os.path.join(sysconfig.get_path("scripts"), "tw-make")



This next cell defines a function for accessing API keys. It supports getting them from Colab's secret store or from environment variables. You may need to modify it if you want to access them through some other mechanism.

In [2]:
import os, warnings
def get_api_key(service : str) -> str:
    try:
        from google.colab import userdata

        secret = userdata.get(service)
        if secret is not None:
            return secret
    except ImportError:
        pass
    secret = os.environ.get(service.upper())
    if secret is None:
        warnings.warn(
            "Failed to retrieve API key for {}. Set it in Colab secrets "
            "or via the {} environment variable.".format(
                service, service.upper()
            )
        )
    return secret

Now we define `WORK_DIR`, where we'll store generated games and experimental results. We'll use a Google Drive if we're running in Colab, otherwise use the Jupyter data directory. You might want to customize this.

In [3]:
import os.path, jupyter_core

USE_GOOGLE_DRIVE : bool = True
OFFLINE_WORK_DIR : str = os.path.join(
    jupyter_core.paths.jupyter_data_dir(),
"textworld")
ONLINE_WORK_DIR : str = "/content/drive/MyDrive/textworld"

if USE_GOOGLE_DRIVE:
    try:
        from google.colab import drive
        drive.mount("/content/drive", force_remount=True)
        WORK_DIR : str = ONLINE_WORK_DIR
    except ImportError:
        WORK_DIR = OFFLINE_WORK_DIR
else:
    WORK_DIR = OFFLINE_WORK_DIR

Mounted at /content/drive


This notebook supports testing any model compatible with the [OpenAI chat completion API](https://platform.openai.com/docs/api-reference/chat). The two variable definitions below can be extended to configure additional models. `CLIENTS` is a dictionary whose keys are arbitrary identifiers and whose values are dictionaries of arguments to the `OpenAI` constructor. `MODELS` is a dictionary whose keys are arbitrary identifiers and whose values are dictionaries which use the following keys:

* `'client'` (required) takes a string value which points into `CLIENTS` to determine how to construct the client used with this model.
* `'developer_role'` (optional) gives a role name to associate with the developer prompt, if different from "developer". Some models call this role "system", and some models don't support customizing the developer prompt in which case this should be set to "user".
* Any other key is interpreted as a keyword argument to be provided to `OpenAI.chat.completions.create`. If you don't provide a `model` argument here, then the key that you used to name this `MODELS` entry is used.

In [4]:
from typing import Dict, Any

CLIENTS : Dict[str, Dict[str, Any]] = {
    'openai': {
        'base_url': "https://api.openai.com/v1",
        'api_key': get_api_key("openai_api_key"),
    },
    'lambdalabs': {
        'base_url': "https://api.lambdalabs.com/v1",
        'api_key': get_api_key("lambdalabs_api_key"),
    },
}

MODELS : Dict[str, Dict[str, Any]] = {
    'gpt-4o': {'client': 'openai'},
    'gpt-4o-mini': {'client': 'openai'},
    'llama3.1-405b-instruct-fp8': {
        'client': 'lambdalabs'
    },
    'llama3.3-70b-instruct-fp8': {
        'client': 'lambdalabs',
    }
}

## Experiment

Now our environment is set up and we can configure the parameters of the experiment. `TW_MAKE_ARGS` lists what arguments we'll pass to `tw-make` when generating games. See [command reference](https://textworld.readthedocs.io/en/latest/textworld.challenges.cooking.html). This configures our games' difficulty, and we're going for more-or-less the maximum:

* The recipe has three ingredients.
* One is already inventory, two must be searched for.
* The map has 12 rooms.
* The map is full of doors which must be opened before proceeding.
* Recipe preparation requires cooking steps and cutting steps.
* The agent will have to drop items in order to juggle around inventory limits.

In [5]:
from typing import List

TW_MAKE_ARGS : List[str] = [
    "tw-cooking", "--recipe","3", "--take", "2", "--go", "12", "--open", "--cook",
    "--cut", "--drop",
]

`MAX_TURNS` controls how many turns the agent is allowed in which to complete the game before the run is considered a failure. 100 gives a decent amount of leeway while still creating some time pressure.

In [6]:
MAX_TURNS : int = 100

`INSTRUCTIONS` defines the developer prompt. A well-worded prompt is key to getting good results. The one here follows conventional best practices for prompt engineering but has not been particularly optimized.

In [7]:
INSTRUCTIONS : str = \
"""You are going to play a text adventure game. All of the user's input
represents text printed by the game. Your output will be interpreted
as commands issued to the game. However, the game will ignore
anything you put in parentheses. Use parentheticals to record your
thoughts as you think step-by-step through these instructions before
you decide what move to make. Every time you enter a new room, begin
your next thought by listing what exits you see.

The game you will be playing is randomly generated, but all games
follow the same simple template. Your goal is to prepare and eat a
meal. To do this, you will perform the following, three-phase
procedure.

# Phase I: Find the kitchen

Explore the map until you locate the kitchen. Use compass commands
`N`, `S`, `E`, and `W` to move around. There may be doors in your
way. Use the command `OPEN <adjective> DOOR` to open them,
substituting whatever adjective you see in the room description.

When you explore, pay careful attention to any occurrence of "north",
"south", "east", or "west" in a room description. These words always
indicate room exits. Explore all exits until you've located
everything you need.

During this phase, make a note of any items you come across, but do
not pick any of them up.

# Phase II: Search for ingredients

Once you find the kitchen, use the command `READ COOKBOOK`. The
cookbook will contain a list of ingredients, and then a list of
directions for preparation. Right now, just pay attention to the
list of ingredients. Don't worry about preparation until Phase III.

Next, use the command `I` to list your inventory. You may already be
carrying some of the ingredients you need. If you are carrying
anything that you *don't* need, use the `DROP` command to drop it.
You may find some items in your inventory that you didn't pick up,
which were there at the start of the game. This is normal.

Third, use the command `OPEN FRIDGE`. If the fridge contains any
ingredients you need, use the `GET` command to pick them up.

Now, figure out what ingredients you are still missing, and then
continue exploring the map until you find them.

Pick up only the ingredients that the recipe calls for. Pay
attention to their entire description. For example, if the recipe
calls for a red pepper, don't pick up a yellow pepper unles the
recipe needs that too.

If you try to pick something up and the game tells you, "You're
carrying too many things already", you've messed up: either you're
carrying something you don't need, or you don't need the thing
you're trying to pick up.  Check your inventory again (use `I`),
compare your inventory against the recipe, and `DROP` any
unnecessary items.

If you get lost or stuck in a loop while exploring in Phase II,
remember the principles from Phase I: look for "north", "south",
"east", and "west" in room descriptions to make sure you haven't
overlooked any exits. If you notice a closed door, that's certainly
a way you haven't explored yet.

# Phase III: Prepare your meal

Once you have all your ingredients, return to the kitchen. Now
`READ COOKBOOK` again, and then check your inventory again.
Double-check that your inventory contains all the required
ingredients and nothing else. If you made a mistake, return to
phase II and correct it.

Now you are ready to follow the directions in the cookbook. Each
step except the last one in the list of directions is either a
cutting step, or a cooking step. A cutting step calls for you either
to *slice*, *dice*, or *chop* an ingredient. A cooking step calls
for you either to *roast*, *fry*, or *grill* an ingredient.

Determine if any steps involve cutting. If and only if you need to
cut anything, use `GET KNIFE` to pick up the knife. You might have
to drop something first to make room for the knife in your inventory.

Now, follow the directions in order and follow them exactly. Use the
following commands:

* If the cookbook says to *slice* an ingredient, use `SLICE <ingredient> WITH KNIFE`.
* If the cookbook says to *dice* an ingredient, use `DICE <ingredient> WITH KNIFE`.
* If the cookbook says to *chop* an ingredient, use `CHOP <ingredient> WITH KNIFE`.
* If the cookbook says to *roast* an ingredient, use `COOK <ingredient> WITH OVEN`.
* If the cookbook says to *fry* an ingredient, use `COOK <ingredient> WITH STOVE`.
* If the cookbook says to *grill* an ingredient, use `COOK <ingredient> WITH BBQ`.

Be careful to use the right verb for cutting and the right appliance
for cooking! If you slice something that you should have diced or
chopped, or if you roast something that you should have fried or
grilled, you'll lose the game. Also be sure not to cook any
ingredient more than once, or you'll burn it.

Everything you need for meal preparation is in the kitchen, except
for the BBQ. The BBQ is in the backyard. So, if you have to grill
something, you will have to go to the backyard first and then return
to the kitchen afterward. After you get back to the kitchen, you
should `READ COOKBOOK` again so you can remember where you left off.

The final step in the list of directions will be "prepare meal", so
once you have done everything else, use `PREPARE MEAL` to put the
ingredients together; make sure you're still holding them all. You
can drop the knife now if you need to. Finally, use the command
`EAT MEAL` to win the game, and then `QUIT`."""

After the developer prompt comes a walkthrough of sample game, generated based on what we specify in `SAMPLE_GAMES`. Each sample game either can be sourced from a URL, or can be generated from a `tw-make` command line and a seed. To demonstrate that we haven't cherry-picked anything here, we use the demo game from the TextWorld website. This game is missing a number of difficulty features that are enabled in our evaluation games, but it turns out to be good enough.

The agent will see `INSTRUCTIONS` as a developer prompt, followed by this sample game as an assistant/user interaction, followed by repetition of the instructions.

In [8]:
from typing import List, Union, TypedDict, NotRequired

UrlSampleGame = TypedDict('UrlSampleGame', {
    'url': str,
    'solution': List[str],
})
SeedSampleGame = TypedDict('SeedSampleGame', {
    'tw_make_args': NotRequired[List[str]],
    'seed': int,
    'solution': List[str],
})
SampleGame = Union[UrlSampleGame, SeedSampleGame]

SAMPLE_GAMES : List[SampleGame] = [
    {
        'url': (
            "https://textworldparchmentserver.z13.web.core.windows.net/games/game1.z8"
        ),
        'solution': [
            "(I see exits to the south, east, and west. I'm in Phase I, searching for "
                "the kitchen. I've started the game in the backyard, where the BBQ is. "
                "That isn't important right now, but I'll remember it for later in "
                "case I need to grill something. Let's try the east exit first.) E",
            "(I see an exit to the west. Looks like this is a dead end with nothing "
                "useful, so I'll go back west) W",
            "(I see exits to the south, east, and west.  I'll search westward next, "
                "but first I need to open the door in my way) OPEN WOODEN DOOR",
            "(Now that the door is open, the path west is clear) W",
            "(I see an exit to the east. Another dead end. The toolbox and workbench "
                "aren't useful.) E",
            "(I see exits to the south, east, and west.  I've checked east and west, "
                "so only south remains. There's another door there I need to open.) "
                "OPEN SCREEN DOOR",
            "(Now my way south is clear.) S",
            "(I see exits to the north, south, east, and west. I found the kitchen, "
                "so I'm in Phase II now. The first thing to do in Phase II is read "
                "the cookbook.) READ COOKBOOK",
            "(Okay, I'm looking for a black papper, a chicken leg, and milk. Do I "
                "already have any of that?) I",
            "(No, there's nothing in my inventory. So there's nothing I need to drop, "
                "but three ingredients to search for. The fridge is the first place "
                "to search.) OPEN FRIDGE",
            "(That raw chicken leg is one of my three ingredients, so I should pick "
                "it up. The wing isn't an ingredient, so I'll ignore that.) "
                "GET RAW CHICKEN LEG",
            "(That's one ingredient down, two I'm still searching for. I see there "
                "are exits from here in all four directions, but I've already explored "
                "everything to the north. Let's try west.)  OPEN FROSTED-GLASS DOOR",
            "(Now my path west is clear.) W",
            "(I see an exit to the east. I don't need flour, so there's nothing useful "
                "here, and it's a dead end.) E",
            "(I see exits to the north, south, east, and west. The east and south "
                "exits are still unexplored. Let's go east.) E",
            "(I see exits to the south and west. There's nothing useful here, but "
                "there's an unexplored exit to the south.) S",
            "(I see exits to the north and west. Again nothing useful here, but "
                "another unexplored exit to the west.) W",
            "(I see exits to the north, east, and west. In my last three turns I went "
                "east, south, and west from the kitchen, so that north exit probably "
                "takes me back to the kitchen. West seems like the most promising "
                "direction for further exploration, but I need to open the door "
                "first.) OPEN FRONT DOOR",
            "(Now my way west is clear.) W",
            "(I see exits to the east and west. There's nothing useful here, but "
                "there's an explored exit to the west.) W",
            "(I see exits to the north and east. Again nothing useful here, but "
                "there's a door to the north.) OPEN SLIDING DOOR",
            "(Now my way north is clear.) N",
            "(I see an exit to the south. Jackpot! This room has both of the "
                "ingredients I'm still searching for. First, the black pepper.) "
                "GET BLACK PEPPER",
            "(I also need the milk.) GET MILK",
            "(I don't need the red hot pepper or the salt. I have all my ingredients "
                "now, which means I'm in Phase III and should return to the kitchen "
                "by going back the way I came.) S",
            "(I see exits to the north and east. The house is back east.) E",
            "(I see exits to the east and west. I'll go east again to re-enter the "
                "house.) E",
            "(I see exits to the north, east, and west. Okay, now I'm back in this "
                "corridor. I speculated earlier that north would be a shortcut back "
                "to the kitchen.) N",
            "(I see exits to the north, south, east, and west. I'm back in the "
                "kitchen, so it's time to re-read the cookbook.) READ COOKBOOK",
            "(Now I'll check my inventory to double-check that I have all the "
                "ingredients.) I",
            "(The ingredients I need are a black pepper, chicken leg, and milk. That "
                "exactly matches my inventory, so that means I did everything right "
                "so far and can proceed with the directions. Sice none of the "
                "directions involve cutting, I can ignore the knife. The first "
                "direction says to roast the chicken leg. When I need to roast "
                "something, I use the oven. I mustn't use the stove or the BBQ; "
                "I'd lose the game if I did that.)  COOK RAW CHICKEN LEG WITH OVEN",
            "(Okay, checking my inventory one more time: I should still be holding "
                "all the now-cooked ingredients.) I",
            "(Right. I'm still holding the now-roasted chicken leg, as well as milk "
                "and the black pepper which didn't need any preparation. Now, the "
                "final step in the directions is to prepare the meal, and I should be "
                "ready for that.) PREPARE MEAL",
            "(Finally, I can eat the meal. This should win the game.) EAT MEAL",
        ],
    }
]

Now we get into our implementation proper. We'll start by developing two utility functions, respectively for cleaning up the game's output before the agent sees it, and cleaning up the agent's input before the game sees it. `clean_obs` removes some junk that the game tries to print into the status bar but ends up in the main output instead. `strip_parentheticals` removes the parenthesized chain-of-thought output that the agent is instructed to provide but which the game can't parse.

In [9]:
import re

def clean_obs(obs : str) -> str:
    """Observations returned by the TextWorld gym include, without demarcation, things
    that the Z machine meant to put into the status bar rather than the main text
    window. This is ugly and unhelpful, so we use this function to strip it out by
    removing everything after the last '>'."""
    return re.sub(r">[^>]*$", "> ", obs)


def strip_parentheticals(action : str) -> str:
    """The LLM is instructed to place its chain-of-thought in parentheses. This function
    strips away parenthetical content so it doesn't get passed to the game parser."""
    # Strip balanced, innermost parentheticals until none remain
    while True:
        new_action = re.sub(r"[(][^()]*[)]", "", action)
        if new_action == action:
            break
        action = new_action
    # Strip to/from unbalanced parens
    action = re.sub(r"^.*[)]", "", action)
    action = re.sub(r"[(].*$", "", action)
    return action.lstrip(" \r\n").strip(" \r\n.")

Another utility function. `extend_messages` takes a transcript (i.e., a list of chat-completion chunks), a game environment, and some input from the agent, executes the input, and extends the transcript with both the input and the game's response to it.

In [10]:
import openai, sys, textworld.gym
from typing import Any, Iterable, List, Optional, Tuple, TypedDict

TranscriptChunk = TypedDict('TranscriptChunk', {'role': str, 'content': str})
Transcript = List[TranscriptChunk]
TranscriptIter = Iterable[TranscriptChunk]

def extend_messages(
        messages : Transcript,
        action : str,
        env : textworld.gym.envs.TextworldGymEnv,
        render=False,
        prestripped : Optional[str] = None
    ) -> Tuple[float, bool, Any]:
    """Executes `action` in `env` and then extends `messages` the resulting
    assistant/user exchange. If `render` is `True`, prints the exchange to stdout."""
    messages.append({"role": "assistant", "content": action})
    if prestripped is None:
        prestripped = strip_parentheticals(action)
    if render:
        print("> " + action)
    obs, score, done, infos = env.step(prestripped)
    obs = clean_obs(obs)
    if render:
        print(obs, end="")
        sys.stdout.flush()
    messages.append({"role": "user", "content": obs})
    return (score, done, infos)

The `make_game` function takes care of invoking `tw-make` to generate a game. It stores games in `WORK_DIR/games` and names them according to a hash of the command line and the seed that was used. This provides for memoization: if the file already exists, then the game has already been generated and the function can return early.

In [11]:
import hashlib, random, os, os.path, subprocess
from typing import List, Optional

def game_hash(tw_make_args : List[str]) -> str:
    """Returns a hash of the arguments to `tw_make`, as a hex digest"""
    return hashlib.sha256(" ".join(tw_make_args).encode("utf-8")).hexdigest()

def make_game(seed : Optional[int] = None, tw_make_args : Optional[List[str]] = None) -> str:
    """Invokes 'tw-make' with `tw_make_args` (defaulting to `TW_MAKE_ARGS`) and an
    additional '--seed' argument of `seed` (default random). Stores the result in a
    file named based on the seed and a hash of `tw_make_args`, and returns that
    filename. If the file already exists, returns it immediately without
    regenerating."""
    if seed is None:
        seed = random.randint(1, 65535)
    if tw_make_args is None:
        tw_make_args = TW_MAKE_ARGS

    game_dir = os.path.join(WORK_DIR, "games")
    game_file = "{}-{}.z8".format(
        game_hash(tw_make_args), seed
    )
    os.makedirs(game_dir, exist_ok=True)
    out = os.path.join(game_dir, game_file)
    if os.path.exists(out):
        return out
    args = [TW_MAKE_BIN] + tw_make_args + ["--output", out, "--seed", str(seed)]
    subprocess.run(args).check_returncode()
    return out

`make_prompt` builds the prompt by interpreting what we've specified in `INSTRUCTIONS` and `SAMPLE_GAMES`.

In [12]:
import shutil, tempfile, urllib, textworld.gym
from typing import List

def make_prompt(instructions : str, sample_games: List[SampleGame]) -> Transcript:
    """Runs the sample games to produce transcripts, then combines those with the
    instructions to produce a complete prompt."""

    prompt : List[TranscriptChunk] = [{"role": "developer", "content": instructions}]

    for game in sample_games:
        if 'url' in game:
            with urllib.request.urlopen(game['url']) as response:
                with tempfile.NamedTemporaryFile(suffix=".z8") as tmp_file:
                    shutil.copyfileobj(response, tmp_file)
                    env_id = textworld.gym.register_game(
                        tmp_file.name,
                        name='tmp_file.name'
                    )
                    env = textworld.gym.make(env_id)
                    obs, _ = env.reset()
                    prompt += [{"role": "user", "content": clean_obs(obs)}]
                    for step in game["solution"]:
                        extend_messages(prompt, step, env)
        elif 'seed' in game:
            path = make_game(game['seed'], game.get('tw_make_args', TW_MAKE_ARGS))
            env_id = textworld.gym.register_game(path, name=path)
            env = textworld.gym.make(env_id)
            obs, _ = env.reset()
            prompt += [{"role": "user", "content": clean_obs(obs)}]
            for step in game["solution"]:
                extend_messages(prompt, step, env)
        else:
            raise ValueError("Each sample game must contain either a url or a seed")

        prompt += [
            {"role": "assistant", "content": "QUIT"},
            {
                "role": "developer",
                "content": (
                    "Well done! Now play a new game. Your instructions are the "
                    "same as before:\n\n" + instructions
                ),
            },
        ]
    return prompt

`make_prompt` includes a developer prompt, but not all models support it. This next function will replace it with something else when needed.

In [13]:
def replace_dev_prompt(
        prompt : Transcript,
        replacement : str = "user"
    ) -> TranscriptIter:
    """Iterates over `prompt` and replaces every instance of the developer role with
    the `replacement` role."""
    for item in prompt:
        if item['role'] == "developer":
            newitem = dict(item)
            newitem['role'] = replacement
            yield newitem
        else:
            yield item

This type definition specifies how the result of a game is represented. `model`, `tw_make_args`, and `seed` are known at the start of the game and indicate what model was playing the game and what game it was playing. `messages` is the complete game transcript. `outcome` specifies how the game ended: whether it was won, lost, ended due to hitting the turn limit, ended due to repeated silence from the agent, ended due to the agent quitting out of the game, or ended due to an API error.

In [14]:
from typing import TypedDict, List, Optional, Literal
GameResult = TypedDict('GameResult', {
    'model': str,
    'tw_make_args': List[str],
    'seed': int,
    'error': Optional[str],
    'outcome': Literal['won', 'lost', 'turnmax', 'silence', 'quit', 'error'],
    'messages': Transcript,
    'turns': int,
})

Now we're finally ready to define our `play_game` function which interfaces the agent to the game, producing a `GameResult`.

In [15]:
import textworld, textworld.gym, openai, re, sys, time
from typing import Optional, List

MEMO_INSTRUCTIONS : Optional[str] = None
MEMO_SAMPLE_GAMES : Optional[List[SampleGame]] = None
DEFAULT_PROMPT : Optional[Transcript] = None

def play_game(
        seed : int,
        model : str,
        tw_make_args : Optional[List[str]] = None,
        prompt : Optional[Transcript ] = None,
        max_turns : Optional[int] = None,
        render : bool = False
    ) -> GameResult:
    global MEMO_INSTRUCTIONS, MEMO_SAMPLE_GAMES, DEFAULT_PROMPT
    if MEMO_INSTRUCTIONS != INSTRUCTIONS or MEMO_SAMPLE_GAMES != SAMPLE_GAMES:
        DEFAULT_PROMPT = make_prompt(INSTRUCTIONS, SAMPLE_GAMES)
        MEMO_INSTRUCTIONS = str(INSTRUCTIONS)
        MEMO_SAMPLE_GAMES = list(SAMPLE_GAMES)
    if prompt is None:
        prompt = DEFAULT_PROMPT
    if max_turns is None:
        max_turns = MAX_TURNS
    if tw_make_args is None:
        tw_make_args = TW_MAKE_ARGS

    path = make_game(seed, tw_make_args)

    request_infos = textworld.core.EnvInfos(lost=True, won=True)
    env_id = textworld.gym.register_game(
        path, request_infos=request_infos, max_episode_steps=max_turns,
        name=path
    )

    status = {
        'model': model,
        'tw_make_args': tw_make_args,
        'seed': seed,
        'error': None,
    }

    game_env = textworld.gym.make(env_id)
    obs, infos = game_env.reset()
    obs = clean_obs(obs)
    done = False
    turns = 0
    silences = 0

    if render:
        print(obs, end="")
        sys.stdout.flush()

    model_args = dict(MODELS[model])
    client_args = dict(CLIENTS[model_args["client"]])
    del model_args["client"]
    if "developer_role" in model_args:
        developer_role = model_args["developer_role"]
        prompt = list(replace_dev_prompt(prompt, developer_role))
        del model_args["developer_role"]
    else:
        developer_role = "developer"
    if "model" not in model_args:
        model_args["model"] = model

    messages : Transcript = prompt + [{"role": "user", "content": obs}]

    client = openai.OpenAI(**client_args)
    while not done:
        turns += 1
        status['turns'] = turns
        backoff = 30.0
        consecutive_failures = 0
        while True:
            try:
                model_args["messages"] = messages
                action = (
                    client.chat.completions.create(**model_args)
                    .choices[0]
                    .message.content
                )
                break
            except openai.RateLimitError:
                time.sleep(backoff)
                backoff *= 2.0
            except openai.APIError as e:
                if render:
                    print(e)
                consecutive_failures += 1
                time.sleep(backoff)
                backoff *= 2.0
                if consecutive_failures > 3:
                    status['outcome'] = 'error'
                    status['error'] = str(e)
                    status['messages'] = messages
                    return status

        stripped = strip_parentheticals(action)
        if re.match(r"QUIT\b", stripped, re.I):
            messages += [{"role": "assistant", "content": action}]
            if render:
                print("> " + action)

            status['outcome'] = 'quit'
            status['messages'] = messages
            return status
        elif '.' in stripped or ',' in stripped or ';' in stripped or \
                '\n' in stripped or re.search(r"\bthen\b", stripped, re.I):
            silences += 1
            correction = [
                {"role": "assistant", "content": action},
                {
                    "role": developer_role,
                    "content": (
                        "Your last message looks like you may have tried to issue "
                        "multiple commands at once. The game doesn't support this. "
                        "Try again, one command at a time.\n\n> "
                    ),
                },
            ]
            if render:
                print("> " + correction[0]['content'])
                print(correction[1]['content'], end='')
            messages += correction
            if silences >= 3:
                status['outcome'] = 'silence'
                status['messages'] = messages
                return status
        elif stripped == "":
            silences += 1
            correction = [
                {"role": "assistant", "content": action},
                {
                    "role": developer_role,
                    "content": (
                        "Your last message contained nothing except a parenthetical, "
                        "so it couldn't be provided to the game. Try again. If you've "
                        "completely given up, just say QUIT.\n\n> "
                    ),
                },
            ]

            if render:
                print("> " + correction[0]['content'])
                print(correction[1]['content'], end='')
            messages += correction
            if silences >= 3:
                status['outcome'] = 'silence'
                status['messages'] = messages
                return status
        else:
            silences = 0
            try:
                _, done, infos = extend_messages(
                    messages, action, game_env, render, prestripped=stripped
                )
            except Exception as e:
                status['outcome'] = 'error'
                status['messages'] = messages
                status['error'] = str(e)
                return status

    if infos['lost']:
        status['outcome'] = 'lost'
    elif infos['won']:
        status['outcome'] = 'won'
    else:
        status['outcome'] = 'turnmax'
    status['messages'] = messages
    return status

Now we set up some machinery to run games in parallel batches and let the agent retry up to a fixed number of attempts if it doesn't win the first time. `run_experiment` will return a list of lists of `GameResults`, where each inner list is a series of attempts at the same game seed. Results are stored, and any model/seed/tw_make_args
combination that has already been run will have the stored result returned rather than being re-run.

In [16]:
import json, multiprocessing, os, os.path
from typing import List

def experiment_worker(
        seed : int,
        model : str,
        max_attempts : int,
        tw_make_args : List[str],
        prompt : Transcript,
        max_turns : int,
        out_dir : str
    ) -> List[GameResult]:
    os.makedirs(out_dir, exist_ok=True)
    results = list()
    nonerror_attempts = 0
    for attempt in range(5*max_attempts):
        if nonerror_attempts >= max_attempts:
            break
        out_path = os.path.join(out_dir, str(attempt) + ".json")
        try:
            with open(out_path, 'r') as f:
                saved_result = json.load(f)
                results.append(saved_result)
                if saved_result['outcome'] == 'won':
                    return results
                elif saved_result['outcome'] != 'error':
                    nonerror_attempts += 1
                continue
        except:
            pass
        result = play_game(seed, model,
                           tw_make_args=tw_make_args,
                           prompt=prompt,
                           max_turns=max_turns)
        results.append(result)
        with open(out_path, 'w') as f:
            json.dump(result, f, indent=4, sort_keys=True)
        if result['outcome'] == 'won':
            return results
        elif result['outcome'] != 'error':
            nonerror_attempts += 1
    return results

def pregenerate_games(
        start_seed : int = 1,
        n_games : int = 100,
        tw_make_args : Optional[List[str]] = None
    ):
    with multiprocessing.Pool() as p:
        p.starmap(
            make_game,
            [
                (seed, tw_make_args)
                for seed in range(start_seed, start_seed + n_games)
            ]
        )

def run_experiment(
        models : List[str],
        experiment_name : str = 'experiment',
        processes : int = 1,
        start_seed : int = 1,
        n_games : int = 100,
        max_attempts : int = 3,
        tw_make_args : Optional[List[str]] = None,
        prompt : Optional[Transcript] = None,
        max_turns : Optional[int] = None,
) -> List[List[GameResult]]:
    if tw_make_args is None:
        tw_make_args = TW_MAKE_ARGS
    if prompt is None:
        prompt = DEFAULT_PROMPT
    if max_turns is None:
        max_turns = MAX_TURNS

    pregenerate_games(start_seed, n_games, tw_make_args)

    worker_arguments = [
        (
            seed,
            model,
            max_attempts,
            tw_make_args,
            prompt,
            max_turns,
            os.path.join(
                WORK_DIR,
                experiment_name,
                game_hash(tw_make_args),
                model,
                str(seed)
            )
        )
        for model in models
        for seed in range(start_seed, start_seed + n_games)
    ]

    with multiprocessing.Pool(processes) as p:
        return p.starmap(experiment_worker, worker_arguments)

One last utility function to make it convenient to download experimental results.

In [17]:
import os.path, tarfile, tempfile
def download_experiment(experiment: str = 'experiment'):
    try:
        from google.colab import files
    except ImportError as e:
        raise RuntimeError(
            "Download failed: this notebook is not running in Google Colab."
        ) from e
        return

    experiment_dir = os.path.join(WORK_DIR, experiment)
    if not os.path.exists(experiment_dir):
        raise FileNotFoundError(
            f"Download failed: experiment directory '{experiment_dir}' not found."
        )

    with tempfile.TemporaryDirectory() as temp_dir:
        tar_filename = os.path.join(temp_dir, f"{experiment}.tar.gz")
        with tarfile.open(tar_filename, "w:gz") as tar:
            tar.add(experiment_dir, arcname=os.path.basename(experiment_dir))

        files.download(tar_filename)

And now we're ready to run! `processes=12` is about right for running close to the rate limit on my Tier 4 OpenAI account.

In [18]:
MAX_ATTEMPTS = 3
ALL_RESULTS = run_experiment(
    ['gpt-4o', 'gpt-4o-mini', 'llama3.1-405b-instruct-fp8', 'llama3.3-70b-instruct-fp8'],
    processes=12,
    n_games=100,
    max_attempts=MAX_ATTEMPTS
)

## Analysis

We'll build a table of cumulative outcome frequencies over each attempt.

In [19]:
FILTERED_RESULTS = []
for result_list in ALL_RESULTS:
    filtered = [ result for result in result_list if result['outcome'] != 'error']
    FILTERED_RESULTS.append(filtered)

RESULTS_BY_MODEL : Dict[str, List[List[GameResult]]] = dict()
for results in FILTERED_RESULTS:
    model = results[0]['model']
    if model not in RESULTS_BY_MODEL:
        RESULTS_BY_MODEL[model] = list()
    RESULTS_BY_MODEL[model].append(results)

OUTCOMES : Dict[str, List[Dict[str, int]]] = dict()
CUMULATIVE_OUTCOMES : Dict[str, List[Dict[str, int]]] = dict()

for model, all_model_results in RESULTS_BY_MODEL.items():
    model_outcomes = [dict() for _ in range(MAX_ATTEMPTS)]
    cumulative_model_outcomes = [None for _ in range(MAX_ATTEMPTS)]
    cumulative_model_outcomes[0] = dict()
    for attempt in range(MAX_ATTEMPTS):
        for results in all_model_results:
            if attempt < len(results):
                outcome = results[attempt]['outcome']
                if outcome not in model_outcomes[attempt]:
                    model_outcomes[attempt][outcome] = 0
                if outcome not in cumulative_model_outcomes[attempt]:
                    cumulative_model_outcomes[attempt][outcome] = 0
                model_outcomes[attempt][outcome] += 1
                cumulative_model_outcomes[attempt][outcome] += 1
            if attempt + 1 < MAX_ATTEMPTS:
                cumulative_model_outcomes[attempt + 1] = \
                    dict(cumulative_model_outcomes[attempt])
    OUTCOMES[model] = model_outcomes
    CUMULATIVE_OUTCOMES[model] = cumulative_model_outcomes

CUMULATIVE_OUTCOMES

{'gpt-4o': [{'won': 89, 'lost': 4, 'quit': 7},
  {'won': 98, 'lost': 4, 'quit': 9},
  {'won': 100, 'lost': 4, 'quit': 9}],
 'gpt-4o-mini': [{'won': 23,
   'lost': 13,
   'turnmax': 46,
   'silence': 10,
   'quit': 8},
  {'won': 30, 'lost': 20, 'turnmax': 91, 'silence': 16, 'quit': 20},
  {'won': 35, 'lost': 34, 'turnmax': 132, 'silence': 20, 'quit': 26}],
 'llama3.1-405b-instruct-fp8': [{'won': 89,
   'quit': 8,
   'turnmax': 2,
   'silence': 1},
  {'won': 100, 'quit': 8, 'turnmax': 2, 'silence': 1},
  {'won': 100, 'quit': 8, 'turnmax': 2, 'silence': 1}],
 'llama3.3-70b-instruct-fp8': [{'quit': 19,
   'won': 57,
   'turnmax': 18,
   'lost': 5,
   'silence': 1},
  {'quit': 24, 'won': 80, 'turnmax': 28, 'lost': 9, 'silence': 2},
  {'quit': 30, 'won': 85, 'turnmax': 36, 'lost': 10, 'silence': 2}]}

Llama3.1-405B-Instruct-FP8 won 89/100 games on its first attempt and the rest on its second attempt. Nice!

Now we're going to compute Bayesian credible intervals on these win rates. Consider each
game as a biased coin independently drawn from some population. Each coin in the population has a different bias, and the distribution of biases is unknown. We toss each coin either `MAX_ATTEMPTS` times or until it comes up heads, whichever comes first. We want to compute a credible interval on the probability that a newly-drawn coin from this population will come up heads by its n'th toss for each n in [1..MAX_ATTEMPTS].

Computing this for $n=1$ is straightforward: we can just compute the inverse CDF for a beta distribution with (e.g. for 89 wins) $\alpha = 89.5,\ \beta = 11.5$ at (e.g. for a 95% confidence interval) 0.025 and 0.975. However, for $n>1$ it gets a lot messier and we're best off resorting to a Monte Carlo method. At each stage we draw beta random variates from the posterior distribution reached by having passed the previous stages.

In [20]:
# If you don't already have numpy and scipy installed, add them to
# the pip install line at the very top of this notebook.
import numpy as np
from scipy.stats import beta

def compute_credible_intervals(
        N, ns,
        prior_a=0.5,
        prior_b=0.5,
        conf_level=0.95,
        nsamples=1000000
):
    intervals = []
    # cumulative probability so far; initially, no chance of head.
    cumulative = np.zeros(nsamples)
    # Number of coins that reach the current stage
    remaining = N
    lower_percentile = (1 - conf_level) / 2.0 * 100
    upper_percentile = (1 + conf_level) / 2.0 * 100

    for n in ns:
        # At each stage, we have 'remaining' coins available.
        # Our Beta posterior parameters are:
        a = n + prior_a
        b = (remaining - n) + prior_b
        # Draw samples for the probability of head at this stage.
        p_samples = beta.rvs(a, b, size=nsamples)
        # Update the cumulative probability.
        # A coin gets head by this stage either if it already got head before,
        # or, if not, it gets head in this stage.
        cumulative += (1 - cumulative) * p_samples
        # Compute the credible interval from the samples.
        ci = np.percentile(cumulative, [lower_percentile, upper_percentile])
        intervals.append(ci)
        # Update the remaining coins for the next stage.
        remaining -= n

    return intervals

In [21]:
from typing import Dict, List, Tuple
WIN_RATE_CREDIBLE_INTERVALS : Dict[str, List[Tuple[float, float]]] = dict()

for model, outcome_dicts in OUTCOMES.items():
    wins = [ outcome_dict.get('won', 0) for outcome_dict in outcome_dicts ]
    cis = compute_credible_intervals(len(RESULTS_BY_MODEL[model]), wins)
    WIN_RATE_CREDIBLE_INTERVALS[model] = [ (ci[0], ci[1]) for ci in cis ]

WIN_RATE_CREDIBLE_INTERVALS

{'gpt-4o': [(0.8175083884197585, 0.940097598387077),
  (0.9397861454121765, 0.9960414688750009),
  (0.9795307423009577, 0.9999963657003693)],
 'gpt-4o-mini': [(0.15593107754352875, 0.31925219958167633),
  (0.22089395764801617, 0.399218937102621),
  (0.2701572993858289, 0.4556640786012336)],
 'llama3.1-405b-instruct-fp8': [(0.817651980052568, 0.9400877173788655),
  (0.9763106420616923, 0.9999952779039336),
  (0.9848036098758056, 0.9999997362772812)],
 'llama3.3-70b-instruct-fp8': [(0.47209598901986566, 0.6641005885884872),
  (0.7167100851648446, 0.8709264424212316),
  (0.7774182475862635, 0.913407572566829)]}