***This notebooks assumes familiarity with the documentation on [how to add new games](https://github.com/clp-research/clembench/blob/main/docs/howto_add_games.md) and how to [log events and build records](https://github.com/clp-research/clembench/blob/main/docs/logdoc.md). Please read those first. Also first have a look at [how to prototype games](https://github.com/clp-research/clembench/blob/main/docs/howto_prototype_games.ipynb), which shows the basic commands to interact with the LLMs.***

# Adding a new game to the framework


Let's implement a simple two-player game we will call ```firstlast```. 

The players should engage in a turn-based conversation about a predefined topic. Player A starts with an utterance whose first and last token must start with a predefined letter, say  ```d```. Player B must then reply with an utterance whose first and last token must be the next one in the alphabet (here, an ```e```). And so on, for ```n``` turns (where each turn is comprised by an utterance from A and an utterance from B). If an utterance does not conform to these rules (i.e. it is incorrect), the players lose the game. We also define a move rule: If an utterance does not start with 'I SAY: ' (i.e., it is invalid), the game is immediately aborted. If all utterances up to turn ```n``` are valid and correct, the game is successful.
 
For instance, if the topic is ```birds```, the initial letter is ```h``` and the number of turns is 2, this would be a successful game:

- Hi! I love birds, but it's hard to identify them. I need help. (h: hi / help)
- I know what you mean. I can try to help, please describe it. (i: I/ it)
- Just a moment... Ok, it's blue but looks like an Eurasian jay. (j: just / jay)
- Kick in more details, otherwise I don't know. (k: kick / know)

At each turn, we need to check two aspects:
- does the utterance have a valid form that can be parsed? Failing to meet this leads to the game being aborted.
- does the utterance fulfils the game rules for a successful turn? Failing to meet this leads to game being lost.

To make things simple, in this example we will check only these conditions:
- validity: does the utterance start with ```I SAY```:?
- correctness: do the first and last tokens begin with the same, correct letter at play?

To add a new game to the framework, we need at least the following components:

1. Game resources: all data, prompt templates and text files that are necessary to create instances of a game and to group these instances into experiments.
2. Instances: a JSON file containing the configuration of each instance of this game, grouped into experiments. This must be done by a script named ```instancegenerator.py```, with a class that inherits from ```GameInstanceGenerator```.
3. Game Master: a script that controls and enforces the defined move and game rules and dynamics, inheriting from ```GameMaster```. This must be implemented in a file ```master.py```.
4. Players: a script that defines the programatic behaviour and any other attributes of a player, inheriting from ```Player```. This can be implemented in a file named ```players.py```.
5. Game Benchmark: a class that realises the game, inheriting from ```GameBenchmark```. This can also live in the file ```master.py```.

In this example, defining an episode requires instantiating the initial prompts and 3 additional parameters: the topic, the letter for the first player and the number of turns for that game play. 

Let's walk through the implementation step by step. We'll write the contents of ```instancegenerator.py```, ```master.py``` and ```players.py```, that you should save as files to run the game.

**Note**: For the mandatory methods, always check the parent class documentation to be sure about the required and optional arguments.

In [1]:
import os
import sys
from pathlib import Path

sys.path.append('..')

GAME_NAME = 'firstlast'

We should start by creating a new directory inside the ```games/``` directory, naming it after our game. Everything our game needs will live there. Inside it, we create a directory ```in``` where game instances will be saved and ```resources``` for other resources, containing at least one directory for templates (initial prompts).

In [2]:
path = Path(f'../games/{GAME_NAME}')

os.mkdir(path)
os.mkdir(path / 'in')
os.mkdir(path / 'resources')
os.mkdir(path / 'resources' / 'initial_prompts')

## Defining prompts with game rules

The players need to be instructed on what the rules of the game are, possibly with some examples, at the beginning of the game. For that, we must define the initial prompts passed to player A and player B. 

In the template text, we can define variables that will later be filled with our chosen values (here: topic, first letter, number of rounds). The prompts need to be adjusted for player A and B according to their roles. For example:

Player A:
> "Let's play a game. You must have a conversation about $topic with your partner. Your first turn must start and end with words that begin with the letter $letter. The reply of your partner must be similar, with the letter that comes after $letter in the alphabet. Then it's your turn again with the next letter, and so on. You'll do it for $nturns turns. Always start your utterance with I SAY: and then give your answer. If you break the rules, you lose."

Player B:
> "Let's play a game. You must have a conversation about $topic with your partner. Their first turn must start and end with words that begin with the letter $letter. Your reply must be similar, with the letter that comes after $letter in the alphabet. Then it's their turn again with the next letter, and so on. You'll do it for $nturns turns. Always start your utterance with I SAY: and then give your answer. If you break the rules, you lose."

Decide what amount of prompt engineering you will do at this step. Once you are satisfied with your preliminary results (i.e. the model can process the instructions well enough for your purposes), save the initial prompts as plain texts using ```.template``` as an extension. You can save many templates if you wish to test different prompts, and read each of them when you generate game instances (see below). Note that this template is just an example and has not been tested.

Let's write one template file for player A and one for player B:

In [3]:
with open(path / 'resources' / 'initial_prompts' / 'initial_prompt_a.template', 'w') as file:
    file.write(
        "Let's play a game. You must have a conversation about $topic with your partner. Your first turn must start and end with words that begin with the letter $letter. The reply of your partner must be similar, with the letter that comes after $letter in the alphabet. Then it's your turn again with the next letter, and so on. You'll do it for $nturns turns. Always start your utterance with I SAY: and then give your answer. If you break the rules, you lose."
    ) 

with open(path / 'resources' / 'initial_prompts' /  'initial_prompt_b.template', 'w') as file:
    file.write(
        "Let's play a game. You must have a conversation about $topic with your partner. Their first turn must start and end with words that begin with the letter $letter. Your reply must be similar, with the letter that comes after $letter in the alphabet. Then it's their turn again with the next letter, and so on. You'll do it for $nturns turns. Always start your utterance with I SAY: and then give your answer. If you break the rules, you lose."
    ) 

## Additional game resources

Everything else needed to create instances of the game or to be accessed by the game master should be saved into ```games/firstlast/resources```. 

Let's create a ```topics.txt``` file with a list of topics that we can later sample from to create our instances.

In [4]:
topics = ['dogs', 'cats', 'birds', 'trees']

with open(path / 'resources' / 'topics.txt', 'w') as file:
    for topic in topics:
        file.write(topic + '\n') 

## Creating game instances

Create a Python script called ```instancegenerator.py``` in ```games/firstlast/```. Running this file will create a ```JSON``` file into ```games/firstlast/in/```, called ```instances.json```. We can use the framework to organise that. All we need to do is write a class that inherits ```GameInstanceGenerator``` and write its ```on_generate``` method according to our needs. Then, in the main call, instantiate this class and call its ```.generate()``` method.

In ```on_generate```, we define experiments and then define instances of each experiment. An instance is a configuration of one specific game play and an experiment is a set of related instances. We can define what is an experiment and what is an instance depending on what dimensions we want to evaluate later. 

For our game, let's define an experiment as a set of instances about the same topic. To define an instance in an experiment, we define the initial letter, the initial prompts and the number of turns. This is useful if we wish to evaluate the performance of LLMs on variations of the same topic. Another possibility would be to define experiment as a set of instances with the same number of turns, and then each instance could be about different topics. That's our choice. 

Running this will automatically create a ```JSON``` file with a key ```experiments```, which is a list of experiments. Each element has a name and a list of ```game_instances```. A game instance should have at least a ```game_id``` assigning an index to that instance and other keys and values that are necessary to play the game. Here, the initial prompts can have their slots already filled with the instance's values (or we can leave that for the ```setup``` method of the game master).

Here is an example of the structure we need:

```JSON
{
    "experiments": [
        {
            "name": "NAME_1",
            "game_instances": [
                {
                    "game_id": 0,
                    "first_letter": "LETTER",
                    "n_turns": "N",
                    "prompt_player_a": "PROMPT_A",
                    "prompt_player_b": "PROMPT_B",
                },
                {
                    "game_id": 1,
                    "first_letter": "LETTER",
                    "n_turns": "N",
                    "prompt_player_a": "PROMPT_A",
                    "prompt_player_b": "PROMPT_B",
                },
            ]
        },
        {
            "name": "NAME_2",
            "game_instances": [
                {
                    "game_id": 0,
                    "first_letter": "LETTER",
                    "n_turns": "N",
                    "prompt_player_a": "PROMPT_A",
                    "prompt_player_b": "PROMPT_B",
                },
                {
                    "game_id": 1,
                    "first_letter": "LETTER",
                    "n_turns": "N",
                    "prompt_player_a": "PROMPT_A",
                    "prompt_player_b": "PROMPT_B",
                },
            ]
        },
    ]
}
````

***Note***: The ```instances.json``` file should contain everything that the game master needs to set up the configuration of a game play! We can add as many keys and values and we need.

Here is an example:

In [None]:
# save the contents of this cell as games/firstlast/instancegenerator.py
import random
import string

from clemgame.clemgame import GameInstanceGenerator

# set the name of the game in the script, as you named the directory
# this name will be used everywhere, including in the table of results
GAME_NAME = 'firstlast'
# we will create 10 instances for each experiment; vary this as you wish
N_INSTANCES = 10
# if the generation involves randomness, remember to set a random seed
SEED = 123

class FirstLastGameInstanceGenerator(GameInstanceGenerator):
    def __init__(self):
        # always do this to initialise GameInstanceGenerator
        super().__init__(GAME_NAME)
    
    # define on_generate, a mandatory method
    def on_generate(self):
        # get the list of topics, which will be our experiments
        topics = self.load_file('resources/topics.txt').strip('\n').split('\n')
        # get the prompts for player a and player b
        # we'll keep the prompts fixed in all instances, replacing only the
        # necessary slots (but you can do it differently)
        prompt_a = self.load_template('resources/initial_prompts/initial_prompt_a')
        prompt_b = self.load_template('resources/initial_prompts/initial_prompt_b')

        # building the file, one experiment at a time
        for topic in topics:
            # create an experiment (for us, named after a topic)
            experiment = self.add_experiment(topic)
            # build N_INSTANCES instances for each experiment
            for game_id in range(N_INSTANCES):
                # set the parameters
                # here we do it randomly, but that can also be read from a file
                # one of the first 5 letters in the alphabet
                letter = random.choice(string.ascii_lowercase[:5])
                # up to 8 turns, so that we don't run out of letters
                n_turns = random.randint(3, 8)
                # create a game instance, using a game_id counter/index
                instance = self.add_game_instance(experiment, game_id)
                # populate the game instance with its parameters
                instance['first_letter'] = letter
                instance['n_turns'] = n_turns
                instance['prompt_player_a'] = self.create_prompt(
                    topic, prompt_a, letter, n_turns)
                instance['prompt_player_b'] = self.create_prompt(
                    topic, prompt_b, letter, n_turns)
    
    # an additional method, specific for our example
    def create_prompt(self,
                      topic: str,
                      prompt: str,
                      letter: str,
                      n_turns: int) -> str:
        """Replace a prompt template with slot values."""
        text = string.Template(prompt).substitute(topic=topic, letter=letter,
                                                  nturns=n_turns)
        return text


if __name__ == '__main__':
    random.seed(SEED)
    # always call this, which will actually generate and save the JSON file
    FirstLastGameInstanceGenerator().generate()


Summary:

- write a class that inherits from ```GameInstanceGenerator```.
- you must implement the ```on_generate``` method, which should call ```self.add_experiment()``` to add experiments and ```self.add_game_instance()``` to add instances. Populate the game instance with keys and values.
- ```GameInstanceGenerator``` has methods to load various files inside the game directory, for example ```self.load_template()``` and  ```self.load_file()```.
- in ```'__main__'```, call ```FirstLastGameInstanceGenerator().generate()```.
- set a random seed if your generation relies on randomness; when you need new instances, change the random seed.

## Creating the Game

Now, create a file called ```master.py``` in ```games/firstlast/```. In this script, we must define a class that inherits from ```GameMaster``` and will define how the game is played, enforce the rules and log the necessary actions so that the interaction can later be reconstructed and evaluated. We also need to define the two players with classes that inherit from ```Player```, which we can do in a separate script, e.g. ```players.py```.

We'll first explain it step by step, and later we provide the full code in one cell.


### Defining the Player class

In our game, the role of player A and B are symmetric, i.e. they behave the same way and have the same tasks and goals. So we can define one class and instantiate both players from it. If in your game players have different roles, then define two types of Player objects. The only method that we must implement is ```_custom_response()```, which must define a programmatic behaviour for this player. The rest (getting and generating utterances via API calls to LLMs) is taken care of by the framework (we'll see below how to use it). Of course, you can add more methods that relate to the behaviour of the player in your game.

The programatic behaviour is useful in two cases: when the player is really a program (i.e. it sends only predefined messages, e.g. read from a file, not retrieved from an LLM agent) or for testing your program, using the ```mock``` setting that does not make API calls to any LLMs. For the first case, the argument ```model_name``` in the initialisation should be set to ```"programmatic"```.

We also initialise a list to represent the dialogue history of this palyer. It will be incrementally built during the game play by appending new utterances to it.

In [6]:
# save the contents of this cell as games/firstlast/players.py

import random
from string import ascii_lowercase as letters
from typing import List

from clemgame.clemgame import Player


class Speaker(Player):
    def __init__(self, model_name: str, player: str, letter: str):
        # always initialise the Player class with the model_name argument
        # if the player is a program and you don't want to make API calls to
        # LLMS, use model_name="programmatic"
        super().__init__(model_name)
        self.player: str = player
        self.initial_letter: str = letter

        # a list to keep the dialogue history
        self.history: List = []

    # implement this method as you prefer, with these same arguments
    def _custom_response(self, messages, turn_idx) -> str:
        """Return a mock message with the suitable letter and format."""
        # get the first letter of the content of the last message
        # messages is a list of dictionaries with messages in openai API format
        if turn_idx == 1 and self.player == 'A':
            letter = 'I SAY: ' + self.initial_letter
        else:
            previous_letter = messages[-1]['content'][7].lower()
            # introduce a small probability that the player fails
            letter = self._sample_letter(previous_letter)
        # return a string whose first and last tokens start with the next letter     
        return f"{letter}xxx from {self.player}, turn {turn_idx} {letter.replace('I SAY: ', '')}xxx."

    # an additional method specific for this game
    # for testing, we want the utterances to be invalid or incorrect sometimes
    def _sample_letter(self, letter: str) -> str:
        """Randomly decide which letter to use in the message."""
        prob = random.random() 
        index = letters.index(letter)
        if prob < 0.05:
            # correct but invalid (no tag)
            return letters[index + 1]
        if prob < 0.1:
            # valid tag but wrong letter
            return 'I SAY: ' + letter
        # valid and correct
        return 'I SAY: ' + letters[index + 1]


Summary:

- write a class that inherits from ```Player```.
- define its ```_custom_response()``` method, which implements the programmatic behaviour of the player (for testing, or because it is really a program) and returns a string.
- ```model_name``` is the name of the LLM, or ```"programmatic"```.

### Defining the GameMaster class

To define the game master, we need to write a class that inherits from ```GameMaster``` and implement its ```setup()``` and ```play()``` methods, which allow the episode to be created and run, and then its ```compute_scores()``` method that will compute the metrics that are required for evaluation. Score are computed after the game is finished, using the separate ```score``` argument in the cli script.

The metrics that every game must compute are listed at ```clemgame/metrics.py``` and described in more detail in the appendix. Note: You should **not** implement ```METRIC_PLAYED``` if you use the provided evaluation scripts, because this metric is inferred from ```METRIC_ABORTED``` there. Besides, any number of additional game-specific metrics can also be logged (see more below).

These are the mandatory methods. However, for readability, we will also write auxiliary methods.

**IMPORTANT**: The game master has to log ***every event*** that is relevant to reconstruct the interaction, build the transcript and evaluate the game. 

The ```GameMaster``` is also a ```GameResourceLocator```, which has special methods to access and write files in the game's local directory, and a ```GameRecorder```, which knows to to log events. We'll see below how to use it.

#### Initialisation

The first step is to define how to initialise the game master. The ```__init__``` method gets the experiment object and a list of player names as strings. Initialise any needed attributes here. In our example, we define variables for whether the game gets aborted or lost and the number of actually completed game turns.

```python
import copy
from typing import List, Dict, Tuple
from string import ascii_lowercase as letters

import numpy as np

import clemgame.metrics as ms
from clemgame.clemgame import GameMaster, GameBenchmark
from clemgame import get_logger

from games.firstlast.players import Speaker
from games.firstlast.instancegenerator import GAME_NAME

class FirstLast(GameMaster):
    """Implement mechanisms for playing FirstLast."""
    def __init__(self, experiment: Dict, player_backends: List[str]):
        super().__init__(GAME_NAME, experiment, player_backends)

        # save experiment and player attributes that will be necessary later
        self.topic = experiment['name']
        self.model_a = player_backends[0]
        self.model_b = player_backends[1]

        # initialise attributes that will be used for the evaluation scores
        self.aborted: bool = False
        self.lose: bool = False
        self.complete_turns: int = 0
````

### Keeping records

Note: see details about keeping records in ```logdoc.md```.

All events that occur during the game, i.e. all actions by the game master and by the players, must be documented. This is done by the methods of ```GameRecorder```, inherited by the ```GameMaster```. The essential ones are:

- at the beginning of every turn, call ```log_next_turn()``` (what is the definition of turn is up to you; here, it means one utterance by player A and one utterance by player B)
- in the game setup, call ```log_players()``` in order to log the models that are playing this episode of the game
- use ```log_event()``` to log all types of actions with a ```to``` and ```from_```.
- the ```action``` object passed to ```log_event()``` must contain at least a key ```type``` and a key ```content```. The first can be send message, get message, metadata, parse, error, invalid format or any game-specific types. Content is the actual message to de displayed in the transcript.
- use only the values 'Player 1', 'Player 2' or 'GM' for the ```from_``` and ```to``` arguments. Messages that the game master emits to itself should have 'GM' both in ```from_``` and ```to```.
- all events that involve making an API call should pass an additional ```call``` argument to ```log_event()``` containing the actual and exact API input and output objects, for posterior inspection if necessary.
- any other object needed for scoring or documentation can be logged with ```log_key()```.

Every action that is logged gets saved into the ```interactions.json``` file after the game is played. This file is then used to build game transcripts and to compute evaluation scores. The file ```requests.json``` contain the API calls, saved when ```log_event()``` is called with a ```call``` argument. Make sure that you save the exact API input and output. If you use a list, make deep copies to guarantee that you are not logging an object that mutates.

Besides game events, the game master must also compute and log scores. Please read ```logdoc.md``` for details. In summary, there are two types of scores: episode-level and turn-level. These should be computed inside the method ```compute_scores()``` using ```log_episode_score()``` and ```log_turn_score()```, respectively. 

```compute_scores()``` gets ```interactions.json``` dictionary as argument, so every key and value that are necessary to compute scores should be logged into the interaction file.

### IMPORTANT: Inspecting the game records

During development, always check the generated ```interactions.json``` and ```requests.json``` to make sure that the API calls are passing the correct structure and that the records are being correctly saved.

```interactions.json``` is built by the game master as a way to represent the actual interaction (with all its meta-events like parsing messages or checking game rules). This is used to create the transcripts, which are a user-friendly visualisation of the interaction. But remember that this does not reflect the actual API calls, this only reflects what the game master makes of the game!

The actual prompts and responses from the model are saved into ```requests.json```, when an action is logged with its corresponding prompt and response object (see below how to do it). This file will reflect what was actually passed to and from the LLM. Remeber that LLMS do not keep a internal state, so every call to a model must contain its full dialogue history. Also remeber that when there are two LLMs playing at once, each will have its own dialogue history, which may be different! That's why, for debugging purposes, only looking at ```interactions.json``` is not enough, because it may not reflect exactly what the LLMs consumed and output.

#### Logging framework level events

If you'd like to log runtime events for the framework, you can define and use the framework standard logger: ```logger = get_logger(__name__)```. We won't use it here.

### Setup

The setup method gets all keys=values in the instance dictionary, as we defined above. Use this method to setup everything that is needed so that the game can be played. 

In our example, we instantiate both players (and thus an empty dialogue history for each of them), the initial turn index and letter and some variables to keep track of the (in)valid requests (which are mandatory scores).

```python
    def setup(self, first_letter: str, n_turns: int, prompt_player_a: str,
              prompt_player_b: str, game_id: int) -> None:
        """Setup the episode (mandatory)."""

        self.n_turns = n_turns

        # instantiate both players
        self.player_a = Speaker(self.model_a, 'A', first_letter)
        self.player_b = Speaker(self.model_b, 'B', first_letter)

        # initialise game variables
        self.current_turn: int = 0
        self.current_letter: str = first_letter

        # initialise common metrics
        self.request_counts = [0] * (n_turns + 1)
        self.parsed_request_counts = [0] * (n_turns + 1)
        self.violated_request_counts = [0] * (n_turns + 1)

        # add initial prompts to each player's messages
        self.initiate(prompt_player_a, prompt_player_b)

        # always log the details of the players in this format (see logdoc)
        self.log_players({
            'GM': 'Game master for FirstLast',
            'Player 1': f'Player A: {self.model_a}',
            'Player 2': f'Player B: {self.model_b}'
            })

        # log any additional keys that will be relevant for evaluation
        self.log_key('n_turns', n_turns)
```

The setup is calling an ```initiate()``` method that creates turn 0. For us, turn 0 are the initial prompts sent to each player.

```python
    def initiate(self, prompt_player_a: str, prompt_player_b: str) -> None:
        """Initialise the dialogue history (firstlast specific)."""
        # always call log_next_turn what a turn starts
        self.log_next_turn()

        # append the initial message of each player to their history
        # the value user means the message is from an interlocutor of the model
        self.player_a.history.append({'role': 'user', 'content': prompt_player_a})
        self.player_b.history.append({'role': 'user', 'content': prompt_player_b})

        # also log the messages as events for the transcriptions
        action = {'type': 'send message', 'content': prompt_player_a}
        self.log_event(from_='GM', to='Player 1', action=action)
        action = {'type': 'send message', 'content': prompt_player_b}
        self.log_event(from_='GM', to='Player 2', action=action)
```

Summary:

- the setup must define players, call ```log_players()``` and log other game-specific keys.

### Playing the game

Implement the ```play()``` method which should run the game from beginning to end. To faciliate readability, we'll break this method into various components.

Here, we have a loop over turns, which runs until any stop condition is met (see below). Once the loop is over, we log an 'end game' action and save all temporary variables that will be needed for evaluation.

```python
    def play(self) -> None:
        """Play the game until the end (mandatory)."""
        # play the game
        while self.proceed():
            self.current_turn += 1
            # always call log_next_turn when a new turn starts
            self.log_next_turn()
            self.turn()
        
        if self.complete_turns == self.n_turns:
            # log a message informing that the game was successfuly played
            action = {'type': 'info', 'content': 'game successful'}
            self.log_event(from_='GM', to='GM', action=action)

        # log a final message saying that the game did came to an end
        action = {'type': 'info', 'content': 'end game'}
        self.log_event(from_='GM', to='GM', action=action)
        # log all temporary game variables that are needed for evaluation
        self.log_eval_assets()
```

Summary:

- ```play()``` must perform a full run of the game; once it is finished, the game is also fisinhed and the records and requests are saved by the framework.

Let's also define a few auxiliary methods.

In ```proceed()```, we check the conditions for continuing to the next turn. For our example, we check whether the maximum number of turns has been reached and also whether it was not aborted or lost in the previous turn:

```python
    def proceed(self) -> None:
        """Check if the game loop should continue (firstlast specific)."""
        return (self.current_turn < self.n_turns
                and not self.aborted
                and not self.lose)
```

A method to change the current letter to the next one, after one player has produced a correct utterance:

```python
    def update_letter(self) -> None:
        """Update the letter being played (firstlast specific)."""
        current_index = letters.index(self.current_letter)
        self.current_letter = letters[current_index + 1]
```

A method that appends an utterance to the history of a player. The format must be a dictionary that contain the keys ```role``` (user or assistant, see below) and ```content``` (the actual utterance as a string).

```python
    def _append_utterance(self, utterance: str, player: str, role: str) -> None:
        """Add an utterance to the history of a player (firstlast specific)."""
        assert player in ('a', 'b')
        if player == 'a':
            self.player_a.history.append({'role': role, 'content': utterance})
        else:
            self.player_b.history.append({'role': role, 'content': utterance})
```

A method to parse the utterance, first checking whether it starts with I SAY:. This will be our condition to abort the game (i.e., it relates to the *form* of the reply, not its content).

```python
    @staticmethod
    def parse(utterance: str) -> Tuple[str, str]:
        """Check if the utterance is valid and return first and last tokens (firstlast specific)."""
        if not utterance.startswith('I SAY:'):
            return None, None
        tokens = utterance[7:].split()
        return tokens[0], tokens[-1]
```

A method to check for correctness of the utterance. This will be our condition for deciding whether a turn is successful.

```python
    def check_correctness(self, first_token: str, last_token: str) -> bool:
        """Check if the utterance conforms to rules (firstlast specific)."""
        return first_token[0] == self.current_letter and first_token[0] == last_token[0]
```

Finally, a method that logs some counts and variables that are used later for scoring.

```python
    def log_eval_assets(self) -> None:
        """Aux to log variables needed for scoring (firstlast specific)"""
        self.log_key('Played turns', self.current_turn)
        self.log_key('Complete turns', self.complete_turns)
        self.log_key(ms.METRIC_ABORTED, self.aborted)
        self.log_key(ms.METRIC_LOSE, self.lose)
        self.log_key(ms.METRIC_REQUEST_COUNT, self.request_counts)
        self.log_key(ms.METRIC_REQUEST_COUNT_PARSED, self.parsed_request_counts)
        self.log_key(ms.METRIC_REQUEST_COUNT_VIOLATED, self.violated_request_counts)

```

### Defining a turn

Let's structure what a turn of our game must do:

1. Send the initial prompt to player A
2. Get player A's response
3. Check if the response conforms to the desired form, i.e. if the reponsed can be parsed (here, we only check if the characters are actually letters). If not, abort game immediately.
4. Check if the response conforms to the game rules, i.e. what a player is supposed to do for a successful turn (here, we only check if the first and last token begins with the desired letter). If not, lose game immediately.
5. Repeat the same for player B.
6. Log all events in between each step.

Note: instead of aborting the game immediately, you can also define a loop that allows re-prompting up to a maximum number of tries. In these cases, you can append some extra instructions to the currrent prompt, like ```Please answer carefully!```. We won't do it here.

Don't forget that you have to build the dialogue history of each player as the game goes on. Here, we do that by appending messages to the player's ```history``` attribute. Messages that the player produces should be appended with the role ```assistant```. Messages that the player receives should be appended with the role ```user```. The framework then takes care of putting this list in a suitable format for each LLM. The records of player A and B should have the role keys swapped (i.e., what is assistant for player A becomes user for player B, and vice versa).

We implement that as a ```turn()``` method. Note: this is a design decision, do it as your prefer.

```python
    def turn(self) -> None:
        """Perform a game turn, utterances by A and B (firstlast specific)."""
        # get player A's reply and add it to its history
        answer_a = self._get_utterance('a')

        # check if the game should be aborted or lost
        is_valid_turn = self._check_validity(answer_a)
        if not is_valid_turn:
            # stop game
            return None

        # add A's reply to B's history
        self._append_utterance(answer_a, 'b', 'user')
        # also add the reply to the transcript
        action = {'type': 'send message', 'content': answer_a}
        self.log_event(from_='GM', to='Player 2', action=action)

        # next player gets the next letter in the alphabet
        self.update_letter()

        # now do the same for player B

        # get player B's reply and add it to its history
        answer_b = self._get_utterance('b')

        # check if the game should be aborted or lost
        is_valid_turn = self._check_validity(answer_b)
        if not is_valid_turn:
            # stop game
            return None

        # add B's reply to A's history
        self._append_utterance(answer_b, 'a', 'user')
        # also add the reply to the transcript
        action = {'type': 'send message', 'content': answer_b}
        self.log_event(from_='GM', to='Player 1', action=action)

        self.update_letter()
        self.complete_turns += 1
```


In the method above, we used some auxiliary methods.

A method that makes an API call (or gets a programmatic response) with a chosen player and add the utterance to its own history. Note that we use the role ```assistant```.

```python
    def _get_utterance(self, player: str) -> str:
        """Get utterance from a player and log it (firstlast specific)."""
        assert player in ('a', 'b')
        if player == 'a':
            # make an API call (or get a programmatic response) from player a
            prompt, raw_answer, answer = self.player_a(self.player_a.history,
                                                    self.current_turn)
            # add API call to the records
            action = {'type': 'get message', 'content': answer}
            self.log_event(from_='Player 1', to='GM', action=action,
                        call=(copy.deepcopy(prompt), raw_answer))
            # add reply to its own memory
            self._append_utterance(answer, 'a', 'assistant')

        else:
            # make an API call (or get a programmatic response) from player b
            prompt, raw_answer, answer = self.player_b(self.player_b.history,
                                            self.current_turn)
            # add API call to the records
            action = {'type': 'get message', 'content': answer}
            self.log_event(from_='Player 2', to='GM', action=action,
                        call=(copy.deepcopy(prompt), raw_answer))
            # add reply to its own memory
            self._append_utterance(answer, 'b', 'assistant')

        # increase the number of API requests 
        self.request_counts[self.current_turn] += 1
        return answer
```

A method that checks if an utterance conforms to the game rules (being valid and being correct):

```python
    def _check_validity(self, answer: str) -> bool:
        """Check if answer is valid and correct (firstlast specific)."""
        # parse answer
        first_token, last_token = self.parse(answer)

        # if invalid tag, abort game
        if first_token is None or last_token is None:
            self.aborted = True
            # log the abortion event
            action = {'type': 'invalid format', 'content': 'abort'}
            self.log_event(from_='GM', to='GM', action=action)
            # increase the counter of requests that violate form rules
            self.violated_request_counts[self.current_turn] += 1
            return False

        # increase the counter of requests that conform to form rules
        self.parsed_request_counts[self.current_turn] += 1
        # log the event that the string was valid (no strange characters)
        action = {'type': 'metadata', 'content': 'valid string'}
        self.log_event(from_='GM', to='GM', action=action)

        # if correct characters, check correctness wrt game rules
        is_correct_reply = self.check_correctness(first_token, last_token)

        # if not correct, lost game
        if not is_correct_reply:
            self.lose = True
            # log the fact that the game is now lost
            action = {'type': 'parse',
                      'content': f'{first_token}/{last_token} violates rules'}
            self.log_event(from_='GM', to='GM', action=action)

            return False

        # log the fact that the answer was correct
        action = {'type': 'parse',
                  'content': f'{first_token}/{last_token} conforms to rules'}
        self.log_event(from_='GM', to='GM', action=action)

        return True

```

### Computing scores for the evaluation

During the game play, some attributes kept track of counters that are used for the evaluation, but we have not computed all evaluation scores yet. This is done by the mandatory  ```compute_scores()``` method. It gets the full ```interaction.json``` file as input and must compute and log both turn-level and episode-level scores. This is a separate step which does not occur in the same runtime as the game play. Therefore, all relevant information should get saved into ```interaction.json``` and accessed again by ```compute_scores()``` once the scoring is called (see cli commands below).

Here, we log only the common metrics for all games, but other metrics can be logged the same way.

**Important**: If the game is aborted, all episode-level scores must be set to ```numpy.nan``` and turn-level scores can be computed for the valid turns before the abortion action.

```python
    def compute_scores(self, episode_interactions: Dict) -> None:
        """Compute episode-level and turn-level scores (mandatory)."""
        played_turns = episode_interactions['Played turns']
        complete_turns = episode_interactions['Complete turns']
        # turn 0 was only the initial prompts, so we disregard it here
        reqs = episode_interactions[ms.METRIC_REQUEST_COUNT][1:]
        p_reqs = episode_interactions[ms.METRIC_REQUEST_COUNT_PARSED][1:]
        v_reqs = episode_interactions[ms.METRIC_REQUEST_COUNT_VIOLATED][1:]
        n_turns = len(reqs)

        for turn in range(0, played_turns):
            self.log_turn_score(turn, ms.METRIC_REQUEST_COUNT, reqs[turn])
            self.log_turn_score(turn, ms.METRIC_REQUEST_COUNT_PARSED, p_reqs[turn])
            self.log_turn_score(turn, ms.METRIC_REQUEST_COUNT_VIOLATED, v_reqs[turn])

        aborted = int(episode_interactions[ms.METRIC_ABORTED])
        lose = int(episode_interactions[ms.METRIC_LOSE]) if not aborted else 0
        success =  1 - lose if not aborted else 0
        bench_score = complete_turns / n_turns if not aborted else np.nan
        
        self.log_episode_score(ms.METRIC_ABORTED, aborted)
        self.log_episode_score(ms.METRIC_LOSE, lose)
        self.log_episode_score(ms.METRIC_SUCCESS, success)
        self.log_episode_score(ms.METRIC_REQUEST_COUNT, sum(reqs))
        self.log_episode_score(ms.METRIC_REQUEST_COUNT_PARSED, sum(p_reqs))
        self.log_episode_score(ms.METRIC_REQUEST_COUNT_VIOLATED, sum(v_reqs))
        self.log_episode_score(ms.METRIC_REQUEST_SUCCESS, sum(p_reqs) / sum(reqs))
        self.log_episode_score(ms.BENCH_SCORE, bench_score)
```

Summary:

- use ```log_episode_score``` to log computed episode-level scores (measuring success at the whole game play) and ```log_turn_score``` to log computed turn-level scores (measuring success or progress at each turn)
- all games must preferably implement the common metrics (see ```clemgame/metrics.py``` and the appendix in the paper); ```METRIC_PLAYED``` must not be logged, because it is inferred by the provided evaluation script
- minimally, all games must compute ```METRIC_ABORTED```, the binary ```METRIC_SUCCESS``` and its ```BENCH_SCORE``` (which ranges form 0=fails to 100=success).
- games can have as many additional game-specific evaluation metrics as you want; make sure to use different names
- if the game is aborted, all game-specific scores should be set to ```np.nan``````

### Full script

In [None]:
# save the contents of this cell as games/firstlast/master.py

import copy
from typing import List, Dict, Tuple
from string import ascii_lowercase as letters

import numpy as np

import clemgame.metrics as ms
from clemgame.clemgame import GameMaster, GameBenchmark
from clemgame import get_logger

from games.firstlast.players import Speaker
from games.firstlast.instancegenerator import GAME_NAME


# use the framework logger to log events relevant at runtime;
# this is independent from the game records / transcript
logger = get_logger(__name__)


class FirstLast(GameMaster):
    """Implement mechanisms for playing FirstLast."""
    def __init__(self, experiment: Dict, player_backends: List[str]):
        super().__init__(GAME_NAME, experiment, player_backends)

        # save experiment and player attributes that will be necessary later
        self.topic = experiment['name']
        self.model_a = player_backends[0]
        self.model_b = player_backends[1]

        # initialise attributes that will be used for the evaluation scores
        self.aborted: bool = False
        self.lose: bool = False
        self.complete_turns: int = 0

    def setup(self, first_letter: str, n_turns: int, prompt_player_a: str,
              prompt_player_b: str, game_id: int) -> None:
        """Setup the episode (mandatory)."""

        self.n_turns = n_turns

        # instantiate both players
        self.player_a = Speaker(self.model_a, 'A', first_letter)
        self.player_b = Speaker(self.model_b, 'B', first_letter)

        # initialise game variables
        self.current_turn: int = 0
        self.current_letter: str = first_letter

        # initialise common metrics
        self.request_counts = [0] * (n_turns + 1)
        self.parsed_request_counts = [0] * (n_turns + 1)
        self.violated_request_counts = [0] * (n_turns + 1)

        # add initial prompts to each player's messages
        self.initiate(prompt_player_a, prompt_player_b)

        # always log the details of the players in this format (see logdoc)
        self.log_players({
            'GM': 'Game master for FirstLast',
            'Player 1': f'Player A: {self.model_a}',
            'Player 2': f'Player B: {self.model_b}'
            })

        # log any additional keys that will be relevant for evaluation
        self.log_key('n_turns', n_turns)
        
    def play(self) -> None:
        """Play the game until the end (mandatory)."""
        # play the game
        while self.proceed():
            self.current_turn += 1
            # always call log_next_turn when a new turn starts
            self.log_next_turn()
            self.turn()

        if self.complete_turns == self.n_turns:
            # log a message informing that the game was successfuly played
            action = {'type': 'info', 'content': 'game successful'}
            self.log_event(from_='GM', to='GM', action=action)

        # log a final message saying that the game did came to an end
        action = {'type': 'info', 'content': 'end game'}
        self.log_event(from_='GM', to='GM', action=action)
        # log all temporary game variables that are needed for evaluation
        self.log_eval_assets()


    def initiate(self, prompt_player_a: str, prompt_player_b: str) -> None:
        """Initialise the dialogue history (firstlast specific)."""
        # always call log_next_turn what a turn starts
        self.log_next_turn()

        # append the initial message of each player to their history
        # the value user means the message is from an interlocutor of the model
        self.player_a.history.append({'role': 'user', 'content': prompt_player_a})
        self.player_b.history.append({'role': 'user', 'content': prompt_player_b})

        # also log the messages as events for the transcriptions
        action = {'type': 'send message', 'content': prompt_player_a}
        self.log_event(from_='GM', to='Player 1', action=action)
        action = {'type': 'send message', 'content': prompt_player_b}
        self.log_event(from_='GM', to='Player 2', action=action)

    def proceed(self) -> None:
        """Check if the game loop should continue (firstlast specific)."""
        return (self.current_turn < self.n_turns
                and not self.aborted
                and not self.lose)

    def update_letter(self) -> None:
        """Update the letter being played (firstlast specific)."""
        current_index = letters.index(self.current_letter)
        self.current_letter = letters[current_index + 1]

    def _get_utterance(self, player: str) -> str:
        """Get utterance from a player and log it (firstlast specific)."""
        assert player in ('a', 'b')
        if player == 'a':
            # make an API call (or get a programmatic response) from player a
            prompt, raw_answer, answer = self.player_a(self.player_a.history,
                                                    self.current_turn)
            # add API call to the records
            action = {'type': 'get message', 'content': answer}
            self.log_event(from_='Player 1', to='GM', action=action,
                        call=(copy.deepcopy(prompt), raw_answer))
            # add reply to its own memory
            self._append_utterance(answer, 'a', 'assistant')

        else:
            # make an API call (or get a programmatic response) from player b
            prompt, raw_answer, answer = self.player_b(self.player_b.history,
                                            self.current_turn)
            # add API call to the records
            action = {'type': 'get message', 'content': answer}
            self.log_event(from_='Player 2', to='GM', action=action,
                        call=(copy.deepcopy(prompt), raw_answer))
            # add reply to its own memory
            self._append_utterance(answer, 'b', 'assistant')

        # increase the number of API requests 
        self.request_counts[self.current_turn] += 1
        return answer

    def _check_validity(self, answer: str) -> bool:
        """Check if answer is valid and correct (firstlast specific)."""
        # parse answer
        first_token, last_token = self.parse(answer)

        # if invalid tag, abort game
        if first_token is None or last_token is None:
            self.aborted = True
            # log the abortion event
            action = {'type': 'invalid format', 'content': 'abort'}
            self.log_event(from_='GM', to='GM', action=action)
            # increase the counter of requests that violate form rules
            self.violated_request_counts[self.current_turn] += 1
            return False

        # increase the counter of requests that conform to form rules
        self.parsed_request_counts[self.current_turn] += 1
        # log the event that the string was valid (no strange characters)
        action = {'type': 'metadata', 'content': 'valid string'}
        self.log_event(from_='GM', to='GM', action=action)

        # if correct characters, check correctness wrt game rules
        is_correct_reply = self.check_correctness(first_token, last_token)

        # if not correct, lost game
        if not is_correct_reply:
            self.lose = True
            # log the fact that the game is now lost
            action = {'type': 'parse',
                      'content': f'{first_token}/{last_token} violates rules'}
            self.log_event(from_='GM', to='GM', action=action)

            return False

        # log the fact that the answer was correct
        action = {'type': 'parse',
                  'content': f'{first_token}/{last_token} conforms to rules'}
        self.log_event(from_='GM', to='GM', action=action)

        return True

    def turn(self) -> None:
        """Perform a game turn, utterances by A and B (firstlast specific)."""
        # get player A's reply and add it to its history
        answer_a = self._get_utterance('a')

        # check if the game should be aborted or lost
        is_valid_turn = self._check_validity(answer_a)
        if not is_valid_turn:
            # stop game
            return None

        # add A's reply to B's history
        self._append_utterance(answer_a, 'b', 'user')
        # also add the reply to the transcript
        action = {'type': 'send message', 'content': answer_a}
        self.log_event(from_='GM', to='Player 2', action=action)

        # next player gets the next letter in the alphabet
        self.update_letter()

        # now do the same for player B

        # get player B's reply and add it to its history
        answer_b = self._get_utterance('b')

        # check if the game should be aborted or lost
        is_valid_turn = self._check_validity(answer_b)
        if not is_valid_turn:
            # stop game
            return None

        # add B's reply to A's history
        self._append_utterance(answer_b, 'a', 'user')
        # also add the reply to the transcript
        action = {'type': 'send message', 'content': answer_b}
        self.log_event(from_='GM', to='Player 1', action=action)

        self.update_letter()
        self.complete_turns += 1

    def _append_utterance(self, utterance: str, player: str, role: str) -> None:
        """Add an utterance to the history of a player (firstlast specific)."""
        assert player in ('a', 'b')
        if player == 'a':
            self.player_a.history.append({'role': role, 'content': utterance})
        else:
            self.player_b.history.append({'role': role, 'content': utterance})

    @staticmethod
    def parse(utterance: str) -> Tuple[str, str]:
        """Check if utterance is valid and return first/last tokens (firstlast specific)."""
        if not utterance.startswith('I SAY: '):
            return None, None
        tokens = utterance[7:].split()
        return tokens[0], tokens[-1]
            
    def check_correctness(self, first_token: str, last_token: str) -> bool:
        """Check if the utterance conforms to rules (firstlast specific)."""
        return first_token[0] == self.current_letter and first_token[0] == last_token[0]

    def compute_scores(self, episode_interactions: Dict) -> None:
        """Compute episode-level and turn-level scores (mandatory)."""
        played_turns = episode_interactions['Played turns']
        complete_turns = episode_interactions['Complete turns']
        # turn 0 was only the initial prompts, so we disregard it here
        reqs = episode_interactions[ms.METRIC_REQUEST_COUNT][1:]
        p_reqs = episode_interactions[ms.METRIC_REQUEST_COUNT_PARSED][1:]
        v_reqs = episode_interactions[ms.METRIC_REQUEST_COUNT_VIOLATED][1:]
        n_turns = len(reqs)

        for turn in range(0, played_turns):
            self.log_turn_score(turn, ms.METRIC_REQUEST_COUNT, reqs[turn])
            self.log_turn_score(turn, ms.METRIC_REQUEST_COUNT_PARSED, p_reqs[turn])
            self.log_turn_score(turn, ms.METRIC_REQUEST_COUNT_VIOLATED, v_reqs[turn])

        aborted = int(episode_interactions[ms.METRIC_ABORTED])
        lose = int(episode_interactions[ms.METRIC_LOSE]) if not aborted else 0
        success =  1 - lose if not aborted else 0
        bench_score = complete_turns / n_turns if not aborted else np.nan
        
        self.log_episode_score(ms.METRIC_ABORTED, aborted)
        self.log_episode_score(ms.METRIC_LOSE, lose)
        self.log_episode_score(ms.METRIC_SUCCESS, success)
        self.log_episode_score(ms.METRIC_REQUEST_COUNT, sum(reqs))
        self.log_episode_score(ms.METRIC_REQUEST_COUNT_PARSED, sum(p_reqs))
        self.log_episode_score(ms.METRIC_REQUEST_COUNT_VIOLATED, sum(v_reqs))
        self.log_episode_score(ms.METRIC_REQUEST_SUCCESS, sum(p_reqs) / sum(reqs))
        self.log_episode_score(ms.BENCH_SCORE, bench_score)

    def log_eval_assets(self) -> None:
        """Aux to log variables needed for scoring (firstlast specific)"""
        self.log_key('Played turns', self.current_turn)
        self.log_key('Complete turns', self.complete_turns)
        self.log_key(ms.METRIC_ABORTED, self.aborted)
        self.log_key(ms.METRIC_LOSE, self.lose)
        self.log_key(ms.METRIC_REQUEST_COUNT, self.request_counts)
        self.log_key(ms.METRIC_REQUEST_COUNT_PARSED, self.parsed_request_counts)
        self.log_key(ms.METRIC_REQUEST_COUNT_VIOLATED, self.violated_request_counts)


### Defining the GameBenchmark

Finally, we need a final step to "inform" the framework about our game. This can be done in the same file where the GameMaster lives. We need to create a child of ```GameBenchmark``` which defines whether the game is single-player or not. We also add a short description of the game and the method that calls the FirstLast game master.

In single player games, either the other player is a program or the player interacts only with the game master.

In [None]:
# put this at the end of games/firstlast/master.py

# always add the GameBenchmark child with this structure
class FirstLastGameBenchmark(GameBenchmark):
    """Integrate the game into the benchmark run."""
    def __init__(self):
        super().__init__(GAME_NAME)

    # defines whether the game is single player or not
    def is_single_player(self):
        return False

    # add a description of your game
    def get_description(self):
        return "A simple game in which utterances must follow alphabetical rules."

    # copy this, replacing the name of the game master in the return statement
    def create_game_master(self,
                           experiment: Dict,
                           player_backends: List[str]
                           ) -> GameMaster:
        return FirstLast(experiment, player_backends)


## Running your game

After you follow the initial steps in this notebook to create the game directory and then create the 3 Python files (instancegenerator, player and master), you can test the results by running:

```bash
python3 games/firstlast/instancegenerator.py 
python3 scripts/cli.py -m mock--mock run firstlast
python3 scripts/cli.py transcribe firstlast
python3 scripts/cli.py score firstlast
```