# Running experiments with custom evaluations

We illustrate how we can run custom scorers to perform automatic evaluations of responses when sending a prompt to an API. We will use the Anthropic API to query a model and evaluate the results with a custom evaluation function, however, feel free to adapt the provided input experiment file to use another API.

In the [evaluation docs](https://alan-turing-institute.github.io/prompto/docs/evaluation/#automatic-evaluation-using-a-scoring-function), we provide an explanation of scoring functions and how they can be applied to evaluate responses from models. In this notebook, we will show how to use a custom scorer to evaluate responses from a model in Python.

In [1]:
from prompto.settings import Settings
from prompto.experiment import Experiment
from dotenv import load_dotenv
import os

## Environment setup

In this experiment, we will use the Anthropic API, but feel free to edit the input file provided to use a different API and model.

When using `prompto` to query models from the Anthropic API, lines in our experiment `.jsonl` files must have `"api": "anthropic"` in the prompt dict. 

For the [Anthropic API](https://alan-turing-institute.github.io/prompto/docs/anthropic/), there are two environment variables that could be set:
- `ANTHROPIC_API_KEY`: the API key for the Anthropic API

As mentioned in the [environment variables docs](https://alan-turing-institute.github.io/prompto/docs/environment_variables/#model-specific-environment-variables), there are also model-specific environment variables too which can be utilised. In particular, when you specify a `model_name` key in a prompt dict, one could also specify a `ANTHROPIC_API_KEY_model_name` environment variable to indicate the API key used for that particular model (where "model_name" is replaced to whatever the corresponding value of the `model_name` key is). We will see a concrete example of this later.

To set environment variables, one can simply have these in a `.env` file which specifies these environment variables as key-value pairs:
```
ANTHROPIC_API_KEY=<YOUR-ANTHROPIC-KEY>
```

If you make this file, you can run the following which should return `True` if it's found one, or `False` otherwise:

In [2]:
load_dotenv(dotenv_path=".env")

True

Now, we obtain those values. We raise an error if the `ANTHROPIC_API_KEY` environment variable hasn't been set:

In [3]:
ANTHROPIC_API_KEY = os.environ.get("ANTHROPIC_API_KEY")
if ANTHROPIC_API_KEY is None:
    raise ValueError("ANTHROPIC_API_KEY is not set")

If you get any errors or warnings in the above two cells, try to fix your `.env` file like the example we have above to get these variables set.

## Writing a custom evaluation function

The only rule when writing custom evaluations is that the function should take in a single argument which is the `prompt_dict` with the responses from the API. The function should return the same dictionary with any additional keys that you want to add.

In the following example, this is not a particularly useful evaluation in most cases - it simply performs a rough word count of the response by splitting on spaces. In a real-world scenario, you might want to compare it to some reference text (which could be provided in the prompt dictionary as an "expected_response" key) or use a more sophisticated evaluation, e.g. some regex computation.

In [4]:
def count_words_in_response(response_dict: dict) -> dict:
    """
    This function is an example of an evaluation function that can be used to evaluate the response of an experiment.
    It counts the number of words in the response and adds it to the response_dict. It also adds a boolean value to
    the response_dict that is True if the response has more than 10 words and False otherwise.
    """
    # Count the number of spaces in the response
    response_dict["word_count"] = response_dict["response"].count(" ") + 1
    response_dict["more_than_10_words"] = response_dict["word_count"] > 10
    return response_dict

Now we simply run the experiment in the same way as normal, but pass in your evaluation function into `process` method of the `Experiment` object.

Note more than one functions can be passed and they will be executed in the order they are passed.

In [6]:
settings = Settings(data_folder="./data", max_queries=30)
experiment = Experiment(file_name="input-evaluation.jsonl", settings=settings)

In [8]:
responses, avg_query_processing_time = await experiment.process(
    evaluation_funcs=[count_words_in_response]
)

Sending 2 queries at 30 QPM with RI of 2.0s  (attempt 1/3): 100%|██████████| 2/2 [00:04<00:00,  2.00s/query]
Waiting for responses  (attempt 1/3): 100%|██████████| 2/2 [00:01<00:00,  1.09query/s]


In [9]:
experiment.completed_responses

[{'id': 1,
  'api': 'anthropic',
  'model_name': 'claude-3-5-sonnet-20240620',
  'prompt': 'How does technology impact us? Keep the response to less than 10 words.',
  'parameters': {'temperature': 1, 'max_tokens': 100},
  'timestamp_sent': '30-08-2024-08-58-02',
  'response': 'Technology revolutionizes communication, work, and daily life, reshaping human experiences.',
  'Word Count': 10,
  'more_than_10_words': False},
 {'id': 0,
  'api': 'anthropic',
  'model_name': 'claude-3-haiku-20240307',
  'prompt': 'How does technology impact us?',
  'parameters': {'temperature': 1, 'max_tokens': 100},
  'timestamp_sent': '30-08-2024-08-58-00',
  'response': 'Technology has had a profound impact on our lives in both positive and negative ways. Here are some of the key ways technology has influenced us:\n\nPositive impacts:\n- Increased connectivity and communication - Technology has made it easier to stay in touch with loved ones, coordinate with colleagues, and access information.\n- Advancem

We can see the results from the evaluation function in the completed responses. 

## Running a scorer automatically from the command line

In the [evaluation docs](https://alan-turing-institute.github.io/prompto/docs/evaluation/#running-a-scorer-evaluation-automatically-using-prompto_run_experiment), we discuss how you can use the `prompto_run_experiment` command line tool to run experiments and automatically evaluate responses using a scorer.

In this case, we would need to define the above function in a Python file and add it to the `SCORING_FUNCTIONS` dictionary in the [src/prompto/scorers.py](https://github.com/alan-turing-institute/prompto/blob/main/src/prompto/scorer.py) file. We could add the following key and value to the dictionary:
    
```python
"count_words_in_response": count_words_in_response
```

Then, we could run the following command to run the experiment and evaluate the responses using the custom scorer:
```bash
prompto_run_experiment --file <path-to-experiment-file> --scorer count_words_in_response
```