# Demo for launching a simple experiment in evalap

*Objectif*: **Comparing some baseline models on a few metrics**

- code: https://github.com/etalab-ia/evalap
- api documentation: https://evalap.etalab.gouv.fr/redoc

In [9]:
import os
import sys
import time

import dotenv
from IPython.display import HTML
import numpy as np
import pandas as pd
import requests

dotenv.load_dotenv("../.env")
sys.path.append("..")
from evalap.utils import log_and_raise_for_status

#EVALAP_API_URL = "http://localhost:8000/v1"
EVALAP_API_URL = "https://evalap.etalab.gouv.fr/v1"
EVALAP_API_KEY = os.getenv("EVALAP_API_KEY") 
ALBERT_API_URL = "https://albert.api.etalab.gouv.fr/v1"
ALBERT_API_KEY = os.getenv("ALBERT_API_KEY")
MFS_API_URL = "https://franceservices.etalab.gouv.fr/api/v1"
MFS_API_KEY = os.getenv("MFS_API_KEY")
headers = {"Authorization": f"Bearer {EVALAP_API_KEY}"}

## Load a dataset on which your evaluation is grounded

evalap can used only 3 fields (more soon): 
- query: the model input question for LLM.
- output(optional): a answer to the question pre-generated by the user.
- ouptut_true(optional): the ground truth answer.

-> Each metric has a set of required input in their specidication. The metrics used in an experiment will therefore constraint what field is waited in the dataset you use.

In [2]:
mfs_dataset = pd.read_csv("_data/evaluation_set_qo_qcm.csv", delimiter=";")
mfs_dataset.rename(columns={'true_answer': 'output_true'}, inplace=True) # to be compliant with evalap
mfs_dataset.head()

Unnamed: 0,query,output_true,top_valid,operateur,thematique
0,Comment contester un PV reçu depuis l’Italie ?,"Pour contester un PV reçu depuis l'Italie, il ...",1,ANTS,Amendes
1,Quelles aides judiciaires trouver en France po...,Vous pouvez bénéficier de l'aide juridictionne...,0,ANTS,Amendes
2,Une mairie peut-elle refuser un formulaire cer...,"Non, une mairie ne peut pas refuser un formula...",1,ANTS,CNI/passeport
3,Comment renouveler une Carte Nationale d’Ident...,Pour renouveler une Carte Nationale d'Identité...,1,ANTS,CNI/passeport
4,Quel formulaire cerfa utiliser pour renouveler...,"Si vous ne faites pas la pré-demande, vous dev...",1,ANTS,CNI/passeport


## Publish the dataset on evalap

If the dataset already exists you'll geta DuplicateEntry error. That is normal.

In [8]:
# Publish a dataset
dataset = {"name": "MFS_questions_v01", "readme": "MFS dataset with ground truth"
           , "default_metric" : "judge_notator"
           , "df": mfs_dataset.to_json()}
response = requests.post(f'{EVALAP_API_URL}/dataset', json=dataset, headers=headers)
resp = response.json()
resp

{'name': 'MFS_questions_v01',
 'readme': 'MFS dataset with ground truth',
 'default_metric': 'judge_notator',
 'columns_map': None,
 'id': 1,
 'created_at': '2025-08-07T21:52:28.949956',
 'size': 39,
 'columns': ['query', 'output_true', 'top_valid', 'operateur', 'thematique'],
 'parquet_size': 0,
 'parquet_columns': []}

## List all avalaible metrics and their information

- the `require` field indicates which fields is required in the dataset for this metrics to operate.
- the `type` field is ignore for now. It will be associated with the type of the observation you get in the result output later

In [3]:
# Show available metrics
# - the require fields should be an existing field to the dataset used with the metric.
# - output metric can be generated from the query (see below)
response = requests.get(f'{EVALAP_API_URL}/metrics', headers=headers)
all_metrics = response.json()
df = pd.DataFrame(all_metrics).sort_values(by=["type", "name"])
HTML(df.to_html(index=False))

name,description,type,require
judge_complexity,"[0-10] score complexity of query, thematic...",dataset,[query]
answer_relevancy,see https://docs.confident-ai.com/docs/metrics-introduction,deepeval,"[output, query]"
bias,see https://docs.confident-ai.com/docs/metrics-introduction,deepeval,"[output, query]"
contextual_precision,see https://docs.confident-ai.com/docs/metrics-introduction,deepeval,"[output, output_true, query, retrieval_context]"
contextual_recall,see https://docs.confident-ai.com/docs/metrics-introduction,deepeval,"[output, output_true, query, retrieval_context]"
contextual_relevancy,see https://docs.confident-ai.com/docs/metrics-introduction,deepeval,"[output, query, retrieval_context]"
faithfulness,see https://docs.confident-ai.com/docs/metrics-introduction,deepeval,"[output, query, retrieval_context]"
hallucination,see https://docs.confident-ai.com/docs/metrics-introduction,deepeval,"[context, output, query]"
ragas,see https://docs.confident-ai.com/docs/metrics-introduction,deepeval,"[output_true, query, retrieval_context]"
toxicity,see https://docs.confident-ai.com/docs/metrics-introduction,deepeval,"[output, query]"


## Lauching a couple of experiments

Here we lauched the experiment independantly with the `/experiment` route. We will see on another notebook how to launch grouped experiments in a grid search fashion by using the route `/experiment_set`.

In [16]:
# Launch an experiment with a given **dataset**, **model** and set of **metrics** to compute.
# - the model generate the "output" from the "query"
# - you can also pass the "output" column instead of a model if you generated the answer by yourself...

# Designing my experiments
dataset = "MFS_questions_v01"
models_to_test = [
    {"name": "meta-llama/Llama-3.1-8B-Instruct",         "base_url": ALBERT_API_URL, "api_key": ALBERT_API_KEY},
    {"name": "meta-llama/Meta-Llama-3.1-8B-Instruct",  "base_url": ALBERT_API_URL, "api_key": ALBERT_API_KEY},
    {"name": "meta-llama/Meta-Llama-3.1-70B-Instruct", "base_url": ALBERT_API_URL, "api_key": ALBERT_API_KEY},
    {
      "_name": "mfs-rag-baseline",
      "name": "AgentPublic/llama3-instruct-8b", "extra_params": {"rag": {"mode":"rag", "limit":7}}, 
      "base_url": MFS_API_URL, "api_key": MFS_API_KEY, 
    },    
]
sampling_params = {"temperature": 0.2} # use the same sampling params for all in this evalation
metrics = ["output_length", "answer_relevancy"]

# Lauching the experiments
experiment_ids = []
for i, model in enumerate(models_to_test):
    name = model["_name"] if model.get("_name") else model["name"]
    model["sampling_params"] = sampling_params
    model = model.copy()
    model.pop("_name") if "_name" in model else None
    experiment = {
        "name" : f"MFS_questions>{name}_{i}_v0", 
        "dataset": dataset,
        "model": model,
        "metrics": metrics,
    }
    response = requests.post(f'{EVALAP_API_URL}/experiment', json=experiment, headers=headers)
    resp = response.json()
    if "id" in resp:
        experiment_ids.append(resp["id"])
        print(f'Created experiment: {resp["name"]} ({resp["id"]}), status: {resp["experiment_status"]}')
    else:
        print(resp)

Created experiment: MFS_questions>AgentPublic/llama3-instruct-8b_0_v0 (25), status: running_answers
Created experiment: MFS_questions>mfs-rag-baseline_1_v0 (26), status: running_answers


In [141]:
# Add or recompute a metric to an existing experiment(s)
# - If you want to update one or many metriccs without relaucnhing the answer generation
# -> In this exemple we add the generation_time metric to the list of computed metrics (generation time is kept in memory at inference time)
for exp_id in experiment_ids:
    experiment = {
        "metrics": ["generation_time"],
        "rerun_answers": False,
    }
    response = requests.patch(f'{EVALAP_API_URL}/experiment/{exp_id}', json=experiment, headers=headers)
    resp = response.json()
    if "id" in resp:
        print(f'Updated experiment: {resp["name"]} ({resp["id"]}), status: {resp["experiment_status"]}')
    else:
        print(resp)

Updated experiment: MFS_questions>AgentPublic/llama3-instruct-8b_0_v0 (5), status: finished
Updated experiment: MFS_questions>meta-llama/Meta-Llama-3.1-8B-Instruct_1_v0 (6), status: finished
Updated experiment: MFS_questions>meta-llama/Meta-Llama-3.1-70B-Instruct_2_v0 (7), status: finished
Updated experiment: MFS_questions>mfs-rag-baseline_3_v0 (8), status: finished


## Reading and showing results

-> The table in the ouput show the mean and std score, for each metrics, across the dataset questions. So it show the variability overs the dataset questios for one run, not the "natural" variability of the model (random variation across multiple generations). 

In [128]:
# Read results
# --
df_all = [] # list of results per experiment/model
supports = [] # Store the support for debugging: if some answer/observation failed, the support can be less than the dataset size
for model, exp_id in zip(models_to_test, experiment_ids):
    # Get an experiment result
    response = requests.get(f'{EVALAP_API_URL}/experiment/{exp_id}?with_results=true', headers=headers)
    experiment = response.json()
    # experiment["name"] # Name of the experiment
    if experiment["experiment_status"] != "finished":
        print(f"Warning: experiment {exp_id} is not finished yet...")
    results = experiment["results"]
    
    # Build a result dataframe  from the observation_table (mean, std etc)
    df_metrics = {}
    metric_support = []
    supports.append(metric_support)
    for metric_results in results: 
        metric_name = metric_results["metric_name"]
        arr = np.array([x["score"] for x in metric_results["observation_table"] if pd.notna(x["score"])])
        df = pd.DataFrame([[
                np.mean(arr), # mean
                np.std(arr), # std
                np.median(arr), # median
                f"{arr.mean():.2f} ± {arr.std():.2f}",  # Formatting as 'mean±std'
                len(arr), # support
            ]], columns=["mean", "std", "median", "mean_std", "support"])
        df_metrics[metric_name] = df
        metric_support.append(len(arr))
    
    # Stack the mean_std final measure
    name = model["_name"] if model.get("_name") else model["name"]
    df = pd.DataFrame({metric_name:df["mean_std"].iloc[0] for metric_name, df in sorted(df_metrics.items())}, index=[name])
    df_all.append(df)

final_df = pd.concat(df_all)
#final_df["support"] = supports # for debugging
# Reorder columns
final_df = final_df[['answer_relevancy', 'judge_completude', 'judge_exactness', 'judge_notator', 'output_length', 'generation_time', 'toxicity', 'bias']]
final_df

Unnamed: 0,answer_relevancy,judge_completude,judge_exactness,judge_notator,output_length,generation_time,toxicity,bias
AgentPublic/llama3-instruct-8b,0.90 ± 0.23,26.41 ± 22.27,0.03 ± 0.16,3.41 ± 1.97,308.33 ± 93.23,6.64 ± 1.99,0.00 ± 0.00,0.00 ± 0.00
meta-llama/Meta-Llama-3.1-8B-Instruct,0.95 ± 0.15,30.00 ± 20.16,0.05 ± 0.22,3.97 ± 2.27,296.13 ± 113.57,3.82 ± 1.43,0.00 ± 0.00,0.00 ± 0.00
meta-llama/Meta-Llama-3.1-70B-Instruct,0.92 ± 0.22,35.00 ± 24.73,0.05 ± 0.22,4.85 ± 2.63,261.89 ± 109.02,9.42 ± 4.01,0.00 ± 0.00,0.00 ± 0.00
mfs-rag-baseline,0.90 ± 0.18,42.05 ± 26.91,0.21 ± 0.40,5.36 ± 2.59,119.41 ± 66.23,4.31 ± 1.80,0.00 ± 0.00,0.00 ± 0.00


## What is inside an experiment result ?

In [125]:
# See what's inside an experiment result
# --
metric_index = 2
len(results) # number of metrics
list(results[metric_index]) # the keys of the dict representing the result object
len(results[metric_index]["observation_table"]) # number of observation -> one per dataset line.
results[metric_index]["observation_table"][0] # the actual "observation" for one metric, for one line in the dataset.

{'id': 1276,
 'created_at': '2024-11-09T12:57:32.560445',
 'score': 1.0,
 'observation': 'The score is 1.00 because the response perfectly addresses the question without any irrelevant information. Great job!',
 'num_line': 21,
 'error_msg': None,
 'execution_time': 4}