# Demo for launching an experiment set  

*Objectif*: **Comparing the impact of the `limit` parameters on a RAG model**

An experiment set is a collection of experiments that are part of the same evaluation scenario. 
In this notebook, we're comparing how the (maximum) number of chunks influences the model's performance.

To conduct these experiments, one approach consists by creating an empty experiment set (via POST /experiment_set) and then add a list of experiments to it (via POST /experiment with the reference to the experimentset_id). Each experiment should have all parameters the same, except for the (maximum) number of chunks.

Alternatively, the /experiment_set endpoint offers a convenient feature called cv (short for cross-validation). This feature includes two key parameters:

- **common_params**: These are the parameters that will be shared across all experiments in the set.
- **grid_params**: This allows you to specify a list of varying values for any parameter.


Both **commons_params** and **grid_params** accept all the parameter defined by the ExperimentSetCreate schema.  
The experiments will be generated by combining the **common_params** with each unique set of values from the cartesian product of the lists provided in **grid_params**.

In [1]:
import os
import sys
import time

import dotenv
from IPython.display import HTML
import numpy as np
import pandas as pd
import requests

dotenv.load_dotenv("../.env")
sys.path.append("..")

#EVALAP_API_URL = "http://localhost:8000/v1"
EVALAP_API_URL = "https://evalap.etalab.gouv.fr/v1"
EVALAP_API_KEY = os.getenv("EVALAP_API_KEY") 
ALBERT_API_URL = "https://albert.api.etalab.gouv.fr/v1"
ALBERT_API_KEY = os.getenv("ALBERT_API_KEY")
MFS_API_URL = "https://franceservices.etalab.gouv.fr/api/v1"
MFS_API_KEY = os.getenv("MFS_API_KEY")
headers = {"Authorization": f"Bearer {EVALAP_API_KEY}"}

In [2]:
# Various utility functions
# --
def format_metrics(row):
   # format a dataframe has a series of "mean ± std"
   metrics = {}
   for metric in final_df.columns.levels[0]:
       mean_value = row[(metric, 'mean')]
       std_value = row[(metric, 'std')]
       metrics[metric] = f"{mean_value:.2f} ± {std_value:.2f}"
   return pd.Series(metrics)
    
def highlight_cells(s):
 # Custom function to highlight the entry with the highest/lowest mean value
    means = s.apply(lambda x: float(str(x).split('±')[0].strip()))
    # Create a mask where 1 for max, 0 for min
    max_mean_index = means.idxmax()
    min_mean_index = means.idxmin()  
    mask = pd.Series({max_mean_index: 1, min_mean_index: 0}, index=s.index)
    return [
        'font-weight: bold; color: salmon' if mask_value == 0 else
        'font-weight: bold; color: green' if mask_value == 1 else
        ''
        for mask_value in mask
    ]

## Designing and running an experiment set


In [14]:
# Designing my experiments
# --
expset_name = "mfs_rag_limit_v1"
expset_readme = "Comparing the impact of the `limit` parameters on a RAG model."
metrics = ["answer_relevancy", "judge_exactness", "judge_notator", "output_length", "generation_time"]
common_params = {
    "dataset" : "MFS_questions_v01",
    "model" : {"name": "meta-llama/Llama-3.1-8B-Instruct", "sampling_params": {"temperature": 0.2}, "base_url": MFS_API_URL, "api_key": MFS_API_KEY},
    "metrics" : metrics,
    "judge_model": "gpt-4o-mini", # the default is not given
}
grid_params = {
    "model": [{"extra_params": {"rag": {"mode":"rag", "limit":i}}} for i in [1, 2, 3, 4, 5, 7, 10, 15, 20]],
}

# Lauching the experiment set
expset = {
    "name" : expset_name, 
    "readme": expset_readme,
    "cv": {"common_params": common_params, "grid_params": grid_params}
}
response = requests.post(f'{EVALAP_API_URL}/experiment_set', json=expset, headers=headers)
resp = response.json()
if "id" in resp:
    expset_id = resp["id"]
    print(f'Created expset: {resp["name"]} ({resp["id"]})')
else:
    print(resp)

Created expset: mfs_rag_limit_v1 (7)


## Reading and showing results

In [15]:
# Read results
# --
df_all = None # multi-dimensional DataFrame
arr_all = {} # keep references of source array per metric metrics 

# Fetch results and compute macro metrics (mean, std etc).
# --
response = requests.get(f'{EVALAP_API_URL}/experiment_set/{expset_id}', headers=headers)
expset = response.json()
rows = []
for i, exp in enumerate(expset["experiments"]):
    # Get an experiment result
    exp_id = exp["id"]
    response = requests.get(f'{EVALAP_API_URL}/experiment/{exp_id}?with_results=true', headers=headers)
    experiment = response.json()
    # experiment["name"] # Name of the experiment
    if experiment["experiment_status"] != "finished":
        print(f"Warning: experiment {exp_id} is not finished yet...")
    results = experiment["results"]
    model = experiment["model"]["name"] + "_limit" + str(experiment["model"]["extra_params"]["rag"]["limit"])
    
    # Add an observation row from the observation_table (mean, std etc)
    row = {"model": model}
    rows.append(row)
    metric_arrs = {}
    arr_all[model] = metric_arrs
    for metric_results in results: 
        metric = metric_results["metric_name"]
        arr = np.array([x["score"] for x in metric_results["observation_table"] if pd.notna(x["score"])])
        row[(metric, 'mean')] = np.mean(arr)
        row[(metric, 'std')] = np.std(arr)
        row[(metric, 'median')] = np.median(arr)
        row[(metric, 'mean_std')] = f"{arr.mean():.2f} ± {arr.std():.2f}"  # Formatting as 'mean±std'
        row[(metric, 'support')] = len(arr)
        metric_arrs[metric] = arr
    
df_all = pd.DataFrame(rows)
df_all.set_index('model', inplace=True)
df_all.columns = pd.MultiIndex.from_tuples(df_all.columns)
final_df = df_all.xs('mean', axis=1, level=1) # pick the "macro" metric to show (mean, std, support etc)

final_df = final_df.sort_values(by='judge_exactness', ascending=False)
final_df = final_df[metrics] # reorder columns
final_df = final_df.style.apply(highlight_cells, axis=0)
final_df

Unnamed: 0_level_0,answer_relevancy,judge_exactness,judge_notator,output_length,generation_time
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AgentPublic/llama3-instruct-8b_limit20,0.877041,0.589744,6.307692,138.230769,5.769231
AgentPublic/llama3-instruct-8b_limit7,0.880916,0.564103,6.487179,121.538462,4.74359
AgentPublic/llama3-instruct-8b_limit10,0.848411,0.564103,6.487179,131.692308,4.641026
AgentPublic/llama3-instruct-8b_limit15,0.853233,0.564103,6.641026,127.871795,4.897436
AgentPublic/llama3-instruct-8b_limit5,0.786707,0.487179,6.230769,112.948718,4.230769
AgentPublic/llama3-instruct-8b_limit4,0.842524,0.435897,5.923077,111.666667,4.102564
AgentPublic/llama3-instruct-8b_limit2,0.785439,0.410256,5.717949,102.076923,3.846154
AgentPublic/llama3-instruct-8b_limit3,0.8552,0.358974,5.717949,107.307692,4.205128
AgentPublic/llama3-instruct-8b_limit1,0.775834,0.205128,4.25641,82.692308,3.333333
