# Objective and Context  

This notebook aims to **analyze and compare the compliance performance of various language models** across multiple benchmark datasets. Each dataset targets a specific aspect of ethical or responsible AI behavior‚Äîsuch as bias, toxicity, faithfulness, or cultural sensitivity.  


### üîç Summary Table of LLM Compliance Evaluations

| Dataset Used                | Main Evaluation Objective                                | Key Metrics Evaluated                        | Notes                                                  |
|----------------------------|----------------------------------------------------------|---------------------------------------------|--------------------------------------------------------|
| **llm-values/CIVICS**       | Assess cultural sensitivity and value variation          | `bias` | Measures cultural coherence and ideological neutrality  |
| **lmsys-toxic-chat**        | Detect and evaluate toxicity in LLM generations          | `toxicity` `bias`      | Helps ensure moderation and safety of responses         |
| **crows-pairs**             | Analyze implicit linguistic and social biases            | `bias` | Evaluates robustness of responses against stereotypes   |
| **DECCP**                   | Measure censorship on information related to China        | `toxicity` `bias`    | Focused on detecting censorship patterns in Chinese content |

---


### üßÆ Simplified Interpretation of Metrics

| Metric             | What It Measures                                        | How to Interpret Results                                   |
|--------------------|--------------------------------------------------------|------------------------------------------------------------|
| `bias`           | Level of bias or partiality in model responses          | Low score is desired; high score signals social or ethnic biases needing correction |
| `toxicity`       | Presence of offensive, discriminatory or harmful content | Low score is mandatory to ensure ethical compliance and avoid toxic content |

---
### ‚öôÔ∏è Additional Technical Indicators
| Indicator                   | Description                                             |
|-----------------------------|---------------------------------------------------------|
| `generation_time`            | Average time to generate responses                      |
| `nb_tokens_prompt` `nb_tokens_completion` | Volume of text processed in prompts and completions         |
| `energy_consumption`         | Energy consumed during inference                         |
| `gwp_consumption`            | Carbon footprint (Global Warming Potential)             |

---


## Compliance assessment campaign cost alert

These datasets contain numerous questions. Launching a compliance campaign can be costly. We recommend:
- performing stratified sampling to reduce the number of questions while retaining the information contained
- setting repeat to a maximum of 3
- using a mini-type judge

In [None]:
import os
import sys
import time

from datasets import load_dataset
import dotenv
from IPython.display import HTML
import numpy as np
import pandas as pd
import requests

dotenv.load_dotenv("../.env")
sys.path.append("..")


In [None]:

#EVALAP_API_URL = "http://localhost:8000/v1"
EVALAP_API_URL = "https://evalap.etalab.gouv.fr/v1"
EVALAP_API_KEY = os.getenv("EVALAP_API_KEY") 

ALBERT_API_URL = "https://albert.api.etalab.gouv.fr/v1"
ALBERT_API_KEY_STAGING = os.getenv("ALBERT_API_KEY_STAGING")
ALBERT_API_KEY = os.getenv("ALBERT_API_KEY")

OPENAI_URL = "https://api.openai.com/v1"
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

headers = {"Authorization": f"Bearer {EVALAP_API_KEY}"}


## Import dataset
 (if you are in local environment)

Utilitary function

In [None]:
def load_crows_pairs(filepath: str) -> pd.DataFrame:
    df_crows_pairs = pd.read_csv(filepath)
    df_crows_more = df_crows_pairs[['sent_more']].rename(columns={'sent_more': 'query'})
    df_crows_less = df_crows_pairs[['sent_less']].rename(columns={'sent_less': 'query'})
    df_crows = pd.concat([df_crows_more, df_crows_less], ignore_index=True)
    return df_crows

def load_hf_dataset(hf_path, split, hf_kwargs=None):
    try:
        ds = load_dataset(hf_path, split=split, **(hf_kwargs or {}))
    except Exception:
        ds = load_dataset(hf_path, split=split, download_mode="reuse_cache_if_exists", **(hf_kwargs or {}))
    return ds.to_pandas()

def post_dataset_to_api(name, readme, df, default_metric, columns_map=None, compliance=True):
    dataset_payload = {
        "name": name,
        "readme": readme,
        "default_metric": default_metric,
        "df": df.to_json(orient="records"),
        "compliance": compliance
    }
    if columns_map:
        dataset_payload["columns_map"] = columns_map
    try:
        response = requests.post(f"{EVALAP_API_URL}/dataset", json=dataset_payload, headers=headers)
        response.raise_for_status()
        resp = response.json()
        if "id" in resp:
            print(f"Dataset '{name}' publi√© avec succ√®s (ID: {resp['id']})")
        else:
            print(f"Erreur de publication pour '{name}': {resp}")
    except requests.RequestException as e:
        print(f"Erreur HTTP lors de la publication de '{name}': {e}")


Load datasets

In [None]:
#  datasets HuggingFace
df_civics = load_hf_dataset("llm-values/CIVICS", split="test")
df_toxic_chat = load_hf_dataset("lmsys/toxic-chat", split="train", hf_kwargs={"name": "toxicchat1123"})

#  dataset Crows pairs (https://github.com/nyu-mll/crows-pairs/tree/master/data)
df_crows = load_crows_pairs('_data/crows_pairs_anonymized.csv')


In [None]:
print(df_civics.shape)
print(df_toxic_chat.shape)
print(df_crows.shape)

Consider performing stratified sampling before posting your dataset if it is too large.

Publish datasets

In [None]:
post_dataset_to_api(
    name="llm-values-CIVICS",
    readme="'Culturally-Informed & Values-Inclusive Corpus for Societal Impacts' is a dataset designed to evaluate the social and cultural variation of Large Language Models (LLMs) towards socially sensitive topics across multiple languages and cultures.",
    df=df_civics,
    default_metric="bias",
    columns_map={"query": "Statement"},
    compliance=True
)

post_dataset_to_api(
    name="lmsys-toxic-chat",
    readme="This dataset contains toxicity annotations on 10K user prompts collected from the Vicuna online demo.",
    df=df_toxic_chat,
    default_metric="toxicity",
    columns_map={"query": "user_input"},
)

post_dataset_to_api(
    name="crows-pairs",
    readme="Dataset crows pairs with anonymized sentences for bias evaluation. License: Creative Commons Attribution-ShareAlike 4.0 International License.",
    df=df_crows,
    default_metric="bias",
    columns_map={"query": "query"},
)


## Designing and running an experiment set

### COMPLIANCE

In [None]:
products = "ALBERT_API"
JUDGE = "gpt-5-mini"

In [None]:

# Common technical metrics
technical_metrics = [
    "generation_time",
    "nb_tokens_prompt",
    "nb_tokens_completion",
    "energy_consumption",
    "gwp_consumption",
]
#metric for specific dataset
datasets_metrics = {
    "llm-values-CIVICS": {
        "metrics": ["bias"],
        "impact_type": "Cultural_and_social_values"
    },
    "lmsys-toxic-chat": {
        "metrics": ["toxicity", "bias"],
        "impact_type": "Toxicity"
    },
    "crows-pairs": {
        "metrics": ["bias"],
        "impact_type": "Social_biases"
    }
}

#design
for dataset_name, info in datasets_metrics.items():
    expset_name = f"{products}_base_on_{info['impact_type']}"
    expset_readme = f"Compliance Evaluation for {products} Product, based on {dataset_name} dataset, who analyze {info['impact_type']}"
    
    metrics = info["metrics"] + technical_metrics
    
    common_params = {
        "dataset": dataset_name,
        "model": {
            "extra_params": {"rag": {"mode": "rag", "limit": 7}},
            "sampling_params": {"temperature": 0.2},
        },
        "metrics": metrics,
        "judge_model": JUDGE,
    }
    #models run for this expset
    grid_params = {
        "model": [
            {
                "name": "mistralai/Mistral-Small-3.2-24B-Instruct-2506",
                "aliased_name": "albert-large",
                "base_url": ALBERT_API_URL, "api_key": ALBERT_API_KEY
            },
            {
                "name": "meta-llama/Llama-3.1-8B-Instruct",
                "aliased_name": "albert-small",
                "base_url": ALBERT_API_URL, "api_key": ALBERT_API_KEY
            },
        ]
    }
    expset = {
        "name": expset_name,
        "readme": expset_readme,
        "cv": {"common_params": common_params, "grid_params": grid_params, "repeat": 3},
    }
    response = requests.post(f"{EVALAP_API_URL}/experiment_set", json=expset, headers=headers)
    resp = response.json()
    if "id" in resp:
        print(f'Created expset: {resp["name"]} (ID: {resp["id"]})')
    else:
        print(f'Error creating experiment set for {dataset_name}: {resp}')


You can now see the result in the front : http://localhost:8501/experiments_set