# `RankingPipeline`

Dans ce script on teste la pipeline complète, permettant de paramétrer les méthodes de calcul des scores.

## Calcul des scores

La méthode `run` lance le calcul des scores.

In [None]:
import os
from getpass import getpass

cache_dir = input("Indicate path to all Hugging Face caches:")
os.environ["HF_DATASETS_CACHE"] = cache_dir
os.environ["HF_HUB_CACHE"] = cache_dir
os.environ["HF_TOKEN"] = getpass("Enter your HuggingFace token:")

In [21]:
from pathlib import Path
from rank_comparia.pipeline import RankingPipeline

### Paramètres de `RankingPipeline`  

- `method` : Méthode de classement utilisé : `elo_random`, `elo_ordered`, `ml`  
- `include_votes` : Utilisation des données de votes  
- `include_reactions` : Utilisation des données de réactions
- `bootstrap_samples` : Nombres d'échantillons pour cacluler la version *Bootstrap* 
- `mean_how` : Moyenner par nombre de token générés ou par matchs effectués
- `batch` : si on batch le nombre de match 
- `export_path` : le chemin vers le dossier dans lequel exporter les graphes et les scores finaux

In [None]:
pipeline = RankingPipeline(
    method="elo_random",
    include_votes=True,
    include_reactions=True,
    bootstrap_samples=5,
    mean_how="token",
    batch=False,
    export_path=Path("output"),
)

Using the latest cached version of the dataset since ministere-culture/comparia-votes couldn't be found on the Hugging Face Hub (offline mode is enabled).
Found the latest cached dataset configuration 'default' at /home/jupyterhub-users/shared/projet_comparia/huggingface_hub/ministere-culture___comparia-votes/default/0.0.0/679f56e14f413546403b3468d717c4417e394326 (last modified on Mon Jul 28 10:06:44 2025).
Using the latest cached version of the dataset since ministere-culture/comparia-conversations couldn't be found on the Hugging Face Hub (offline mode is enabled).
Found the latest cached dataset configuration 'default' at /home/jupyterhub-users/shared/projet_comparia/huggingface_hub/ministere-culture___comparia-conversations/default/0.0.0/dc40af6af1c14e68bf39d55f6e1573d2d6582f19 (last modified on Wed Jun  4 17:40:30 2025).
Using the latest cached version of the dataset since ministere-culture/comparia-reactions couldn't be found on the Hugging Face Hub (offline mode is enabled).
Fou

Final votes dataset contains 55617 conversations pairs.
Reactions data originally contains 22993 conversations pairs.
Final reactions dataset contains 21244 conversations pairs.


In [28]:
pipeline.matches

conversation_pair_id,model_a,model_b,score,categories,model_a_active_params,model_b_active_params,total_conv_a_output_tokens,total_conv_a_kwh,total_conv_b_output_tokens,total_conv_b_kwh
str,str,str,i32,list[str],f64,f64,f64,f64,f64,f64
"""e0a4afe42906427fb44dab8f7d8b66…","""gpt-4o-mini-2024-07-18""","""gemini-2.0-flash-001""",2,"[""Business & Economics & Finance"", ""Education""]",35.0,40.0,317.0,0.002396,542.0,0.004476
"""c0ce05dddccd4feaa35118d9f58891…","""gemma-2-9b-it""","""hermes-3-llama-3.1-405b""",2,"[""Health & Wellness & Medicine"", ""Society & Social Issues & Human Rights""]",9.0,405.0,566.0,0.002212,767.0,0.182489
"""fa0f3f8dd8b1438cbdc748fab481dc…","""gpt-4o-2024-08-06""","""aya-expanse-8b""",1,"[""Business & Economics & Finance""]",200.0,8.0,526.0,0.03231,729.0,0.002747
"""e85752b0f339459385bfc064a35ec3…","""mixtral-8x7b-instruct-v0.1""","""gemini-1.5-pro-001""",0,"[""Natural Science & Formal Science & Technology""]",,,292.0,0.001346,668.0,0.089563
"""1a03ebdfe29a4762a7d97d6247041d…","""gpt-4o-mini-2024-07-18""","""gemini-2.0-flash-001""",2,"[""Education"", ""Natural Science & Formal Science & Technology""]",35.0,40.0,822.0,0.006212,731.0,0.006037
…,…,…,…,…,…,…,…,…,…,…
"""712ff01209d846c9b786eb544e3c59…","""gemini-2.0-flash-001""","""claude-3-5-sonnet-v2""",2,"[""Food & Drink & Cooking"", ""Health & Wellness & Medicine""]",40.0,300.0,1851.0,0.015287,338.0,0.045373
"""2bc5df446aa74c5786c10a3e135259…","""claude-3-5-sonnet-v2""","""deepseek-v3-chat""",0,"[""Natural Science & Formal Science & Technology"", ""Education""]",300.0,37.0,785.0,0.105377,882.0,0.041477
"""2000efa884d74caca883ba98efc9f5…","""mistral-small-24b-instruct-250…","""gpt-4o-mini-2024-07-18""",2,"[""Natural Science & Formal Science & Technology""]",24.0,35.0,454.0,0.00273,846.0,0.006393
"""6c9edcb7c7844eb495711fc1ab4818…","""gemini-1.5-pro-002""","""llama-3.1-70b""",0,"[""Health & Wellness & Medicine"", ""Natural Science & Formal Science & Technology""]",220.0,70.0,827.0,0.110882,812.0,0.010125


## Match_list()

Cette fonction permet de construire la liste des matchs effectués dans l'arène. Chaque élément de la liste correspond à un match joué avec :
- les deux modèles qui ont joué le match
- l'issue du match : 0 si c'eslt modèle B qui gagne, 2 si c'est le modèle A qui gagne et 1 s'il y a égalité.
- l'id de la conversation du match

In [29]:
pipeline.match_list()

[Match(model_a='gpt-4o-mini-2024-07-18', model_b='gemini-2.0-flash-001', score=<MatchScore.A: 2>, id='e0a4afe42906427fb44dab8f7d8b66af-72f12096597742c494f0c2bd0aa7d8be'),
 Match(model_a='gemma-2-9b-it', model_b='hermes-3-llama-3.1-405b', score=<MatchScore.A: 2>, id='c0ce05dddccd4feaa35118d9f58891ff-bdf8062d6abb4732b768c354b0975523'),
 Match(model_a='gpt-4o-2024-08-06', model_b='aya-expanse-8b', score=<MatchScore.Draw: 1>, id='fa0f3f8dd8b1438cbdc748fab481dcdc-a07a1b31957b4df889ca9cdac69b0d13'),
 Match(model_a='mixtral-8x7b-instruct-v0.1', model_b='gemini-1.5-pro-001', score=<MatchScore.B: 0>, id='e85752b0f339459385bfc064a35ec3d5-adcf3308040f49c7afd1c9d863e7f7d8'),
 Match(model_a='gpt-4o-mini-2024-07-18', model_b='gemini-2.0-flash-001', score=<MatchScore.A: 2>, id='1a03ebdfe29a4762a7d97d6247041dee-0af0467081434c0eb7ede68fd2ccb76b'),
 Match(model_a='phi-4', model_b='c4ai-command-r-08-2024', score=<MatchScore.B: 0>, id='0ff20c842d7d482ab5810d0f84b78971-9f0cbe06c0b14384a8ebf835f0fdef73'),
 

In [30]:
scores = pipeline.run()

Computing bootstrap scores from a sample of 76861 matches.


Processing bootstrap samples: 100%|██████████| 5/5 [00:00<00:00,  5.48it/s]


In [31]:
scores

model_name,median,p2.5,p97.5,total_output_tokens,conso_all_conv,n_match,mean_conso_per_match,mean_conso_per_token
str,f64,f64,f64,f64,f64,u32,f64,f64
"""Yi-1.5-9B-Chat""",774.241171,723.795812,816.145812,47531.0,0.182743,65,0.002811,0.000004
"""aya-expanse-8b""",1009.908116,974.484095,1147.204097,1.05781e6,3.98568,1302,0.003061,0.000004
"""c4ai-command-r-08-2024""",931.936808,876.647061,976.315158,2.747666e6,20.763975,3598,0.005771,0.000008
"""chocolatine-14b-instruct-dpo-v…",786.607742,648.945278,829.370282,131187.0,0.602249,309,0.001949,0.000005
"""chocolatine-2-14b-instruct-v2.…",841.683681,788.800519,914.163245,533384.0,1.934863,1796,0.001077,0.000004
…,…,…,…,…,…,…,…,…
"""qwen2-7b-instruct""",779.862264,686.265866,864.205276,43550.0,0.153078,80,0.001913,0.000004
"""qwen2.5-32b-instruct""",1015.134656,922.150664,1045.036653,75812.0,0.531085,142,0.00374,0.000007
"""qwen2.5-7b-instruct""",933.475457,835.779738,1015.902373,1.186919e6,4.305576,1420,0.003032,0.000004
"""qwen2.5-coder-32b-instruct""",922.406688,908.776973,983.476495,4.0819e6,29.128193,4954,0.00588,0.000007


### Une autre méthode de calcul 

Ici on utilise uniquement les données de votes.

In [None]:
pipeline = RankingPipeline(
    method="elo_random",
    include_votes=False,
    include_reactions=True,
    bootstrap_samples=5,
    mean_how="token",
    batch=False,
    export_path=None,  # Path("output"),
)
scores_votes = pipeline.run()

Using the latest cached version of the dataset since ministere-culture/comparia-reactions couldn't be found on the Hugging Face Hub (offline mode is enabled).
Found the latest cached dataset configuration 'default' at /home/jupyterhub-users/shared/projet_comparia/huggingface_hub/ministere-culture___comparia-reactions/default/0.0.0/80befa851337d9f295096cef3d100b40d220dc07 (last modified on Mon Jul 28 10:06:54 2025).
Using the latest cached version of the dataset since ministere-culture/comparia-conversations couldn't be found on the Hugging Face Hub (offline mode is enabled).
Found the latest cached dataset configuration 'default' at /home/jupyterhub-users/shared/projet_comparia/huggingface_hub/ministere-culture___comparia-conversations/default/0.0.0/dc40af6af1c14e68bf39d55f6e1573d2d6582f19 (last modified on Wed Jun  4 17:40:30 2025).


Reactions data originally contains 22993 conversations pairs.
Final reactions dataset contains 21244 conversations pairs.
Computing bootstrap scores from a sample of 21244 matches.


Processing bootstrap samples: 100%|██████████| 5/5 [00:00<00:00, 25.20it/s]


In [12]:
scores_votes

model_name,median,p2.5,p97.5,total_output_tokens,conso_all_conv,n_match,mean_conso_per_match,mean_conso_per_token
str,f64,f64,f64,f64,f64,u32,f64,f64
"""aya-expanse-8b""",940.817651,924.158808,1022.448303,327851.0,1.235297,467,0.002645,0.000004
"""c4ai-command-r-08-2024""",949.314328,913.310484,1005.993491,741453.0,5.603123,1079,0.005193,0.000008
"""chocolatine-2-14b-instruct-v2.…",807.375319,705.541094,888.0187,167003.0,0.605807,556,0.00109,0.000004
"""claude-3-5-sonnet-v2""",979.612297,871.392429,1044.974816,992729.0,133.262452,1834,0.072662,0.000134
"""claude-3-7-sonnet""",1086.632401,988.473159,1129.083656,287583.0,38.604711,296,0.130421,0.000134
…,…,…,…,…,…,…,…,…
"""phi-3.5-mini-instruct""",931.243001,797.687846,966.268693,343212.0,1.052349,430,0.002447,0.000003
"""phi-4""",1030.029215,964.638968,1077.368166,1.149348e6,5.298356,1498,0.003537,0.000005
"""qwen2.5-7b-instruct""",946.331531,833.547935,1022.996342,313742.0,1.138106,392,0.002903,0.000004
"""qwen2.5-coder-32b-instruct""",858.68396,800.259613,977.292582,1.168947e6,8.341536,1490,0.005598,0.000007


## Pipeline avec un ranker alternatif

Utilisation du Ranker `MaximumLikelihood`

In [None]:
pipeline = RankingPipeline(
    method="ml",
    include_votes=True,
    include_reactions=True,
    mean_how="token",
    bootstrap_samples=5,
    batch=False,
    export_path=Path("output"),
)

Using the latest cached version of the dataset since ministere-culture/comparia-votes couldn't be found on the Hugging Face Hub (offline mode is enabled).
Found the latest cached dataset configuration 'default' at /home/jupyterhub-users/shared/projet_comparia/huggingface_hub/ministere-culture___comparia-votes/default/0.0.0/679f56e14f413546403b3468d717c4417e394326 (last modified on Mon Jul 28 10:06:44 2025).
Using the latest cached version of the dataset since ministere-culture/comparia-conversations couldn't be found on the Hugging Face Hub (offline mode is enabled).
Found the latest cached dataset configuration 'default' at /home/jupyterhub-users/shared/projet_comparia/huggingface_hub/ministere-culture___comparia-conversations/default/0.0.0/dc40af6af1c14e68bf39d55f6e1573d2d6582f19 (last modified on Wed Jun  4 17:40:30 2025).
Using the latest cached version of the dataset since ministere-culture/comparia-reactions couldn't be found on the Hugging Face Hub (offline mode is enabled).
Fou

Final votes dataset contains 55617 conversations pairs.
Reactions data originally contains 22993 conversations pairs.
Final reactions dataset contains 21244 conversations pairs.


In [14]:
scores_ml = pipeline.run()

Computing bootstrap scores from a sample of 76861 matches.


Processing bootstrap samples: 100%|██████████| 5/5 [00:00<00:00,  6.09it/s]


In [None]:
pipeline = RankingPipeline(
    method="ml",
    include_votes=True,
    include_reactions=False,
    mean_how="token",
    bootstrap_samples=5,
    batch=False,
)

scores_ml_votes = pipeline.run()

Using the latest cached version of the dataset since ministere-culture/comparia-votes couldn't be found on the Hugging Face Hub (offline mode is enabled).
Found the latest cached dataset configuration 'default' at /home/jupyterhub-users/shared/projet_comparia/huggingface_hub/ministere-culture___comparia-votes/default/0.0.0/679f56e14f413546403b3468d717c4417e394326 (last modified on Mon Jul 28 10:06:44 2025).
Using the latest cached version of the dataset since ministere-culture/comparia-conversations couldn't be found on the Hugging Face Hub (offline mode is enabled).
Found the latest cached dataset configuration 'default' at /home/jupyterhub-users/shared/projet_comparia/huggingface_hub/ministere-culture___comparia-conversations/default/0.0.0/dc40af6af1c14e68bf39d55f6e1573d2d6582f19 (last modified on Wed Jun  4 17:40:30 2025).


Final votes dataset contains 55617 conversations pairs.
Computing bootstrap scores from a sample of 55617 matches.


Processing bootstrap samples: 100%|██████████| 5/5 [00:00<00:00,  7.98it/s]


## Comparaison des différentes méthodes

In [16]:
import polars as pl

pl.concat(
    [
        scores.select("model_name", "median").rename(mapping={"median": "score_elo"}),
        scores_votes.select("model_name", "median").rename(mapping={"median": "score_elo_votes"}),
        scores_ml.select("model_name", "median").rename(mapping={"median": "score_ml"}),
        scores_ml_votes.select("model_name", "median").rename(mapping={"median": "score_ml_votes"}),
    ],
    how="align",
)

model_name,score_elo,score_elo_votes,score_ml,score_ml_votes
str,f64,f64,f64,f64
"""Yi-1.5-9B-Chat""",820.155853,,813.571738,755.436748
"""aya-expanse-8b""",1050.031121,940.817651,1005.933372,1002.482744
"""c4ai-command-r-08-2024""",955.966302,949.314328,953.636286,954.025745
"""chocolatine-14b-instruct-dpo-v…",765.275949,,789.4,771.642505
"""chocolatine-2-14b-instruct-v2.…",863.132705,807.375319,848.395798,852.66589
…,…,…,…,…
"""qwen2-7b-instruct""",780.921448,,746.920477,770.749264
"""qwen2.5-32b-instruct""",940.966675,,998.703232,1002.393794
"""qwen2.5-7b-instruct""",966.826827,946.331531,956.055307,978.358183
"""qwen2.5-coder-32b-instruct""",946.653857,858.68396,952.915885,962.0599


In [17]:
import polars as pl
import altair as alt

df_pl = pl.concat(
    [
        scores.select("model_name", "median").rename(mapping={"median": "score_elo"}),
        scores_votes.select("model_name", "median").rename(mapping={"median": "score_elo_votes"}),
        scores_ml.select("model_name", "median").rename(mapping={"median": "score_ml"}),
        scores_ml_votes.select("model_name", "median").rename(mapping={"median": "score_ml_votes"}),
    ],
    how="align",
).sort("score_elo", descending=True)

df = df_pl.to_pandas()
df_long = df.melt(
    id_vars=["model_name"],
    value_vars=["score_elo", "score_elo_votes", "score_ml", "score_ml_votes"],
    var_name="score_type",
    value_name="score",
)
legend_labels = {
    "score_elo": "Elo score (all data)",
    "score_elo_votes": "Elo score (votes data)",
    "score_ml": "BT score (all data)",
    "score_ml_votes": "BT score (votes data)",
}
df_long["score_type"] = df_long["score_type"].map(legend_labels)

chart = (
    alt.Chart(df_long)
    .mark_circle(size=80)
    .encode(
        x=alt.X("model_name:N", sort=df["model_name"].tolist(), title="model_name"),
        y=alt.Y("score:Q", title="Score", scale=alt.Scale(domain=[500, 1300])),
        color=alt.Color("score_type:N", title="Score Type"),
        tooltip=["model_name", "score", "score_type"],
    )
    .properties(width=600, height=400)
)

chart

## Scores par catégorie

Les méthodes `run_category` et `run_all_categories` permettent de calculer des scores pour une catégorie spécifiée ou pour toutes les catégories (avec un nombre de matchs total supérieur à un seuil).

In [None]:
pipeline = RankingPipeline(
    method="elo_random",
    include_votes=True,
    include_reactions=True,
    mean_how="token",
    bootstrap_samples=5,
    batch=False,
)

Using the latest cached version of the dataset since ministere-culture/comparia-votes couldn't be found on the Hugging Face Hub (offline mode is enabled).
Found the latest cached dataset configuration 'default' at /home/jupyterhub-users/shared/projet_comparia/huggingface_hub/ministere-culture___comparia-votes/default/0.0.0/679f56e14f413546403b3468d717c4417e394326 (last modified on Mon Jul 28 10:06:44 2025).
Using the latest cached version of the dataset since ministere-culture/comparia-conversations couldn't be found on the Hugging Face Hub (offline mode is enabled).
Found the latest cached dataset configuration 'default' at /home/jupyterhub-users/shared/projet_comparia/huggingface_hub/ministere-culture___comparia-conversations/default/0.0.0/dc40af6af1c14e68bf39d55f6e1573d2d6582f19 (last modified on Wed Jun  4 17:40:30 2025).
Using the latest cached version of the dataset since ministere-culture/comparia-reactions couldn't be found on the Hugging Face Hub (offline mode is enabled).
Fou

Final votes dataset contains 55617 conversations pairs.
Reactions data originally contains 22993 conversations pairs.
Final reactions dataset contains 21244 conversations pairs.


In [17]:
pipeline.run_category("Education")

Computing bootstrap scores from a sample of 23033 matches.


Processing bootstrap samples:   0%|          | 0/5 [00:00<?, ?it/s]

Processing bootstrap samples: 100%|██████████| 5/5 [00:00<00:00, 14.37it/s]


model_name,median,p2.5,p97.5,total_output_tokens,conso_all_conv,n_match,mean_conso_per_match,mean_conso_per_token
str,f64,f64,f64,f64,f64,u32,f64,f64
"""Yi-1.5-9B-Chat""",800.817442,741.422315,850.615496,47531.0,0.182743,65,0.002811,0.000004
"""aya-expanse-8b""",971.809061,919.628331,1045.417157,1.05781e6,3.98568,1302,0.003061,0.000004
"""c4ai-command-r-08-2024""",918.126747,902.835228,1019.960401,2.747666e6,20.763975,3598,0.005771,0.000008
"""chocolatine-14b-instruct-dpo-v…",810.954755,730.302308,924.922786,131187.0,0.602249,309,0.001949,0.000005
"""chocolatine-2-14b-instruct-v2.…",805.31044,763.368634,919.795801,533384.0,1.934863,1796,0.001077,0.000004
…,…,…,…,…,…,…,…,…
"""qwen2-7b-instruct""",826.097891,752.052838,863.03069,43550.0,0.153078,80,0.001913,0.000004
"""qwen2.5-32b-instruct""",1068.580006,1032.987712,1171.712044,75812.0,0.531085,142,0.00374,0.000007
"""qwen2.5-7b-instruct""",932.091519,881.377309,1079.192671,1.186919e6,4.305576,1420,0.003032,0.000004
"""qwen2.5-coder-32b-instruct""",946.687908,899.577859,1033.987153,4.0819e6,29.128193,4954,0.00588,0.000007


In [18]:
results = pipeline.run_all_categories()

Computing bootstrap scores from a sample of 23033 matches.


Processing bootstrap samples:   0%|          | 0/5 [00:00<?, ?it/s]

Processing bootstrap samples: 100%|██████████| 5/5 [00:00<00:00, 17.62it/s]


Computing bootstrap scores from a sample of 8046 matches.


Processing bootstrap samples: 100%|██████████| 5/5 [00:00<00:00, 49.23it/s]


Computing bootstrap scores from a sample of 12297 matches.


Processing bootstrap samples: 100%|██████████| 5/5 [00:00<00:00, 34.22it/s]

Computing bootstrap scores from a sample of 10069 matches.



Processing bootstrap samples: 100%|██████████| 5/5 [00:00<00:00, 37.29it/s]


Computing bootstrap scores from a sample of 11206 matches.


Processing bootstrap samples: 100%|██████████| 5/5 [00:00<00:00, 34.16it/s]


Computing bootstrap scores from a sample of 5281 matches.


Processing bootstrap samples: 100%|██████████| 5/5 [00:00<00:00, 68.62it/s]


Computing bootstrap scores from a sample of 5714 matches.


Processing bootstrap samples: 100%|██████████| 5/5 [00:00<00:00, 67.92it/s]


Computing bootstrap scores from a sample of 31303 matches.


Processing bootstrap samples: 100%|██████████| 5/5 [00:00<00:00, 12.20it/s]


Computing bootstrap scores from a sample of 17013 matches.


Processing bootstrap samples: 100%|██████████| 5/5 [00:00<00:00, 23.52it/s]


Computing bootstrap scores from a sample of 13220 matches.


Processing bootstrap samples: 100%|██████████| 5/5 [00:00<00:00, 28.90it/s]

Skipping Other which has less than 1000 matches.
Computing bootstrap scores from a sample of 7920 matches.



Processing bootstrap samples: 100%|██████████| 5/5 [00:00<00:00, 48.83it/s]


Computing bootstrap scores from a sample of 5642 matches.


Processing bootstrap samples: 100%|██████████| 5/5 [00:00<00:00, 81.46it/s]


Computing bootstrap scores from a sample of 5913 matches.


Processing bootstrap samples: 100%|██████████| 5/5 [00:00<00:00, 73.85it/s]


Skipping Shopping & Commodity which has less than 1000 matches.
Skipping Daily Life & Home & Lifestyle which has less than 1000 matches.
Skipping Religion & Spirituality which has less than 1000 matches.
Skipping Sports which has less than 1000 matches.
Skipping History which has less than 1000 matches.
Skipping Real Estate which has less than 1000 matches.
Skipping Philosophy which has less than 1000 matches.
Skipping International which has less than 1000 matches.
Skipping Psychology which has less than 1000 matches.
Skipping Security which has less than 1000 matches.
Skipping Philosophy & Spirituality which has less than 1000 matches.
Skipping Fashion which has less than 1000 matches.
Skipping Music which has less than 1000 matches.
Skipping Marketing which has less than 1000 matches.
Skipping Ethics & Debate which has less than 1000 matches.
Skipping Philosophy & logic which has less than 1000 matches.
Skipping Philosophy & Ethics which has less than 1000 matches.
Skipping Industry

In [19]:
results

{'Education': shape: (52, 4)
 ┌─────────────────────────────────┬─────────────┬─────────────┬─────────────┐
 │ model_name                      ┆ median      ┆ p2.5        ┆ p97.5       │
 │ ---                             ┆ ---         ┆ ---         ┆ ---         │
 │ str                             ┆ f64         ┆ f64         ┆ f64         │
 ╞═════════════════════════════════╪═════════════╪═════════════╪═════════════╡
 │ command-a                       ┆ 1165.939453 ┆ 1039.06224  ┆ 1175.586129 │
 │ claude-3-7-sonnet               ┆ 1163.001257 ┆ 1132.657457 ┆ 1185.869103 │
 │ gemini-2.0-flash-exp            ┆ 1151.385467 ┆ 1061.135761 ┆ 1183.146645 │
 │ gemini-2.0-flash-001            ┆ 1145.885573 ┆ 1089.733703 ┆ 1232.851487 │
 │ gemini-1.5-pro-001              ┆ 1129.806102 ┆ 1035.871364 ┆ 1159.290317 │
 │ …                               ┆ …           ┆ …           ┆ …           │
 │ chocolatine-2-14b-instruct-v2.… ┆ 832.347216  ┆ 797.186406  ┆ 900.729485  │
 │ Yi-1.5-9B-Chat      