# Calcul de scores à partir du jeu de données `comparia`

Dans ce notebook nous illustrons l'utilisation des classes `Ranker` pour calculer des scores à partir des données `comparia`.

## Chargement des données

In [None]:
import os
from getpass import getpass

cache_dir = input("Indicate path to all Hugging Face caches:")
os.environ["HF_DATASETS_CACHE"] = cache_dir
os.environ["HF_HUB_CACHE"] = cache_dir
os.environ["HF_TOKEN"] = getpass("Enter your HuggingFace token:")

In [2]:
from rank_comparia.utils import load_comparia

reactions = load_comparia("ministere-culture/comparia-reactions")

Using the latest cached version of the dataset since ministere-culture/comparia-reactions couldn't be found on the Hugging Face Hub (offline mode is enabled).
Found the latest cached dataset configuration 'default' at /home/jupyterhub-users/shared/projet_comparia/huggingface_hub/ministere-culture___comparia-reactions/default/0.0.0/80befa851337d9f295096cef3d100b40d220dc07 (last modified on Mon Jul 28 10:06:54 2025).
Using the latest cached version of the dataset since ministere-culture/comparia-conversations couldn't be found on the Hugging Face Hub (offline mode is enabled).
Found the latest cached dataset configuration 'default' at /home/jupyterhub-users/shared/projet_comparia/huggingface_hub/ministere-culture___comparia-conversations/default/0.0.0/dc40af6af1c14e68bf39d55f6e1573d2d6582f19 (last modified on Wed Jun  4 17:40:30 2025).


## Mise en forme des données

Ici nous utilisons des fonctions *legacy* avec une heuristique simple pour déterminer le résultat d'un *match* (une paire de conversation) à partir des réactions associées. On soustrait le nombre de réactions négatives au nombre de réactions positives pour chaque modèle. Le modèle avec la différence la plus élevée est vainqueur du match. Si les différences sont identiques pour les deux modèle, le *match* est une égalité (on filtre les égalités dans la fonction `get_winners`).   

In [3]:
from rank_comparia.data_transformation import get_matches_with_score, get_winners, get_winrates

matches = get_matches_with_score(reactions)

matches.head(5)

model_a_name,model_b_name,conversation_pair_id,score_a,score_b
str,str,str,i64,i64
"""gpt-4o-2024-08-06""","""gemini-2.0-flash-exp""","""a9d7ba9f60914ca3807fb9534834b9…",-2,2
"""llama-3.3-70b""","""c4ai-command-r-08-2024""","""778cd8622030442db5a52c82a8ce35…",-2,-2
"""gemma-2-9b-it""","""llama-3.3-70b""","""dad8a77a13754042994d06dab41b11…",2,-1
"""deepseek-v3-chat""","""claude-3-5-sonnet-v2""","""26accee6499848d599d152723533d8…",2,0
"""gemini-2.0-flash-exp""","""mistral-large-2411""","""ea434cd5413c4b2ea5c00acb5bc133…",1,-1


In [4]:
winners = get_winners(matches)

On calcule des taux de victoire par modèle.

In [5]:
winrates = get_winrates(winners)
winrates.sort("winrate", descending=True)

model_name,len,wins,winrate
str,u32,u32,f64
"""gemini-2.0-flash-exp""",856,647,75.584112
"""deepseek-v3-chat""",1530,1080,70.588235
"""gemma-3-27b""",495,348,70.30303
"""gemini-2.0-flash-001""",688,480,69.767442
"""gemini-1.5-pro-001""",328,222,67.682927
…,…,…,…
"""mixtral-8x7b-instruct-v0.1""",585,222,37.948718
"""lfm-40b""",887,321,36.189402
"""mixtral-8x22b-instruct-v0.1""",1459,445,30.500343
"""mistral-nemo-2407""",1461,435,29.774127


## Calcul des scores

Pour chaque *match* on calcule un score, on mélange les matchs et on les ajoute un par un à un `ELORanker` qui met à jour les scores à chaque ajout de match.

In [6]:
from rank_comparia.elo import ELORanker
from rank_comparia.ranker import Match, MatchScore
import random


def compute_match_score(score_a: int, score_b: int) -> MatchScore:
    final_score = score_b - score_a
    if final_score > 0:
        return MatchScore.B
    elif final_score < 0:
        return MatchScore.A
    else:
        return MatchScore.Draw


def get_shuffled_results(matches: list[Match], model_names: list[str], seed: int = 0):
    random.seed(seed)
    ranker_shuffle = ELORanker(K=40)
    matches_shuffle = random.sample(matches, k=len(matches))
    ranker_shuffle.add_players(model_names)
    ranker_shuffle.compute_scores(matches=matches_shuffle)
    return ranker_shuffle.players

100 sets de scores sont calculés avec des ordres d'ajout des matchs différents.

In [7]:
model_names = set(matches["model_a_name"].unique()) | set(matches["model_b_name"].unique())
matches = [
    Match(
        match_dict["model_a_name"],
        match_dict["model_b_name"],
        compute_match_score(match_dict["score_a"], match_dict["score_b"]),
    )
    for match_dict in matches.to_dicts()
]

player_results = {
    seed: get_shuffled_results(matches=matches, model_names=model_names, seed=seed) for seed in range(100)  # type: ignore
}

Les scores moyens sont calculés:

In [8]:
players_avg_ranking = {
    player_name: sum(results[player_name] for results in player_results.values()) / 100 for player_name in model_names
}

In [9]:
for player, ranking in sorted(players_avg_ranking.items(), key=lambda x: -x[1]):
    print(f"{player} : {ranking}")

gemini-2.0-flash-exp : 1147.9599849089927
gemini-2.0-flash-001 : 1143.8844295922418
gemma-3-27b : 1143.6980445411996
deepseek-v3-chat : 1118.4919941658666
deepseek-v3-0324 : 1117.7935322007172
claude-3-7-sonnet : 1109.6578565056934
command-a : 1103.506541041654
gpt-4.1-mini : 1089.0134217999946
gemma-3-12b : 1087.4332177526528
llama-3.1-nemotron-70b-instruct : 1082.8862324152165
deepseek-r1 : 1067.707567154525
grok-3-mini-beta : 1064.5175499116738
gemini-1.5-pro-002 : 1059.2637543153353
gemma-3-4b : 1051.2248313858229
gemini-1.5-pro-001 : 1039.7767494567863
llama-4-scout : 1038.538638236755
mistral-small-3.1-24b : 1034.7727254076663
mistral-large-2411 : 1031.9773440478325
o4-mini : 1007.7047558790124
claude-3-5-sonnet-v2 : 999.6028528708318
o3-mini : 998.0999711625981
gpt-4o-mini-2024-07-18 : 997.3395258174513
mistral-saba : 996.1870618799895
llama-3.1-405b : 993.7309560027419
llama-3.3-70b : 989.636112392419
jamba-1.5-large : 986.4594007385728
gpt-4.1-nano : 983.3776018378088
mistral-

Deux calculs des scores avec des ordres de matchs différents:

In [10]:
ranker_shuffle = ELORanker(K=40)

random.seed(42)
matches_shuffle = random.sample(matches, k=len(matches))
ranker_shuffle.add_players(model_names)  # type: ignore
ranker_shuffle.compute_scores(matches=matches_shuffle)
ranker_shuffle.get_scores()

{'deepseek-v3-chat': 1222.3659076645158,
 'deepseek-v3-0324': 1160.022327528974,
 'gemma-3-27b': 1148.2471033702627,
 'llama-3.1-nemotron-70b-instruct': 1129.19821147745,
 'grok-3-mini-beta': 1129.1226774110612,
 'claude-3-7-sonnet': 1110.4562506943903,
 'gemini-2.0-flash-001': 1093.4435764629218,
 'gemma-3-12b': 1093.4037570373907,
 'llama-4-scout': 1080.9682101315454,
 'llama-3.1-70b': 1073.1794687278423,
 'gpt-4o-2024-08-06': 1066.928754250905,
 'gpt-4o-mini-2024-07-18': 1054.3153814444802,
 'o4-mini': 1052.2915707673058,
 'deepseek-r1': 1042.1160324385348,
 'gemini-2.0-flash-exp': 1040.7819248617755,
 'gpt-4.1-mini': 1039.8891571909273,
 'gemini-1.5-pro-002': 1030.0832258204975,
 'gpt-4.1-nano': 1025.5068295180092,
 'mistral-large-2411': 1010.2553504632021,
 'o3-mini': 1008.0472387426064,
 'gemma-3-4b': 1007.8624774286593,
 'jamba-1.5-large': 1003.3622032217606,
 'gemma-2-27b-it-q8': 1002.9330250557166,
 'qwq-32b': 998.4497816797184,
 'claude-3-5-sonnet-v2': 997.1559291576199,
 'mi

In [11]:
ranker_shuffle = ELORanker(K=40)

random.seed(1337)
matches_shuffle = random.sample(matches, k=len(matches))
ranker_shuffle.add_players(model_names)  # type: ignore
ranker_shuffle.compute_scores(matches=matches_shuffle)
ranker_shuffle.get_scores()

{'gemini-2.0-flash-001': 1209.8342618515323,
 'gemma-3-27b': 1195.8299467148802,
 'deepseek-v3-chat': 1139.0750302684273,
 'mistral-small-3.1-24b': 1138.0141516770327,
 'deepseek-v3-0324': 1129.758660061055,
 'gemini-2.0-flash-exp': 1118.4754331032336,
 'claude-3-7-sonnet': 1116.0175756795334,
 'deepseek-r1': 1114.1205449381769,
 'llama-3.1-nemotron-70b-instruct': 1081.1405222523958,
 'command-a': 1076.0820191004175,
 'llama-3.3-70b': 1063.9149934964255,
 'claude-3-5-sonnet-v2': 1059.8741758182616,
 'llama-3.1-70b': 1054.1928910862732,
 'grok-3-mini-beta': 1052.6146703291633,
 'gemma-3-12b': 1046.9029354750498,
 'mistral-small-24b-instruct-2501': 1046.6210369707717,
 'llama-4-scout': 1028.7571908209454,
 'gpt-4.1-nano': 1019.2522832667461,
 'gpt-4.1-mini': 1015.7552936399269,
 'gemini-1.5-pro-002': 1013.8739456511503,
 'gemini-1.5-pro-001': 1008.0056676253727,
 'gemma-3-4b': 1004.4250174730746,
 'phi-4': 1002.1162429467676,
 'o3-mini': 999.0945843715662,
 'gpt-4o-mini-2024-07-18': 995.

## Utilisation d'un Ranker par maximum de vraisemblance

Ici on calcule les scores avec un `Ranker` alternatif, défini dans `src/rank_comparia/maximum_likelihood.py`

In [12]:
from rank_comparia.maximum_likelihood import MaximumLikelihoodRanker

ranker = MaximumLikelihoodRanker()
ranker.compute_scores(matches=matches)
ranker.get_scores()

{'gemini-2.0-flash-exp': np.float64(1143.1512358130012),
 'gemma-3-27b': np.float64(1136.3184203836174),
 'gemini-2.0-flash-001': np.float64(1133.400570032988),
 'deepseek-v3-chat': np.float64(1114.0328789363393),
 'deepseek-v3-0324': np.float64(1113.2067915235539),
 'claude-3-7-sonnet': np.float64(1105.1612084450092),
 'command-a': np.float64(1101.8786858020478),
 'gpt-4.1-mini': np.float64(1093.6685242271394),
 'gemma-3-12b': np.float64(1081.4653106195133),
 'llama-3.1-nemotron-70b-instruct': np.float64(1078.196872612909),
 'grok-3-mini-beta': np.float64(1071.2724072426324),
 'deepseek-r1': np.float64(1064.2931624791943),
 'gemma-3-4b': np.float64(1055.3168065207053),
 'gemini-1.5-pro-002': np.float64(1047.884270532969),
 'llama-4-scout': np.float64(1038.2019664020102),
 'gemini-1.5-pro-001': np.float64(1037.4923002507746),
 'mistral-small-3.1-24b': np.float64(1030.0445144875346),
 'mistral-large-2411': np.float64(1028.6925455484807),
 'o3-mini': np.float64(1006.2210001085826),
 'cla

## Bootstrap

Les classes `Ranker` ont une méthode `compute_boostrap_scores` qui permettent de calculer des scores et intervalles de confiance bootstrap (les matchs qui servent au calcul des scores pour chaque échantillon bootstrap sont issus de ré-échantillonages avec remise de l'échantillon de matchs initial). 

In [13]:
ranker = ELORanker(K=40)

ranker.add_players(model_names)  # type: ignore
scores = ranker.compute_bootstrap_scores(matches=matches)

scores

Computing bootstrap scores from a sample of 22993 matches.


Processing bootstrap samples: 100%|██████████| 100/100 [00:04<00:00, 21.52it/s]


model_name,median,p2.5,p97.5
str,f64,f64,f64
"""gemini-2.0-flash-exp""",1153.390147,1031.890169,1244.877391
"""gemini-2.0-flash-001""",1152.431767,1052.093245,1251.837456
"""gemma-3-27b""",1134.143863,1041.371622,1231.061902
"""deepseek-v3-chat""",1123.290963,1018.621306,1233.623067
"""deepseek-v3-0324""",1114.157011,1011.227499,1207.584693
…,…,…,…
"""mixtral-8x7b-instruct-v0.1""",896.434294,771.549341,983.017173
"""phi-3.5-mini-instruct""",871.098077,783.43975,996.576985
"""mixtral-8x22b-instruct-v0.1""",847.277576,759.236246,984.498975
"""mistral-nemo-2407""",846.145031,737.419622,963.055363


In [14]:
ranker = MaximumLikelihoodRanker()
scores = ranker.compute_bootstrap_scores(matches=matches)

scores

Computing bootstrap scores from a sample of 22993 matches.


Processing bootstrap samples: 100%|██████████| 100/100 [00:06<00:00, 14.87it/s]


model_name,median,p2.5,p97.5
str,f64,f64,f64
"""gemini-2.0-flash-exp""",1141.046782,1118.549819,1161.670632
"""gemma-3-27b""",1137.791747,1110.83339,1168.632882
"""gemini-2.0-flash-001""",1136.193398,1112.836702,1155.937855
"""deepseek-v3-chat""",1115.718863,1097.233955,1131.381339
"""deepseek-v3-0324""",1110.786993,1074.965941,1157.845912
…,…,…,…
"""phi-3.5-mini-instruct""",890.608256,851.020407,915.327981
"""mixtral-8x7b-instruct-v0.1""",886.7996,860.324786,910.511334
"""mixtral-8x22b-instruct-v0.1""",858.512945,843.320127,875.31544
"""mistral-nemo-2407""",854.711631,835.924169,869.653397


# Ajout de la notion de frugalité dans le score

In [15]:
from rank_comparia.frugality import get_normalized_log_cost, calculate_frugality_score
from rank_comparia.plot import plot_elo_against_frugal_elo

conversations = load_comparia("ministere-culture/comparia-conversations")
conversations = conversations.rename({"model_a_name": "model_a", "model_b_name": "model_b"})

frugality_score = calculate_frugality_score(conversations, None)
graph = plot_elo_against_frugal_elo(
    frugal_log_score=get_normalized_log_cost(frugality_score, mean="token"), bootstraped_scores=scores
)

Using the latest cached version of the dataset since ministere-culture/comparia-conversations couldn't be found on the Hugging Face Hub (offline mode is enabled).
Found the latest cached dataset configuration 'default' at /home/jupyterhub-users/shared/projet_comparia/huggingface_hub/ministere-culture___comparia-conversations/default/0.0.0/dc40af6af1c14e68bf39d55f6e1573d2d6582f19 (last modified on Wed Jun  4 17:40:30 2025).
Using the latest cached version of the dataset since ministere-culture/comparia-conversations couldn't be found on the Hugging Face Hub (offline mode is enabled).
Found the latest cached dataset configuration 'default' at /home/jupyterhub-users/shared/projet_comparia/huggingface_hub/ministere-culture___comparia-conversations/default/0.0.0/dc40af6af1c14e68bf39d55f6e1573d2d6582f19 (last modified on Wed Jun  4 17:40:30 2025).


In [16]:
graph