Add metric identity and deterministic aggregate-instance linkage fields by yananlong · Pull Request #52 · evaleval/every_eval_ever

yananlong · 2026-02-23T10:13:22Z

This PR updates the evaluation result schemas to improve metric identity and enable deterministic aggregate <-> instance joins.

Changes

Add evaluation_results[].evaluation_result_id (aggregate) and evaluation_result_id (instance) as a stable join key.
Add optional metric identity fields under evaluation_results[].metric_config: metric_id, metric_name, metric_kind, metric_unit, metric_parameters.

Metric identity guidance (all optional)

metric_id: stable identifier for joining/deduping/querying. Use a canonical global id when applicable (e.g. accuracy, f1_macro, auroc, rmse, pass_at_k). For benchmark/leaderboard-specific metrics, use a namespaced id (e.g. rewardbench.overall, lmarena.elo).
metric_kind: normalized metric family used for safe aggregation (e.g. accuracy, f1, auroc, rmse, mae, pass_rate, elo).
metric_name: display name for the metric (e.g. Score, Accuracy, pass@1).
metric_unit: representation of the numeric values (e.g. proportion, percent, points, ms, tokens).
metric_parameters: metric-specific configuration (e.g. { "k": 1 } for pass@k).

Deterministic evaluation_result_id recipe
Suggested construction:

evaluation_result_id = "er_" + sha256(canonical_json(payload))
canonical_json: JSON with sorted keys and stable separators (no whitespace)
payload: at minimum { evaluation_id, evaluation_name, source_data.dataset_name, metric_id, metric_kind, metric_unit, metric_parameters } (include generation_config if you need to disambiguate multiple computations of the same metric)

Before/after examples

Example A (RewardBench overall, where evaluation_name: "Score" is ambiguous across sources):

Before:

{
  "evaluation_name": "Score",
  "metric_config": {
    "evaluation_description": "Overall RewardBench Score",
    "lower_is_better": false,
    "score_type": "continuous",
    "min_score": 0.0,
    "max_score": 1.0
  },
  "score_details": { "score": 0.612599 }
}

After:

{
  "evaluation_result_id": "er_<sha256(...)>",
  "evaluation_name": "Score",
  "metric_config": {
    "metric_id": "rewardbench.overall",
    "metric_name": "Score",
    "metric_kind": "preference_rate",
    "metric_unit": "proportion",
    "metric_parameters": {},
    "evaluation_description": "Overall RewardBench Score",
    "lower_is_better": false,
    "score_type": "continuous",
    "min_score": 0.0,
    "max_score": 1.0
  },
  "score_details": { "score": 0.612599 }
}

Example B (pass@k, where k is indispensable):

Before:

{
  "evaluation_name": "multiple-js",
  "metric_config": {
    "evaluation_description": "pass@1 on multiple-js",
    "lower_is_better": false,
    "score_type": "continuous",
    "min_score": 0.0,
    "max_score": 1.0
  },
  "score_details": { "score": 0.189814 }
}

After:

{
  "evaluation_result_id": "er_<sha256(...)>",
  "evaluation_name": "multiple-js",
  "metric_config": {
    "metric_id": "pass_at_k",
    "metric_name": "pass@1",
    "metric_kind": "pass_rate",
    "metric_unit": "proportion",
    "metric_parameters": { "k": 1 },
    "evaluation_description": "pass@1 on multiple-js",
    "lower_is_better": false,
    "score_type": "continuous",
    "min_score": 0.0,
    "max_score": 1.0
  },
  "score_details": { "score": 0.189814 }
}

Copilot

Pull request overview

This pull request enhances the evaluation schema with metric identity fields and deterministic linkage capabilities between aggregate and instance-level evaluation records.

Changes:

Added evaluation_result_id field to both aggregate and instance-level schemas as a stable join key
Added metric identity fields (metric_id, metric_name, metric_kind, metric_unit) to support cross-source metric normalization
Added metric_parameters field to capture metric-specific configuration (e.g., k value for pass@k metrics)

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
instance_level_eval.schema.json	Added `evaluation_result_id` as preferred foreign key and updated `evaluation_name` description to clarify its role when the new ID is unavailable
eval.schema.json	Added `evaluation_result_id` to evaluation results and five new metric metadata fields (`metric_id`, `metric_name`, `metric_kind`, `metric_unit`, `metric_parameters`) to `metric_config`

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

nelaturuharsha · 2026-02-25T10:30:49Z

@yananlong Thanks for the PR. Looks already good to me, but what would be useful if you could parse/generate 1-2 concrete examples of how the proposed entries would look from JSONs that are on previous versions of the schema and how they would look different. Thanks!

borgr · 2026-02-25T19:52:16Z

The matching keys make a lot of sense

"cross-source normalization" is that a known term?

There's quite a lot of fields in the metric, it is unclear to me which of those are necessary and how would I get about to fill them if I was contributing a new evaluation.
Why do we need a "kind" as opposed to name and id?
Also unit, is that something that we need to write ourselves? won't it always be deducible from the rest?
parameters and "more" sounds good

janbatzner · 2026-02-26T16:33:42Z

Thank you very much, excited!
+1 on @borgr and @nelaturuharsha - examples would really help! 👍 Also finding it hard to tell metric_id apart from metric_kind based on the descriptions.

For evaluation_result_id, how should converters build it? "Deterministic" without a formula would risk everyone generating different IDs, no?

For the breaking changes beyond metric/linkage (interactions → messages, type narrowing, additionalProperties: false, new required eval_library). Let's start maybe add a migration script to this PR as @nelaturuharsha and @evijit suggested?

mrshu · 2026-02-26T17:01:57Z

Building on @nelaturuharsha, @borgr, and @janbatzner: +1 that examples and clearer guidance would make this much easier to adopt safely.

I think we should merge with three concrete additions:

Add 1-2 before/after JSON examples (old payload -> new payload) showing which of the new fields are optional vs expected.
Define a deterministic construction recipe for evaluation_result_id (inputs + normalization rules), so converters don’t generate incompatible IDs.
Add a migration note/script for existing records, as suggested in-thread.

One concrete integration gap I verified locally: the repo’s generated Pydantic types are currently out of sync with these schema additions (evaluation_result_id / new metric_* fields), so converters using typed models won’t yet preserve/accept the new fields end-to-end. It would help to either regenerate those files in this PR or explicitly scope that as immediate follow-up.

dmjoy · 2026-03-02T21:13:25Z

+1 for adding some way to join the InstanceLevelEvaluationLog records back to the EvaluationResult(s) they're associated with. Though I believe a single instance record can be associated with more than one EvaluationResult based on what I've seen in some outputs converted from HELM where there are normally several metrics reported across a run, or where you might have metrics on a full dataset (and then in the same file, on subsets of the same dataset, e.g.: https://huggingface.co/datasets/evaleval/EEE_datastore/blob/main/data/helm_mmlu/allenai/olmo-1.7-7b/c1c79360-60bd-4f5d-a746-e0411b94f69b.json)

yananlong · 2026-03-04T18:24:54Z

Thanks for the detailed feedback @nelaturuharsha @borgr @janbatzner @mrshu @dmjoy.

I pushed a couple updates based on this thread:

PR description now includes: (1) a concrete metric_id convention (global ids when applicable, otherwise namespaced ids like rewardbench.overall / lmarena.elo), (2) a deterministic evaluation_result_id recipe, and (3) 2 before/after JSON examples.
Schema text: metric_id description explicitly mentions namespacing (eval.schema.json), and instance-to-aggregate linkage semantics are clarified (instance_level_eval.schema.json) so one instance record links to one aggregate result; emit multiple instance rows when a sample contributes to multiple aggregate results.

If you want the evaluation_result_id recipe to always include/exclude generation_config (vs only when needed to disambiguate), I can tighten that wording.

yananlong · 2026-03-04T18:31:22Z

@yananlong Thanks for the PR. Looks already good to me, but what would be useful if you could parse/generate 1-2 concrete examples of how the proposed entries would look from JSONs that are on previous versions of the schema and how they would look different. Thanks!

Example A (RewardBench overall, where evaluation_name: "Score" is ambiguous across sources):

Before:
{ "evaluation_name": "Score", "metric_config": { "evaluation_description": "Overall RewardBench Score", "lower_is_better": false, "score_type": "continuous", "min_score": 0.0, "max_score": 1.0 }, "score_details": { "score": 0.612599 } }
After:
{ "evaluation_result_id": "er_<sha256(...)>", "evaluation_name": "Score", "metric_config": { "metric_id": "rewardbench_overall", "metric_name": "Score", "metric_kind": "preference_rate", "metric_unit": "proportion", "metric_parameters": {}, "evaluation_description": "Overall RewardBench Score", "lower_is_better": false, "score_type": "continuous", "min_score": 0.0, "max_score": 1.0 }, "score_details": { "score": 0.612599 } }

Example B (BigCode-style pass@k, where k is indispensable):

Before: { "evaluation_name": "multiple-js", "metric_config": { "evaluation_description": "pass@1 on multiple-js", "lower_is_better": false, "score_type": "continuous", "min_score": 0.0, "max_score": 1.0 }, "score_details": { "score": 0.189814 } }
After: { "evaluation_result_id": "er_<sha256(...)>", "evaluation_name": "multiple-js", "metric_config": { "metric_id": "pass_at_k", "metric_name": "pass@1", "metric_kind": "pass_rate", "metric_unit": "proportion", "metric_parameters": { "k": 1 }, "evaluation_description": "pass@1 on multiple-js", "lower_is_better": false, "score_type": "continuous", "min_score": 0.0, "max_score": 1.0 }, "score_details": { "score": 0.189814 } }

nelaturuharsha · 2026-03-05T16:14:47Z

LGTM, thanks for comments and addressing the questions!

dmjoy · 2026-03-05T17:43:06Z

@yananlong I'm not sure about the approach to handling instances that are associated with more than one evaluation result (comment quote):

if a single underlying interaction/sample contributes to multiple aggregate results, emit multiple instance records with different evaluation_result_id values.

At least for some of the HELM outputs & conversions I've played with, there can be 10 or so evaluation result records (for different metrics) across the same set of instances. It seems excessive to just duplicate each instance 10 times with a different evaluation result ID. I understand there are constraints here trying to fit everything into two JSON files, but in a relation database for example you might have a join table since this is a many-to-many relationship (between instances and evaluation results). Maybe a compromise here would be to have the evaluation_result_id foreign key in the instances be a list instead that can reference all of the EvaluationResults that they're associated with. (There are of course other ways to handle this but the list of evaluation_result_ids seems like the shortest path from where this PR is currently)

EDIT: It may also make sense to have this relationship to be implicit if every instance is associated with every evaluation result (and it's only necessary to include the evaluation_result_id foreign key(s) in the results where this isn't true)

yananlong · 2026-03-06T04:15:55Z

@dmjoy Great point, and I agree this is the main tradeoff.

I kept evaluation_result_id as a single FK because each instance row currently has a single evaluation.score / evaluation.is_correct. If we make evaluation_result_id a list, the row no longer says which score belongs to which aggregate metric result (unless we also redesign evaluation into a per-metric map/list).

So in this PR the model is intentionally: one instance-level row = one metric result for one sample, with explicit duplication when the same sample contributes to multiple aggregate results.

I agree this is not storage-optimal for HELM-like runs with many metrics. A cleaner many-to-many shape would be a follow-up normalization (e.g., stable instance_id + separate instance↔metric-result table/object), but that is a larger schema change than this PR.

I can open a follow-up issue/PR sketch for that normalized design right after this lands if that sounds good.

dmjoy · 2026-03-06T13:00:58Z

@yananlong Ah I see. I suppose I was assuming that evaluation.score and evaluation.is_correct were somehow agnostic of the evaluation_result, but if that's not the case then yes I suppose the approach you've taken here is kind of necessary (without a bigger restructure / refactor of the schema).

Regarding a more normalized design for the schema, it might be a good exercise to see how that looks / feels, but I don't want to impose. For our use case at least being able to associate instances back to evaluation results as you've done here is sufficient.

Add metric identity and deterministic aggregate-instance linkage fields

e5a77f2

Copilot AI review requested due to automatic review settings February 23, 2026 10:13

Copilot started reviewing on behalf of yananlong February 23, 2026 10:13 View session

Copilot AI reviewed Feb 23, 2026

View reviewed changes

nelaturuharsha assigned akornilo, borgr, nelaturuharsha, janbatzner, damian1996 and evijit Feb 25, 2026

yananlong added 2 commits March 4, 2026 23:23

Clarify instance-to-aggregate linkage semantics

3a3667f

Clarify metric_id namespacing guidance

8c616fd

nelaturuharsha merged commit 589c45b into evaleval:main Mar 6, 2026

yananlong deleted the schema-metric-linkage branch April 18, 2026 18:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add metric identity and deterministic aggregate-instance linkage fields#52

Add metric identity and deterministic aggregate-instance linkage fields#52
nelaturuharsha merged 3 commits intoevaleval:mainfrom
yananlong:schema-metric-linkage

yananlong commented Feb 23, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

nelaturuharsha commented Feb 25, 2026

Uh oh!

borgr commented Feb 25, 2026

Uh oh!

janbatzner commented Feb 26, 2026

Uh oh!

mrshu commented Feb 26, 2026

Uh oh!

dmjoy commented Mar 2, 2026

Uh oh!

yananlong commented Mar 4, 2026

Uh oh!

yananlong commented Mar 4, 2026

Uh oh!

nelaturuharsha commented Mar 5, 2026

Uh oh!

dmjoy commented Mar 5, 2026 •

edited

Loading

Uh oh!

yananlong commented Mar 6, 2026

Uh oh!

dmjoy commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

Conversation

yananlong commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

nelaturuharsha commented Feb 25, 2026

Uh oh!

borgr commented Feb 25, 2026

Uh oh!

janbatzner commented Feb 26, 2026

Uh oh!

mrshu commented Feb 26, 2026

Uh oh!

dmjoy commented Mar 2, 2026

Uh oh!

yananlong commented Mar 4, 2026

Uh oh!

yananlong commented Mar 4, 2026

Example A (RewardBench overall, where evaluation_name: "Score" is ambiguous across sources):

Example B (BigCode-style pass@k, where k is indispensable):

Uh oh!

nelaturuharsha commented Mar 5, 2026

Uh oh!

dmjoy commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yananlong commented Mar 6, 2026

Uh oh!

dmjoy commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

yananlong commented Feb 23, 2026 •

edited

Loading

dmjoy commented Mar 5, 2026 •

edited

Loading