Skip to content

Add metric identity and deterministic aggregate-instance linkage fields#52

Merged
nelaturuharsha merged 3 commits intoevaleval:mainfrom
yananlong:schema-metric-linkage
Mar 6, 2026
Merged

Add metric identity and deterministic aggregate-instance linkage fields#52
nelaturuharsha merged 3 commits intoevaleval:mainfrom
yananlong:schema-metric-linkage

Conversation

@yananlong
Copy link
Copy Markdown
Contributor

@yananlong yananlong commented Feb 23, 2026

This PR updates the evaluation result schemas to improve metric identity and enable deterministic aggregate <-> instance joins.

Changes

  • Add evaluation_results[].evaluation_result_id (aggregate) and evaluation_result_id (instance) as a stable join key.
  • Add optional metric identity fields under evaluation_results[].metric_config: metric_id, metric_name, metric_kind, metric_unit, metric_parameters.

Metric identity guidance (all optional)

  • metric_id: stable identifier for joining/deduping/querying. Use a canonical global id when applicable (e.g. accuracy, f1_macro, auroc, rmse, pass_at_k). For benchmark/leaderboard-specific metrics, use a namespaced id (e.g. rewardbench.overall, lmarena.elo).
  • metric_kind: normalized metric family used for safe aggregation (e.g. accuracy, f1, auroc, rmse, mae, pass_rate, elo).
  • metric_name: display name for the metric (e.g. Score, Accuracy, pass@1).
  • metric_unit: representation of the numeric values (e.g. proportion, percent, points, ms, tokens).
  • metric_parameters: metric-specific configuration (e.g. { "k": 1 } for pass@k).

Deterministic evaluation_result_id recipe
Suggested construction:

  • evaluation_result_id = "er_" + sha256(canonical_json(payload))
  • canonical_json: JSON with sorted keys and stable separators (no whitespace)
  • payload: at minimum { evaluation_id, evaluation_name, source_data.dataset_name, metric_id, metric_kind, metric_unit, metric_parameters } (include generation_config if you need to disambiguate multiple computations of the same metric)

Before/after examples

Example A (RewardBench overall, where evaluation_name: "Score" is ambiguous across sources):

Before:

{
  "evaluation_name": "Score",
  "metric_config": {
    "evaluation_description": "Overall RewardBench Score",
    "lower_is_better": false,
    "score_type": "continuous",
    "min_score": 0.0,
    "max_score": 1.0
  },
  "score_details": { "score": 0.612599 }
}

After:

{
  "evaluation_result_id": "er_<sha256(...)>",
  "evaluation_name": "Score",
  "metric_config": {
    "metric_id": "rewardbench.overall",
    "metric_name": "Score",
    "metric_kind": "preference_rate",
    "metric_unit": "proportion",
    "metric_parameters": {},
    "evaluation_description": "Overall RewardBench Score",
    "lower_is_better": false,
    "score_type": "continuous",
    "min_score": 0.0,
    "max_score": 1.0
  },
  "score_details": { "score": 0.612599 }
}

Example B (pass@k, where k is indispensable):

Before:

{
  "evaluation_name": "multiple-js",
  "metric_config": {
    "evaluation_description": "pass@1 on multiple-js",
    "lower_is_better": false,
    "score_type": "continuous",
    "min_score": 0.0,
    "max_score": 1.0
  },
  "score_details": { "score": 0.189814 }
}

After:

{
  "evaluation_result_id": "er_<sha256(...)>",
  "evaluation_name": "multiple-js",
  "metric_config": {
    "metric_id": "pass_at_k",
    "metric_name": "pass@1",
    "metric_kind": "pass_rate",
    "metric_unit": "proportion",
    "metric_parameters": { "k": 1 },
    "evaluation_description": "pass@1 on multiple-js",
    "lower_is_better": false,
    "score_type": "continuous",
    "min_score": 0.0,
    "max_score": 1.0
  },
  "score_details": { "score": 0.189814 }
}

Copilot AI review requested due to automatic review settings February 23, 2026 10:13
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request enhances the evaluation schema with metric identity fields and deterministic linkage capabilities between aggregate and instance-level evaluation records.

Changes:

  • Added evaluation_result_id field to both aggregate and instance-level schemas as a stable join key
  • Added metric identity fields (metric_id, metric_name, metric_kind, metric_unit) to support cross-source metric normalization
  • Added metric_parameters field to capture metric-specific configuration (e.g., k value for pass@k metrics)

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
instance_level_eval.schema.json Added evaluation_result_id as preferred foreign key and updated evaluation_name description to clarify its role when the new ID is unavailable
eval.schema.json Added evaluation_result_id to evaluation results and five new metric metadata fields (metric_id, metric_name, metric_kind, metric_unit, metric_parameters) to metric_config

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@nelaturuharsha
Copy link
Copy Markdown
Collaborator

@yananlong Thanks for the PR. Looks already good to me, but what would be useful if you could parse/generate 1-2 concrete examples of how the proposed entries would look from JSONs that are on previous versions of the schema and how they would look different. Thanks!

@borgr
Copy link
Copy Markdown
Collaborator

borgr commented Feb 25, 2026

The matching keys make a lot of sense

"cross-source normalization" is that a known term?

There's quite a lot of fields in the metric, it is unclear to me which of those are necessary and how would I get about to fill them if I was contributing a new evaluation.
Why do we need a "kind" as opposed to name and id?
Also unit, is that something that we need to write ourselves? won't it always be deducible from the rest?
parameters and "more" sounds good

@janbatzner
Copy link
Copy Markdown
Collaborator

Thank you very much, excited!
+1 on @borgr and @nelaturuharsha - examples would really help! 👍 Also finding it hard to tell metric_id apart from metric_kind based on the descriptions.

For evaluation_result_id, how should converters build it? "Deterministic" without a formula would risk everyone generating different IDs, no?

For the breaking changes beyond metric/linkage (interactions → messages, type narrowing, additionalProperties: false, new required eval_library). Let's start maybe add a migration script to this PR as @nelaturuharsha and @evijit suggested?

@mrshu
Copy link
Copy Markdown
Contributor

mrshu commented Feb 26, 2026

Building on @nelaturuharsha, @borgr, and @janbatzner: +1 that examples and clearer guidance would make this much easier to adopt safely.

I think we should merge with three concrete additions:

  1. Add 1-2 before/after JSON examples (old payload -> new payload) showing which of the new fields are optional vs expected.
  2. Define a deterministic construction recipe for evaluation_result_id (inputs + normalization rules), so converters don’t generate incompatible IDs.
  3. Add a migration note/script for existing records, as suggested in-thread.

One concrete integration gap I verified locally: the repo’s generated Pydantic types are currently out of sync with these schema additions (evaluation_result_id / new metric_* fields), so converters using typed models won’t yet preserve/accept the new fields end-to-end. It would help to either regenerate those files in this PR or explicitly scope that as immediate follow-up.

@dmjoy
Copy link
Copy Markdown

dmjoy commented Mar 2, 2026

+1 for adding some way to join the InstanceLevelEvaluationLog records back to the EvaluationResult(s) they're associated with. Though I believe a single instance record can be associated with more than one EvaluationResult based on what I've seen in some outputs converted from HELM where there are normally several metrics reported across a run, or where you might have metrics on a full dataset (and then in the same file, on subsets of the same dataset, e.g.: https://huggingface.co/datasets/evaleval/EEE_datastore/blob/main/data/helm_mmlu/allenai/olmo-1.7-7b/c1c79360-60bd-4f5d-a746-e0411b94f69b.json)

@yananlong
Copy link
Copy Markdown
Contributor Author

Thanks for the detailed feedback @nelaturuharsha @borgr @janbatzner @mrshu @dmjoy.

I pushed a couple updates based on this thread:

  • PR description now includes: (1) a concrete metric_id convention (global ids when applicable, otherwise namespaced ids like rewardbench.overall / lmarena.elo), (2) a deterministic evaluation_result_id recipe, and (3) 2 before/after JSON examples.
  • Schema text: metric_id description explicitly mentions namespacing (eval.schema.json), and instance-to-aggregate linkage semantics are clarified (instance_level_eval.schema.json) so one instance record links to one aggregate result; emit multiple instance rows when a sample contributes to multiple aggregate results.

If you want the evaluation_result_id recipe to always include/exclude generation_config (vs only when needed to disambiguate), I can tighten that wording.

@yananlong
Copy link
Copy Markdown
Contributor Author

@yananlong Thanks for the PR. Looks already good to me, but what would be useful if you could parse/generate 1-2 concrete examples of how the proposed entries would look from JSONs that are on previous versions of the schema and how they would look different. Thanks!

Example A (RewardBench overall, where evaluation_name: "Score" is ambiguous across sources):

Before:
{ "evaluation_name": "Score", "metric_config": { "evaluation_description": "Overall RewardBench Score", "lower_is_better": false, "score_type": "continuous", "min_score": 0.0, "max_score": 1.0 }, "score_details": { "score": 0.612599 } }
After:
{ "evaluation_result_id": "er_<sha256(...)>", "evaluation_name": "Score", "metric_config": { "metric_id": "rewardbench_overall", "metric_name": "Score", "metric_kind": "preference_rate", "metric_unit": "proportion", "metric_parameters": {}, "evaluation_description": "Overall RewardBench Score", "lower_is_better": false, "score_type": "continuous", "min_score": 0.0, "max_score": 1.0 }, "score_details": { "score": 0.612599 } }

Example B (BigCode-style pass@k, where k is indispensable):

Before: { "evaluation_name": "multiple-js", "metric_config": { "evaluation_description": "pass@1 on multiple-js", "lower_is_better": false, "score_type": "continuous", "min_score": 0.0, "max_score": 1.0 }, "score_details": { "score": 0.189814 } }
After: { "evaluation_result_id": "er_<sha256(...)>", "evaluation_name": "multiple-js", "metric_config": { "metric_id": "pass_at_k", "metric_name": "pass@1", "metric_kind": "pass_rate", "metric_unit": "proportion", "metric_parameters": { "k": 1 }, "evaluation_description": "pass@1 on multiple-js", "lower_is_better": false, "score_type": "continuous", "min_score": 0.0, "max_score": 1.0 }, "score_details": { "score": 0.189814 } }

@nelaturuharsha
Copy link
Copy Markdown
Collaborator

LGTM, thanks for comments and addressing the questions!

@dmjoy
Copy link
Copy Markdown

dmjoy commented Mar 5, 2026

@yananlong I'm not sure about the approach to handling instances that are associated with more than one evaluation result (comment quote):

if a single underlying interaction/sample contributes to multiple aggregate results, emit multiple instance records with different evaluation_result_id values.

At least for some of the HELM outputs & conversions I've played with, there can be 10 or so evaluation result records (for different metrics) across the same set of instances. It seems excessive to just duplicate each instance 10 times with a different evaluation result ID. I understand there are constraints here trying to fit everything into two JSON files, but in a relation database for example you might have a join table since this is a many-to-many relationship (between instances and evaluation results). Maybe a compromise here would be to have the evaluation_result_id foreign key in the instances be a list instead that can reference all of the EvaluationResults that they're associated with. (There are of course other ways to handle this but the list of evaluation_result_ids seems like the shortest path from where this PR is currently)

EDIT: It may also make sense to have this relationship to be implicit if every instance is associated with every evaluation result (and it's only necessary to include the evaluation_result_id foreign key(s) in the results where this isn't true)

@yananlong
Copy link
Copy Markdown
Contributor Author

@dmjoy Great point, and I agree this is the main tradeoff.

I kept evaluation_result_id as a single FK because each instance row currently has a single evaluation.score / evaluation.is_correct. If we make evaluation_result_id a list, the row no longer says which score belongs to which aggregate metric result (unless we also redesign evaluation into a per-metric map/list).

So in this PR the model is intentionally: one instance-level row = one metric result for one sample, with explicit duplication when the same sample contributes to multiple aggregate results.

I agree this is not storage-optimal for HELM-like runs with many metrics. A cleaner many-to-many shape would be a follow-up normalization (e.g., stable instance_id + separate instance↔metric-result table/object), but that is a larger schema change than this PR.

I can open a follow-up issue/PR sketch for that normalized design right after this lands if that sounds good.

@dmjoy
Copy link
Copy Markdown

dmjoy commented Mar 6, 2026

@yananlong Ah I see. I suppose I was assuming that evaluation.score and evaluation.is_correct were somehow agnostic of the evaluation_result, but if that's not the case then yes I suppose the approach you've taken here is kind of necessary (without a bigger restructure / refactor of the schema).

Regarding a more normalized design for the schema, it might be a good exercise to see how that looks / feels, but I don't want to impose. For our use case at least being able to associate instances back to evaluation results as you've done here is sufficient.

@nelaturuharsha nelaturuharsha merged commit 589c45b into evaleval:main Mar 6, 2026
@yananlong yananlong deleted the schema-metric-linkage branch April 18, 2026 18:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants