Add metric identity and deterministic aggregate-instance linkage fields#52
Conversation
There was a problem hiding this comment.
Pull request overview
This pull request enhances the evaluation schema with metric identity fields and deterministic linkage capabilities between aggregate and instance-level evaluation records.
Changes:
- Added
evaluation_result_idfield to both aggregate and instance-level schemas as a stable join key - Added metric identity fields (
metric_id,metric_name,metric_kind,metric_unit) to support cross-source metric normalization - Added
metric_parametersfield to capture metric-specific configuration (e.g., k value for pass@k metrics)
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| instance_level_eval.schema.json | Added evaluation_result_id as preferred foreign key and updated evaluation_name description to clarify its role when the new ID is unavailable |
| eval.schema.json | Added evaluation_result_id to evaluation results and five new metric metadata fields (metric_id, metric_name, metric_kind, metric_unit, metric_parameters) to metric_config |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@yananlong Thanks for the PR. Looks already good to me, but what would be useful if you could parse/generate 1-2 concrete examples of how the proposed entries would look from JSONs that are on previous versions of the schema and how they would look different. Thanks! |
|
The matching keys make a lot of sense "cross-source normalization" is that a known term? There's quite a lot of fields in the metric, it is unclear to me which of those are necessary and how would I get about to fill them if I was contributing a new evaluation. |
|
Thank you very much, excited! For For the breaking changes beyond metric/linkage (interactions → messages, type narrowing, additionalProperties: false, new required eval_library). Let's start maybe add a migration script to this PR as @nelaturuharsha and @evijit suggested? |
|
Building on @nelaturuharsha, @borgr, and @janbatzner: +1 that examples and clearer guidance would make this much easier to adopt safely. I think we should merge with three concrete additions:
One concrete integration gap I verified locally: the repo’s generated Pydantic types are currently out of sync with these schema additions ( |
|
+1 for adding some way to join the |
|
Thanks for the detailed feedback @nelaturuharsha @borgr @janbatzner @mrshu @dmjoy. I pushed a couple updates based on this thread:
If you want the |
Example A (RewardBench overall, where evaluation_name: "Score" is ambiguous across sources):Before: Example B (BigCode-style pass@k, where k is indispensable):Before: |
|
LGTM, thanks for comments and addressing the questions! |
|
@yananlong I'm not sure about the approach to handling instances that are associated with more than one evaluation result (comment quote):
At least for some of the HELM outputs & conversions I've played with, there can be 10 or so evaluation result records (for different metrics) across the same set of instances. It seems excessive to just duplicate each instance 10 times with a different evaluation result ID. I understand there are constraints here trying to fit everything into two JSON files, but in a relation database for example you might have a join table since this is a many-to-many relationship (between instances and evaluation results). Maybe a compromise here would be to have the EDIT: It may also make sense to have this relationship to be implicit if every instance is associated with every evaluation result (and it's only necessary to include the |
|
@dmjoy Great point, and I agree this is the main tradeoff. I kept So in this PR the model is intentionally: one instance-level row = one metric result for one sample, with explicit duplication when the same sample contributes to multiple aggregate results. I agree this is not storage-optimal for HELM-like runs with many metrics. A cleaner many-to-many shape would be a follow-up normalization (e.g., stable I can open a follow-up issue/PR sketch for that normalized design right after this lands if that sounds good. |
|
@yananlong Ah I see. I suppose I was assuming that Regarding a more normalized design for the schema, it might be a good exercise to see how that looks / feels, but I don't want to impose. For our use case at least being able to associate instances back to evaluation results as you've done here is sufficient. |
This PR updates the evaluation result schemas to improve metric identity and enable deterministic aggregate <-> instance joins.
Changes
evaluation_results[].evaluation_result_id(aggregate) andevaluation_result_id(instance) as a stable join key.evaluation_results[].metric_config:metric_id,metric_name,metric_kind,metric_unit,metric_parameters.Metric identity guidance (all optional)
metric_id: stable identifier for joining/deduping/querying. Use a canonical global id when applicable (e.g.accuracy,f1_macro,auroc,rmse,pass_at_k). For benchmark/leaderboard-specific metrics, use a namespaced id (e.g.rewardbench.overall,lmarena.elo).metric_kind: normalized metric family used for safe aggregation (e.g.accuracy,f1,auroc,rmse,mae,pass_rate,elo).metric_name: display name for the metric (e.g.Score,Accuracy,pass@1).metric_unit: representation of the numeric values (e.g.proportion,percent,points,ms,tokens).metric_parameters: metric-specific configuration (e.g.{ "k": 1 }forpass@k).Deterministic
evaluation_result_idrecipeSuggested construction:
evaluation_result_id = "er_" + sha256(canonical_json(payload))canonical_json: JSON with sorted keys and stable separators (no whitespace)payload: at minimum{ evaluation_id, evaluation_name, source_data.dataset_name, metric_id, metric_kind, metric_unit, metric_parameters }(includegeneration_configif you need to disambiguate multiple computations of the same metric)Before/after examples
Example A (RewardBench overall, where
evaluation_name: "Score"is ambiguous across sources):Before:
{ "evaluation_name": "Score", "metric_config": { "evaluation_description": "Overall RewardBench Score", "lower_is_better": false, "score_type": "continuous", "min_score": 0.0, "max_score": 1.0 }, "score_details": { "score": 0.612599 } }After:
{ "evaluation_result_id": "er_<sha256(...)>", "evaluation_name": "Score", "metric_config": { "metric_id": "rewardbench.overall", "metric_name": "Score", "metric_kind": "preference_rate", "metric_unit": "proportion", "metric_parameters": {}, "evaluation_description": "Overall RewardBench Score", "lower_is_better": false, "score_type": "continuous", "min_score": 0.0, "max_score": 1.0 }, "score_details": { "score": 0.612599 } }Example B (
pass@k, wherekis indispensable):Before:
{ "evaluation_name": "multiple-js", "metric_config": { "evaluation_description": "pass@1 on multiple-js", "lower_is_better": false, "score_type": "continuous", "min_score": 0.0, "max_score": 1.0 }, "score_details": { "score": 0.189814 } }After:
{ "evaluation_result_id": "er_<sha256(...)>", "evaluation_name": "multiple-js", "metric_config": { "metric_id": "pass_at_k", "metric_name": "pass@1", "metric_kind": "pass_rate", "metric_unit": "proportion", "metric_parameters": { "k": 1 }, "evaluation_description": "pass@1 on multiple-js", "lower_is_better": false, "score_type": "continuous", "min_score": 0.0, "max_score": 1.0 }, "score_details": { "score": 0.189814 } }