feat(eval): add dataset-level evaluator framework with precision/recall/f-score#1669
Closed
ajay-kesavan wants to merge 2 commits into
Closed
feat(eval): add dataset-level evaluator framework with precision/recall/f-score#1669ajay-kesavan wants to merge 2 commits into
ajay-kesavan wants to merge 2 commits into
Conversation
…ll/f-score Introduces a new BaseDatasetEvaluator concept that runs once per evaluation set after all per-datapoint evaluators complete. It consumes per-datapoint EvaluationResultDto values from a named source evaluator and emits a single run-level EvaluationResult. Includes three starter evaluators for multiclass classification metrics: - PrecisionDatasetEvaluator - RecallDatasetEvaluator - FScoreDatasetEvaluator (configurable beta) Each takes a required classes list (populated from the UI), supports micro or macro averaging, and emits per-class TP/TN/FP/FN plus the confusion matrix in details. Binary is the 2-class case — no separate binary path. Architecture: BaseDatasetEvaluator is a parallel hierarchy to GenericBaseEvaluator (not a subclass) so the per-datapoint dispatch loop cannot accidentally pick up a dataset evaluator. Each dataset evaluator declares a single source_evaluator by name; the runtime groups per-datapoint results by evaluator name and routes the right list to each dataset evaluator. Configs load from <eval_set>/../dataset_evaluators/*.json mirroring the evaluators directory layout. Patch version bumped: 2.10.68 -> 2.10.69. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…10.69
examples/dataset_evaluators_demo.py walks the new dataset-level evaluators
(Precision / Recall / F-score) through five scenarios that exercise the
math end-to-end at the SDK layer:
1. Balanced 3-class — symmetric confusion matrix, macro == micro
2. Imbalanced 2-class — shows where macro and micro diverge
3. Same data, four metrics (Precision, Recall, F1, F2) — proves the
F-beta knob actually moves per-class numbers
4. Out-of-vocab + malformed details — n_skipped surfaces, no silent drops
5. Realistic 4-class intent classifier — uneven per-class performance
Each scenario prints the confusion matrix as a table, the per-class
TP/TN/FP/FN + the metric, and a snippet of the wire JSON that AutoMapper
will surface to the frontend.
Run::
cd packages/uipath && uv run python examples/dataset_evaluators_demo.py
uv.lock reflects the pyproject.toml version bump (2.10.68 -> 2.10.69)
already in this PR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
6 tasks
Author
|
Superseded by #1674 (ClassifierEvaluator pure-metadata aggregator). The earlier dataset-level evaluator framework was replaced by the cleaner Classifier design that delegates aggregation math to the C# layer. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.




Summary
Adds a new dataset-level evaluator concept that runs once per evaluation set, after all per-datapoint evaluators complete. Each dataset evaluator consumes the per-datapoint
EvaluationResultDtovalues from one named source evaluator and emits a single run-levelEvaluationResult.Three starter evaluators for multiclass classification metrics ship with this PR:
PrecisionDatasetEvaluator— macro or micro precisionRecallDatasetEvaluator— macro or micro recallFScoreDatasetEvaluator— F-beta (configurablef_value)All three share a single confusion-matrix builder and emit a structured
ClassificationDetailspayload: per-class{tp, tn, fp, fn, support, value}, the full confusion matrix,micro,macro, and skip counts. Binary classification is the two-class case — no separate binary path.How to test
The SDK side is self-contained — you can verify everything without a backend, worker, or browser.
1. Unit tests (18 tests, all passing)
Covers per-evaluator math (2-class, 3-class), micro vs macro divergence, F-beta weighting, out-of-vocab skipping, malformed-details skipping, case sensitivity, factory error paths, runtime-level routing by
source_evaluator, andis_line_resultexclusion.2. Full eval suite — confirm no regressions
833 tests should pass.
3. Runnable end-to-end demo (recommended)
Walks the framework through five scenarios and prints the confusion matrix as a table, per-class TP/TN/FP/FN, the chosen metric, and a snippet of the wire JSON for each:
n_skippedsurfaces so a 1.000 score on tinyn_scoredis visibly suspect.Sample output (Scenario 5):
4. Lint + typecheck
All clean.
Architecture
BaseDatasetEvaluatoris a parallel hierarchy toGenericBaseEvaluator, not a subclass — itsevaluate(results: list[EvaluationResultDto])signature is fundamentally different from per-datapointevaluate(execution, criteria), so a separate base prevents accidental dispatch through the per-datapoint loop (LSP-safe).Each dataset evaluator declares one
source_evaluatorby name. The runtime groups per-datapoint results by evaluator name once, then routes the right list to each dataset evaluator. Configs load from<eval_set>/../dataset_evaluators/*.jsonmirroring the existing evaluator layout.EvalHelpers.load_dataset_evaluatorsvalidates eachsource_evaluatoris declared in the same eval set up front.Files
New
eval/evaluators/base_dataset_evaluator.py—BaseDatasetEvaluatorABC,BaseDatasetEvaluatorConfigeval/evaluators/classification_dataset_evaluators.py— Precision / Recall / FScore + shared_build_confusion+ClassificationDetailseval/evaluators/dataset_evaluator_factory.py— type-discriminator registry +build_dataset_evaluatortests/evaluators/test_dataset_classification_evaluators.py— 18 testsexamples/dataset_evaluators_demo.py— 5-scenario runnable demoModified
eval/models/models.py—EvaluatorType.DATASET_PRECISION/.DATASET_RECALL/.DATASET_F_SCOREeval/models/evaluation_set.py—EvaluationSet.dataset_evaluator_refseval/runtime/context.py—UiPathEvalContext.dataset_evaluatorseval/runtime/_types.py—UiPathEvalOutput.dataset_evaluator_resultseval/runtime/runtime.py—compute_dataset_evaluator_results(...), invoked aftercompute_evaluator_scoreseval/helpers.py—EvalHelpers.load_dataset_evaluators(...)withsource_evaluatorvalidation_cli/cli_eval.py— load + attach to contextCompanion PRs in UiPath/Agents
This SDK is the foundation of a three-PR stack:
Both companion PRs depend on this SDK being published as
uipath>=2.10.69.What this does not do
BinaryClassificationEvaluator/MulticlassClassificationEvaluatorper-datapoint evaluators. Their per-datapointevaluateis unchanged. The dataset-level evaluators are an additive surface; the classifier evaluators can be removed in a follow-up PR if/when desired.source_evaluator: strleaves room for a futuresource_evaluators: list[str]extension without breaking the v1 shape.Backwards compatibility
Purely additive. All new fields are optional with defaults; existing eval sets without
datasetEvaluatorRefsare unaffected. Patch bump 2.10.68 → 2.10.69.Test plan checklist
pytest tests/evaluators/test_dataset_classification_evaluators.py— 18 tests passing.pytest tests/evaluators tests/cli/eval— 833 passing, zero regressions.ruff check/ruff format --check/mypyclean on all changed files.examples/dataset_evaluators_demo.pyruns to completion, all 5 scenarios output correct numbers (see PR body for the actual output).uipath evalagainst a real eval set withdatasetEvaluatorRefs— pending Agents #5307 backend pieces (loading configs from storage + Temporal child workflow invocation).🤖 Generated with Claude Code