Skip to content

feat(eval): add dataset-level evaluator framework with precision/recall/f-score#1669

Closed
ajay-kesavan wants to merge 2 commits into
mainfrom
feat/eval-dataset-evaluators
Closed

feat(eval): add dataset-level evaluator framework with precision/recall/f-score#1669
ajay-kesavan wants to merge 2 commits into
mainfrom
feat/eval-dataset-evaluators

Conversation

@ajay-kesavan
Copy link
Copy Markdown

@ajay-kesavan ajay-kesavan commented May 20, 2026

Summary

Adds a new dataset-level evaluator concept that runs once per evaluation set, after all per-datapoint evaluators complete. Each dataset evaluator consumes the per-datapoint EvaluationResultDto values from one named source evaluator and emits a single run-level EvaluationResult.

Three starter evaluators for multiclass classification metrics ship with this PR:

  • PrecisionDatasetEvaluator — macro or micro precision
  • RecallDatasetEvaluator — macro or micro recall
  • FScoreDatasetEvaluator — F-beta (configurable f_value)

All three share a single confusion-matrix builder and emit a structured ClassificationDetails payload: per-class {tp, tn, fp, fn, support, value}, the full confusion matrix, micro, macro, and skip counts. Binary classification is the two-class case — no separate binary path.

How to test

The SDK side is self-contained — you can verify everything without a backend, worker, or browser.

1. Unit tests (18 tests, all passing)

cd packages/uipath
uv run pytest tests/evaluators/test_dataset_classification_evaluators.py -v

Covers per-evaluator math (2-class, 3-class), micro vs macro divergence, F-beta weighting, out-of-vocab skipping, malformed-details skipping, case sensitivity, factory error paths, runtime-level routing by source_evaluator, and is_line_result exclusion.

2. Full eval suite — confirm no regressions

uv run pytest tests/evaluators tests/cli/eval --no-cov

833 tests should pass.

3. Runnable end-to-end demo (recommended)

uv run python examples/dataset_evaluators_demo.py

Walks the framework through five scenarios and prints the confusion matrix as a table, per-class TP/TN/FP/FN, the chosen metric, and a snippet of the wire JSON for each:

Scenario What it demonstrates
1. Balanced 3-class intent recognition Symmetric data → macro = micro = 0.667. TN math (= n_scored − TP − FP − FN) verified per class.
2. Imbalanced 2-class, macro vs micro Same data, two configs. macro=0.633 vs micro=0.750 — surfaces the bias of micro on imbalanced sets.
3. Precision / Recall / F1 / F2 on the same dataset Recall on "yes" = 1.0 but precision = 0.500. F1 averages; F2 (β=2) shifts toward recall.
4. Out-of-vocab + malformed details 6 datapoints in, 2 scored, 4 skipped. n_skipped surfaces so a 1.000 score on tiny n_scored is visibly suspect.
5. Realistic 4-class intent classifier macro F1 = 0.6535, micro F1 = 0.7600. Macro surfaces "modify" (F=0.333) weakness that micro hides under "book" (F=0.909) wins.

Sample output (Scenario 5):

 Scenario 5 — Realistic 4-class intent classifier
══════════════════════════════════════════════════════════════════════════════
  metric = f_score   average = macro   score (headline) = 0.6535
  micro = 0.7600   macro = 0.6535   scored = 25/25   skipped = 0

            │     book    │    cancel   │  reschedule │    modify   │  ← expected
─────────────────────────────────────────────────────────────────────────────────
book        │          10 │           1 │           0 │           0 │
cancel      │           1 │           6 │           0 │           0 │
reschedule  │           0 │           0 │           2 │           1 │
modify      │           0 │           1 │           2 │           1 │
           ↑ predicted

  class       │  TP  TN  FP  FN  support  f_score
  ───────────────────────────────────────────────
  book        │  10  13   1   1       11  0.909
  cancel      │   6  16   1   2        8  0.800
  reschedule  │   2  20   1   2        4  0.571
  modify      │   1  20   3   1        2  0.333

4. Lint + typecheck

uv run ruff check src/uipath/eval/evaluators src/uipath/eval/runtime tests/evaluators
uv run ruff format --check src/uipath/eval/evaluators src/uipath/eval/runtime tests/evaluators
uv run mypy src/uipath/eval/evaluators src/uipath/eval/runtime src/uipath/_cli/cli_eval.py

All clean.

Architecture

EvaluationSet (JSON)
  evaluatorRefs:            datasetEvaluatorRefs:   <-- NEW
    intent_match              precision_intent
                              recall_intent
                              f1_intent
        |                            |
        v                            v
UiPathEvalContext
  evaluators           dataset_evaluators           <-- NEW
        |                            |
        v                            v
RUN
  (1) per-datapoint pass  (existing; local workers OR per-job in serverless)
  ----- JOIN -----
  (2) group per-datapoint results by evaluator name
  (3) dataset-level pass: for each ds_eval, route grouped[source_evaluator]
        |
        v
UiPathEvalOutput
  evaluation_set_results      (existing)
  dataset_evaluator_results   <-- NEW

BaseDatasetEvaluator is a parallel hierarchy to GenericBaseEvaluator, not a subclass — its evaluate(results: list[EvaluationResultDto]) signature is fundamentally different from per-datapoint evaluate(execution, criteria), so a separate base prevents accidental dispatch through the per-datapoint loop (LSP-safe).

Each dataset evaluator declares one source_evaluator by name. The runtime groups per-datapoint results by evaluator name once, then routes the right list to each dataset evaluator. Configs load from <eval_set>/../dataset_evaluators/*.json mirroring the existing evaluator layout. EvalHelpers.load_dataset_evaluators validates each source_evaluator is declared in the same eval set up front.

Files

New

  • eval/evaluators/base_dataset_evaluator.pyBaseDatasetEvaluator ABC, BaseDatasetEvaluatorConfig
  • eval/evaluators/classification_dataset_evaluators.py — Precision / Recall / FScore + shared _build_confusion + ClassificationDetails
  • eval/evaluators/dataset_evaluator_factory.py — type-discriminator registry + build_dataset_evaluator
  • tests/evaluators/test_dataset_classification_evaluators.py — 18 tests
  • examples/dataset_evaluators_demo.py — 5-scenario runnable demo

Modified

  • eval/models/models.pyEvaluatorType.DATASET_PRECISION / .DATASET_RECALL / .DATASET_F_SCORE
  • eval/models/evaluation_set.pyEvaluationSet.dataset_evaluator_refs
  • eval/runtime/context.pyUiPathEvalContext.dataset_evaluators
  • eval/runtime/_types.pyUiPathEvalOutput.dataset_evaluator_results
  • eval/runtime/runtime.pycompute_dataset_evaluator_results(...), invoked after compute_evaluator_scores
  • eval/helpers.pyEvalHelpers.load_dataset_evaluators(...) with source_evaluator validation
  • _cli/cli_eval.py — load + attach to context

Companion PRs in UiPath/Agents

This SDK is the foundation of a three-PR stack:

  • UiPath/Agents#5306 — Studio Web UI (picker section + Aggregations panel)
  • UiPath/Agents#5307 — python-eval-worker workflow + C# storage/DTO/mapper + workflow wiring (the activity body is a structural no-op pending three deferred pieces — see that PR's body)

Both companion PRs depend on this SDK being published as uipath>=2.10.69.

What this does not do

  • Does not touch the existing BinaryClassificationEvaluator / MulticlassClassificationEvaluator per-datapoint evaluators. Their per-datapoint evaluate is unchanged. The dataset-level evaluators are an additive surface; the classifier evaluators can be removed in a follow-up PR if/when desired.
  • No CLI flag changes, no new event types, no span/exporter integration for the dataset-level results in this pass.
  • Multi-source dataset evaluators (e.g. DisagreementEvaluator) are out of scope — source_evaluator: str leaves room for a future source_evaluators: list[str] extension without breaking the v1 shape.

Backwards compatibility

Purely additive. All new fields are optional with defaults; existing eval sets without datasetEvaluatorRefs are unaffected. Patch bump 2.10.68 → 2.10.69.

Test plan checklist

  • pytest tests/evaluators/test_dataset_classification_evaluators.py — 18 tests passing.
  • pytest tests/evaluators tests/cli/eval — 833 passing, zero regressions.
  • ruff check / ruff format --check / mypy clean on all changed files.
  • examples/dataset_evaluators_demo.py runs to completion, all 5 scenarios output correct numbers (see PR body for the actual output).
  • End-to-end via uipath eval against a real eval set with datasetEvaluatorRefs — pending Agents #5307 backend pieces (loading configs from storage + Temporal child workflow invocation).

🤖 Generated with Claude Code

…ll/f-score

Introduces a new BaseDatasetEvaluator concept that runs once per evaluation
set after all per-datapoint evaluators complete. It consumes per-datapoint
EvaluationResultDto values from a named source evaluator and emits a single
run-level EvaluationResult.

Includes three starter evaluators for multiclass classification metrics:

- PrecisionDatasetEvaluator
- RecallDatasetEvaluator
- FScoreDatasetEvaluator (configurable beta)

Each takes a required classes list (populated from the UI), supports micro
or macro averaging, and emits per-class TP/TN/FP/FN plus the confusion
matrix in details. Binary is the 2-class case — no separate binary path.

Architecture: BaseDatasetEvaluator is a parallel hierarchy to
GenericBaseEvaluator (not a subclass) so the per-datapoint dispatch loop
cannot accidentally pick up a dataset evaluator. Each dataset evaluator
declares a single source_evaluator by name; the runtime groups
per-datapoint results by evaluator name and routes the right list to each
dataset evaluator. Configs load from <eval_set>/../dataset_evaluators/*.json
mirroring the evaluators directory layout.

Patch version bumped: 2.10.68 -> 2.10.69.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added test:uipath-langchain Triggers tests in the uipath-langchain-python repository test:uipath-integrations labels May 20, 2026
…10.69

examples/dataset_evaluators_demo.py walks the new dataset-level evaluators
(Precision / Recall / F-score) through five scenarios that exercise the
math end-to-end at the SDK layer:

  1. Balanced 3-class — symmetric confusion matrix, macro == micro
  2. Imbalanced 2-class — shows where macro and micro diverge
  3. Same data, four metrics (Precision, Recall, F1, F2) — proves the
     F-beta knob actually moves per-class numbers
  4. Out-of-vocab + malformed details — n_skipped surfaces, no silent drops
  5. Realistic 4-class intent classifier — uneven per-class performance

Each scenario prints the confusion matrix as a table, the per-class
TP/TN/FP/FN + the metric, and a snippet of the wire JSON that AutoMapper
will surface to the frontend.

Run::

    cd packages/uipath && uv run python examples/dataset_evaluators_demo.py

uv.lock reflects the pyproject.toml version bump (2.10.68 -> 2.10.69)
already in this PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@sonarqubecloud
Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
83.5% Coverage on New Code (required ≥ 90%)
C Reliability Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

@ajay-kesavan
Copy link
Copy Markdown
Author

Superseded by #1674 (ClassifierEvaluator pure-metadata aggregator). The earlier dataset-level evaluator framework was replaced by the cleaner Classifier design that delegates aggregation math to the C# layer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:uipath-integrations test:uipath-langchain Triggers tests in the uipath-langchain-python repository

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant