feat(eval): add dataset-level evaluator framework with precision/recall/f-score by ajay-kesavan · Pull Request #1669 · UiPath/uipath-python

ajay-kesavan · 2026-05-20T21:06:23Z

Summary

Adds a new dataset-level evaluator concept that runs once per evaluation set, after all per-datapoint evaluators complete. Each dataset evaluator consumes the per-datapoint EvaluationResultDto values from one named source evaluator and emits a single run-level EvaluationResult.

Three starter evaluators for multiclass classification metrics ship with this PR:

PrecisionDatasetEvaluator — macro or micro precision
RecallDatasetEvaluator — macro or micro recall
FScoreDatasetEvaluator — F-beta (configurable f_value)

All three share a single confusion-matrix builder and emit a structured ClassificationDetails payload: per-class {tp, tn, fp, fn, support, value}, the full confusion matrix, micro, macro, and skip counts. Binary classification is the two-class case — no separate binary path.

How to test

The SDK side is self-contained — you can verify everything without a backend, worker, or browser.

1. Unit tests (18 tests, all passing)

cd packages/uipath
uv run pytest tests/evaluators/test_dataset_classification_evaluators.py -v

Covers per-evaluator math (2-class, 3-class), micro vs macro divergence, F-beta weighting, out-of-vocab skipping, malformed-details skipping, case sensitivity, factory error paths, runtime-level routing by source_evaluator, and is_line_result exclusion.

2. Full eval suite — confirm no regressions

uv run pytest tests/evaluators tests/cli/eval --no-cov

833 tests should pass.

3. Runnable end-to-end demo (recommended)

uv run python examples/dataset_evaluators_demo.py

Walks the framework through five scenarios and prints the confusion matrix as a table, per-class TP/TN/FP/FN, the chosen metric, and a snippet of the wire JSON for each:

Scenario	What it demonstrates
1. Balanced 3-class intent recognition	Symmetric data → macro = micro = 0.667. TN math (= n_scored − TP − FP − FN) verified per class.
2. Imbalanced 2-class, macro vs micro	Same data, two configs. macro=0.633 vs micro=0.750 — surfaces the bias of micro on imbalanced sets.
3. Precision / Recall / F1 / F2 on the same dataset	Recall on "yes" = 1.0 but precision = 0.500. F1 averages; F2 (β=2) shifts toward recall.
4. Out-of-vocab + malformed details	6 datapoints in, 2 scored, 4 skipped. `n_skipped` surfaces so a 1.000 score on tiny `n_scored` is visibly suspect.
5. Realistic 4-class intent classifier	macro F1 = 0.6535, micro F1 = 0.7600. Macro surfaces "modify" (F=0.333) weakness that micro hides under "book" (F=0.909) wins.

Sample output (Scenario 5):

 Scenario 5 — Realistic 4-class intent classifier
══════════════════════════════════════════════════════════════════════════════
  metric = f_score   average = macro   score (headline) = 0.6535
  micro = 0.7600   macro = 0.6535   scored = 25/25   skipped = 0

            │     book    │    cancel   │  reschedule │    modify   │  ← expected
─────────────────────────────────────────────────────────────────────────────────
book        │          10 │           1 │           0 │           0 │
cancel      │           1 │           6 │           0 │           0 │
reschedule  │           0 │           0 │           2 │           1 │
modify      │           0 │           1 │           2 │           1 │
           ↑ predicted

  class       │  TP  TN  FP  FN  support  f_score
  ───────────────────────────────────────────────
  book        │  10  13   1   1       11  0.909
  cancel      │   6  16   1   2        8  0.800
  reschedule  │   2  20   1   2        4  0.571
  modify      │   1  20   3   1        2  0.333

4. Lint + typecheck

uv run ruff check src/uipath/eval/evaluators src/uipath/eval/runtime tests/evaluators
uv run ruff format --check src/uipath/eval/evaluators src/uipath/eval/runtime tests/evaluators
uv run mypy src/uipath/eval/evaluators src/uipath/eval/runtime src/uipath/_cli/cli_eval.py

All clean.

Architecture

EvaluationSet (JSON)
  evaluatorRefs:            datasetEvaluatorRefs:   <-- NEW
    intent_match              precision_intent
                              recall_intent
                              f1_intent
        |                            |
        v                            v
UiPathEvalContext
  evaluators           dataset_evaluators           <-- NEW
        |                            |
        v                            v
RUN
  (1) per-datapoint pass  (existing; local workers OR per-job in serverless)
  ----- JOIN -----
  (2) group per-datapoint results by evaluator name
  (3) dataset-level pass: for each ds_eval, route grouped[source_evaluator]
        |
        v
UiPathEvalOutput
  evaluation_set_results      (existing)
  dataset_evaluator_results   <-- NEW

BaseDatasetEvaluator is a parallel hierarchy to GenericBaseEvaluator, not a subclass — its evaluate(results: list[EvaluationResultDto]) signature is fundamentally different from per-datapoint evaluate(execution, criteria), so a separate base prevents accidental dispatch through the per-datapoint loop (LSP-safe).

Each dataset evaluator declares one source_evaluator by name. The runtime groups per-datapoint results by evaluator name once, then routes the right list to each dataset evaluator. Configs load from <eval_set>/../dataset_evaluators/*.json mirroring the existing evaluator layout. EvalHelpers.load_dataset_evaluators validates each source_evaluator is declared in the same eval set up front.

Files

New

eval/evaluators/base_dataset_evaluator.py — BaseDatasetEvaluator ABC, BaseDatasetEvaluatorConfig
eval/evaluators/classification_dataset_evaluators.py — Precision / Recall / FScore + shared _build_confusion + ClassificationDetails
eval/evaluators/dataset_evaluator_factory.py — type-discriminator registry + build_dataset_evaluator
tests/evaluators/test_dataset_classification_evaluators.py — 18 tests
examples/dataset_evaluators_demo.py — 5-scenario runnable demo

Modified

eval/models/models.py — EvaluatorType.DATASET_PRECISION / .DATASET_RECALL / .DATASET_F_SCORE
eval/models/evaluation_set.py — EvaluationSet.dataset_evaluator_refs
eval/runtime/context.py — UiPathEvalContext.dataset_evaluators
eval/runtime/_types.py — UiPathEvalOutput.dataset_evaluator_results
eval/runtime/runtime.py — compute_dataset_evaluator_results(...), invoked after compute_evaluator_scores
eval/helpers.py — EvalHelpers.load_dataset_evaluators(...) with source_evaluator validation
_cli/cli_eval.py — load + attach to context

Companion PRs in UiPath/Agents

This SDK is the foundation of a three-PR stack:

UiPath/Agents#5306 — Studio Web UI (picker section + Aggregations panel)
UiPath/Agents#5307 — python-eval-worker workflow + C# storage/DTO/mapper + workflow wiring (the activity body is a structural no-op pending three deferred pieces — see that PR's body)

Both companion PRs depend on this SDK being published as uipath>=2.10.69.

What this does not do

Does not touch the existing BinaryClassificationEvaluator / MulticlassClassificationEvaluator per-datapoint evaluators. Their per-datapoint evaluate is unchanged. The dataset-level evaluators are an additive surface; the classifier evaluators can be removed in a follow-up PR if/when desired.
No CLI flag changes, no new event types, no span/exporter integration for the dataset-level results in this pass.
Multi-source dataset evaluators (e.g. DisagreementEvaluator) are out of scope — source_evaluator: str leaves room for a future source_evaluators: list[str] extension without breaking the v1 shape.

Backwards compatibility

Purely additive. All new fields are optional with defaults; existing eval sets without datasetEvaluatorRefs are unaffected. Patch bump 2.10.68 → 2.10.69.

Test plan checklist

pytest tests/evaluators/test_dataset_classification_evaluators.py — 18 tests passing.
pytest tests/evaluators tests/cli/eval — 833 passing, zero regressions.
ruff check / ruff format --check / mypy clean on all changed files.
examples/dataset_evaluators_demo.py runs to completion, all 5 scenarios output correct numbers (see PR body for the actual output).
End-to-end via uipath eval against a real eval set with datasetEvaluatorRefs — pending Agents #5307 backend pieces (loading configs from storage + Temporal child workflow invocation).

🤖 Generated with Claude Code

…ll/f-score Introduces a new BaseDatasetEvaluator concept that runs once per evaluation set after all per-datapoint evaluators complete. It consumes per-datapoint EvaluationResultDto values from a named source evaluator and emits a single run-level EvaluationResult. Includes three starter evaluators for multiclass classification metrics: - PrecisionDatasetEvaluator - RecallDatasetEvaluator - FScoreDatasetEvaluator (configurable beta) Each takes a required classes list (populated from the UI), supports micro or macro averaging, and emits per-class TP/TN/FP/FN plus the confusion matrix in details. Binary is the 2-class case — no separate binary path. Architecture: BaseDatasetEvaluator is a parallel hierarchy to GenericBaseEvaluator (not a subclass) so the per-datapoint dispatch loop cannot accidentally pick up a dataset evaluator. Each dataset evaluator declares a single source_evaluator by name; the runtime groups per-datapoint results by evaluator name and routes the right list to each dataset evaluator. Configs load from <eval_set>/../dataset_evaluators/*.json mirroring the evaluators directory layout. Patch version bumped: 2.10.68 -> 2.10.69. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…10.69 examples/dataset_evaluators_demo.py walks the new dataset-level evaluators (Precision / Recall / F-score) through five scenarios that exercise the math end-to-end at the SDK layer: 1. Balanced 3-class — symmetric confusion matrix, macro == micro 2. Imbalanced 2-class — shows where macro and micro diverge 3. Same data, four metrics (Precision, Recall, F1, F2) — proves the F-beta knob actually moves per-class numbers 4. Out-of-vocab + malformed details — n_skipped surfaces, no silent drops 5. Realistic 4-class intent classifier — uneven per-class performance Each scenario prints the confusion matrix as a table, the per-class TP/TN/FP/FN + the metric, and a snippet of the wire JSON that AutoMapper will surface to the frontend. Run:: cd packages/uipath && uv run python examples/dataset_evaluators_demo.py uv.lock reflects the pyproject.toml version bump (2.10.68 -> 2.10.69) already in this PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

sonarqubecloud · 2026-05-20T23:17:06Z

Quality Gate failed

Failed conditions
83.5% Coverage on New Code (required ≥ 90%)
C Reliability Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

ajay-kesavan · 2026-05-22T03:38:15Z

Superseded by #1674 (ClassifierEvaluator pure-metadata aggregator). The earlier dataset-level evaluator framework was replaced by the cleaner Classifier design that delegates aggregation math to the C# layer.

github-actions Bot added test:uipath-langchain Triggers tests in the uipath-langchain-python repository test:uipath-integrations labels May 20, 2026

ajay-kesavan mentioned this pull request May 21, 2026

feat(eval): add ClassifierEvaluator (pure-metadata aggregator) #1674

Open

6 tasks

ajay-kesavan closed this May 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): add dataset-level evaluator framework with precision/recall/f-score#1669

feat(eval): add dataset-level evaluator framework with precision/recall/f-score#1669
ajay-kesavan wants to merge 2 commits into
mainfrom
feat/eval-dataset-evaluators

ajay-kesavan commented May 20, 2026 •

edited

Loading

Uh oh!

sonarqubecloud Bot commented May 20, 2026

Uh oh!

ajay-kesavan commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ajay-kesavan commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How to test

1. Unit tests (18 tests, all passing)

2. Full eval suite — confirm no regressions

3. Runnable end-to-end demo (recommended)

4. Lint + typecheck

Architecture

Files

Companion PRs in UiPath/Agents

What this does not do

Backwards compatibility

Test plan checklist

Uh oh!

sonarqubecloud Bot commented May 20, 2026

Quality Gate failed

Uh oh!

ajay-kesavan commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ajay-kesavan commented May 20, 2026 •

edited

Loading