Skip to content

feat(eval): classification evaluator schemas + sample projects + e2e tests#1663

Closed
ajay-kesavan wants to merge 2 commits into
mainfrom
feat/classification-evaluator-types
Closed

feat(eval): classification evaluator schemas + sample projects + e2e tests#1663
ajay-kesavan wants to merge 2 commits into
mainfrom
feat/classification-evaluator-types

Conversation

@ajay-kesavan
Copy link
Copy Markdown

@ajay-kesavan ajay-kesavan commented May 20, 2026

Summary

Completes the classification evaluator feature shipped in #1397 by adding the three pieces that PR didn't carry:

  1. Generated type schemasBinaryClassificationEvaluator.json and MulticlassClassificationEvaluator.json under packages/uipath/src/uipath/eval/evaluators_types/, produced by python -m uipath.eval.evaluators_types.generate_types. These are the machine-readable schemas external tooling (Flow UI evaluator picker, uip maestro flow eval) uses to know each evaluator's config / criteria / justification shape.

  2. Sample projects under packages/uipath/samples/:

    • binary_classification_agent/ — rule-based spam/ham classifier wired to the binary classification evaluator with metric_type=precision. Eval set is designed so 4/5 datapoints pass but precision is 2/3 because of one deliberate false positive — demonstrates the dataset-level metric diverging from a simple per-row pass rate.
    • multiclass_classification_simple/ — rule-based 3-class router (payments / support / spam) wired to the multiclass classification evaluator with averaging=macro. Eval set forces a misroute that hurts both payments precision and support recall, giving macro F1 = (0.8 + 0.8 + 1.0) / 3.
  3. End-to-end test at packages/uipath/tests/cli/eval/test_classification_samples_e2e.py — loads each sample's eval set, wires its main.py into a stand-in runtime, calls evaluate(), and asserts both the per-row scores and the aggregated metric produced by reduce_scores. Locks in the dataset-level math.

Why split this PR

PR #1397 added the Python implementation and registered the new evaluator type IDs (uipath-binary-classification, uipath-multiclass-classification) in the coded-evaluator discriminator, but didn't regenerate the JSON type files or add a runnable example. Without these the evaluators are merged-in-name-only.

Test plan

  • pytest tests/cli/eval/test_classification_samples_e2e.py — both samples pass
  • ruff check tests/cli/eval/test_classification_samples_e2e.py — clean
  • ruff format --check — clean
  • cat packages/uipath/src/uipath/eval/evaluators_types/BinaryClassificationEvaluator.json exposes positive_class, metric_type, f_value in evaluatorConfigSchema.properties
  • cat packages/uipath/src/uipath/eval/evaluators_types/MulticlassClassificationEvaluator.json exposes classes, averaging, metric_type, f_value
  • CI passes

Related PRs

  • chore(eval): resync evaluator type schemas with Python source #1664 — companion PR that refreshes the 11 unrelated stale schemas in the same directory (split out for review hygiene; no functional overlap with this PR).
  • UiPath/cli#2128 — TypeScript-side flow-tool registry entries that wire these evaluators into the Flow UI evaluator picker.

Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com

🤖 Generated with Claude Code

Generates BinaryClassificationEvaluator.json and MulticlassClassificationEvaluator.json
from the new evaluators added in #1397 so external tooling (Flow UI evaluator
picker, `uip maestro flow eval`) can read the config / criteria / justification
schemas.

Files produced by `python -m uipath.eval.evaluators_types.generate_types`,
restricted to the two new evaluator types. A companion PR refreshes the other
11 stale schemas in evaluators_types/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ajay-kesavan ajay-kesavan force-pushed the feat/classification-evaluator-types branch from 6931598 to 6b11767 Compare May 20, 2026 00:54
@ajay-kesavan ajay-kesavan changed the title chore(eval): regenerate evaluator type schemas with classification evaluators feat(eval): add evaluator type schemas for classification evaluators May 20, 2026
…tors

Adds two sample projects under packages/uipath/samples/ that double as
end-to-end test fixtures for the binary and multiclass classification
evaluators added in #1397:

- binary_classification_agent — rule-based spam/ham classifier wired up
  to the binary classification evaluator with metric_type=precision.
  Eval set is designed so 4/5 datapoints pass but precision is 2/3
  because of one deliberate false positive.
- multiclass_classification_simple — rule-based 3-class router (payments
  / support / spam) wired up to the multiclass classification evaluator
  with macro-averaged F1. Eval set forces a misroute that hurts both
  payments precision and support recall, giving macro F1 = 26/30.

Adds tests/cli/eval/test_classification_samples_e2e.py which loads each
sample's eval-sets/default.json, wires its main.py into a stand-in runtime,
calls evaluate(), and asserts both the per-row scores and the aggregated
metric produced by reduce_scores. Locks in the dataset-level math, not just
per-row correct/incorrect.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@sonarqubecloud
Copy link
Copy Markdown

@ajay-kesavan ajay-kesavan changed the title feat(eval): add evaluator type schemas for classification evaluators feat(eval): classification evaluator schemas + sample projects + e2e tests May 20, 2026
@ajay-kesavan
Copy link
Copy Markdown
Author

Superseded by #1674 (ClassifierEvaluator). The schema/sample work here was replaced by the simpler single-evaluator approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant