feat(eval): Add span output attributes and metadata for evaluation spans #1079

AAgnihotry · 2026-01-09T10:53:59Z

Summary

This PR adds comprehensive span output attributes and metadata for evaluation spans, following the AgentOutputSpanAttributes pattern with Pydantic models.

Changes

New Module: `_span_utils.py`

Created a centralized utility module for evaluation span configuration:

Pydantic Models:

EvalSetRunOutput: Output model for "Evaluation Set Run" spans (score as int)
EvaluationOutput: Output model for "Evaluation" spans (score as int)
EvaluationOutputSpanOutput: Output model for "Evaluation output" spans (type, value, justification)

Calculation Functions:

calculate_overall_score(): Calculates average across all evaluators
calculate_evaluation_average_score(): Calculates average for a single evaluation

Low-level Attribute Setters:

set_eval_set_run_output_and_metadata(): Sets output and metadata for eval set run spans
set_evaluation_output_and_metadata(): Sets output and metadata for evaluation spans
set_evaluation_output_span_output(): Sets output for evaluation output spans

High-level Configuration Functions:

configure_eval_set_run_span(): Complete configuration including schema retrieval and score calculation
configure_evaluation_span(): Complete configuration with error handling

Updated: `_runtime.py`

Refactored to use utility functions from _span_utils.py
Simplified code from ~30 lines to ~6 lines per span type
All three span types now properly set output and metadata attributes

Span Attributes Added

All evaluation spans now include:

output: JSON string containing score (for eval set run and evaluation) or type/value/justification (for evaluation output)
agentId: execution ID
agentName: "N/A"
inputSchema: runtime input schema as JSON string
outputSchema: runtime output schema as JSON string

Tests

Unit Tests (test_eval_span_utils.py): 19 new tests

Pydantic model serialization tests
Calculation function tests
Low-level span attribute setting tests
High-level configuration function tests with async/await

Integration Tests (test_eval_tracing_integration.py): 3 new tests

Verification that "Evaluation Set Run" span has output with score
Verification that "Evaluation" span has metadata attributes
Verification that "Evaluation output" span has type, value, and justification

Span Attribute Tests (test_eval_runtime_spans.py): 13 new tests

Tests for output attributes on all three span types
Tests for metadata attributes (agentId, agentName, schemas)
Tests for proper JSON structure and types

Test Infrastructure Fix:

Fixed SpanCapturingTracer to capture attributes set via span.set_attribute() (not just initial attributes)

Test Results

✅ All 1531 tests passing (7 skipped for authentication)
✅ 19 new unit tests for span utilities
✅ 3 new integration tests for span attributes
✅ 13 new span attribute tests
✅ Linting: ruff check passed
✅ Formatting: ruff format passed
✅ Type checking: mypy passed

Files Changed

src/uipath/_cli/_evals/_span_utils.py (new, 290 lines)
src/uipath/_cli/_evals/_runtime.py (40 lines modified)
tests/cli/eval/test_eval_span_utils.py (new, 462 lines)
tests/cli/eval/test_eval_runtime_spans.py (184 lines added)
tests/cli/eval/test_eval_tracing_integration.py (297 lines added)

Total: 1,268 lines added, 5 lines removed

🤖 Generated with Claude Code

Development Package

Use uipath pack --nolock to get the latest dev build from this PR (requires version range).
Add this package as a dependency in your pyproject.toml:

[project]
dependencies = [
  # Exact version:
  "uipath==2.4.9.dev1010793680",

  # Any version from PR
  "uipath>=2.4.9.dev1010790000,<2.4.9.dev1010800000"
]

[[tool.uv.index]]
name = "testpypi"
url = "https://test.pypi.org/simple/"
publish-url = "https://test.pypi.org/legacy/"
explicit = true

[tool.uv.sources]
uipath = { index = "testpypi" }

[tool.uv]
override-dependencies = [
    "uipath>=2.4.9.dev1010790000,<2.4.9.dev1010800000",
]

- Created _span_utils.py module with Pydantic models for span outputs - EvalSetRunOutput: for "Evaluation Set Run" spans - EvaluationOutput: for "Evaluation" spans - EvaluationOutputSpanOutput: for "Evaluation output" spans - Added calculation functions for overall and evaluation average scores - Added low-level functions to set span attributes (output, agentId, agentName, schemas) - Added high-level configuration functions for complete span setup - Refactored _runtime.py to use utility functions (reduced from ~30 to ~6 lines per span) - Added comprehensive unit tests (19 tests in test_eval_span_utils.py) - Added integration tests (3 tests in test_eval_tracing_integration.py) - Added span attribute tests (13 tests in test_eval_runtime_spans.py) - Fixed SpanCapturingTracer to capture attributes set via set_attribute() All spans now include: - output: JSON with score for eval set run and evaluation spans - output: JSON with type, value, justification for evaluation output spans - agentId: execution ID - agentName: "N/A" - inputSchema: runtime input schema as JSON - outputSchema: runtime output schema as JSON 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- Changed import from opentelemetry.sdk.trace.Span to opentelemetry.trace.Span (protocol) - Added proper type annotations to MockSpan class - Added None checks before accessing Status attributes (status_code, description) - Fixed __str__ mock configuration with proper lambda signature - Added type: ignore comments for MockSpan arg-type compatibility in tests All mypy checks now pass with no errors. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

github-actions bot added test:uipath-langchain Triggers tests in the uipath-langchain-python repository test:uipath-llamaindex Triggers tests in the uipath-llamaindex-python repository labels Jan 9, 2026

Chibionos approved these changes Jan 9, 2026

View reviewed changes

AAgnihotry added the build:dev Create a dev build from the pr label Jan 9, 2026

AAgnihotry added 4 commits January 9, 2026 11:02

fix: bump the version

35284b7

feat: add input and evaluatorId

c400a52

fix: remove schemas from evaluation

2899544

fix: the rendering of schemas in evaluation set run span

3325cd3

AAgnihotry force-pushed the feat/spanAttr branch from 3cc9992 to 3325cd3 Compare January 9, 2026 19:53

AAgnihotry merged commit 626f0c2 into main Jan 9, 2026
120 of 121 checks passed

AAgnihotry deleted the feat/spanAttr branch January 9, 2026 20:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(eval): Add span output attributes and metadata for evaluation spans #1079

feat(eval): Add span output attributes and metadata for evaluation spans #1079

Uh oh!

AAgnihotry commented Jan 9, 2026 •

edited by github-actions bot

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat(eval): Add span output attributes and metadata for evaluation spans #1079

feat(eval): Add span output attributes and metadata for evaluation spans #1079

Uh oh!

Conversation

AAgnihotry commented Jan 9, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

New Module: _span_utils.py

Updated: _runtime.py

Span Attributes Added

Tests

Test Results

Files Changed

Development Package

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AAgnihotry commented Jan 9, 2026 •

edited by github-actions bot

Loading

New Module: `_span_utils.py`

Updated: `_runtime.py`