[OPIK-4957] [SDK] feat: improve evaluation suite run experience and performance by alexkuzmik · Pull Request #5677 · comet-ml/opik

alexkuzmik · 2026-03-16T12:20:00Z

Details

Improve the evaluation suite run experience with several UX and performance enhancements:

Add reasoning_effort support to LLMJudge (defaults to "low") to speed up assertion checking, with graceful filtering for models that don't support it
Show a progress bar hint ("there might be a delay before the first items are processed") until the first item completes, so it doesn't look hung
Show a clickable dashboard link before evaluation starts so users can follow along in the UI
Move the experiment link inside the results panel (bold cyan, at the top) and remove the redundant standalone link
Clarify LLM judge prompt with explicit ---BEGIN/END--- delimiters so short agent outputs don't blend into description text

Change checklist

User facing
Documentation update

Issues

Resolves #
OPIK-4957

AI-WATERMARK

AI-WATERMARK: yes

If yes:
- Tools: Claude Code
- Model(s): Claude Opus 4.6
- Scope: implementation
- Human verification: yes, manual testing with evaluation_suite_advanced_example.py

Testing

Tested manually by running sdks/python/src/internal_sandbox/evaluation_suite_advanced_example.py against a local Opik instance and verifying:

Progress bar shows hint message while at 0%, then reverts to "Evaluation" after first item completes
Clickable "Opik dashboard" link appears before the progress bar
Results panel shows bold cyan clickable link at the top
reasoning_effort=low is passed through to LiteLLM and filtered out for unsupported models (e.g. gpt-4o-mini)
LLM judge prompt uses ---BEGIN/END--- delimiters around input/output

Documentation

Defaults to "low" to reduce latency. Silently dropped for models that don't support it (e.g. gpt-4o-mini). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ent link Add a hint message to the progress bar while waiting for the first item to complete, and show a clickable experiment link at the top of the suite results panel. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Use ---BEGIN/END--- delimiters around input and output sections so short agent responses don't blend into the description text. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Print "Running evaluation suite, results will be available in Opik dashboard" with a bold cyan clickable link before the progress bar appears, so users can follow along in the UI. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-03-16T12:22:05Z

Python SDK Unit Tests Results (Python 3.14)

2 207 tests 2 207 ✅ 41s ⏱️
1 suites 0 💤
1 files 0 ❌

Results for commit 26d85c3.

♻️ This comment has been updated with latest results.

github-actions · 2026-03-16T12:22:08Z

Python SDK Unit Tests Results (Python 3.12)

2 207 tests 2 207 ✅ 46s ⏱️
1 suites 0 💤
1 files 0 ❌

Results for commit 26d85c3.

♻️ This comment has been updated with latest results.

github-actions · 2026-03-16T12:22:16Z

Python SDK Unit Tests Results (Python 3.13)

2 207 tests 2 207 ✅ 52s ⏱️
1 suites 0 💤
1 files 0 ❌

Results for commit 26d85c3.

♻️ This comment has been updated with latest results.

github-actions · 2026-03-16T12:22:18Z

Python SDK Unit Tests Results (Python 3.11)

2 207 tests 2 207 ✅ 46s ⏱️
1 suites 0 💤
1 files 0 ❌

Results for commit 26d85c3.

♻️ This comment has been updated with latest results.

github-actions · 2026-03-16T12:22:23Z

Python SDK Unit Tests Results (Python 3.10)

2 207 tests 2 207 ✅ 53s ⏱️
1 suites 0 💤
1 files 0 ❌

Results for commit 26d85c3.

♻️ This comment has been updated with latest results.

sdks/python/src/opik/evaluation/suite_evaluators/llm_judge/metric.py

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Store reasoning_effort in model.customParameters so persisted online-evaluation configs preserve it. Old configs without it fall back to DEFAULT_REASONING_EFFORT. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-03-16T13:45:09Z

Python SDK E2E Tests Results (Python 3.11)

244 tests ±0 242 ✅ ±0 8m 25s ⏱️ - 1m 16s
1 suites ±0 2 💤 ±0
1 files ±0 0 ❌ ±0

Results for commit 26d85c3. ± Comparison against base commit 5204da2.

This pull request removes 1 and adds 1 tests. Note that renamed tests count towards both.

tests.e2e.test_tracing ‑ test_opik_client__update_trace__happy_flow[None-None-None-None-019cf68c-13e6-74b5-9f64-66d9f84dd81e]

tests.e2e.test_tracing ‑ test_opik_client__update_trace__happy_flow[None-None-None-None-019cfb31-955b-7ea9-b8c6-5f787aaf7547]

♻️ This comment has been updated with latest results.

github-actions · 2026-03-16T13:47:23Z

Python SDK E2E Tests Results (Python 3.12)

244 tests ±0 242 ✅ ±0 8m 37s ⏱️ - 1m 10s
1 suites ±0 2 💤 ±0
1 files ±0 0 ❌ ±0

Results for commit 26d85c3. ± Comparison against base commit 5204da2.

This pull request removes 1 and adds 1 tests. Note that renamed tests count towards both.

tests.e2e.test_tracing ‑ test_opik_client__update_trace__happy_flow[None-None-None-None-019cf687-f22e-7121-9a98-0fc09339fb50]

tests.e2e.test_tracing ‑ test_opik_client__update_trace__happy_flow[None-None-None-None-019cfb31-bf6c-779e-a5eb-8e2949333474]

♻️ This comment has been updated with latest results.

github-actions · 2026-03-16T13:50:44Z

Python SDK E2E Tests Results (Python 3.14)

244 tests ±0 242 ✅ ±0 8m 49s ⏱️ +18s
1 suites ±0 2 💤 ±0
1 files ±0 0 ❌ ±0

Results for commit 8bf6ecc. ± Comparison against base commit 5204da2.

This pull request removes 1 and adds 1 tests. Note that renamed tests count towards both.

tests.e2e.test_tracing ‑ test_opik_client__update_trace__happy_flow[None-None-None-None-019cf67d-7f62-74f1-8fbb-a7955bd6fb95]

tests.e2e.test_tracing ‑ test_opik_client__update_trace__happy_flow[None-None-None-None-019cf6e1-57d4-78fd-8f63-aadd6c3a35c9]

github-actions · 2026-03-16T13:51:12Z

Python SDK E2E Tests Results (Python 3.13)

244 tests ±0 242 ✅ ±0 7m 59s ⏱️ - 1m 16s
1 suites ±0 2 💤 ±0
1 files ±0 0 ❌ ±0

Results for commit 8bf6ecc. ± Comparison against base commit 5204da2.

This pull request removes 1 and adds 1 tests. Note that renamed tests count towards both.

tests.e2e.test_tracing ‑ test_opik_client__update_trace__happy_flow[None-None-None-None-019cf686-0f7a-742e-a9d8-3834416991b2]

tests.e2e.test_tracing ‑ test_opik_client__update_trace__happy_flow[None-None-None-None-019cf6e2-94b6-7386-9b61-84423e7441f2]

github-actions · 2026-03-16T13:52:27Z

Python SDK E2E Tests Results (Python 3.10)

244 tests ±0 242 ✅ ±0 8m 26s ⏱️ -30s
1 suites ±0 2 💤 ±0
1 files ±0 0 ❌ ±0

Results for commit 26d85c3. ± Comparison against base commit 5204da2.

This pull request removes 1 and adds 1 tests. Note that renamed tests count towards both.

tests.e2e.test_tracing ‑ test_opik_client__update_trace__happy_flow[None-None-None-None-019cf67f-ffba-7f49-9784-0623efef0db2]

tests.e2e.test_tracing ‑ test_opik_client__update_trace__happy_flow[None-None-None-None-019cfb31-d453-70ba-a836-89e1a4a413a0]

♻️ This comment has been updated with latest results.

github-actions · 2026-03-17T09:37:04Z

📋 PR Linter Failed

❌ Incomplete Issues Section. You must reference at least one GitHub issue (#xxxx), Jira ticket (OPIK-xxxx), CUST ticket (CUST-xxxx), DEV ticket (DEV-xxxx), or DND ticket (DND-xxxx) under the ## Issues section.

…udge prompt Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

alexkuzmik and others added 5 commits March 13, 2026 21:39

feat(sdk): add reasoning_effort support to LLMJudge

d40cebf

Defaults to "low" to reduce latency. Silently dropped for models that don't support it (e.g. gpt-4o-mini). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

refactor(sdk): clarify LLM judge prompt with explicit delimiters

389aed4

Use ---BEGIN/END--- delimiters around input and output sections so short agent responses don't blend into the description text. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Mention the default model gpt-5-nano

4d668aa

alexkuzmik requested a review from a team as a code owner March 16, 2026 12:20

github-actions bot added python Pull requests that update Python code Python SDK labels Mar 16, 2026

github-actions bot assigned alexkuzmik Mar 16, 2026

baz-reviewer bot reviewed Mar 16, 2026

View reviewed changes

sdks/python/src/opik/evaluation/suite_evaluators/llm_judge/metric.py Show resolved Hide resolved

test(sdk): update LLM judge test for new prompt delimiters

7b43d5d

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions bot added the tests Including test files, or tests related like configuration. label Mar 16, 2026

baz-reviewer bot approved these changes Mar 16, 2026

View reviewed changes

alexkuzmik changed the title ~~[NA] [SDK] feat: improve evaluation suite run experience and performance~~ [OPIK-4957] [SDK] feat: improve evaluation suite run experience and performance Mar 17, 2026

fix(sdk): restore detailed INPUT/OUTPUT section descriptions in LLM j…

26d85c3

…udge prompt Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

petrotiurin approved these changes Mar 17, 2026

View reviewed changes

alexkuzmik merged commit 01f8953 into main Mar 17, 2026
120 checks passed

alexkuzmik deleted the alexkuzmik/improve-evaluation-suite-run-experience-and-performance branch March 17, 2026 10:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OPIK-4957] [SDK] feat: improve evaluation suite run experience and performance#5677

[OPIK-4957] [SDK] feat: improve evaluation suite run experience and performance#5677
alexkuzmik merged 8 commits intomainfrom
alexkuzmik/improve-evaluation-suite-run-experience-and-performance

alexkuzmik commented Mar 16, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 16, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 16, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 16, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 16, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

github-actions bot commented Mar 16, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 16, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 16, 2026

Uh oh!

github-actions bot commented Mar 16, 2026

Uh oh!

github-actions bot commented Mar 16, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alexkuzmik commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Details

Change checklist

Issues

AI-WATERMARK

Testing

Documentation

Uh oh!

github-actions bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Python SDK Unit Tests Results (Python 3.14)

Uh oh!

github-actions bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Python SDK Unit Tests Results (Python 3.12)

Uh oh!

github-actions bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Python SDK Unit Tests Results (Python 3.13)

Uh oh!

github-actions bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Python SDK Unit Tests Results (Python 3.11)

Uh oh!

github-actions bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Python SDK Unit Tests Results (Python 3.10)

Uh oh!

Uh oh!

github-actions bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Python SDK E2E Tests Results (Python 3.11)

Uh oh!

github-actions bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Python SDK E2E Tests Results (Python 3.12)

Uh oh!

github-actions bot commented Mar 16, 2026

Python SDK E2E Tests Results (Python 3.14)

Uh oh!

github-actions bot commented Mar 16, 2026

Python SDK E2E Tests Results (Python 3.13)

Uh oh!

github-actions bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Python SDK E2E Tests Results (Python 3.10)

Uh oh!

github-actions bot commented Mar 17, 2026

📋 PR Linter Failed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alexkuzmik commented Mar 16, 2026 •

edited

Loading

github-actions bot commented Mar 16, 2026 •

edited

Loading

github-actions bot commented Mar 16, 2026 •

edited

Loading

github-actions bot commented Mar 16, 2026 •

edited

Loading

github-actions bot commented Mar 16, 2026 •

edited

Loading

github-actions bot commented Mar 16, 2026 •

edited

Loading

github-actions bot commented Mar 16, 2026 •

edited

Loading

github-actions bot commented Mar 16, 2026 •

edited

Loading

github-actions bot commented Mar 16, 2026 •

edited

Loading