Skip to content

[OPIK-4957] [SDK] feat: improve evaluation suite run experience and performance#5677

Merged
alexkuzmik merged 8 commits intomainfrom
alexkuzmik/improve-evaluation-suite-run-experience-and-performance
Mar 17, 2026
Merged

[OPIK-4957] [SDK] feat: improve evaluation suite run experience and performance#5677
alexkuzmik merged 8 commits intomainfrom
alexkuzmik/improve-evaluation-suite-run-experience-and-performance

Conversation

@alexkuzmik
Copy link
Collaborator

@alexkuzmik alexkuzmik commented Mar 16, 2026

Details

Improve the evaluation suite run experience with several UX and performance enhancements:

  • Add reasoning_effort support to LLMJudge (defaults to "low") to speed up assertion checking, with graceful filtering for models that don't support it
  • Show a progress bar hint ("there might be a delay before the first items are processed") until the first item completes, so it doesn't look hung
  • Show a clickable dashboard link before evaluation starts so users can follow along in the UI
  • Move the experiment link inside the results panel (bold cyan, at the top) and remove the redundant standalone link
  • Clarify LLM judge prompt with explicit ---BEGIN/END--- delimiters so short agent outputs don't blend into description text

Change checklist

  • User facing
  • Documentation update

Issues

  • Resolves #
  • OPIK-4957

AI-WATERMARK

AI-WATERMARK: yes

  • If yes:
    • Tools: Claude Code
    • Model(s): Claude Opus 4.6
    • Scope: implementation
    • Human verification: yes, manual testing with evaluation_suite_advanced_example.py

Testing

Tested manually by running sdks/python/src/internal_sandbox/evaluation_suite_advanced_example.py against a local Opik instance and verifying:

  • Progress bar shows hint message while at 0%, then reverts to "Evaluation" after first item completes
  • Clickable "Opik dashboard" link appears before the progress bar
  • Results panel shows bold cyan clickable link at the top
  • reasoning_effort=low is passed through to LiteLLM and filtered out for unsupported models (e.g. gpt-4o-mini)
  • LLM judge prompt uses ---BEGIN/END--- delimiters around input/output

Documentation

alexkuzmik and others added 5 commits March 13, 2026 21:39
Defaults to "low" to reduce latency. Silently dropped for
models that don't support it (e.g. gpt-4o-mini).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ent link

Add a hint message to the progress bar while waiting for the first item
to complete, and show a clickable experiment link at the top of the
suite results panel.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use ---BEGIN/END--- delimiters around input and output sections so
short agent responses don't blend into the description text.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Print "Running evaluation suite, results will be available in Opik
dashboard" with a bold cyan clickable link before the progress bar
appears, so users can follow along in the UI.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@alexkuzmik alexkuzmik requested a review from a team as a code owner March 16, 2026 12:20
@github-actions github-actions bot added python Pull requests that update Python code Python SDK labels Mar 16, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Mar 16, 2026

Python SDK Unit Tests Results (Python 3.14)

2 207 tests   2 207 ✅  41s ⏱️
    1 suites      0 💤
    1 files        0 ❌

Results for commit 26d85c3.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 16, 2026

Python SDK Unit Tests Results (Python 3.12)

2 207 tests   2 207 ✅  46s ⏱️
    1 suites      0 💤
    1 files        0 ❌

Results for commit 26d85c3.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 16, 2026

Python SDK Unit Tests Results (Python 3.13)

2 207 tests   2 207 ✅  52s ⏱️
    1 suites      0 💤
    1 files        0 ❌

Results for commit 26d85c3.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 16, 2026

Python SDK Unit Tests Results (Python 3.11)

2 207 tests   2 207 ✅  46s ⏱️
    1 suites      0 💤
    1 files        0 ❌

Results for commit 26d85c3.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 16, 2026

Python SDK Unit Tests Results (Python 3.10)

2 207 tests   2 207 ✅  53s ⏱️
    1 suites      0 💤
    1 files        0 ❌

Results for commit 26d85c3.

♻️ This comment has been updated with latest results.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added the tests Including test files, or tests related like configuration. label Mar 16, 2026
Store reasoning_effort in model.customParameters so persisted
online-evaluation configs preserve it. Old configs without it
fall back to DEFAULT_REASONING_EFFORT.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Contributor

github-actions bot commented Mar 16, 2026

Python SDK E2E Tests Results (Python 3.11)

244 tests  ±0   242 ✅ ±0   8m 25s ⏱️ - 1m 16s
  1 suites ±0     2 💤 ±0 
  1 files   ±0     0 ❌ ±0 

Results for commit 26d85c3. ± Comparison against base commit 5204da2.

This pull request removes 1 and adds 1 tests. Note that renamed tests count towards both.
tests.e2e.test_tracing ‑ test_opik_client__update_trace__happy_flow[None-None-None-None-019cf68c-13e6-74b5-9f64-66d9f84dd81e]
tests.e2e.test_tracing ‑ test_opik_client__update_trace__happy_flow[None-None-None-None-019cfb31-955b-7ea9-b8c6-5f787aaf7547]

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 16, 2026

Python SDK E2E Tests Results (Python 3.12)

244 tests  ±0   242 ✅ ±0   8m 37s ⏱️ - 1m 10s
  1 suites ±0     2 💤 ±0 
  1 files   ±0     0 ❌ ±0 

Results for commit 26d85c3. ± Comparison against base commit 5204da2.

This pull request removes 1 and adds 1 tests. Note that renamed tests count towards both.
tests.e2e.test_tracing ‑ test_opik_client__update_trace__happy_flow[None-None-None-None-019cf687-f22e-7121-9a98-0fc09339fb50]
tests.e2e.test_tracing ‑ test_opik_client__update_trace__happy_flow[None-None-None-None-019cfb31-bf6c-779e-a5eb-8e2949333474]

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Contributor

Python SDK E2E Tests Results (Python 3.14)

244 tests  ±0   242 ✅ ±0   8m 49s ⏱️ +18s
  1 suites ±0     2 💤 ±0 
  1 files   ±0     0 ❌ ±0 

Results for commit 8bf6ecc. ± Comparison against base commit 5204da2.

This pull request removes 1 and adds 1 tests. Note that renamed tests count towards both.
tests.e2e.test_tracing ‑ test_opik_client__update_trace__happy_flow[None-None-None-None-019cf67d-7f62-74f1-8fbb-a7955bd6fb95]
tests.e2e.test_tracing ‑ test_opik_client__update_trace__happy_flow[None-None-None-None-019cf6e1-57d4-78fd-8f63-aadd6c3a35c9]

@github-actions
Copy link
Contributor

Python SDK E2E Tests Results (Python 3.13)

244 tests  ±0   242 ✅ ±0   7m 59s ⏱️ - 1m 16s
  1 suites ±0     2 💤 ±0 
  1 files   ±0     0 ❌ ±0 

Results for commit 8bf6ecc. ± Comparison against base commit 5204da2.

This pull request removes 1 and adds 1 tests. Note that renamed tests count towards both.
tests.e2e.test_tracing ‑ test_opik_client__update_trace__happy_flow[None-None-None-None-019cf686-0f7a-742e-a9d8-3834416991b2]
tests.e2e.test_tracing ‑ test_opik_client__update_trace__happy_flow[None-None-None-None-019cf6e2-94b6-7386-9b61-84423e7441f2]

@github-actions
Copy link
Contributor

github-actions bot commented Mar 16, 2026

Python SDK E2E Tests Results (Python 3.10)

244 tests  ±0   242 ✅ ±0   8m 26s ⏱️ -30s
  1 suites ±0     2 💤 ±0 
  1 files   ±0     0 ❌ ±0 

Results for commit 26d85c3. ± Comparison against base commit 5204da2.

This pull request removes 1 and adds 1 tests. Note that renamed tests count towards both.
tests.e2e.test_tracing ‑ test_opik_client__update_trace__happy_flow[None-None-None-None-019cf67f-ffba-7f49-9784-0623efef0db2]
tests.e2e.test_tracing ‑ test_opik_client__update_trace__happy_flow[None-None-None-None-019cfb31-d453-70ba-a836-89e1a4a413a0]

♻️ This comment has been updated with latest results.

@alexkuzmik alexkuzmik changed the title [NA] [SDK] feat: improve evaluation suite run experience and performance [OPIK-4957] [SDK] feat: improve evaluation suite run experience and performance Mar 17, 2026
@github-actions
Copy link
Contributor

📋 PR Linter Failed

Incomplete Issues Section. You must reference at least one GitHub issue (#xxxx), Jira ticket (OPIK-xxxx), CUST ticket (CUST-xxxx), DEV ticket (DEV-xxxx), or DND ticket (DND-xxxx) under the ## Issues section.

…udge prompt

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@alexkuzmik alexkuzmik merged commit 01f8953 into main Mar 17, 2026
120 checks passed
@alexkuzmik alexkuzmik deleted the alexkuzmik/improve-evaluation-suite-run-experience-and-performance branch March 17, 2026 10:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Python SDK python Pull requests that update Python code tests Including test files, or tests related like configuration.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants