[OPIK-4957] [SDK] feat: improve evaluation suite run experience and performance#5677
Conversation
Defaults to "low" to reduce latency. Silently dropped for models that don't support it (e.g. gpt-4o-mini). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ent link Add a hint message to the progress bar while waiting for the first item to complete, and show a clickable experiment link at the top of the suite results panel. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use ---BEGIN/END--- delimiters around input and output sections so short agent responses don't blend into the description text. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Print "Running evaluation suite, results will be available in Opik dashboard" with a bold cyan clickable link before the progress bar appears, so users can follow along in the UI. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Python SDK Unit Tests Results (Python 3.14)2 207 tests 2 207 ✅ 41s ⏱️ Results for commit 26d85c3. ♻️ This comment has been updated with latest results. |
Python SDK Unit Tests Results (Python 3.12)2 207 tests 2 207 ✅ 46s ⏱️ Results for commit 26d85c3. ♻️ This comment has been updated with latest results. |
Python SDK Unit Tests Results (Python 3.13)2 207 tests 2 207 ✅ 52s ⏱️ Results for commit 26d85c3. ♻️ This comment has been updated with latest results. |
Python SDK Unit Tests Results (Python 3.11)2 207 tests 2 207 ✅ 46s ⏱️ Results for commit 26d85c3. ♻️ This comment has been updated with latest results. |
Python SDK Unit Tests Results (Python 3.10)2 207 tests 2 207 ✅ 53s ⏱️ Results for commit 26d85c3. ♻️ This comment has been updated with latest results. |
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Store reasoning_effort in model.customParameters so persisted online-evaluation configs preserve it. Old configs without it fall back to DEFAULT_REASONING_EFFORT. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Python SDK E2E Tests Results (Python 3.11)244 tests ±0 242 ✅ ±0 8m 25s ⏱️ - 1m 16s Results for commit 26d85c3. ± Comparison against base commit 5204da2. This pull request removes 1 and adds 1 tests. Note that renamed tests count towards both.♻️ This comment has been updated with latest results. |
Python SDK E2E Tests Results (Python 3.12)244 tests ±0 242 ✅ ±0 8m 37s ⏱️ - 1m 10s Results for commit 26d85c3. ± Comparison against base commit 5204da2. This pull request removes 1 and adds 1 tests. Note that renamed tests count towards both.♻️ This comment has been updated with latest results. |
Python SDK E2E Tests Results (Python 3.14)244 tests ±0 242 ✅ ±0 8m 49s ⏱️ +18s Results for commit 8bf6ecc. ± Comparison against base commit 5204da2. This pull request removes 1 and adds 1 tests. Note that renamed tests count towards both. |
Python SDK E2E Tests Results (Python 3.13)244 tests ±0 242 ✅ ±0 7m 59s ⏱️ - 1m 16s Results for commit 8bf6ecc. ± Comparison against base commit 5204da2. This pull request removes 1 and adds 1 tests. Note that renamed tests count towards both. |
Python SDK E2E Tests Results (Python 3.10)244 tests ±0 242 ✅ ±0 8m 26s ⏱️ -30s Results for commit 26d85c3. ± Comparison against base commit 5204da2. This pull request removes 1 and adds 1 tests. Note that renamed tests count towards both.♻️ This comment has been updated with latest results. |
📋 PR Linter Failed❌ Incomplete Issues Section. You must reference at least one GitHub issue ( |
…udge prompt Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Details
Improve the evaluation suite run experience with several UX and performance enhancements:
reasoning_effortsupport to LLMJudge (defaults to"low") to speed up assertion checking, with graceful filtering for models that don't support it---BEGIN/END---delimiters so short agent outputs don't blend into description textChange checklist
Issues
AI-WATERMARK
AI-WATERMARK: yes
Testing
Tested manually by running
sdks/python/src/internal_sandbox/evaluation_suite_advanced_example.pyagainst a local Opik instance and verifying:reasoning_effort=lowis passed through to LiteLLM and filtered out for unsupported models (e.g. gpt-4o-mini)---BEGIN/END---delimiters around input/outputDocumentation