[test] Mitigate flaky live-LLM e2e/cross-language CI tests#717
Merged
Conversation
The Python-wrapped Ollama connection in ChatModelCrossLanguageAgent fell back to the Python default request timeout of 30s, which intermittently exceeded under CI load and surfaced as httpx.ReadTimeout in the cross-language job. Set it to 240s, matching the Java-native connection in the same agent. The snake_case key forwards as a kwarg to the Python OllamaChatModelConnection constructor. Part of apache#716.
The live-LLM e2e and cross-language tests run a small Ollama model (qwen3:1.7b) and intermittently fail on non-deterministic tool-call results or Ollama read timeouts, turning CI red on unrelated changes. Retry these suites automatically, scoped to the e2e/cross-language invocations only so unit and style runs stay deterministic: - Python: pytest-rerunfailures with --reruns 2 --reruns-delay 5 on the e2e pytest calls in tools/ut.sh and tools/e2e.sh. - Java: -Dsurefire.rerunFailingTestsCount=2 on the e2e mvn calls in tools/ut.sh and test_resource_cross_language.sh. A test that passes on retry yields a green build but is reported as a flake, so the signal is preserved rather than masked. Part of apache#716.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Linked issue: #716
Purpose of change
The live-LLM e2e and cross-language CI tests run a small Ollama model (
qwen3:1.7b) and fail intermittently — either the model returns a wrong tool-call result (e.g.assert <varies> == 1386528) or an Ollama call exceeds its read timeout (httpx.ReadTimeout). These flakes reproduce across many branches includingmain, turning CI red on unrelated PRs. See #716 for the failure statistics and evidence.This PR mitigates the flakiness without masking real, deterministic failures:
Per-test retry, scoped to the live-LLM e2e/cross-language suites only. Python uses
pytest-rerunfailures(--reruns 2 --reruns-delay 5); Java uses Surefire-Dsurefire.rerunFailingTestsCount=2. Both are applied only at the e2e/cross-language test invocations — the unit and style invocations are untouched, so a genuine regression still fails immediately. A test that passes on retry produces a green build but is reported as a flake (pytestRmarkers / Surefire "Flakes"), so the signal is preserved, not hidden.Close the one remaining 30 s timeout gap. The cross-language test's Python-wrapped Ollama connection fell back to the Python default
request_timeoutof 30 s; this sets it to 240 s, matching the Java-native connection already configured in the same test agent.Loosening exact-equality assertions on LLM output (asserting tool-call shape rather than an exact value) is noted in #716 as longer-term hardening and is intentionally out of scope here.
Tests
This is a CI/test-configuration change, so no new product tests are added (a test asserting "retry is configured" would be tautological). Verified:
-k "not e2e_tests") and unit mvn (-pl "${exclude_list}") invocations are unchanged.pytest-rerunfailures==16.3resolves againstpytest==9.0.3, and the e2e jobs install it viatools/build.sh'suv sync --extra dev(thedevextra composestest) before the--no-syncpytest runs.mvn ... -Dsurefire.rerunFailingTestsCount=2 test-compileis accepted on Surefire 3.5.2; the cross-language module compiles with the timeout change.API
No public API change. All changes are test-configuration (CI scripts, the Python test extra, and one e2e test agent).
Documentation
doc-neededdoc-not-neededdoc-included