Skip to content

[test] Mitigate flaky live-LLM e2e/cross-language CI tests#717

Merged
xintongsong merged 2 commits into
apache:mainfrom
weiqingy:716-impl
May 31, 2026
Merged

[test] Mitigate flaky live-LLM e2e/cross-language CI tests#717
xintongsong merged 2 commits into
apache:mainfrom
weiqingy:716-impl

Conversation

@weiqingy
Copy link
Copy Markdown
Collaborator

Linked issue: #716

Purpose of change

The live-LLM e2e and cross-language CI tests run a small Ollama model (qwen3:1.7b) and fail intermittently — either the model returns a wrong tool-call result (e.g. assert <varies> == 1386528) or an Ollama call exceeds its read timeout (httpx.ReadTimeout). These flakes reproduce across many branches including main, turning CI red on unrelated PRs. See #716 for the failure statistics and evidence.

This PR mitigates the flakiness without masking real, deterministic failures:

  1. Per-test retry, scoped to the live-LLM e2e/cross-language suites only. Python uses pytest-rerunfailures (--reruns 2 --reruns-delay 5); Java uses Surefire -Dsurefire.rerunFailingTestsCount=2. Both are applied only at the e2e/cross-language test invocations — the unit and style invocations are untouched, so a genuine regression still fails immediately. A test that passes on retry produces a green build but is reported as a flake (pytest R markers / Surefire "Flakes"), so the signal is preserved, not hidden.

  2. Close the one remaining 30 s timeout gap. The cross-language test's Python-wrapped Ollama connection fell back to the Python default request_timeout of 30 s; this sets it to 240 s, matching the Java-native connection already configured in the same test agent.

Loosening exact-equality assertions on LLM output (asserting tool-call shape rather than an exact value) is noted in #716 as longer-term hardening and is intentionally out of scope here.

Tests

This is a CI/test-configuration change, so no new product tests are added (a test asserting "retry is configured" would be tautological). Verified:

  • The retry flags are scoped correctly — a deliberately-failing test is retried 3× under an e2e selector but runs once under the unit selector; the unit pytest (-k "not e2e_tests") and unit mvn (-pl "${exclude_list}") invocations are unchanged.
  • pytest-rerunfailures==16.3 resolves against pytest==9.0.3, and the e2e jobs install it via tools/build.sh's uv sync --extra dev (the dev extra composes test) before the --no-sync pytest runs.
  • mvn ... -Dsurefire.rerunFailingTestsCount=2 test-compile is accepted on Surefire 3.5.2; the cross-language module compiles with the timeout change.

API

No public API change. All changes are test-configuration (CI scripts, the Python test extra, and one e2e test agent).

Documentation

  • doc-needed
  • doc-not-needed
  • doc-included

weiqingy added 2 commits May 30, 2026 22:04
The Python-wrapped Ollama connection in ChatModelCrossLanguageAgent fell
back to the Python default request timeout of 30s, which intermittently
exceeded under CI load and surfaced as httpx.ReadTimeout in the
cross-language job. Set it to 240s, matching the Java-native connection
in the same agent. The snake_case key forwards as a kwarg to the Python
OllamaChatModelConnection constructor.

Part of apache#716.
The live-LLM e2e and cross-language tests run a small Ollama model
(qwen3:1.7b) and intermittently fail on non-deterministic tool-call
results or Ollama read timeouts, turning CI red on unrelated changes.

Retry these suites automatically, scoped to the e2e/cross-language
invocations only so unit and style runs stay deterministic:
- Python: pytest-rerunfailures with --reruns 2 --reruns-delay 5 on the
  e2e pytest calls in tools/ut.sh and tools/e2e.sh.
- Java: -Dsurefire.rerunFailingTestsCount=2 on the e2e mvn calls in
  tools/ut.sh and test_resource_cross_language.sh.

A test that passes on retry yields a green build but is reported as a
flake, so the signal is preserved rather than masked.

Part of apache#716.
@github-actions github-actions Bot added doc-not-needed Your PR changes do not impact docs fixVersion/0.3.0 The feature or bug should be implemented/fixed in the 0.3.0 version. priority/major Default priority of the PR or issue. labels May 31, 2026
Copy link
Copy Markdown
Contributor

@xintongsong xintongsong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@xintongsong xintongsong merged commit e209672 into apache:main May 31, 2026
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

doc-not-needed Your PR changes do not impact docs fixVersion/0.3.0 The feature or bug should be implemented/fixed in the 0.3.0 version. priority/major Default priority of the PR or issue.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants