[fix][test] Add surefire forked JVM timeout to prevent FE UT from han…#61543
[fix][test] Add surefire forked JVM timeout to prevent FE UT from han…#61543hello-stephen wants to merge 1 commit intoapache:masterfrom
Conversation
…ging indefinitely
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run feut |
FE UT Coverage ReportIncrement line coverage `` 🎉 |
|
/review |
There was a problem hiding this comment.
Code Review Summary
Critical Issue: Wrong System Property Names (PR is a no-op)
The system property names used on the Maven command line are incorrect and will be silently ignored by Maven Surefire, making this entire PR ineffective.
Per the Maven Surefire source code (@Parameter annotations in SurefireMojo.java):
| Parameter | Used in PR (incorrect) | Correct user property |
|---|---|---|
forkedProcessTimeoutInSeconds |
-Dsurefire.forkedProcessTimeoutInSeconds |
-Dsurefire.timeout |
forkedProcessExitTimeoutInSeconds |
-Dsurefire.forkedProcessExitTimeoutInSeconds |
-Dsurefire.exitTimeout |
Maven's @Parameter(property = "...") annotation defines the only system property name that maps to a mojo field. There is no fallback or aliasing. Using the wrong name means the values are never bound to the plugin parameters.
Secondary Issue: Inaccurate Comment
The script comment says the timeout triggers when a JVM "produces no output for FE_UT_FORK_TIMEOUT seconds". Per the Surefire docs, forkedProcessTimeoutInSeconds is an absolute timeout — it kills the forked JVM after N seconds regardless of output. This means a legitimate test taking >300s would be killed. The comment should accurately describe this as an absolute wall-clock timeout.
Checkpoint Conclusions
- Goal accomplished? No. Due to wrong property names, the timeout is never actually set. The forked JVMs will still hang indefinitely.
- Minimal and focused? Yes, the change is small and scoped to one file.
- Concurrency? N/A.
- Configuration items? Two new env vars (
FE_UT_FORK_TIMEOUT,FE_UT_FORK_EXIT_TIMEOUT) are added with sensible defaults and follow the existing pattern (FE_UT_PARALLEL). However, the 300s absolute timeout may be too aggressive for some FE tests — this should be validated against the longest-running normal test. - Parallel code paths? All four
mvn testinvocations are updated consistently — good. - Test coverage? Not directly testable, but the fix should be validated by running FE UT with a known-hanging test to confirm the timeout fires.
- Other issues? The
FE_UT_FORK_TIMEOUTandFE_UT_FORK_EXIT_TIMEOUTvariables are notexported (unlikeFE_UT_PARALLEL), but since they're only used in the same script, this is fine.
Required Fix
Replace all occurrences:
# Before (wrong):
-Dsurefire.forkedProcessTimeoutInSeconds="${FE_UT_FORK_TIMEOUT}"
-Dsurefire.forkedProcessExitTimeoutInSeconds="${FE_UT_FORK_EXIT_TIMEOUT}"
# After (correct):
-Dsurefire.timeout="${FE_UT_FORK_TIMEOUT}"
-Dsurefire.exitTimeout="${FE_UT_FORK_EXIT_TIMEOUT}"And fix the comment to accurately describe forkedProcessTimeoutInSeconds as an absolute wall-clock timeout, not an idle/no-output timeout.
|
|
||
| # Timeout settings for surefire forked JVM processes. | ||
| # If a forked test JVM produces no output for FE_UT_FORK_TIMEOUT seconds, | ||
| # surefire will send SIGQUIT (which triggers a JVM thread dump) and then |
There was a problem hiding this comment.
Inaccurate comment: Per the Surefire docs, forkedProcessTimeoutInSeconds is an absolute wall-clock timeout — it kills the forked JVM after N seconds from fork start, regardless of whether output is being produced.
The comment says "produces no output for FE_UT_FORK_TIMEOUT seconds" which implies an idle timeout. This is misleading. A legitimate test that runs for >300s will be killed even if it's actively producing output.
Suggested fix:
# Timeout settings for surefire forked JVM processes.
# If a forked test JVM runs for more than FE_UT_FORK_TIMEOUT seconds (wall clock),
# surefire will kill the process, marking the test as ERROR rather than hanging forever.| "${MVN_CMD}" test jacoco:report -DfailIfNoTests=false -Dtest="$1" | ||
| "${MVN_CMD}" test jacoco:report -DfailIfNoTests=false -Dtest="$1" \ | ||
| -Dsurefire.forkedProcessTimeoutInSeconds="${FE_UT_FORK_TIMEOUT}" \ | ||
| -Dsurefire.forkedProcessExitTimeoutInSeconds="${FE_UT_FORK_EXIT_TIMEOUT}" |
There was a problem hiding this comment.
Bug (Critical): Wrong system property name — this is silently ignored.
-Dsurefire.forkedProcessTimeoutInSeconds is not a recognized Maven Surefire user property. The @Parameter annotation in SurefireMojo.java declares:
@Parameter(property = "surefire.timeout")
private int forkedProcessTimeoutInSeconds;The correct command-line property is -Dsurefire.timeout. Same issue with forkedProcessExitTimeoutInSeconds — the correct property is -Dsurefire.exitTimeout.
As written, both values are silently ignored and the forked JVM timeout remains at its default (0 = wait forever), making this PR a complete no-op.
Fix all four invocations:
-Dsurefire.timeout="${FE_UT_FORK_TIMEOUT}" \
-Dsurefire.exitTimeout="${FE_UT_FORK_EXIT_TIMEOUT}"
…ging indefinitely
Problem:
FE UT occasionally hangs when a test's forked JVM process gets stuck (e.g. a mocked FE background thread that never exits, or a deadlock during test initialization). When this happens, the entire build silently stalls until the CI global timeout is hit (3h+), causing all subsequent tests in the same Surefire batch to never run.
Two concrete examples identified from CI logs:
org.apache.doris.statistics.AnalysisTaskExecutorTest — mocked FE started but JVM never exited, hung for ~1h39m
org.apache.doris.planner.FederationBackendPolicyTest — zero output after "Running", hung silently
Solution:
Add -Dsurefire.forkedProcessTimeoutInSeconds and -Dsurefire.forkedProcessExitTimeoutInSeconds to all mvn test invocations in run-fe-ut.sh.
When a forked JVM exceeds the timeout with no progress:
Surefire sends SIGQUIT → the JVM prints a full thread dump to stdout (visible in CI logs), making the root cause immediately diagnosable
Surefire then kills the process and marks the test as ERROR
The rest of the test suite continues normally
The timeout values are configurable via env vars FE_UT_FORK_TIMEOUT (default: 300s) and FE_UT_FORK_EXIT_TIMEOUT (default: 60s), so CI pipelines can override them without touching the script.
Impact:
No change to normal test execution
Hung tests now fail fast with a thread dump instead of silently consuming the entire CI budget
Overall FE UT wall-clock time reduced from 3h+ (timeout-killed) back to the expected ~1.5h