Skip to content

[fix][test] Add surefire forked JVM timeout to prevent FE UT from han…#61543

Draft
hello-stephen wants to merge 1 commit intoapache:masterfrom
hello-stephen:0320-feut
Draft

[fix][test] Add surefire forked JVM timeout to prevent FE UT from han…#61543
hello-stephen wants to merge 1 commit intoapache:masterfrom
hello-stephen:0320-feut

Conversation

@hello-stephen
Copy link
Contributor

@hello-stephen hello-stephen commented Mar 20, 2026

…ging indefinitely

Problem:
FE UT occasionally hangs when a test's forked JVM process gets stuck (e.g. a mocked FE background thread that never exits, or a deadlock during test initialization). When this happens, the entire build silently stalls until the CI global timeout is hit (3h+), causing all subsequent tests in the same Surefire batch to never run.
Two concrete examples identified from CI logs:

org.apache.doris.statistics.AnalysisTaskExecutorTest — mocked FE started but JVM never exited, hung for ~1h39m
org.apache.doris.planner.FederationBackendPolicyTest — zero output after "Running", hung silently

Solution:
Add -Dsurefire.forkedProcessTimeoutInSeconds and -Dsurefire.forkedProcessExitTimeoutInSeconds to all mvn test invocations in run-fe-ut.sh.
When a forked JVM exceeds the timeout with no progress:

Surefire sends SIGQUIT → the JVM prints a full thread dump to stdout (visible in CI logs), making the root cause immediately diagnosable
Surefire then kills the process and marks the test as ERROR
The rest of the test suite continues normally

The timeout values are configurable via env vars FE_UT_FORK_TIMEOUT (default: 300s) and FE_UT_FORK_EXIT_TIMEOUT (default: 60s), so CI pipelines can override them without touching the script.
Impact:

No change to normal test execution
Hung tests now fail fast with a thread dump instead of silently consuming the entire CI budget
Overall FE UT wall-clock time reduced from 3h+ (timeout-killed) back to the expected ~1.5h

@Thearas
Copy link
Contributor

Thearas commented Mar 20, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@hello-stephen
Copy link
Contributor Author

run feut

@hello-stephen
Copy link
Contributor Author

FE UT Coverage Report

Increment line coverage `` 🎉
Increment coverage report
Complete coverage report

@hello-stephen hello-stephen marked this pull request as draft March 20, 2026 06:30
@hello-stephen
Copy link
Contributor Author

/review

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

Critical Issue: Wrong System Property Names (PR is a no-op)

The system property names used on the Maven command line are incorrect and will be silently ignored by Maven Surefire, making this entire PR ineffective.

Per the Maven Surefire source code (@Parameter annotations in SurefireMojo.java):

Parameter Used in PR (incorrect) Correct user property
forkedProcessTimeoutInSeconds -Dsurefire.forkedProcessTimeoutInSeconds -Dsurefire.timeout
forkedProcessExitTimeoutInSeconds -Dsurefire.forkedProcessExitTimeoutInSeconds -Dsurefire.exitTimeout

Maven's @Parameter(property = "...") annotation defines the only system property name that maps to a mojo field. There is no fallback or aliasing. Using the wrong name means the values are never bound to the plugin parameters.

Secondary Issue: Inaccurate Comment

The script comment says the timeout triggers when a JVM "produces no output for FE_UT_FORK_TIMEOUT seconds". Per the Surefire docs, forkedProcessTimeoutInSeconds is an absolute timeout — it kills the forked JVM after N seconds regardless of output. This means a legitimate test taking >300s would be killed. The comment should accurately describe this as an absolute wall-clock timeout.

Checkpoint Conclusions

  • Goal accomplished? No. Due to wrong property names, the timeout is never actually set. The forked JVMs will still hang indefinitely.
  • Minimal and focused? Yes, the change is small and scoped to one file.
  • Concurrency? N/A.
  • Configuration items? Two new env vars (FE_UT_FORK_TIMEOUT, FE_UT_FORK_EXIT_TIMEOUT) are added with sensible defaults and follow the existing pattern (FE_UT_PARALLEL). However, the 300s absolute timeout may be too aggressive for some FE tests — this should be validated against the longest-running normal test.
  • Parallel code paths? All four mvn test invocations are updated consistently — good.
  • Test coverage? Not directly testable, but the fix should be validated by running FE UT with a known-hanging test to confirm the timeout fires.
  • Other issues? The FE_UT_FORK_TIMEOUT and FE_UT_FORK_EXIT_TIMEOUT variables are not exported (unlike FE_UT_PARALLEL), but since they're only used in the same script, this is fine.

Required Fix

Replace all occurrences:

# Before (wrong):
-Dsurefire.forkedProcessTimeoutInSeconds="${FE_UT_FORK_TIMEOUT}"
-Dsurefire.forkedProcessExitTimeoutInSeconds="${FE_UT_FORK_EXIT_TIMEOUT}"

# After (correct):
-Dsurefire.timeout="${FE_UT_FORK_TIMEOUT}"
-Dsurefire.exitTimeout="${FE_UT_FORK_EXIT_TIMEOUT}"

And fix the comment to accurately describe forkedProcessTimeoutInSeconds as an absolute wall-clock timeout, not an idle/no-output timeout.


# Timeout settings for surefire forked JVM processes.
# If a forked test JVM produces no output for FE_UT_FORK_TIMEOUT seconds,
# surefire will send SIGQUIT (which triggers a JVM thread dump) and then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inaccurate comment: Per the Surefire docs, forkedProcessTimeoutInSeconds is an absolute wall-clock timeout — it kills the forked JVM after N seconds from fork start, regardless of whether output is being produced.

The comment says "produces no output for FE_UT_FORK_TIMEOUT seconds" which implies an idle timeout. This is misleading. A legitimate test that runs for >300s will be killed even if it's actively producing output.

Suggested fix:

# Timeout settings for surefire forked JVM processes.
# If a forked test JVM runs for more than FE_UT_FORK_TIMEOUT seconds (wall clock),
# surefire will kill the process, marking the test as ERROR rather than hanging forever.

"${MVN_CMD}" test jacoco:report -DfailIfNoTests=false -Dtest="$1"
"${MVN_CMD}" test jacoco:report -DfailIfNoTests=false -Dtest="$1" \
-Dsurefire.forkedProcessTimeoutInSeconds="${FE_UT_FORK_TIMEOUT}" \
-Dsurefire.forkedProcessExitTimeoutInSeconds="${FE_UT_FORK_EXIT_TIMEOUT}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug (Critical): Wrong system property name — this is silently ignored.

-Dsurefire.forkedProcessTimeoutInSeconds is not a recognized Maven Surefire user property. The @Parameter annotation in SurefireMojo.java declares:

@Parameter(property = "surefire.timeout")
private int forkedProcessTimeoutInSeconds;

The correct command-line property is -Dsurefire.timeout. Same issue with forkedProcessExitTimeoutInSeconds — the correct property is -Dsurefire.exitTimeout.

As written, both values are silently ignored and the forked JVM timeout remains at its default (0 = wait forever), making this PR a complete no-op.

Fix all four invocations:

-Dsurefire.timeout="${FE_UT_FORK_TIMEOUT}" \
-Dsurefire.exitTimeout="${FE_UT_FORK_EXIT_TIMEOUT}"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants