fix(pd,store): fg mode exit code propagation in startup scripts#3047
Conversation
…-pd.sh In foreground mode (-d false), start-hugegraph-pd.sh had no foreground branch — the script always backgrounded Java with exec ... &, wrote $! to the pid file, and exited 0, losing Java's exit code entirely. Fix: add DAEMON="true" default and -d flag to getopts. In the daemon branch, keep the existing exec ... & pattern. In the foreground branch, write $$ to the pid file before exec (exec replaces the shell with Java, so $$ == Java's PID after exec), then exec java without & so the process blocks and Java's exit code propagates out directly. No trap needed in the foreground branch — exec replaces the shell process with Java, so signals from Docker/systemd go directly to Java without a wrapper to forward through. Add test-start-hugegraph-pd.sh with 4 tests (daemon regression, foreground blocking, exit code propagation on SIGKILL, SIGTERM forwarding via exec) — 12 assertions, all pass after the fix. Baseline on unmodified code: 3 passed, 9 failed. After fix: 12 passed, 0 failed. Wire test into pd-store-ci.yml for the RocksDB backend. Related to: apache#3043
…aph-store.sh Same structural bug as start-hugegraph-pd.sh: no foreground branch, script always backgrounded Java with exec ... &, wrote $! to the pid file, and exited 0, losing Java's exit code entirely. Fix: add DAEMON="true" default and -d flag to getopts. Daemon branch keeps exec ... & with $! as before. Foreground branch writes $$ to the pid file before exec, then exec java without & so the process blocks and Java's exit code propagates out directly. Add test-start-hugegraph-store.sh with 4 tests (daemon regression, foreground blocking, exit code propagation on SIGKILL, SIGTERM forwarding via exec) and wire into pd-store-ci.yml. Note: Store health check is skipped when PD is not running — this is expected and handled gracefully in the test. Baseline on unmodified code: 3 passed, 5 failed. After fix: 11 passed, 0 failed. Related to: apache#3043
…lure GC log files written by Java during tests have no Apache license header. Apache RAT scans the dist logs/ directory and fails. Delete logs/ in cleanup() so RAT does not see them. Related to: apache#3043
Codecov Report✅ All modified and coverable lines are covered by tests.
Additional details and impacted files@@ Coverage Diff @@
## master #3047 +/- ##
============================================
- Coverage 36.07% 0.07% -36.00%
+ Complexity 338 22 -316
============================================
Files 803 781 -22
Lines 68095 65712 -2383
Branches 8918 8515 -403
============================================
- Hits 24563 51 -24512
- Misses 40897 65659 +24762
+ Partials 2635 2 -2633 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
imbajin
left a comment
There was a problem hiding this comment.
I re-reviewed the current head a6c38e8 and did not find any blocking issue in the PD/Store foreground-mode changes.
The implementation preserves the existing daemon behavior by default, while -d false writes the current shell PID before exec, so the Java process owns the same PID and exit/signal behavior propagates to the caller. The new CI foreground suites also exercise the important process semantics: blocking behavior, pid-file behavior, SIGKILL exit propagation, and SIGTERM termination for both PD and Store.
A couple of non-blocking follow-ups may still be worth considering:
- The PD/Store README files do not document the new
-d true|falseoption yet. - The Store foreground test mainly validates the process model; it intentionally tolerates health-check timeout when PD is not running, while the later workflow steps still cover the normal PD+Store path.
Overall this looks merge-safe to me as the foreground-mode chunk for #3043. The Docker entrypoints still need to be wired to this foreground mode in a follow-up change before the original Docker restart/lifecycle issue is fully closed.
Fix foreground mode in start-hugegraph-pd.sh and start-hugegraph-store.sh (chunks 2–3 of #3043)
Purpose of the PR
Main Changes
Problem
Both
start-hugegraph-pd.shandstart-hugegraph-store.shhad no foregroundbranch — the scripts always backgrounded Java with
exec ... &, wrote$!tothe pid file, and exited 0, losing Java's exit code entirely. Docker/systemd
supervisors never saw a non-zero exit, so containers were never restarted on
Java crash.
Fix
Add
DAEMON="true"default and-d true|falseflag to getopts in both scripts.Daemon branch keeps
exec ... &with$!as before. Foreground branch writes$$to the pid file beforeexec(exec replaces the shell with Java, so$$== Java's PID after exec), thenexec javawithout&so the processblocks and Java's exit code propagates out directly.
No trap needed in the foreground branch —
execreplaces the shell process withJava, so signals from Docker/systemd go directly to Java without a wrapper to
forward through (unlike chunk 1 where
& + waitrequired an explicit trap).Tests
start-hugegraph-pd.shtest-start-hugegraph-pd.shstart-hugegraph-store.shtest-start-hugegraph-store.shBoth test scripts wired into
pd-store-ci.yml(no backend guard needed — pd/store jobs always run).Note: Store health check is skipped when PD is not running — handled gracefully with a warning, not a failure.
What's NOT in this PR
Docker entrypoints, HEALTHCHECK, and cron monitor removal are in chunks 4–8 (separate PR).
Verifying these changes
test-start-hugegraph-pd.sh $PD_DIR— 12 assertions, all pass after fixtest-start-hugegraph-store.sh $STORE_DIR— 11 assertions, all pass after fixDoes this PR potentially affect the following parts?
Documentation Status
Doc - No Need