Skip to content

fix(pd,store): fg mode exit code propagation in startup scripts#3047

Merged
imbajin merged 4 commits into
apache:masterfrom
bitflicker64:fix/pd-store-foreground-mode
Jun 4, 2026
Merged

fix(pd,store): fg mode exit code propagation in startup scripts#3047
imbajin merged 4 commits into
apache:masterfrom
bitflicker64:fix/pd-store-foreground-mode

Conversation

@bitflicker64
Copy link
Copy Markdown
Contributor

Fix foreground mode in start-hugegraph-pd.sh and start-hugegraph-store.sh (chunks 2–3 of #3043)

Purpose of the PR

Main Changes

Problem

Both start-hugegraph-pd.sh and start-hugegraph-store.sh had no foreground
branch — the scripts always backgrounded Java with exec ... &, wrote $! to
the pid file, and exited 0, losing Java's exit code entirely. Docker/systemd
supervisors never saw a non-zero exit, so containers were never restarted on
Java crash.

Fix

Add DAEMON="true" default and -d true|false flag to getopts in both scripts.
Daemon branch keeps exec ... & with $! as before. Foreground branch writes
$$ to the pid file before exec (exec replaces the shell with Java, so
$$ == Java's PID after exec), then exec java without & so the process
blocks and Java's exit code propagates out directly.

No trap needed in the foreground branch — exec replaces the shell process with
Java, so signals from Docker/systemd go directly to Java without a wrapper to
forward through (unlike chunk 1 where & + wait required an explicit trap).

Tests

Script Test file Tests Baseline After fix
start-hugegraph-pd.sh test-start-hugegraph-pd.sh 4 (daemon regression, foreground blocks, exit code 137, SIGTERM 143) 3 passed, 9 failed 12 passed, 0 failed
start-hugegraph-store.sh test-start-hugegraph-store.sh 4 (same) 3 passed, 5 failed 11 passed, 0 failed

Both test scripts wired into pd-store-ci.yml (no backend guard needed — pd/store jobs always run).

Note: Store health check is skipped when PD is not running — handled gracefully with a warning, not a failure.

What's NOT in this PR

Docker entrypoints, HEALTHCHECK, and cron monitor removal are in chunks 4–8 (separate PR).

Verifying these changes

  • Need tests and can be verified as follows:
    • test-start-hugegraph-pd.sh $PD_DIR — 12 assertions, all pass after fix
    • test-start-hugegraph-store.sh $STORE_DIR — 11 assertions, all pass after fix

Does this PR potentially affect the following parts?

  • Dependencies
  • Modify configurations
  • The public API
  • Other affects
  • Nope

Documentation Status

  • Doc - No Need

…-pd.sh

In foreground mode (-d false), start-hugegraph-pd.sh had no foreground
branch — the script always backgrounded Java with exec ... &, wrote $!
to the pid file, and exited 0, losing Java's exit code entirely.

Fix: add DAEMON="true" default and -d flag to getopts. In the daemon
branch, keep the existing exec ... & pattern. In the foreground branch,
write $$ to the pid file before exec (exec replaces the shell with Java,
so $$ == Java's PID after exec), then exec java without & so the process
blocks and Java's exit code propagates out directly.

No trap needed in the foreground branch — exec replaces the shell
process with Java, so signals from Docker/systemd go directly to Java
without a wrapper to forward through.

Add test-start-hugegraph-pd.sh with 4 tests (daemon regression,
foreground blocking, exit code propagation on SIGKILL, SIGTERM
forwarding via exec) — 12 assertions, all pass after the fix.

Baseline on unmodified code: 3 passed, 9 failed.
After fix: 12 passed, 0 failed.

Wire test into pd-store-ci.yml for the RocksDB backend.

Related to: apache#3043
…aph-store.sh

Same structural bug as start-hugegraph-pd.sh: no foreground branch,
script always backgrounded Java with exec ... &, wrote $! to the pid
file, and exited 0, losing Java's exit code entirely.

Fix: add DAEMON="true" default and -d flag to getopts. Daemon branch
keeps exec ... & with $! as before. Foreground branch writes $$ to the
pid file before exec, then exec java without & so the process blocks
and Java's exit code propagates out directly.

Add test-start-hugegraph-store.sh with 4 tests (daemon regression,
foreground blocking, exit code propagation on SIGKILL, SIGTERM
forwarding via exec) and wire into pd-store-ci.yml.

Note: Store health check is skipped when PD is not running — this is
expected and handled gracefully in the test.

Baseline on unmodified code: 3 passed, 5 failed.
After fix: 11 passed, 0 failed.

Related to: apache#3043
@dosubot dosubot Bot added size:XL This PR changes 500-999 lines, ignoring generated files. bug Something isn't working pd PD module store Store module tests Add or improve test cases labels Jun 4, 2026
…lure

GC log files written by Java during tests have no Apache license header.
Apache RAT scans the dist logs/ directory and fails. Delete logs/ in
cleanup() so RAT does not see them.

Related to: apache#3043
@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 4, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 0.07%. Comparing base (d6e6216) to head (a6c38e8).
⚠️ Report is 1 commits behind head on master.

❗ There is a different number of reports uploaded between BASE (d6e6216) and HEAD (a6c38e8). Click for more details.

HEAD has 2 uploads less than BASE
Flag BASE (d6e6216) HEAD (a6c38e8)
3 1
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #3047       +/-   ##
============================================
- Coverage     36.07%   0.07%   -36.00%     
+ Complexity      338      22      -316     
============================================
  Files           803     781       -22     
  Lines         68095   65712     -2383     
  Branches       8918    8515      -403     
============================================
- Hits          24563      51    -24512     
- Misses        40897   65659    +24762     
+ Partials       2635       2     -2633     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Member

@imbajin imbajin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I re-reviewed the current head a6c38e8 and did not find any blocking issue in the PD/Store foreground-mode changes.

The implementation preserves the existing daemon behavior by default, while -d false writes the current shell PID before exec, so the Java process owns the same PID and exit/signal behavior propagates to the caller. The new CI foreground suites also exercise the important process semantics: blocking behavior, pid-file behavior, SIGKILL exit propagation, and SIGTERM termination for both PD and Store.

A couple of non-blocking follow-ups may still be worth considering:

  1. The PD/Store README files do not document the new -d true|false option yet.
  2. The Store foreground test mainly validates the process model; it intentionally tolerates health-check timeout when PD is not running, while the later workflow steps still cover the normal PD+Store path.

Overall this looks merge-safe to me as the foreground-mode chunk for #3043. The Docker entrypoints still need to be wired to this foreground mode in a follow-up change before the original Docker restart/lifecycle issue is fully closed.

@github-project-automation github-project-automation Bot moved this from In progress to In review in HugeGraph PD-Store Tasks Jun 4, 2026
@dosubot dosubot Bot added the lgtm This PR has been approved by a maintainer label Jun 4, 2026
@imbajin imbajin changed the title fix(pd,store): fix foreground mode exit code propagation in startup scripts fix(pd,store): fg mode exit code propagation in startup scripts Jun 4, 2026
@imbajin imbajin merged commit 16382cc into apache:master Jun 4, 2026
16 of 17 checks passed
@github-project-automation github-project-automation Bot moved this from In review to Done in HugeGraph PD-Store Tasks Jun 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working lgtm This PR has been approved by a maintainer pd PD module size:XL This PR changes 500-999 lines, ignoring generated files. store Store module tests Add or improve test cases

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants