[Bug] Docker containers never restart when Java process crashes — broken cron monitor + tail -f /dev/null supervision
Before submit
Environment
Expected & Actual behavior
Expected:
When the HugeGraph Java process crashes inside a Docker container, the container should exit and Docker's restart policy (restart: unless-stopped) should automatically bring it back up.
Actual:
The container stays in Up state permanently even after Java crashes. docker ps shows green. Users get Connection refused. The container never restarts on its own. Manual intervention is required every time.
# Java crashes at T=0:30
$ docker ps
CONTAINER ID IMAGE STATUS PORTS
abc123 hugegraph/server Up 2 hours 0.0.0.0:8080->8080/tcp
# Container looks healthy but:
$ curl http://localhost:8080/versions
curl: (7) Failed to connect to localhost port 8080: Connection refused
Root Cause Analysis
There are three compounding problems:
1. crond is never started — the watchdog is completely dead
cron is installed in all four Dockerfiles but dumb-init only launches docker-entrypoint.sh. Nobody starts crond. So even if start-hugegraph.sh -m true were called, start-monitor.sh registers the crontab job but since crond is not running, monitor-hugegraph.sh never fires. The entire watchdog silently does nothing in containers.
What happens on a VM: What happens in Docker:
crond reads crontab every crond is NOT running
minute monitor-hugegraph.sh NEVER fires
monitor-hugegraph.sh fires HugeGraph stays dead forever
HugeGraph gets restarted
2. tail -f /dev/null means zero supervision
All three docker-entrypoint.sh files background the Java process then sleep forever:
# hugegraph-server entrypoint (current)
./bin/start-hugegraph.sh -j "${JAVA_OPTS:-}" -t 120
# ... post-startup checks ...
tail -f /dev/null # ← keeps container alive with NO watchdog
When Java crashes, tail -f /dev/null keeps running. The container never exits. Docker's restart: unless-stopped only triggers on container exit — since the container never exits, the restart policy never fires. The container stays Up (unhealthy) forever.
3. HEALTHCHECK only exists in docker-compose.yml, not in the Dockerfiles
Health checks are defined per service in docker-compose.yml but none of the four Dockerfiles have a HEALTHCHECK instruction. So docker run without compose has no health reporting at all. depends_on: condition: service_healthy only works because compose injects the check at runtime — it is not baked into the image.
4. Foreground mode is broken in start-hugegraph.sh
start-hugegraph.sh has a -d false foreground flag but it is broken. $!, pid file write, trap, wait_for_startup, disown, and OPEN_MONITOR all run unconditionally after the daemon/foreground if/else block — meaning in foreground mode they all execute after Java has already exited, with empty/stale values. Java's exit code is lost and the script always exits 0.
5. No foreground mode exists at all in start-hugegraph-pd.sh and start-hugegraph-store.sh
Both scripts always background Java unconditionally with exec java ... & regardless of any flag. There is no -d flag and no foreground path.
Impact
CURRENT — Java crashes inside container:
T=0:30 Java crashes (OOM, segfault, deadlock, etc.)
T=0:30 tail -f /dev/null keeps running
T=0:30 Container stays "Up" — Docker sees nothing wrong
T=1:00 HEALTHCHECK marks container "unhealthy" (compose only)
T=∞ Container stays unhealthy forever, never restarts
docker ps shows: Up 2 hours (unhealthy)
Users get: Connection refused
AFTER FIX — Java crashes inside container:
T=0:30 Java crashes
T=0:30 Entrypoint exits → dumb-init exits → container exits
T=0:30 Docker restart policy fires immediately
T=0:31 New container starts
T=1:41 docker ps shows: Up 1 min (healthy)
Additional bug found during investigation
The shipped default conf/rest-server.properties has:
restserver.url=127.0.0.1:8080
No http:// scheme. On macOS, curl fails immediately with "Protocol not supported" causing wait_for_startup to always time out and start-hugegraph.sh to exit 1 even though the server starts fine. Every other config in the repo uses http:// explicitly — raft CI configs, the Dockerfile sed patch, cluster test templates, and the Java ServerOptions default. The shipped default is inconsistent and breaks local macOS development.
Related
[Bug] Docker containers never restart when Java process crashes — broken cron monitor +
tail -f /dev/nullsupervisionBefore submit
Environment
Expected & Actual behavior
Expected:
When the HugeGraph Java process crashes inside a Docker container, the container should exit and Docker's restart policy (
restart: unless-stopped) should automatically bring it back up.Actual:
The container stays in
Upstate permanently even after Java crashes.docker psshows green. Users getConnection refused. The container never restarts on its own. Manual intervention is required every time.Root Cause Analysis
There are three compounding problems:
1.
crondis never started — the watchdog is completely deadcronis installed in all four Dockerfiles butdumb-initonly launchesdocker-entrypoint.sh. Nobody startscrond. So even ifstart-hugegraph.sh -m truewere called,start-monitor.shregisters the crontab job but sincecrondis not running,monitor-hugegraph.shnever fires. The entire watchdog silently does nothing in containers.2.
tail -f /dev/nullmeans zero supervisionAll three
docker-entrypoint.shfiles background the Java process then sleep forever:When Java crashes,
tail -f /dev/nullkeeps running. The container never exits. Docker'srestart: unless-stoppedonly triggers on container exit — since the container never exits, the restart policy never fires. The container staysUp (unhealthy)forever.3.
HEALTHCHECKonly exists indocker-compose.yml, not in the DockerfilesHealth checks are defined per service in
docker-compose.ymlbut none of the four Dockerfiles have aHEALTHCHECKinstruction. Sodocker runwithout compose has no health reporting at all.depends_on: condition: service_healthyonly works because compose injects the check at runtime — it is not baked into the image.4. Foreground mode is broken in
start-hugegraph.shstart-hugegraph.shhas a-d falseforeground flag but it is broken.$!, pid file write,trap,wait_for_startup,disown, andOPEN_MONITORall run unconditionally after the daemon/foreground if/else block — meaning in foreground mode they all execute after Java has already exited, with empty/stale values. Java's exit code is lost and the script always exits 0.5. No foreground mode exists at all in
start-hugegraph-pd.shandstart-hugegraph-store.shBoth scripts always background Java unconditionally with
exec java ... ®ardless of any flag. There is no-dflag and no foreground path.Impact
Additional bug found during investigation
The shipped default
conf/rest-server.propertieshas:No
http://scheme. On macOS,curlfails immediately with "Protocol not supported" causingwait_for_startupto always time out andstart-hugegraph.shto exit 1 even though the server starts fine. Every other config in the repo useshttp://explicitly — raft CI configs, the Dockerfilesedpatch, cluster test templates, and the JavaServerOptionsdefault. The shipped default is inconsistent and breaks local macOS development.Related