Skip to content

[Bug] Docker containers never restart when Java process crashes — broken cron monitor + tail -f /dev/null supervision #3043

@bitflicker64

Description

@bitflicker64

[Bug] Docker containers never restart when Java process crashes — broken cron monitor + tail -f /dev/null supervision

Before submit

  • I have confirmed and searched that there are no similar problems in the historical issue and documents

Environment


Expected & Actual behavior

Expected:
When the HugeGraph Java process crashes inside a Docker container, the container should exit and Docker's restart policy (restart: unless-stopped) should automatically bring it back up.

Actual:
The container stays in Up state permanently even after Java crashes. docker ps shows green. Users get Connection refused. The container never restarts on its own. Manual intervention is required every time.

# Java crashes at T=0:30
$ docker ps
CONTAINER ID   IMAGE              STATUS               PORTS
abc123         hugegraph/server   Up 2 hours           0.0.0.0:8080->8080/tcp

# Container looks healthy but:
$ curl http://localhost:8080/versions
curl: (7) Failed to connect to localhost port 8080: Connection refused

Root Cause Analysis

There are three compounding problems:

1. crond is never started — the watchdog is completely dead

cron is installed in all four Dockerfiles but dumb-init only launches docker-entrypoint.sh. Nobody starts crond. So even if start-hugegraph.sh -m true were called, start-monitor.sh registers the crontab job but since crond is not running, monitor-hugegraph.sh never fires. The entire watchdog silently does nothing in containers.

What happens on a VM:          What happens in Docker:
  crond reads crontab every      crond is NOT running
  minute                         monitor-hugegraph.sh NEVER fires
  monitor-hugegraph.sh fires     HugeGraph stays dead forever
  HugeGraph gets restarted

2. tail -f /dev/null means zero supervision

All three docker-entrypoint.sh files background the Java process then sleep forever:

# hugegraph-server entrypoint (current)
./bin/start-hugegraph.sh -j "${JAVA_OPTS:-}" -t 120
# ... post-startup checks ...
tail -f /dev/null   # ← keeps container alive with NO watchdog

When Java crashes, tail -f /dev/null keeps running. The container never exits. Docker's restart: unless-stopped only triggers on container exit — since the container never exits, the restart policy never fires. The container stays Up (unhealthy) forever.

3. HEALTHCHECK only exists in docker-compose.yml, not in the Dockerfiles

Health checks are defined per service in docker-compose.yml but none of the four Dockerfiles have a HEALTHCHECK instruction. So docker run without compose has no health reporting at all. depends_on: condition: service_healthy only works because compose injects the check at runtime — it is not baked into the image.

4. Foreground mode is broken in start-hugegraph.sh

start-hugegraph.sh has a -d false foreground flag but it is broken. $!, pid file write, trap, wait_for_startup, disown, and OPEN_MONITOR all run unconditionally after the daemon/foreground if/else block — meaning in foreground mode they all execute after Java has already exited, with empty/stale values. Java's exit code is lost and the script always exits 0.

5. No foreground mode exists at all in start-hugegraph-pd.sh and start-hugegraph-store.sh

Both scripts always background Java unconditionally with exec java ... & regardless of any flag. There is no -d flag and no foreground path.


Impact

CURRENT — Java crashes inside container:
  T=0:30  Java crashes (OOM, segfault, deadlock, etc.)
  T=0:30  tail -f /dev/null keeps running
  T=0:30  Container stays "Up" — Docker sees nothing wrong
  T=1:00  HEALTHCHECK marks container "unhealthy" (compose only)
  T=∞     Container stays unhealthy forever, never restarts
          docker ps shows: Up 2 hours (unhealthy)
          Users get: Connection refused

AFTER FIX — Java crashes inside container:
  T=0:30  Java crashes
  T=0:30  Entrypoint exits → dumb-init exits → container exits
  T=0:30  Docker restart policy fires immediately
  T=0:31  New container starts
  T=1:41  docker ps shows: Up 1 min (healthy)

Additional bug found during investigation

The shipped default conf/rest-server.properties has:

restserver.url=127.0.0.1:8080

No http:// scheme. On macOS, curl fails immediately with "Protocol not supported" causing wait_for_startup to always time out and start-hugegraph.sh to exit 1 even though the server starts fine. Every other config in the repo uses http:// explicitly — raft CI configs, the Dockerfile sed patch, cluster test templates, and the Java ServerOptions default. The shipped default is inconsistent and breaks local macOS development.


Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions