Skip to content

[Improvement]: Support graceful shutdown for in-progress optimizer tasks #4198

@j1wonpark

Description

@j1wonpark

Search before asking

  • I have searched in the issues and found no similar issues.

What would you like to be improved?

When an optimizer process receives SIGTERM (e.g., K8s pod termination, rolling restart), in-progress tasks are silently dropped and AMS later re-schedules the same task — doubling work and potentially causing duplicate commits when the original task was already past its commit point.

This happens due to three compounding root causes:

  1. Dropped results: stopOptimizing() only flips a stopped flag, causing any subsequent completeTask() call to be silently skipped by the gated retry loop.
  2. Signal not reaching the JVM: The optimizer container runs as sh -c <args>, leaving sh as PID 1. SIGTERM from K8s is delivered to sh and never forwarded to the Java process, so the pod is killed by SIGKILL before graceful shutdown can run.
  3. Shutdown ordering: Hadoop's FileSystem cache cleanup and the graceful shutdown hook run concurrently via JVM Runtime shutdown hooks, causing in-flight HDFS writers to hit ClosedChannelException during row-group flush.

How should we improve?

  • Drain in-progress tasks: stopOptimizing() should wait for executor threads to finish up to a configurable timeout (--shutdown-timeout-ms, default 10 min), keeping the toucher alive during the drain so AMS heartbeats continue.
  • Best-effort result reporting: After shutdown is requested, completeTask() should fall back to a single direct call instead of silently dropping the result.
  • Signal delivery: Wrap the container command with exec so the JVM replaces the shell and receives SIGTERM directly. Apply to both KubernetesOptimizerContainer and optimizer.sh start-foreground.
  • Shutdown ordering: Register the graceful shutdown hook on Hadoop's ShutdownHookManager with higher priority than FS_CACHE (and SPARK_CONTEXT_SHUTDOWN_PRIORITY for Spark), with an explicit timeout that matches shutdown-timeout-ms.
  • K8s grace period: Derive terminationGracePeriodSeconds automatically from shutdown-timeout-ms plus a buffer so the pod is not SIGKILL'd before the drain completes.

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions