From 7b619801c13621abc8de39dad3e0db8c7db206ad Mon Sep 17 00:00:00 2001
From: Hemanth <74316298+hsavasere@users.noreply.github.com>
Date: Sat, 11 Oct 2025 11:54:53 +0530
Subject: [PATCH 1/3] docs(maintenance): add graceful shutdown procedures
 documentation

Add comprehensive documentation for graceful shutdown procedures in Fluss,
covering server shutdown processes, component-specific shutdown sequences,
best practices, and troubleshooting guidelines. The document provides
implementation details for both Coordinator and Tablet servers, along
with configuration references and monitoring recommendations.
---
 .../operations/graceful-shutdown.md           | 238 ++++++++++++++++++
 1 file changed, 238 insertions(+)
 create mode 100644 website/docs/maintenance/operations/graceful-shutdown.md
diff --git a/website/docs/maintenance/operations/graceful-shutdown.md b/website/docs/maintenance/operations/graceful-shutdown.md
new file mode 100644
index 0000000000..9614cf34e3
--- /dev/null
+++ b/website/docs/maintenance/operations/graceful-shutdown.md
@@ -0,0 +1,238 @@
+# Graceful Shutdown
+
+Fluss provides comprehensive graceful shutdown mechanisms to ensure data integrity and proper resource cleanup when stopping servers or services. This document covers the various shutdown procedures and best practices for different Fluss components.
+
+## Overview
+
+Graceful shutdown in Fluss ensures that:
+- All ongoing operations complete safely
+- Resources are properly released
+- Data consistency is maintained
+- Network connections are cleanly closed
+- Background tasks are terminated properly
+
+## Server Shutdown
+
+### Coordinator Server Shutdown
+
+The Coordinator Server implements a multi-stage shutdown process:
+
+1. **Shutdown Hook Registration**: The server registers a JVM shutdown hook that triggers graceful shutdown on process termination
+2. **Service Termination**: All services are stopped in a specific order to maintain consistency:
+
+   **Coordinator Server Shutdown Order:**
+   1. Server Metric Group → Metric Registry (async)
+   2. Auto Partition Manager → IO Executor (5s timeout)
+   3. Coordinator Event Processor → Coordinator Channel Manager
+   4. RPC Server (async) → Coordinator Service
+   5. Coordinator Context → Lake Table Tiering Manager
+   6. ZooKeeper Client → Authorizer
+   7. Dynamic Config Manager → Lake Catalog Dynamic Loader
+   8. RPC Client → Client Metric Group
+
+   **Tablet Server Shutdown Order:**
+   1. Tablet Server Metric Group → Metric Registry (async)
+   2. RPC Server (async) → Tablet Service
+   3. ZooKeeper Client → RPC Client → Client Metric Group
+   4. Scheduler → KV Manager → Remote Log Manager
+   5. Log Manager → Replica Manager
+   6. Authorizer → Dynamic Config Manager → Lake Catalog Dynamic Loader
+3. **Resource Cleanup**: Executors, connections, and other resources are properly closed
+
+```bash
+# Graceful shutdown via SIGTERM
+kill -TERM <coordinator-pid>
+
+# Or using the shutdown script (if available)
+./bin/stop-coordinator.sh
+```
+
+### Tablet Server Shutdown
+
+The Tablet Server supports **controlled shutdown** to minimize data unavailability:
+
+#### Controlled Shutdown Process
+
+1. **Leadership Transfer**: The server attempts to transfer leadership of all buckets it leads to other replicas
+2. **Retry Logic**: If leadership transfer fails, the server retries with configurable intervals
+3. **Timeout Handling**: After maximum retries, the server proceeds with unclean shutdown if necessary
+
+```bash
+# Initiate controlled shutdown
+kill -TERM <tablet-server-pid>
+```
+
+#### Configuration Options
+
+- **Controlled Shutdown Retries**: Number of attempts to transfer leadership
+- **Retry Interval**: Time between retry attempts (default: configurable via `CONTROLLED_SHUTDOWN_RETRY_INTERVAL_MS`)
+
+## Component-Specific Shutdown
+
+### Executor Services
+
+Fluss uses the `ExecutorUtils` class for graceful executor shutdown:
+
+```java
+// Graceful shutdown with timeout
+ExecutorUtils.gracefulShutdown(timeout, TimeUnit.SECONDS, executorService);
+
+// Non-blocking shutdown
+CompletableFuture<Void> shutdownFuture = 
+    ExecutorUtils.nonBlockingShutdown(timeout, TimeUnit.SECONDS, executorService);
+```
+
+**Shutdown Process**:
+1. Call `shutdown()` to stop accepting new tasks
+2. Wait for existing tasks to complete within timeout
+3. Force termination with `shutdownNow()` if timeout exceeded
+
+### Network Components
+
+#### RPC Server Shutdown
+
+The Netty-based RPC server implements asynchronous shutdown:
+
+```java
+CompletableFuture<Void> shutdownFuture = rpcServer.closeAsync();
+```
+
+**Shutdown Steps**:
+1. Stop accepting new connections
+2. Close existing channels gracefully
+3. Shutdown event loop groups
+4. Release worker pools
+
+#### Event Loop Groups
+
+Network event loops are shut down using Netty's graceful shutdown:
+
+```java
+group.shutdownGracefully()
+    .addListener(finished -> {
+        // Handle completion
+    });
+```
+
+### Remote Log Manager
+
+For components with thread pools, Fluss follows the standard Java pattern:
+
+1. **Disable New Tasks**: Call `shutdown()` to prevent new task submission
+2. **Wait for Completion**: Use `awaitTermination()` with timeout
+3. **Force Cancellation**: Call `shutdownNow()` if tasks don't complete
+4. **Handle Interruption**: Properly handle `InterruptedException`
+
+## Best Practices
+
+### 1. Use Shutdown Hooks
+
+Register shutdown hooks for critical services to ensure cleanup on JVM termination:
+
+```java
+Thread shutdownHook = ShutdownHookUtil.addShutdownHook(
+    service, 
+    "ServiceName", 
+    logger
+);
+```
+
+### 2. Implement Timeout Handling
+
+Always specify timeouts for shutdown operations to prevent indefinite blocking:
+
+```java
+// Good: With timeout
+ExecutorUtils.gracefulShutdown(30, TimeUnit.SECONDS, executor);
+
+// Avoid: Without timeout (may block indefinitely)
+executor.shutdown();
+executor.awaitTermination(Long.MAX_VALUE, TimeUnit.NANOSECONDS);
+```
+
+### 3. Order of Shutdown
+
+Shut down components in reverse order of their dependencies:
+
+1. Stop accepting new requests
+2. Complete ongoing operations
+3. Close client connections
+4. Shutdown background services
+5. Release system resources
+
+### 4. Handle Exceptions
+
+Properly handle exceptions during shutdown to ensure all cleanup steps execute:
+
+```java
+Throwable exception = null;
+try {
+    // Shutdown component 1
+} catch (Throwable t) {
+    exception = ExceptionUtils.firstOrSuppressed(t, exception);
+}
+try {
+    // Shutdown component 2
+} catch (Throwable t) {
+    exception = ExceptionUtils.firstOrSuppressed(t, exception);
+}
+// Continue for all components...
+```
+
+## Monitoring Shutdown
+
+### Logging
+
+Fluss provides detailed logging during shutdown processes:
+
+- **INFO**: Normal shutdown progress
+- **WARN**: Retry attempts or timeout warnings
+- **ERROR**: Shutdown failures or exceptions
+
+### Metrics
+
+Monitor shutdown-related metrics:
+
+- Shutdown duration
+- Failed shutdown attempts
+- Resource cleanup status
+
+## Troubleshooting
+
+### Common Issues
+
+1. **Hanging Shutdown**: 
+   - Check for blocking operations without timeouts
+   - Verify thread pool configurations
+   - Look for deadlocks in application code
+
+2. **Resource Leaks**:
+   - Ensure all `AutoCloseable` resources are properly closed
+   - Check for unclosed network connections
+   - Verify file handle cleanup
+
+3. **Data Loss**:
+   - Use controlled shutdown for Tablet Servers
+   - Ensure proper leadership transfer
+   - Verify replication factor settings
+
+### Debug Steps
+
+1. Enable debug logging for shutdown components
+2. Monitor JVM thread dumps during shutdown
+3. Check system resource usage
+4. Verify network connection states
+
+## Configuration Reference
+
+| Configuration | Description | Default |
+|---------------|-------------|---------|
+| `controlled.shutdown.max.retries` | Maximum retries for controlled shutdown | 3 |
+| `controlled.shutdown.retry.interval.ms` | Interval between retry attempts | 5000 |
+| `shutdown.timeout.ms` | General shutdown timeout | 30000 |
+
+## See Also
+
+- [Configuration](../configuration.md)
+- [Monitoring and Observability](../observability/monitor-metrics.md)
+- [Upgrading Fluss](upgrading.md)
\ No newline at end of file

From a4da5027e43fc1a50a05b43416097400e371dbd6 Mon Sep 17 00:00:00 2001
From: Hemanth <74316298+hsavasere@users.noreply.github.com>
Date: Sat, 11 Oct 2025 12:10:58 +0530
Subject: [PATCH 2/3] Changes done to remove unecessary docs

---
 .../operations/graceful-shutdown.md           | 129 ++----------------
 1 file changed, 9 insertions(+), 120 deletions(-)

diff --git a/website/docs/maintenance/operations/graceful-shutdown.md b/website/docs/maintenance/operations/graceful-shutdown.md
index 9614cf34e3..05ad88855c 100644
--- a/website/docs/maintenance/operations/graceful-shutdown.md
+++ b/website/docs/maintenance/operations/graceful-shutdown.md
@@ -30,13 +30,6 @@ The Coordinator Server implements a multi-stage shutdown process:
    7. Dynamic Config Manager → Lake Catalog Dynamic Loader
    8. RPC Client → Client Metric Group
 
-   **Tablet Server Shutdown Order:**
-   1. Tablet Server Metric Group → Metric Registry (async)
-   2. RPC Server (async) → Tablet Service
-   3. ZooKeeper Client → RPC Client → Client Metric Group
-   4. Scheduler → KV Manager → Remote Log Manager
-   5. Log Manager → Replica Manager
-   6. Authorizer → Dynamic Config Manager → Lake Catalog Dynamic Loader
 3. **Resource Cleanup**: Executors, connections, and other resources are properly closed
 
 ```bash
@@ -49,7 +42,15 @@ kill -TERM <coordinator-pid>
 
 ### Tablet Server Shutdown
 
-The Tablet Server supports **controlled shutdown** to minimize data unavailability:
+The Tablet Server supports **controlled shutdown** to minimize data unavailability. The shutdown process ensures that all services are stopped in a specific order to maintain consistency:
+
+   **Tablet Server Shutdown Order:**
+   1. Tablet Server Metric Group → Metric Registry (async)
+   2. RPC Server (async) → Tablet Service
+   3. ZooKeeper Client → RPC Client → Client Metric Group
+   4. Scheduler → KV Manager → Remote Log Manager
+   5. Log Manager → Replica Manager
+   6. Authorizer → Dynamic Config Manager → Lake Catalog Dynamic Loader
 
 #### Controlled Shutdown Process
 
@@ -67,118 +68,6 @@ kill -TERM <tablet-server-pid>
 - **Controlled Shutdown Retries**: Number of attempts to transfer leadership
 - **Retry Interval**: Time between retry attempts (default: configurable via `CONTROLLED_SHUTDOWN_RETRY_INTERVAL_MS`)
 
-## Component-Specific Shutdown
-
-### Executor Services
-
-Fluss uses the `ExecutorUtils` class for graceful executor shutdown:
-
-```java
-// Graceful shutdown with timeout
-ExecutorUtils.gracefulShutdown(timeout, TimeUnit.SECONDS, executorService);
-
-// Non-blocking shutdown
-CompletableFuture<Void> shutdownFuture = 
-    ExecutorUtils.nonBlockingShutdown(timeout, TimeUnit.SECONDS, executorService);
-```
-
-**Shutdown Process**:
-1. Call `shutdown()` to stop accepting new tasks
-2. Wait for existing tasks to complete within timeout
-3. Force termination with `shutdownNow()` if timeout exceeded
-
-### Network Components
-
-#### RPC Server Shutdown
-
-The Netty-based RPC server implements asynchronous shutdown:
-
-```java
-CompletableFuture<Void> shutdownFuture = rpcServer.closeAsync();
-```
-
-**Shutdown Steps**:
-1. Stop accepting new connections
-2. Close existing channels gracefully
-3. Shutdown event loop groups
-4. Release worker pools
-
-#### Event Loop Groups
-
-Network event loops are shut down using Netty's graceful shutdown:
-
-```java
-group.shutdownGracefully()
-    .addListener(finished -> {
-        // Handle completion
-    });
-```
-
-### Remote Log Manager
-
-For components with thread pools, Fluss follows the standard Java pattern:
-
-1. **Disable New Tasks**: Call `shutdown()` to prevent new task submission
-2. **Wait for Completion**: Use `awaitTermination()` with timeout
-3. **Force Cancellation**: Call `shutdownNow()` if tasks don't complete
-4. **Handle Interruption**: Properly handle `InterruptedException`
-
-## Best Practices
-
-### 1. Use Shutdown Hooks
-
-Register shutdown hooks for critical services to ensure cleanup on JVM termination:
-
-```java
-Thread shutdownHook = ShutdownHookUtil.addShutdownHook(
-    service, 
-    "ServiceName", 
-    logger
-);
-```
-
-### 2. Implement Timeout Handling
-
-Always specify timeouts for shutdown operations to prevent indefinite blocking:
-
-```java
-// Good: With timeout
-ExecutorUtils.gracefulShutdown(30, TimeUnit.SECONDS, executor);
-
-// Avoid: Without timeout (may block indefinitely)
-executor.shutdown();
-executor.awaitTermination(Long.MAX_VALUE, TimeUnit.NANOSECONDS);
-```
-
-### 3. Order of Shutdown
-
-Shut down components in reverse order of their dependencies:
-
-1. Stop accepting new requests
-2. Complete ongoing operations
-3. Close client connections
-4. Shutdown background services
-5. Release system resources
-
-### 4. Handle Exceptions
-
-Properly handle exceptions during shutdown to ensure all cleanup steps execute:
-
-```java
-Throwable exception = null;
-try {
-    // Shutdown component 1
-} catch (Throwable t) {
-    exception = ExceptionUtils.firstOrSuppressed(t, exception);
-}
-try {
-    // Shutdown component 2
-} catch (Throwable t) {
-    exception = ExceptionUtils.firstOrSuppressed(t, exception);
-}
-// Continue for all components...
-```
-
 ## Monitoring Shutdown
 
 ### Logging

From b05515b6609b2d37816cff7feef0811ae60cf05f Mon Sep 17 00:00:00 2001
From: ipolyzos <ipolyzos.se@gmail.com>
Date: Sun, 12 Oct 2025 18:55:23 +0200
Subject: [PATCH 3/3] add some improvements

---
 .../operations/graceful-shutdown.md           | 50 ++++++++-----------
 1 file changed, 22 insertions(+), 28 deletions(-)

diff --git a/website/docs/maintenance/operations/graceful-shutdown.md b/website/docs/maintenance/operations/graceful-shutdown.md
index 05ad88855c..7a183c3a5c 100644
--- a/website/docs/maintenance/operations/graceful-shutdown.md
+++ b/website/docs/maintenance/operations/graceful-shutdown.md
@@ -1,6 +1,8 @@
 # Graceful Shutdown
 
-Fluss provides comprehensive graceful shutdown mechanisms to ensure data integrity and proper resource cleanup when stopping servers or services. This document covers the various shutdown procedures and best practices for different Fluss components.
+Apache Fluss provides a **comprehensive graceful shutdown mechanism** to ensure data integrity and proper resource cleanup when stopping servers or services.
+
+This guide describes the shutdown procedures, configuration options, and best practices for each Fluss component.
 
 ## Overview
 
@@ -11,12 +13,14 @@ Graceful shutdown in Fluss ensures that:
 - Network connections are cleanly closed
 - Background tasks are terminated properly
 
+These guarantees prevent data corruption and ensure smooth restarts of the system.
+
 ## Server Shutdown
 
 ### Coordinator Server Shutdown
 
-The Coordinator Server implements a multi-stage shutdown process:
-
+The **Coordinator Server** uses a multi-stage shutdown process to safely terminate all services in the correct order.
+#### Shutdown Process
 1. **Shutdown Hook Registration**: The server registers a JVM shutdown hook that triggers graceful shutdown on process termination
 2. **Service Termination**: All services are stopped in a specific order to maintain consistency:
 
@@ -42,15 +46,15 @@ kill -TERM <coordinator-pid>
 
 ### Tablet Server Shutdown
 
-The Tablet Server supports **controlled shutdown** to minimize data unavailability. The shutdown process ensures that all services are stopped in a specific order to maintain consistency:
+The **Tablet Server** supports a **controlled shutdown process** designed to minimize data unavailability and ensure leadership handover before termination.
 
-   **Tablet Server Shutdown Order:**
-   1. Tablet Server Metric Group → Metric Registry (async)
-   2. RPC Server (async) → Tablet Service
-   3. ZooKeeper Client → RPC Client → Client Metric Group
-   4. Scheduler → KV Manager → Remote Log Manager
-   5. Log Manager → Replica Manager
-   6. Authorizer → Dynamic Config Manager → Lake Catalog Dynamic Loader
+**Shutdown Order:**
+1. Tablet Server Metric Group → Metric Registry (async)
+2. RPC Server (async) → Tablet Service 
+3. ZooKeeper Client → RPC Client → Client Metric Group 
+4. Scheduler → KV Manager → Remote Log Manager 
+5. Log Manager → Replica Manager 
+6. Authorizer → Dynamic Config Manager → Lake Catalog Dynamic Loader
 
 #### Controlled Shutdown Process
 
@@ -65,8 +69,8 @@ kill -TERM <tablet-server-pid>
 
 #### Configuration Options
 
-- **Controlled Shutdown Retries**: Number of attempts to transfer leadership
-- **Retry Interval**: Time between retry attempts (default: configurable via `CONTROLLED_SHUTDOWN_RETRY_INTERVAL_MS`)
+- **Controlled Shutdown Retries**: Number of attempts to transfer leadership (`default:` 3 retries)
+- **Retry Interval**: Time between retry attempts (`default`: 1000L)
 
 ## Monitoring Shutdown
 
@@ -89,21 +93,11 @@ Monitor shutdown-related metrics:
 ## Troubleshooting
 
 ### Common Issues
-
-1. **Hanging Shutdown**: 
-   - Check for blocking operations without timeouts
-   - Verify thread pool configurations
-   - Look for deadlocks in application code
-
-2. **Resource Leaks**:
-   - Ensure all `AutoCloseable` resources are properly closed
-   - Check for unclosed network connections
-   - Verify file handle cleanup
-
-3. **Data Loss**:
-   - Use controlled shutdown for Tablet Servers
-   - Ensure proper leadership transfer
-   - Verify replication factor settings
+| Issue                | Possible Causes                                                 | Recommended Actions                                                             |
+| -------------------- | --------------------------------------------------------------- | ------------------------------------------------------------------------------- |
+| **Hanging shutdown** | Blocking operations, thread pool misconfiguration, or deadlocks | Check for blocking calls without timeouts, inspect thread dumps                 |
+| **Resource leaks**   | Unclosed resources or connections                               | Verify all `AutoCloseable` resources and file handles are closed                |
+| **Data loss**        | Unclean shutdown or failed leadership transfer                  | Always use controlled shutdown for Tablet Servers and verify replication factor |
 
 ### Debug Steps