# Auto-Restart DockMon's intelligent auto-restart system automatically recovers containers from failures, ensuring high availability for critical services. ## Overview The auto-restart feature provides: - **Per-container configuration** - Fine-grained control over which containers auto-restart - **Configurable retry logic** - Set maximum attempts and delays - **Exponential backoff** - Prevent restart storms during persistent failures - **Integration with alerts** - Get notified when containers repeatedly fail - **Desired state tracking** - Ensure containers stay in intended state - **Global defaults** - Set system-wide auto-restart behavior ## How It Works ### Detection DockMon continuously monitors container states via Docker events and periodic polling. When a container transitions to an exited, dead, or stopped state, the auto-restart system evaluates whether to attempt recovery. ### Decision Process For each stopped container, DockMon checks: 1. **Is auto-restart enabled for this container?** - Checks database configuration - Falls back to global default if not explicitly configured 2. **Has max retry limit been reached?** - Tracks attempt count per container - Resets counter after successful restart - Gives up after max attempts exceeded 3. **Is retry delay satisfied?** - Enforces minimum delay between attempts - Applies backoff strategy (linear or exponential) - Prevents rapid restart loops 4. **Is blackout window active?** - Defers restart during maintenance windows - Queues for execution after blackout ends 5. **Is container already restarting?** - Prevents concurrent restart attempts - Tracks in-progress operations ### Restart Action When all conditions are met: 1. **Wait for retry delay** (respecting backoff strategy) 2. **Attempt container start** via Docker API 3. **Monitor result**: - Success: Reset attempt counter, resume monitoring - Failure: Increment counter, calculate next retry delay 4. **Send alert** if configured (on failure or max retries exceeded) ## Configuration ### Per-Container Configuration **Access**: Container Details → Auto-Restart Tab **Settings**: | Setting | Range | Default | Description | |---------|-------|---------|-------------| | **Enabled** | On/Off | Global default | Enable auto-restart for this container | | **Max Retries** | 0-10 | 3 | Maximum restart attempts before giving up | | **Retry Delay** | 5-300 seconds | 30 | Base delay between restart attempts | | **Backoff Strategy** | Linear/Exponential | Linear | How delay increases with each retry | ### Global Default Configuration **Access**: Settings → System → Auto-Restart Defaults **Settings**: - **Default Auto-Restart**: Enable/disable auto-restart for new containers - **Default Max Retries**: System-wide max retry count - **Default Retry Delay**: System-wide base retry delay **Note**: Global defaults only apply to containers without explicit configuration. Changing global defaults does not affect containers with existing configurations. ## Retry Strategies ### Linear Backoff (Default) Delay remains constant for all retry attempts. **Example** (30-second delay): - Attempt 1: Wait 30 seconds - Attempt 2: Wait 30 seconds - Attempt 3: Wait 30 seconds **Use cases**: - Services with predictable startup time - Network-dependent services (waiting for DNS/routing) - Database-dependent applications ### Exponential Backoff Delay doubles with each retry attempt. **Example** (30-second base delay): - Attempt 1: Wait 30 seconds - Attempt 2: Wait 60 seconds (2^1 × 30) - Attempt 3: Wait 120 seconds (2^2 × 30) - Attempt 4: Wait 240 seconds (2^3 × 30) **Use cases**: - Flaky services with intermittent failures - Services that need time to recover (memory leaks, cache warmup) - Protecting against restart storms **Maximum delay cap**: 300 seconds (5 minutes) to prevent indefinite waits ## Desired State Management Desired state works in conjunction with auto-restart to maintain container availability. ### Desired States | State | Behavior | Icon | |-------|----------|------| | **Should Run** | Container should always be running | Green play icon | | **On-Demand** | Container runs only when manually started | Gray clock icon | | **Unspecified** | No desired state (legacy containers) | No icon | ### Should Run When a container's desired state is "Should Run": - **Auto-restart activates** when container stops unexpectedly - **Warning icon displays** if container is stopped but should be running - **Alerts trigger** if container remains stopped beyond retry limit **Use cases**: - Web servers - API services - Background workers - Databases ### On-Demand When a container's desired state is "On-Demand": - **Auto-restart does NOT activate** (even if enabled) - **No warnings** when container is stopped - **Manual start required** each time **Use cases**: - One-time migration scripts - Development/testing containers - Scheduled jobs (cron-like containers) - Manual intervention tools **Note**: Setting desired state to "On-Demand" effectively disables auto-restart for that container, even if auto-restart is enabled. ## Status Tracking ### Restart Attempt Counter DockMon tracks restart attempts per container using composite keys (`host_id:container_id`): - **Increments** on each failed restart attempt - **Resets to zero** when: - Container successfully starts and runs for 60+ seconds - Auto-restart is manually disabled - Max retries exceeded (gives up) - **Persists across DockMon restarts** (stored in memory, not database) ### Restarting Status Containers in the process of being restarted show: - **Blue spinning circle** in status column - **"Restarting" label** - **Disabled action buttons** (cannot start/stop during restart) ### Failure Tracking When auto-restart fails: 1. **Event logged** to Events table (visible in Event Viewer) 2. **Alert triggered** if alert rule configured for exit state 3. **Attempt counter incremented** 4. **Next retry scheduled** based on backoff strategy ## Integration with Alerts Auto-restart failures can trigger alert notifications. ### Alert Rule Configuration **Create alert rule**: 1. Navigate to Settings → Alerts 2. Create new rule or edit existing 3. Configure: - **Scope**: Container - **Kind**: State Change - **Trigger States**: `exited`, `dead` - **Container Selector**: All containers or specific containers - **Notify Channels**: Your preferred channels (Telegram, Discord, etc.) **Alert message includes**: - Container name and ID - Host name - Exit code - Restart attempt count - Max retries remaining - Timestamp ### Suppression During Blackout Windows Auto-restart attempts are deferred during blackout windows: - **Stops accumulating during blackout**: Containers that stop during maintenance windows are not immediately restarted - **Queued for evaluation**: After blackout ends, DockMon checks all stopped containers - **Bulk restart**: Containers with `desired_state: should_run` are restarted after blackout - **Alerts sent for failures**: Post-blackout checks trigger alerts for containers still in failed state See [Blackout Windows](Blackout-Windows.md) for details. ## Best Practices ### When to Enable Auto-Restart **DO enable for**: - Production web servers - Critical API services - Database containers - Message queue workers - Reverse proxies and load balancers - Monitoring and logging services **DON'T enable for**: - One-time migration scripts - Data import/export jobs - Development containers - Test/CI containers - Containers that should fail-fast ### Retry Configuration Guidelines **Low-risk services** (can restart frequently): - Max retries: 5-10 - Retry delay: 10-15 seconds - Backoff: Linear **Example**: Static file servers, caches **Medium-risk services** (restart has some cost): - Max retries: 3-5 - Retry delay: 30-60 seconds - Backoff: Exponential **Example**: Application servers, APIs **High-risk services** (restart is expensive): - Max retries: 1-3 - Retry delay: 60-120 seconds - Backoff: Exponential **Example**: Databases, stateful services ### Desired State Best Practices **Always set desired state** for new containers: - Production services: "Should Run" - Development/testing: "On-Demand" - One-shot tasks: "On-Demand" **Benefits**: - Clear operational intent - Visual indicators for mismatches - Better alert targeting - Easier troubleshooting ### Alert Integration **Alert on**: - First restart attempt (informational) - Max retries exceeded (critical) - Exit code changes (error) - Restart patterns (warning - possible crash loop) **Alert channels**: - Critical services: Multiple channels (Telegram + Email) - Standard services: Primary channel only - Development: Low-priority channel or disabled ## Monitoring and Troubleshooting ### Viewing Auto-Restart Status **Dashboard**: - Auto-restart icon in Policy column (blue refresh = enabled) - Desired state icon in Policy column - Warning triangle if container should be running but isn't **Container Details**: - Auto-Restart tab shows full configuration - Current restart attempt count - Last restart timestamp - Next retry time (if in retry loop) **Events Tab**: - Filter by container - Look for "container_stopped", "container_started" events - Check exit codes for patterns ### Common Issues #### Container in Restart Loop **Symptoms**: - Container repeatedly starts and stops - High restart attempt count - Spinning "restarting" status **Diagnosis**: 1. Check logs for startup errors 2. Review exit codes in Events tab 3. Look for resource constraints (CPU, memory, disk) 4. Check dependencies (database, network, volumes) **Solutions**: - Disable auto-restart temporarily - Fix root cause (config, resources, code) - Increase retry delay to allow debugging - Set desired state to "On-Demand" until fixed #### Auto-Restart Not Working **Symptoms**: - Container stops but doesn't restart - No restart attempts logged **Diagnosis**: 1. Verify auto-restart is enabled (Container Details → Auto-Restart tab) 2. Check desired state (should be "Should Run") 3. Look for blackout window (Settings → Alerts) 4. Review max retries (may have been exceeded) 5. Check DockMon logs for errors **Solutions**: - Enable auto-restart if disabled - Set desired state to "Should Run" - Wait for blackout window to end - Reset retry counter (disable/re-enable auto-restart) - Check DockMon container logs: `docker logs dockmon` #### Too Many Restart Attempts **Symptoms**: - Service frequently restarts - High resource usage from restart overhead - Alert storm from repeated failures **Diagnosis**: 1. Check logs for error patterns 2. Review container resource limits (memory, CPU) 3. Check host resource availability 4. Look for external dependency failures (database, API) **Solutions**: - Increase resource limits if needed - Fix application bugs causing crashes - Add health checks to detect issues earlier - Use exponential backoff to reduce restart frequency - Reduce max retries to fail faster ### Resetting Retry Counter To reset a container's restart attempt counter: 1. **Disable auto-restart**: - Container Details → Auto-Restart tab - Toggle "Enabled" to Off - Save 2. **Re-enable auto-restart**: - Toggle "Enabled" to On - Save This clears the attempt counter and resets retry timing. ## Advanced Scenarios ### Coordinated Restarts For multi-container applications requiring startup order: 1. **Disable auto-restart** for dependent containers 2. **Use desired state** instead ("Should Run") 3. **Create health checks** for dependency detection 4. **Monitor via events** for manual intervention 5. **Or use Docker Compose** with `depends_on` and health checks **Why**: Auto-restart doesn't coordinate between containers. Use orchestration tools (Docker Compose, Kubernetes) for complex startup dependencies. ### Flapping Detection To detect containers that repeatedly restart (flapping): 1. **Create alert rule**: - Trigger: Container state change to "exited" - Occurrences: 3 within 60 seconds - Severity: Warning 2. **Review alerts** for patterns 3. **Investigate logs** for root cause 4. **Disable auto-restart** for flapping containers until fixed ### Graceful Degradation For non-critical services that should fail gracefully: 1. **Set low max retries** (1-2 attempts) 2. **Use exponential backoff** 3. **Create informational alerts** (not critical) 4. **Set desired state** to "On-Demand" **Example**: Optional caching services, analytics collectors ## Related Documentation - [Container Operations](Container-Operations.md) - Managing container lifecycle - [Blackout Windows](Blackout-Windows.md) - Maintenance periods and alert suppression - [Alerts](https://github.com/darthnorse/dockmon/wiki/Alerts) - Alert rules and notifications - [Settings](Settings.md) - Global configuration options