Skip to content

Epic: Implement Proper Reconciliation for Kubernetes and Filesystem Modes #83

@teemow

Description

@teemow

Epic: Implement Proper Reconciliation for Kubernetes and Filesystem Modes

Problem Statement

Currently, muster uses an event-driven architecture without proper reconciliation loops. This means:

  • No automatic change detection: Changes to CRDs/YAML files are not automatically detected and applied
  • No drift correction: System state can diverge from desired state without detection
  • Manual synchronization: Users must manually trigger operations to apply configuration changes
  • Inconsistent behavior: Kubernetes and filesystem modes behave differently regarding state synchronization

Current State Analysis

Based on code analysis, muster currently:

  • ✅ Has unified client interface (MusterClient) supporting both K8s and filesystem modes
  • ✅ Uses event-driven updates for service state changes
  • Missing: Kubernetes controllers/watchers for CRD changes
  • Missing: Filesystem watchers for YAML file changes
  • Missing: Reconciliation loops to ensure desired state

Desired End State

Implement a unified reconciliation system that:

  1. Automatically detects changes in both Kubernetes and filesystem modes
  2. Reconciles desired vs actual state for all resource types
  3. Maintains consistency across different deployment modes
  4. Provides feedback on reconciliation status and errors
  5. Integrates seamlessly with existing event-driven architecture

Success Criteria

  • CRD changes in Kubernetes are automatically detected and reconciled
  • YAML file changes in filesystem mode are automatically detected and reconciled
  • System automatically corrects drift between desired and actual state
  • Reconciliation status is visible through API and CLI
  • Performance impact is minimal (efficient watching/polling)
  • Backward compatibility is maintained

Architecture Overview

Kubernetes Mode

K8s API Server → Controller/Informer → Reconciler → Internal State → Services
     ↑                                                                    ↓
     └─────────────── Status Updates ←──────────────── Events ←─────────────┘

Filesystem Mode

File Watcher → Change Detector → Reconciler → Internal State → Services
     ↑                                                             ↓
     └─────────── File Updates ←───────────── Events ←─────────────┘

Epic Breakdown

Phase 1: Foundation (4-6 weeks)

  • Reconciliation Framework: Core interfaces and patterns
  • Change Detection: File watchers and Kubernetes informers
  • Reconciler Registry: Plugin system for different resource types

Phase 2: Resource Reconcilers (6-8 weeks)

  • MCPServer Reconciler: Handle MCPServer lifecycle
  • ServiceClass Reconciler: Manage ServiceClass definitions
  • Workflow Reconciler: Process Workflow definitions
  • Service Instance Reconciler: Coordinate running services

Phase 3: Integration & Polish (3-4 weeks)

  • Event Integration: Connect with existing event system
  • Performance Optimization: Efficient batching and caching
  • Monitoring & Observability: Metrics and logging
  • Documentation: Architecture and operational guides

Key Components to Implement

1. Reconciliation Framework

  • ReconcileManager - Central coordination
  • Reconciler interface - Standard reconciliation pattern
  • ChangeDetector interface - Unified change detection
  • ReconcileLoop - Generic reconciliation loop implementation

2. Kubernetes Controllers

  • MCPServerController - Watch and reconcile MCPServer CRDs
  • ServiceClassController - Watch and reconcile ServiceClass CRDs
  • WorkflowController - Watch and reconcile Workflow CRDs
  • Integration with controller-runtime framework

3. Filesystem Watchers

  • FileSystemWatcher - Monitor YAML file changes
  • DirectoryScanner - Periodic validation scans
  • FileChangeProcessor - Handle file system events

4. State Synchronization

  • StateComparator - Compare desired vs actual state
  • DriftDetector - Identify configuration drift
  • ActionPlanner - Determine reconciliation actions

Technical Considerations

Performance

  • Use Kubernetes informers with proper caching
  • Implement file watching with debouncing
  • Batch operations where possible
  • Configurable reconciliation intervals

Error Handling

  • Exponential backoff for failed reconciliations
  • Dead letter queue for persistently failing items
  • Detailed error reporting and logging
  • Circuit breaker patterns for external dependencies

Backward Compatibility

  • Maintain existing API interfaces
  • Support both reconciled and manual operation modes
  • Graceful degradation when reconciliation is disabled
  • Migration path for existing deployments

Dependencies

  • External:

    • controller-runtime (Kubernetes controllers)
    • fsnotify (filesystem watching)
    • golang.org/x/sync (coordination primitives)
  • Internal:

    • Existing MusterClient interface
    • Current event system integration
    • Service orchestrator patterns

Risks & Mitigation

Risk Impact Mitigation
Performance degradation High Benchmarking, caching, configurable intervals
Race conditions High Proper locking, event ordering, idempotent operations
Complex state management Medium Clear interfaces, extensive testing, state machines
Backward compatibility Medium Feature flags, graceful degradation, migration guides

Acceptance Criteria

Functional

  • Changes to CRDs are automatically applied to running services
  • Changes to YAML files are automatically detected and processed
  • System recovers automatically from configuration drift
  • Reconciliation can be disabled/enabled per resource type
  • Reconciliation status is exposed via API

Non-Functional

  • Reconciliation adds <100ms latency to normal operations
  • Memory usage increases by <50MB for typical workloads
  • CPU usage increases by <5% for typical workloads
  • 99.9% uptime maintained during reconciliation operations
  • Zero data loss during reconciliation failures

Testing

  • Unit tests for all reconciliation components

Epic Owner: @teemow
Estimated Effort: 13-18 weeks
Priority: High
Labels: epic, enhancement, kubernetes, reconciliation, architecture

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions