-
Notifications
You must be signed in to change notification settings - Fork 0
Closed
Labels
area/distributedDistributed coordination and TiKVDistributed coordination and TiKVpriority/highHigh priorityHigh prioritysize/MMedium: 3-5 daysMedium: 3-5 daystype/featureNew feature or functionalityNew feature or functionality
Description
Summary
Implement worker heartbeat for liveness tracking and zombie detection to reclaim jobs from dead workers.
Parent Epic
- [Epic] Distributed Roboflow with Alibaba Cloud (OSS + ACK) #9 Distributed Roboflow with TiKV Coordination
Dependencies
- Depends on: feat: add LeRobot 2.1 dataset support #28 (TiKV Client), chore: update .gitignore #31 (Worker Loop)
- Related: [Phase 5] Frame-level checkpoint with TiKV and multipart resume #19 (Checkpoint)
Design
Heartbeat
- Each worker writes to
/heartbeat/{pod_id}periodically - Contains timestamp and current job info
- Combined with checkpoint for efficiency
Zombie Detection
- Any worker can scan for stale heartbeats
- Reclaim jobs from dead workers
- Preserve checkpoint for resume
Tasks
4.5.1 Define Heartbeat Thread
- Create
src/distributed/heartbeat.rs - Define
HeartbeatManager:- pod_id: String
- tikv_client: TikvClient
- current_job: Arc<Mutex<Option>>
- Define configuration:
- heartbeat_interval (default: 30s)
- stale_threshold (default: 5 minutes)
4.5.2 Implement Heartbeat Update
update_heartbeat():- Write to
/heartbeat/{pod_id} - Include: pod_id, hostname, current_job, last_seen, started_at
- Write to
start_background_thread():- Spawn thread
- Loop: update, sleep(interval)
- Stop on shutdown signal
4.5.3 Integrate with Checkpoint
- When processing a job:
- Heartbeat included in checkpoint transaction
- Single TiKV round trip for both
- When idle:
- Standalone heartbeat update
- Reduce TiKV load
4.5.4 Define Zombie Reaper
- Create
src/distributed/reaper.rs - Define
ZombieReaper:- tikv_client: TikvClient
- stale_threshold: Duration
- Runs periodically (every 60s)
- Not leader-elected (all workers run it)
4.5.5 Implement Zombie Detection
find_stale_workers() -> Vec<String>:- Scan
/heartbeat/prefix - Filter where last_seen < now - threshold
- Or heartbeat key missing
- Scan
find_orphaned_jobs() -> Vec<JobRecord>:- Query jobs with status=Processing
- Check if owner's heartbeat is stale
4.5.6 Implement Job Reclamation
reclaim_job(job_id: &str) -> Result<bool>:- Transaction:
- Read job (verify still Processing)
- Read owner heartbeat (verify stale)
- Write: status=Pending, owner=null
- Commit
- Return true if reclaimed
- Transaction:
- Preserve checkpoint (crucial!)
- Log reclamation event
4.5.7 Implement Reaper Loop
run():loop { orphaned = find_orphaned_jobs() for job in orphaned: if reclaim_job(job): log("Reclaimed job {job} from dead worker") sleep(60s) }- Limit reclaims per iteration (prevent thundering herd)
4.5.8 Cleanup on Shutdown
- On graceful shutdown:
- Stop heartbeat thread
- Delete heartbeat key
- Release any held jobs (return to Pending)
- On crash: Heartbeat goes stale naturally
4.5.9 Add Metrics
heartbeat_updates_totalheartbeat_age_seconds(gauge)reaper_jobs_reclaimed_totalreaper_stale_workers_found_total
Acceptance Criteria
- Heartbeat thread updates periodically
- Heartbeat combined with checkpoint
- Zombie reaper finds stale workers
- Orphaned jobs reclaimed correctly
- Checkpoint preserved on reclaim
- Graceful shutdown cleans up
- Metrics exported
- Integration test: Kill worker, verify job reclaimed
Files to Create
src/distributed/heartbeat.rssrc/distributed/reaper.rs
Files to Modify
src/distributed/mod.rssrc/distributed/worker.rs(integrate heartbeat)src/distributed/checkpoint.rs(combine with heartbeat)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
area/distributedDistributed coordination and TiKVDistributed coordination and TiKVpriority/highHigh priorityHigh prioritysize/MMedium: 3-5 daysMedium: 3-5 daystype/featureNew feature or functionalityNew feature or functionality