Skip to content

Conversation

@abhishekbafna
Copy link
Collaborator

Fix Minion Task Metrics Collection Gaps

Problem Statement

The Minion subtask metrics collection had two scenarios where metrics were not being reported accurately:

  1. Previously in-progress tasks that completed between collection cycles: Tasks that were in-progress in one collection run but completed before the next run were not included in metrics, causing their final states (especially errors) to be missed.

  2. Short-lived tasks that started and completed between collection cycles: Tasks that started and completed entirely between two 15-minute collection cycles were never captured in metrics.

Solution

1. Track Previously In-Progress Tasks

  • Added _previousInProgressTasks map to maintain state of tasks that were in-progress in the previous execution cycle
  • Compare previous and current in-progress tasks to detect completed tasks
  • Include completed tasks in current cycle's metrics collection

2. Detect Short-Lived Tasks

  • Added getTasksStartedAfter() method in PinotHelixTaskResourceManager that uses WorkflowContext.getJobStartTimes() to efficiently find tasks that started after a timestamp
  • Filter out tasks already tracked to avoid duplicates
  • Include short-lived tasks in metrics collection

3. Dynamic Timestamp Initialization

  • Replaced static _initializationTimestamp (set at object construction) with dynamic calculation using currentTime - frequencyMs on first execution
  • Ensures accurate timestamps when controller becomes leader for the first time or after leader changes
  • Each task type maintains its own timestamp for independent lifecycle tracking

Key Changes

Files Modified

  1. TaskMetricsEmitter.java

    • Added _previousInProgressTasks map to track tasks across cycles
    • Changed from global _previousExecutionTimestamp to per-task-type _previousExecutionTimestamps map
    • Added dynamic timestamp initialization using _taskMetricsEmitterFrequencyMs
    • Implemented logic to detect and include completed tasks
    • Implemented logic to detect and include short-lived tasks
    • Added cleanup for removed task types
  2. PinotHelixTaskResourceManager.java

    • Added getTasksStartedAfter(taskType, afterTimestampMs) method
    • Uses WorkflowContext.getJobStartTimes() for efficient batch retrieval
    • Returns tasks that started after the specified timestamp
  3. TaskMetricsEmitterTest.java

    • Added comprehensive test for previously in-progress tasks that completed between runs
    • Added comprehensive test for short-lived tasks that started and completed between runs
    • Updated all comments to use contextual naming instead of "gap" terminology
    • Updated variable names for clarity (taskCompletedBetweenRuns, taskShortLived)

Testing

  • ✅ All 11 tests pass (including 2 new tests for the gaps)
  • ✅ No checkstyle violations
  • ✅ Comprehensive test coverage for both scenarios

Benefits

  1. Complete Metrics: No metrics are missed for tasks that complete quickly or between cycles
  2. Accurate Error Reporting: Errors in fast-failing tasks are now captured
  3. Leader Change Resilience: Dynamic timestamp initialization handles controller leadership changes correctly
  4. Per-Task-Type Independence: Each task type maintains its own tracking, preventing cross-contamination

Backward Compatibility

✅ Fully backward compatible - existing functionality is preserved, only extended to cover the missing scenarios.

This commit fixes two scenarios where Minion subtask metrics were not being
reported accurately:

1. Previously in-progress tasks that completed between collection cycles:
   - Track tasks that were in-progress in the previous execution cycle
   - Include completed tasks in current cycle's metrics even if they're
     no longer in getTasksInProgress()
   - Ensures final metrics are captured for tasks that complete quickly

2. Short-lived tasks that started and completed between collection cycles:
   - Use getTasksStartedAfter() to detect tasks that started after the
     previous execution timestamp
   - Filter out tasks already tracked to avoid duplicates
   - Ensures metrics for very short-lived tasks are captured

Key changes:
- Added _previousInProgressTasks map to track tasks across cycles
- Added getTasksStartedAfter() method in PinotHelixTaskResourceManager
- Implemented per-task-type timestamp tracking for independent lifecycle
- Changed initialization timestamp to use dynamic calculation (currentTime - frequencyMs)
  instead of object construction time, handling leader changes correctly
- Updated test cases with contextual naming and comprehensive comments

All tests pass (11/11), including new tests for both scenarios.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes gaps in Minion task metrics collection by tracking tasks across collection cycles and detecting short-lived tasks. The implementation ensures no metrics are missed for tasks that complete quickly or between collection cycles, with particular focus on capturing error states.

Key Changes:

  • Added tracking of previously in-progress tasks to capture final metrics for tasks that completed between cycles
  • Implemented detection of short-lived tasks using job start timestamps from WorkflowContext
  • Changed from global to per-task-type timestamp tracking with dynamic initialization

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
TaskMetricsEmitter.java Added state tracking maps (_previousInProgressTasks, _previousExecutionTimestamps), dynamic timestamp initialization, and logic to include completed and short-lived tasks in metrics
PinotHelixTaskResourceManager.java Added getTasksStartedAfter() method using WorkflowContext.getJobStartTimes() for efficient batch retrieval of tasks by start time
TaskMetricsEmitterTest.java Added comprehensive tests for both previously in-progress tasks that completed and short-lived tasks scenarios

@codecov-commenter
Copy link

codecov-commenter commented Nov 3, 2025

Codecov Report

❌ Patch coverage is 56.36364% with 24 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.17%. Comparing base (84505f8) to head (dfc2ac9).

Files with missing lines Patch % Lines
...lix/core/minion/PinotHelixTaskResourceManager.java 0.00% 22 Missing ⚠️
...ntroller/helix/core/minion/TaskMetricsEmitter.java 93.93% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #17128      +/-   ##
============================================
- Coverage     63.17%   63.17%   -0.01%     
- Complexity     1428     1429       +1     
============================================
  Files          3114     3114              
  Lines        183908   183945      +37     
  Branches      28167    28173       +6     
============================================
+ Hits         116187   116210      +23     
- Misses        58738    58760      +22     
+ Partials       8983     8975       -8     
Flag Coverage Δ
custom-integration1 ?
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-11 63.13% <56.36%> (-0.01%) ⬇️
java-21 63.14% <56.36%> (+<0.01%) ⬆️
temurin 63.17% <56.36%> (-0.01%) ⬇️
unittests 63.17% <56.36%> (-0.01%) ⬇️
unittests1 56.10% <ø> (-0.01%) ⬇️
unittests2 33.65% <56.36%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

@abhishekbafna abhishekbafna requested a review from KKcorps November 5, 2025 07:49
@KKcorps KKcorps merged commit 4c8ac19 into apache:master Nov 6, 2025
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants