Add segment-level failure details capture to reload status tracking #17234

suvodeep-pyne · 2025-11-18T21:56:03Z

This PR enhances the reload status tracking system to capture detailed failure information for individual segments during table reload operations. Building on the existing in-memory reload job status cache infrastructure, this change provides operators with actionable debugging information directly via the reload status API.

Key Changes

1. Enhanced Error Response Structure

ApiErrorResponse: New DTO in pinot-common encapsulating error information:
- errorMsg: Exception message
- stacktrace: Full stack trace
SegmentReloadFailureResponse: Captures segment name, server name, error details (via ApiErrorResponse), and failure timestamp
Shared between server (serialization) and controller (deserialization)

2. Server-Side Failure Detail Capture

ServerReloadJobStatusCache: Enhanced with recordFailure() method that:
- Always increments failure count (exact counting)
- Stores detailed failure information for the first N failures (configurable, default: 5)
- Populates server name using instance ID for server context
- Thread-safe with synchronized access to failure details list
ReloadJobStatus: Added _failedSegmentDetails list to track failed segment information
ServerReloadJobStatusCacheConfig: Added segmentFailureDetailsCount configuration (default: 5)
- ZK config key: pinot.server.table.reload.status.cache.segment.failure.details.count

3. Integration Point

BaseTableDataManager (line 804): Changed from simple counter increment to full failure recording
- Before: _reloadJobStatusCache.getOrCreate(reloadJobId).incrementAndGetFailureCount()
- After: _reloadJobStatusCache.recordFailure(reloadJobId, segmentName, t)

4. API Response Enhancement

Server API: ServerReloadStatusResponse (formerly SegmentReloadStatusValue)
- Moved to pinot-common for sharing between modules
- Added sampleSegmentReloadFailures field with fluent setters
- Returns failed segment details (with server name populated)
Controller API: PinotTableReloadStatusResponse
- Added sampleSegmentReloadFailures field
- Aggregates ALL failures from all servers (NO deduplication)
- Preserves server context: same segment failures on different servers kept separately
- Limited to 500 failures max to prevent huge responses

5. Controller Aggregation Logic

PinotTableReloadStatusReporter: Enhanced to:
- Collect failed segment details from all server responses
- NO deduplication: Preserves server-specific context for debugging
- Apply 500-segment limit across all servers
- Enables pattern detection (e.g., "Server A failing many segments due to OOM")

Design Rationale

Why NO Deduplication?

Same segment can fail on Server A (disk full) but succeed on Server B. Keeping all failures separately:

Enables root cause analysis (infrastructure vs. data corruption)
Preserves server context for targeted troubleshooting
Allows pattern detection across servers

Nested Error Response Structure

The ApiErrorResponse object encapsulates error details:

Cleaner API design with logical grouping
Easier to extend with additional error metadata in the future
Consistent with API design patterns

Memory Impact

Per segment failure: ~2KB (stack trace + metadata)
Per job (default 5 failures): ~10.4KB
Cache-wide (worst case 10,000 jobs): ~108MB
Percentage of heap: 0.34% - 0.67% (on 16-32GB heap)

Thread Safety

Cache layer handles all business logic and synchronization
Data classes remain simple POJOs
Synchronized access to failure details list per job status

Testing

Failure recording under/over limit
Concurrent failure recording (thread safety)
Config changes and cache rebuilds
Server name population
All tests passing ✅

Backward Compatibility

Servers without reload job ID continue working (null handling)
Old API clients ignore new fields (JSON serialization)
No breaking changes to existing functionality

Configuration

New configuration property (dynamic via ZooKeeper):

pinot.server.table.reload.status.cache.segment.failure.details.count = 5

Example API Response

{
  "totalSegmentCount": 300,
  "successCount": 285,
  "failureCount": 15,
  "sampleSegmentReloadFailures": [
    {
      "segmentName": "myTable__0__123__20240101T0000Z",
      "serverName": "Server_192.168.1.10_8098",
      "error": {
        "errorMsg": "IOException: Disk full",
        "stacktrace": "java.io.IOException: Disk full\n  at ..."
      },
      "failedAtMs": 1704067200000
    }
  ]
}

Related Work

Built on Phase 1 reload status cache infrastructure
Part of enhanced segment reload status tracking initiative
Addresses need for actionable debugging information during reload operations

Next Steps

Future enhancements may include:

Success/in-progress tracking with aggregate statistics (Phase 2)
Query parameters for detail level control
Filtering and pagination for large failure lists

Enhances the reload status cache to capture detailed failure information for failed segments including segment name, error message, stack trace, and failure timestamp. Key changes: - Add SegmentReloadStatus class to capture individual segment failure details - Add recordFailure() method to ServerReloadJobStatusCache with bounded storage - Add segment.failure.details.count config (default: 5) to limit stored details - Always count all failures, but only store details for first N failures - Update BaseTableDataManager to call recordFailure() instead of incrementAndGetFailureCount() - Add comprehensive unit tests for failure recording and bounded storage The implementation ensures thread-safe concurrent failure recording while maintaining predictable memory bounds by limiting stored failure details.

…into `SegmentReloadFailureResponse` Refactored how segment failure details are tracked across the reload job pipeline. Key changes: - Removed `SegmentReloadStatus` in favor of the new `SegmentReloadFailureResponse` DTO. - Enhanced failure tracking with server context, stack traces, and JSON serialization. - Updated APIs and internal classes to use `SegmentReloadFailureResponse` consistently. - Improved debugging by aggregating detailed failure data from all servers.

- Removed `serverName` parameter from `recordFailure` to simplify signature. - Improved segment failure tracking by dynamically setting server context in `PinotTableReloadStatusReporter`. - Updated relevant unit tests to align with the changes.

- Updated `ServerReloadJobStatusCache` to require an `instanceId` during initialization. - Modified constructors and call sites to pass the instance-specific ID, improving context tracking. - Enhanced logging to include the instance ID for better debugging. - Updated unit tests to align with the new constructor and ensure compatibility.

- Renamed `SegmentReloadFailureResponse` to `SegmentReloadFailure` for clarity and brevity. - Updated all references and method names across the codebase to use the new class name. - Improved segment reload failure detail handling by streamlining data structures and ensuring consistency. - Removed unnecessary exception throwing in `BaseTableDataManager`.

- Consolidated iteration for count aggregation and segment failure collection in `PinotTableReloadStatusReporter`. - Changed `_successCount` type from `long` to `int` in `ServerReloadStatusResponse` to match usage and simplify handling. - Updated references and method signatures accordingly across impacted files. - Improved code readability and reduced redundant handling of server responses.

…onsistency - Updated all occurrences, references, and methods to reflect the new name. - Improved clarity in segment reload failure handling across the codebase. - Annotated relevant classes with stability annotations to reflect intended usage.

- Added new `ApiErrorResponse` class to encapsulate error message and stack trace in a structured manner. - Updated `SegmentReloadFailureResponse` to use `ApiErrorResponse` for error representation. - Refactored `ServerReloadJobStatusCache` to set errors using `ApiErrorResponse`. - Removed direct error message and stack trace fields from `SegmentReloadFailureResponse` for better modularity.

- Included missing ASF license header for compliance.

…larity - Updated method and variable names in `PinotTableReloadStatusResponse` and `PinotTableReloadStatusReporter` to improve consistency. - Streamlined naming to better align with the purpose and content of the field.

codecov-commenter · 2025-11-18T23:14:21Z

Codecov Report

❌ Patch coverage is 44.79167% with 53 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.20%. Comparing base (3224ce6) to head (f649253).
⚠️ Report is 39 commits behind head on master.

Files with missing lines	Patch %	Lines
...oller/services/PinotTableReloadStatusReporter.java	0.00%	16 Missing ⚠️
...on/response/server/ServerReloadStatusResponse.java	0.00%	13 Missing ⚠️
...ver/api/resources/ControllerJobStatusResource.java	0.00%	10 Missing ⚠️
.../response/server/SegmentReloadFailureResponse.java	76.92%	3 Missing ⚠️
...roller/api/dto/PinotTableReloadStatusResponse.java	0.00%	3 Missing ⚠️
.../local/utils/ServerReloadJobStatusCacheConfig.java	40.00%	3 Missing ⚠️
...pinot/common/response/server/ApiErrorResponse.java	71.42%	2 Missing ⚠️
.../pinot/core/data/manager/BaseTableDataManager.java	0.00%	1 Missing ⚠️
...che/pinot/segment/local/utils/ReloadJobStatus.java	83.33%	1 Missing ⚠️
.../pinot/server/starter/helix/BaseServerStarter.java	0.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #17234      +/-   ##
============================================
+ Coverage     63.17%   63.20%   +0.02%     
- Complexity     1428     1432       +4     
============================================
  Files          3121     3130       +9     
  Lines        184814   185793     +979     
  Branches      28332    28391      +59     
============================================
+ Hits         116760   117428     +668     
- Misses        59033    59310     +277     
- Partials       9021     9055      +34

Flag	Coverage Δ
custom-integration1	`100.00% <ø> (ø)`
integration	`100.00% <ø> (ø)`
integration1	`100.00% <ø> (ø)`
integration2	`0.00% <ø> (ø)`
java-11	`63.16% <44.79%> (+<0.01%)`	⬆️
java-21	`63.17% <44.79%> (+0.02%)`	⬆️
temurin	`63.20% <44.79%> (+0.02%)`	⬆️
unittests	`63.20% <44.79%> (+0.02%)`	⬆️
unittests1	`55.61% <7.57%> (-0.33%)`	⬇️
unittests2	`33.86% <43.75%> (+0.16%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot

Pull Request Overview

This PR enhances the reload status tracking system to capture detailed failure information for individual segments during table reload operations, building on the existing in-memory reload job status cache infrastructure.

Key Changes:

Introduced new DTOs (ApiErrorResponse, SegmentReloadFailureResponse, ServerReloadStatusResponse) to capture and communicate segment failure details
Enhanced server-side failure tracking with configurable limits and thread-safe recording
Modified controller aggregation logic to collect failure details from all servers without deduplication

Reviewed Changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`BaseServerStarter.java`	Updated to pass instance ID when initializing `ServerReloadJobStatusCache`
`ControllerJobStatusResource.java`	Enhanced to return detailed segment failure information via the new response DTOs
`ServerReloadJobStatusCacheTest.java`	Added comprehensive tests for failure recording, concurrency, and configuration changes
`ServerReloadJobStatusCacheConfig.java`	Added `segmentFailureDetailsCount` configuration field with default value of 5
`ServerReloadJobStatusCache.java`	Implemented `recordFailure()` method with thread-safe failure detail capture and limit enforcement
`ReloadJobStatus.java`	Added `_failedSegmentDetails` list to track individual segment failures
`BenchmarkDimensionTableOverhead.java`	Updated benchmark to pass instance ID to cache constructor
`TableDataManagerProvider.java`	Updated test helper to pass instance ID to cache constructor
`BaseTableDataManager.java`	Changed from simple counter increment to full failure recording with segment details
`PinotTableReloadStatusReporter.java`	Enhanced controller aggregation to collect and limit failed segment details from all servers
`PinotTableReloadStatusResponse.java`	Added `segmentReloadFailures` field to controller response DTO
`ServerReloadStatusResponse.java`	New DTO for server-side reload status responses with failure details
`SegmentReloadFailureResponse.java`	New DTO representing individual segment reload failures
`ApiErrorResponse.java`	New DTO encapsulating error message and stack trace information

pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/ReloadJobStatus.java

xiangfu0

lgtm otherwise

pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/ReloadJobStatus.java

- Ensures thread-safe access by wrapping `_failedSegmentDetails` with `Collections.unmodifiableList`.

This reverts commit 338d7cb.

@dataProvider

Thread Safety Improvements: - Make ReloadJobStatus thread-safe with synchronized methods - Replace ArrayList with synchronized access pattern - Return unmodifiable list from getFailedSegmentDetails() - Simplify ControllerJobStatusResource by removing redundant defensive copies Test Improvements: - Remove 3 redundant tests (testConfigUpdateOverwritesPrevious, testCacheRebuildWithDifferentSize, testCacheRebuildWithDifferentTTL) - Consolidate limit tests into parameterized test with @dataProvider - Add 7 new critical tests: concurrent getOrCreate, unmodifiable list enforcement, null parameter validation, zero limit edge case - Remove unnecessary helper method getFailedSegmentDetails() - Simplify test assertions for better readability Net result: 18 tests -> 22 tests with better coverage and quality

suvodeep-pyne added 10 commits November 13, 2025 15:30

Refactor segment reload failure handling

4113c2e

- Removed `serverName` parameter from `recordFailure` to simplify signature. - Improved segment failure tracking by dynamically setting server context in `PinotTableReloadStatusReporter`. - Updated relevant unit tests to align with the changes.

Add Apache License header to ApiErrorResponse file

e279014

- Included missing ASF license header for compliance.

xiangfu0 requested review from Jackie-Jiang, Copilot and xiangfu0 November 20, 2025 21:59

Copilot AI reviewed Nov 20, 2025

View reviewed changes

pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/ReloadJobStatus.java Outdated Show resolved Hide resolved

xiangfu0 added enhancement rest-api labels Nov 20, 2025

xiangfu0 approved these changes Nov 21, 2025

View reviewed changes

pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/ReloadJobStatus.java Outdated Show resolved Hide resolved

suvodeep-pyne added 5 commits November 21, 2025 09:18

Make getFailedSegmentDetails return an unmodifiable list

338d7cb

- Ensures thread-safe access by wrapping `_failedSegmentDetails` with `Collections.unmodifiableList`.

Revert "Make getFailedSegmentDetails return an unmodifiable list"

aef3a82

This reverts commit 338d7cb.

Remove unused imports in ServerReloadJobStatusCache and its test

58da582

Fixed linter errors

f649253

xiangfu0 merged commit b918196 into apache:master Nov 22, 2025
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add segment-level failure details capture to reload status tracking #17234

Add segment-level failure details capture to reload status tracking #17234

Uh oh!

suvodeep-pyne commented Nov 18, 2025 •

edited

Loading

Uh oh!

codecov-commenter commented Nov 18, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

xiangfu0 left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add segment-level failure details capture to reload status tracking #17234

Add segment-level failure details capture to reload status tracking #17234

Uh oh!

Conversation

suvodeep-pyne commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key Changes

1. Enhanced Error Response Structure

2. Server-Side Failure Detail Capture

3. Integration Point

4. API Response Enhancement

5. Controller Aggregation Logic

Design Rationale

Why NO Deduplication?

Nested Error Response Structure

Memory Impact

Thread Safety

Testing

Backward Compatibility

Configuration

Example API Response

Related Work

Next Steps

Uh oh!

codecov-commenter commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

xiangfu0 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

suvodeep-pyne commented Nov 18, 2025 •

edited

Loading

codecov-commenter commented Nov 18, 2025 •

edited

Loading