Skip to content

Conversation

@suvodeep-pyne
Copy link
Contributor

@suvodeep-pyne suvodeep-pyne commented Nov 18, 2025

This PR enhances the reload status tracking system to capture detailed failure information for individual segments during table reload operations. Building on the existing in-memory reload job status cache infrastructure, this change provides operators with actionable debugging information directly via the reload status API.

Key Changes

1. Enhanced Error Response Structure

  • ApiErrorResponse: New DTO in pinot-common encapsulating error information:
    • errorMsg: Exception message
    • stacktrace: Full stack trace
  • SegmentReloadFailureResponse: Captures segment name, server name, error details (via ApiErrorResponse), and failure timestamp
  • Shared between server (serialization) and controller (deserialization)

2. Server-Side Failure Detail Capture

  • ServerReloadJobStatusCache: Enhanced with recordFailure() method that:

    • Always increments failure count (exact counting)
    • Stores detailed failure information for the first N failures (configurable, default: 5)
    • Populates server name using instance ID for server context
    • Thread-safe with synchronized access to failure details list
  • ReloadJobStatus: Added _failedSegmentDetails list to track failed segment information

  • ServerReloadJobStatusCacheConfig: Added segmentFailureDetailsCount configuration (default: 5)

    • ZK config key: pinot.server.table.reload.status.cache.segment.failure.details.count

3. Integration Point

  • BaseTableDataManager (line 804): Changed from simple counter increment to full failure recording
    • Before: _reloadJobStatusCache.getOrCreate(reloadJobId).incrementAndGetFailureCount()
    • After: _reloadJobStatusCache.recordFailure(reloadJobId, segmentName, t)

4. API Response Enhancement

  • Server API: ServerReloadStatusResponse (formerly SegmentReloadStatusValue)

    • Moved to pinot-common for sharing between modules
    • Added sampleSegmentReloadFailures field with fluent setters
    • Returns failed segment details (with server name populated)
  • Controller API: PinotTableReloadStatusResponse

    • Added sampleSegmentReloadFailures field
    • Aggregates ALL failures from all servers (NO deduplication)
    • Preserves server context: same segment failures on different servers kept separately
    • Limited to 500 failures max to prevent huge responses

5. Controller Aggregation Logic

  • PinotTableReloadStatusReporter: Enhanced to:
    • Collect failed segment details from all server responses
    • NO deduplication: Preserves server-specific context for debugging
    • Apply 500-segment limit across all servers
    • Enables pattern detection (e.g., "Server A failing many segments due to OOM")

Design Rationale

Why NO Deduplication?

Same segment can fail on Server A (disk full) but succeed on Server B. Keeping all failures separately:

  • Enables root cause analysis (infrastructure vs. data corruption)
  • Preserves server context for targeted troubleshooting
  • Allows pattern detection across servers

Nested Error Response Structure

The ApiErrorResponse object encapsulates error details:

  • Cleaner API design with logical grouping
  • Easier to extend with additional error metadata in the future
  • Consistent with API design patterns

Memory Impact

  • Per segment failure: ~2KB (stack trace + metadata)
  • Per job (default 5 failures): ~10.4KB
  • Cache-wide (worst case 10,000 jobs): ~108MB
  • Percentage of heap: 0.34% - 0.67% (on 16-32GB heap)

Thread Safety

  • Cache layer handles all business logic and synchronization
  • Data classes remain simple POJOs
  • Synchronized access to failure details list per job status

Testing

  • Failure recording under/over limit
  • Concurrent failure recording (thread safety)
  • Config changes and cache rebuilds
  • Server name population
  • All tests passing ✅

Backward Compatibility

  • Servers without reload job ID continue working (null handling)
  • Old API clients ignore new fields (JSON serialization)
  • No breaking changes to existing functionality

Configuration

New configuration property (dynamic via ZooKeeper):

pinot.server.table.reload.status.cache.segment.failure.details.count = 5

Example API Response

{
  "totalSegmentCount": 300,
  "successCount": 285,
  "failureCount": 15,
  "sampleSegmentReloadFailures": [
    {
      "segmentName": "myTable__0__123__20240101T0000Z",
      "serverName": "Server_192.168.1.10_8098",
      "error": {
        "errorMsg": "IOException: Disk full",
        "stacktrace": "java.io.IOException: Disk full\n  at ..."
      },
      "failedAtMs": 1704067200000
    }
  ]
}

Related Work

  • Built on Phase 1 reload status cache infrastructure
  • Part of enhanced segment reload status tracking initiative
  • Addresses need for actionable debugging information during reload operations

Next Steps

Future enhancements may include:

  • Success/in-progress tracking with aggregate statistics (Phase 2)
  • Query parameters for detail level control
  • Filtering and pagination for large failure lists

Enhances the reload status cache to capture detailed failure information
for failed segments including segment name, error message, stack trace,
and failure timestamp.

Key changes:
- Add SegmentReloadStatus class to capture individual segment failure details
- Add recordFailure() method to ServerReloadJobStatusCache with bounded storage
- Add segment.failure.details.count config (default: 5) to limit stored details
- Always count all failures, but only store details for first N failures
- Update BaseTableDataManager to call recordFailure() instead of incrementAndGetFailureCount()
- Add comprehensive unit tests for failure recording and bounded storage

The implementation ensures thread-safe concurrent failure recording while
maintaining predictable memory bounds by limiting stored failure details.
…into `SegmentReloadFailureResponse`

Refactored how segment failure details are tracked across the reload job pipeline. Key changes:
- Removed `SegmentReloadStatus` in favor of the new `SegmentReloadFailureResponse` DTO.
- Enhanced failure tracking with server context, stack traces, and JSON serialization.
- Updated APIs and internal classes to use `SegmentReloadFailureResponse` consistently.
- Improved debugging by aggregating detailed failure data from all servers.
- Removed `serverName` parameter from `recordFailure` to simplify signature.
- Improved segment failure tracking by dynamically setting server context in `PinotTableReloadStatusReporter`.
- Updated relevant unit tests to align with the changes.
- Updated `ServerReloadJobStatusCache` to require an `instanceId` during initialization.
- Modified constructors and call sites to pass the instance-specific ID, improving context tracking.
- Enhanced logging to include the instance ID for better debugging.
- Updated unit tests to align with the new constructor and ensure compatibility.
- Renamed `SegmentReloadFailureResponse` to `SegmentReloadFailure` for clarity and brevity.
- Updated all references and method names across the codebase to use the new class name.
- Improved segment reload failure detail handling by streamlining data structures and ensuring consistency.
- Removed unnecessary exception throwing in `BaseTableDataManager`.
- Consolidated iteration for count aggregation and segment failure collection in `PinotTableReloadStatusReporter`.
- Changed `_successCount` type from `long` to `int` in `ServerReloadStatusResponse` to match usage and simplify handling.
- Updated references and method signatures accordingly across impacted files.
- Improved code readability and reduced redundant handling of server responses.
…onsistency

- Updated all occurrences, references, and methods to reflect the new name.
- Improved clarity in segment reload failure handling across the codebase.
- Annotated relevant classes with stability annotations to reflect intended usage.
- Added new `ApiErrorResponse` class to encapsulate error message and stack trace in a structured manner.
- Updated `SegmentReloadFailureResponse` to use `ApiErrorResponse` for error representation.
- Refactored `ServerReloadJobStatusCache` to set errors using `ApiErrorResponse`.
- Removed direct error message and stack trace fields from `SegmentReloadFailureResponse` for better modularity.
- Included missing ASF license header for compliance.
…larity

- Updated method and variable names in `PinotTableReloadStatusResponse` and `PinotTableReloadStatusReporter` to improve consistency.
- Streamlined naming to better align with the purpose and content of the field.
@codecov-commenter
Copy link

codecov-commenter commented Nov 18, 2025

Codecov Report

❌ Patch coverage is 44.79167% with 53 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.20%. Comparing base (3224ce6) to head (f649253).
⚠️ Report is 39 commits behind head on master.

Files with missing lines Patch % Lines
...oller/services/PinotTableReloadStatusReporter.java 0.00% 16 Missing ⚠️
...on/response/server/ServerReloadStatusResponse.java 0.00% 13 Missing ⚠️
...ver/api/resources/ControllerJobStatusResource.java 0.00% 10 Missing ⚠️
.../response/server/SegmentReloadFailureResponse.java 76.92% 3 Missing ⚠️
...roller/api/dto/PinotTableReloadStatusResponse.java 0.00% 3 Missing ⚠️
.../local/utils/ServerReloadJobStatusCacheConfig.java 40.00% 3 Missing ⚠️
...pinot/common/response/server/ApiErrorResponse.java 71.42% 2 Missing ⚠️
.../pinot/core/data/manager/BaseTableDataManager.java 0.00% 1 Missing ⚠️
...che/pinot/segment/local/utils/ReloadJobStatus.java 83.33% 1 Missing ⚠️
.../pinot/server/starter/helix/BaseServerStarter.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #17234      +/-   ##
============================================
+ Coverage     63.17%   63.20%   +0.02%     
- Complexity     1428     1432       +4     
============================================
  Files          3121     3130       +9     
  Lines        184814   185793     +979     
  Branches      28332    28391      +59     
============================================
+ Hits         116760   117428     +668     
- Misses        59033    59310     +277     
- Partials       9021     9055      +34     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-11 63.16% <44.79%> (+<0.01%) ⬆️
java-21 63.17% <44.79%> (+0.02%) ⬆️
temurin 63.20% <44.79%> (+0.02%) ⬆️
unittests 63.20% <44.79%> (+0.02%) ⬆️
unittests1 55.61% <7.57%> (-0.33%) ⬇️
unittests2 33.86% <43.75%> (+0.16%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances the reload status tracking system to capture detailed failure information for individual segments during table reload operations, building on the existing in-memory reload job status cache infrastructure.

Key Changes:

  • Introduced new DTOs (ApiErrorResponse, SegmentReloadFailureResponse, ServerReloadStatusResponse) to capture and communicate segment failure details
  • Enhanced server-side failure tracking with configurable limits and thread-safe recording
  • Modified controller aggregation logic to collect failure details from all servers without deduplication

Reviewed Changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
BaseServerStarter.java Updated to pass instance ID when initializing ServerReloadJobStatusCache
ControllerJobStatusResource.java Enhanced to return detailed segment failure information via the new response DTOs
ServerReloadJobStatusCacheTest.java Added comprehensive tests for failure recording, concurrency, and configuration changes
ServerReloadJobStatusCacheConfig.java Added segmentFailureDetailsCount configuration field with default value of 5
ServerReloadJobStatusCache.java Implemented recordFailure() method with thread-safe failure detail capture and limit enforcement
ReloadJobStatus.java Added _failedSegmentDetails list to track individual segment failures
BenchmarkDimensionTableOverhead.java Updated benchmark to pass instance ID to cache constructor
TableDataManagerProvider.java Updated test helper to pass instance ID to cache constructor
BaseTableDataManager.java Changed from simple counter increment to full failure recording with segment details
PinotTableReloadStatusReporter.java Enhanced controller aggregation to collect and limit failed segment details from all servers
PinotTableReloadStatusResponse.java Added segmentReloadFailures field to controller response DTO
ServerReloadStatusResponse.java New DTO for server-side reload status responses with failure details
SegmentReloadFailureResponse.java New DTO representing individual segment reload failures
ApiErrorResponse.java New DTO encapsulating error message and stack trace information

Copy link
Contributor

@xiangfu0 xiangfu0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm otherwise

- Ensures thread-safe access by wrapping `_failedSegmentDetails` with `Collections.unmodifiableList`.
Thread Safety Improvements:
- Make ReloadJobStatus thread-safe with synchronized methods
- Replace ArrayList with synchronized access pattern
- Return unmodifiable list from getFailedSegmentDetails()
- Simplify ControllerJobStatusResource by removing redundant defensive copies

Test Improvements:
- Remove 3 redundant tests (testConfigUpdateOverwritesPrevious, testCacheRebuildWithDifferentSize, testCacheRebuildWithDifferentTTL)
- Consolidate limit tests into parameterized test with @dataProvider
- Add 7 new critical tests: concurrent getOrCreate, unmodifiable list enforcement, null parameter validation, zero limit edge case
- Remove unnecessary helper method getFailedSegmentDetails()
- Simplify test assertions for better readability

Net result: 18 tests -> 22 tests with better coverage and quality
@xiangfu0 xiangfu0 merged commit b918196 into apache:master Nov 22, 2025
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants