-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Add segment-level failure details capture to reload status tracking #17234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Enhances the reload status cache to capture detailed failure information for failed segments including segment name, error message, stack trace, and failure timestamp. Key changes: - Add SegmentReloadStatus class to capture individual segment failure details - Add recordFailure() method to ServerReloadJobStatusCache with bounded storage - Add segment.failure.details.count config (default: 5) to limit stored details - Always count all failures, but only store details for first N failures - Update BaseTableDataManager to call recordFailure() instead of incrementAndGetFailureCount() - Add comprehensive unit tests for failure recording and bounded storage The implementation ensures thread-safe concurrent failure recording while maintaining predictable memory bounds by limiting stored failure details.
…into `SegmentReloadFailureResponse` Refactored how segment failure details are tracked across the reload job pipeline. Key changes: - Removed `SegmentReloadStatus` in favor of the new `SegmentReloadFailureResponse` DTO. - Enhanced failure tracking with server context, stack traces, and JSON serialization. - Updated APIs and internal classes to use `SegmentReloadFailureResponse` consistently. - Improved debugging by aggregating detailed failure data from all servers.
- Removed `serverName` parameter from `recordFailure` to simplify signature. - Improved segment failure tracking by dynamically setting server context in `PinotTableReloadStatusReporter`. - Updated relevant unit tests to align with the changes.
- Updated `ServerReloadJobStatusCache` to require an `instanceId` during initialization. - Modified constructors and call sites to pass the instance-specific ID, improving context tracking. - Enhanced logging to include the instance ID for better debugging. - Updated unit tests to align with the new constructor and ensure compatibility.
- Renamed `SegmentReloadFailureResponse` to `SegmentReloadFailure` for clarity and brevity. - Updated all references and method names across the codebase to use the new class name. - Improved segment reload failure detail handling by streamlining data structures and ensuring consistency. - Removed unnecessary exception throwing in `BaseTableDataManager`.
- Consolidated iteration for count aggregation and segment failure collection in `PinotTableReloadStatusReporter`. - Changed `_successCount` type from `long` to `int` in `ServerReloadStatusResponse` to match usage and simplify handling. - Updated references and method signatures accordingly across impacted files. - Improved code readability and reduced redundant handling of server responses.
…onsistency - Updated all occurrences, references, and methods to reflect the new name. - Improved clarity in segment reload failure handling across the codebase. - Annotated relevant classes with stability annotations to reflect intended usage.
- Added new `ApiErrorResponse` class to encapsulate error message and stack trace in a structured manner. - Updated `SegmentReloadFailureResponse` to use `ApiErrorResponse` for error representation. - Refactored `ServerReloadJobStatusCache` to set errors using `ApiErrorResponse`. - Removed direct error message and stack trace fields from `SegmentReloadFailureResponse` for better modularity.
- Included missing ASF license header for compliance.
…larity - Updated method and variable names in `PinotTableReloadStatusResponse` and `PinotTableReloadStatusReporter` to improve consistency. - Streamlined naming to better align with the purpose and content of the field.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #17234 +/- ##
============================================
+ Coverage 63.17% 63.20% +0.02%
- Complexity 1428 1432 +4
============================================
Files 3121 3130 +9
Lines 184814 185793 +979
Branches 28332 28391 +59
============================================
+ Hits 116760 117428 +668
- Misses 59033 59310 +277
- Partials 9021 9055 +34
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR enhances the reload status tracking system to capture detailed failure information for individual segments during table reload operations, building on the existing in-memory reload job status cache infrastructure.
Key Changes:
- Introduced new DTOs (
ApiErrorResponse,SegmentReloadFailureResponse,ServerReloadStatusResponse) to capture and communicate segment failure details - Enhanced server-side failure tracking with configurable limits and thread-safe recording
- Modified controller aggregation logic to collect failure details from all servers without deduplication
Reviewed Changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
BaseServerStarter.java |
Updated to pass instance ID when initializing ServerReloadJobStatusCache |
ControllerJobStatusResource.java |
Enhanced to return detailed segment failure information via the new response DTOs |
ServerReloadJobStatusCacheTest.java |
Added comprehensive tests for failure recording, concurrency, and configuration changes |
ServerReloadJobStatusCacheConfig.java |
Added segmentFailureDetailsCount configuration field with default value of 5 |
ServerReloadJobStatusCache.java |
Implemented recordFailure() method with thread-safe failure detail capture and limit enforcement |
ReloadJobStatus.java |
Added _failedSegmentDetails list to track individual segment failures |
BenchmarkDimensionTableOverhead.java |
Updated benchmark to pass instance ID to cache constructor |
TableDataManagerProvider.java |
Updated test helper to pass instance ID to cache constructor |
BaseTableDataManager.java |
Changed from simple counter increment to full failure recording with segment details |
PinotTableReloadStatusReporter.java |
Enhanced controller aggregation to collect and limit failed segment details from all servers |
PinotTableReloadStatusResponse.java |
Added segmentReloadFailures field to controller response DTO |
ServerReloadStatusResponse.java |
New DTO for server-side reload status responses with failure details |
SegmentReloadFailureResponse.java |
New DTO representing individual segment reload failures |
ApiErrorResponse.java |
New DTO encapsulating error message and stack trace information |
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/ReloadJobStatus.java
Outdated
Show resolved
Hide resolved
xiangfu0
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm otherwise
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/ReloadJobStatus.java
Outdated
Show resolved
Hide resolved
- Ensures thread-safe access by wrapping `_failedSegmentDetails` with `Collections.unmodifiableList`.
This reverts commit 338d7cb.
Thread Safety Improvements: - Make ReloadJobStatus thread-safe with synchronized methods - Replace ArrayList with synchronized access pattern - Return unmodifiable list from getFailedSegmentDetails() - Simplify ControllerJobStatusResource by removing redundant defensive copies Test Improvements: - Remove 3 redundant tests (testConfigUpdateOverwritesPrevious, testCacheRebuildWithDifferentSize, testCacheRebuildWithDifferentTTL) - Consolidate limit tests into parameterized test with @dataProvider - Add 7 new critical tests: concurrent getOrCreate, unmodifiable list enforcement, null parameter validation, zero limit edge case - Remove unnecessary helper method getFailedSegmentDetails() - Simplify test assertions for better readability Net result: 18 tests -> 22 tests with better coverage and quality
This PR enhances the reload status tracking system to capture detailed failure information for individual segments during table reload operations. Building on the existing in-memory reload job status cache infrastructure, this change provides operators with actionable debugging information directly via the reload status API.
Key Changes
1. Enhanced Error Response Structure
ApiErrorResponse: New DTO inpinot-commonencapsulating error information:errorMsg: Exception messagestacktrace: Full stack traceSegmentReloadFailureResponse: Captures segment name, server name, error details (viaApiErrorResponse), and failure timestamp2. Server-Side Failure Detail Capture
ServerReloadJobStatusCache: Enhanced withrecordFailure()method that:ReloadJobStatus: Added_failedSegmentDetailslist to track failed segment informationServerReloadJobStatusCacheConfig: AddedsegmentFailureDetailsCountconfiguration (default: 5)pinot.server.table.reload.status.cache.segment.failure.details.count3. Integration Point
BaseTableDataManager(line 804): Changed from simple counter increment to full failure recording_reloadJobStatusCache.getOrCreate(reloadJobId).incrementAndGetFailureCount()_reloadJobStatusCache.recordFailure(reloadJobId, segmentName, t)4. API Response Enhancement
Server API:
ServerReloadStatusResponse(formerlySegmentReloadStatusValue)pinot-commonfor sharing between modulessampleSegmentReloadFailuresfield with fluent settersController API:
PinotTableReloadStatusResponsesampleSegmentReloadFailuresfield5. Controller Aggregation Logic
PinotTableReloadStatusReporter: Enhanced to:Design Rationale
Why NO Deduplication?
Same segment can fail on Server A (disk full) but succeed on Server B. Keeping all failures separately:
Nested Error Response Structure
The
ApiErrorResponseobject encapsulates error details:Memory Impact
Thread Safety
Testing
Backward Compatibility
Configuration
New configuration property (dynamic via ZooKeeper):
Example API Response
{ "totalSegmentCount": 300, "successCount": 285, "failureCount": 15, "sampleSegmentReloadFailures": [ { "segmentName": "myTable__0__123__20240101T0000Z", "serverName": "Server_192.168.1.10_8098", "error": { "errorMsg": "IOException: Disk full", "stacktrace": "java.io.IOException: Disk full\n at ..." }, "failedAtMs": 1704067200000 } ] }Related Work
Next Steps
Future enhancements may include: