Skip to content

make buildSegmentInternal raise error out explicitly#7896

Closed
walterddr wants to merge 3 commits intoapache:masterfrom
walterddr:fix_rtsegmanager_error
Closed

make buildSegmentInternal raise error out explicitly#7896
walterddr wants to merge 3 commits intoapache:masterfrom
walterddr:fix_rtsegmanager_error

Conversation

@walterddr
Copy link
Copy Markdown
Contributor

this fixes the issue when datamanager doesn't log segment error properly and shown to user. stacktrace basically only returns a null value and not useful for debugging.

this fixes the issue when datamanager doesn't log segment error properly and shown to user
@walterddr walterddr force-pushed the fix_rtsegmanager_error branch from af76bec to 5aebcb1 Compare December 13, 2021 18:42
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Dec 13, 2021

Codecov Report

Merging #7896 (1b38345) into master (fb12a50) will decrease coverage by 6.42%.
The diff coverage is 7.69%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #7896      +/-   ##
============================================
- Coverage     71.34%   64.91%   -6.43%     
  Complexity     4087     4087              
============================================
  Files          1587     1544      -43     
  Lines         82071    80222    -1849     
  Branches      12267    12062     -205     
============================================
- Hits          58553    52080    -6473     
- Misses        19550    24417    +4867     
+ Partials       3968     3725     -243     
Flag Coverage Δ
integration1 ?
integration2 ?
unittests1 68.30% <7.69%> (-0.03%) ⬇️
unittests2 14.38% <0.00%> (-0.07%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...manager/realtime/LLRealtimeSegmentDataManager.java 47.81% <7.69%> (-24.95%) ⬇️
...a/org/apache/pinot/common/metrics/MinionMeter.java 0.00% <0.00%> (-100.00%) ⬇️
...g/apache/pinot/common/metrics/ControllerMeter.java 0.00% <0.00%> (-100.00%) ⬇️
.../apache/pinot/common/metrics/BrokerQueryPhase.java 0.00% <0.00%> (-100.00%) ⬇️
.../apache/pinot/common/metrics/MinionQueryPhase.java 0.00% <0.00%> (-100.00%) ⬇️
...he/pinot/common/messages/SegmentReloadMessage.java 0.00% <0.00%> (-100.00%) ⬇️
...pinot/core/data/manager/realtime/TimerService.java 0.00% <0.00%> (-100.00%) ⬇️
...pinot/minion/exception/TaskCancelledException.java 0.00% <0.00%> (-100.00%) ⬇️
...not/common/exception/HttpErrorStatusException.java 0.00% <0.00%> (-100.00%) ⬇️
...t/core/startree/plan/StarTreeDocIdSetPlanNode.java 0.00% <0.00%> (-100.00%) ⬇️
... and 371 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fb12a50...1b38345. Read the comment docs.

Copy link
Copy Markdown
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a bigger refactoring and make all methods propagating the error through the exception

}

protected boolean buildSegmentAndReplace() {
protected void buildSegmentAndReplace() throws Exception {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


@Override
protected boolean buildSegmentAndReplace() {
protected void buildSegmentAndReplace() throws Exception {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(code format) same


protected boolean buildSegmentAndReplace() {
protected void buildSegmentAndReplace() throws Exception {
SegmentBuildDescriptor descriptor = buildSegmentInternal(false);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(major) when segment build failed, it will return null descriptor instead of throwing exception. In order to make it work, we need to change buildSegmentInternal() to throw exception as well

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. buildSegmentInternal has already been changed to throw exception.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I missed that. When we throw the exception, we should not log the error. The catcher is responsible for logging the error, or we will log multiple errors for the same exception

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup. working on it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

success = buildSegmentAndReplace();
if (success) {
try {
buildSegmentAndReplace();
Copy link
Copy Markdown
Contributor

@sajjad-moradi sajjad-moradi Dec 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the only place buildSegmentAndReplace is being used. It's also being used two times in goOnlineFromConsuming method. If we throw exception in buildSegmentAndRepalce, then the segment goes to error state in external view upon receiving helix transition message consuming -> online.
The existing behaviour is even worse because segment is not built but external view will move to online state!!
I suggest we proceed with throwing exception in buildSegmentAndReplace, but we try-catch its usage in goOnlineFromConsuming and in the catch part, we use downloadSegmentAndReplace.
@mcvsubbu @Jackie-Jiang what do you guys think?

Copy link
Copy Markdown
Contributor Author

@walterddr walterddr Dec 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes I thought about exactly the same thing. the return null also make the caller alter the state --> if we do not try catch it properly the states wont change.

Ultimately I think we need to rethink when should _realtimeTableDataManager.addSegmentError occur, my opinion is that we should make a single state machine function for each functionality, in this case: the only places we catch and process exception should be: PartitionConsumer and goOnlineFromConsuming, in which both modifies the internal state. the rest of the internal methods should all bubble up the exception by catch and rethrowing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adjusted plz kindly take a look

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The state should be set to DISCARDED if buildSegmentAndReplace fails for whatever reason (It really should not, since another host was able to build). This will allow us to replace the segment by downloading, as a last effort to keep the service going.

What exception are you seeing? Can you paste a stack?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, this is not an addSegment error (unless that method in RealtimeTableDataManager has been misnamed). Secondly, this is an "error" that is totally recoverable, since some other host has already completed the segment, and it can be downloaded. So, no "error" should be raised in this condition.

If you can please specify the problem you are trying to solve, we can help.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I realized that the addSegmentError is a method that adds all segment errors. But my question still stands. When we detect a recoverable error, do we still want to add an error? I don't think so. I think we should call addSegmentError() API only when the error is unrecoverable.

.error("Failed to find file: {} under index directory: {}", V1Constants.MetadataKeys.METADATA_FILE_NAME,
indexDir);
return null;
throw new IllegalStateException(String.format("Failed to find file: %s under index directory: %s",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IllegalStateException doesn't get caught in goOnlineFromConsuming so the desired segment download step wouldn't happen in this case.

@mcvsubbu
Copy link
Copy Markdown
Contributor

@walterddr is there an issue created that has some details on what really happened? Or else, if you can describe what the problem was, that will help as well (maybe it was discussed in the chat channel, but it would help to have it in this PR). Thanks

success = buildSegmentAndReplace();
if (success) {
try {
buildSegmentAndReplace();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The state should be set to DISCARDED if buildSegmentAndReplace fails for whatever reason (It really should not, since another host was able to build). This will allow us to replace the segment by downloading, as a last effort to keep the service going.

What exception are you seeing? Can you paste a stack?

_realtimeTableDataManager.addSegmentError(_segmentNameStr,
new SegmentErrorInfo(System.currentTimeMillis(), "Could not build segment", null));
new SegmentErrorInfo(System.currentTimeMillis(), "Could not build segment!", e));
_state = State.ERROR;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comment above regarding state

@walterddr
Copy link
Copy Markdown
Contributor Author

@walterddr is there an issue created that has some details on what really happened? Or else, if you can describe what the problem was, that will help as well (maybe it was discussed in the chat channel, but it would help to have it in this PR). Thanks

this was actually a very great question @mcvsubbu .
so initially the PR was only to address the fact that 2 of the place where _realtimeTableDataManager.addSegmentError only logs a single string "Could not build segment", with no stack trace attached.
I later discover that this is because the several protected functions that returns "null" indicating the segment creation fail doesn't bubble up the stack trace therefore there's no stacktrace to report.

Thanks for calling this out I think we deviated from the original purpose of this PR. I think we are now addressing a related but different (and bigger) issue.

I can create an issue for what's being discussed here. But for the null-stacktrace issue we can simple move the reporting to where the actual exception was thrown. what do you guys think? (@sajjad-moradi @Jackie-Jiang )

@sajjad-moradi
Copy link
Copy Markdown
Contributor

I can create an issue for what's being discussed here. But for the null-stacktrace issue we can simple move the reporting to where the actual exception was thrown. what do you guys think? (@sajjad-moradi @Jackie-Jiang )

That's a good idea. It makes the changes in this PR small. We can create a separate PR for the other issue.

@walterddr
Copy link
Copy Markdown
Contributor Author

sounds good. i will go ahead and summarize and create the issue; meanwhile I created a simpler PR to address the null stacktrace here: #7909

@mcvsubbu
Copy link
Copy Markdown
Contributor

I still don't understand the problem that was faced in the installation (or, is this something you noticed in the source and want to fix)? The reason I ask is that there are many things going on when the segment finishes, and changing the behavior around that time needs to be done carefully.

Assuming this was a problem in some installation, can you elaborate the issue? (e.g. you noticed segment was not consuming, you looked at the debug api, but found null stack traces, you looked at the server logs and found the segment build had failed due to X -- put down the reason please). Thanks! And if you found a stack trace in the server that you would like to see in debug API, please add that as well. We need to know if the stack trace was while transitioning from consuming to online, or whether it was before that during segment completion protocol.

thanks for your patience.

@walterddr
Copy link
Copy Markdown
Contributor Author

closing this. please move to issue for discussion.

@walterddr walterddr closed this Dec 15, 2021
@walterddr walterddr deleted the fix_rtsegmanager_error branch December 6, 2023 16:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants