make buildSegmentInternal raise error out explicitly#7896
make buildSegmentInternal raise error out explicitly#7896walterddr wants to merge 3 commits intoapache:masterfrom
Conversation
this fixes the issue when datamanager doesn't log segment error properly and shown to user
af76bec to
5aebcb1
Compare
Codecov Report
@@ Coverage Diff @@
## master #7896 +/- ##
============================================
- Coverage 71.34% 64.91% -6.43%
Complexity 4087 4087
============================================
Files 1587 1544 -43
Lines 82071 80222 -1849
Branches 12267 12062 -205
============================================
- Hits 58553 52080 -6473
- Misses 19550 24417 +4867
+ Partials 3968 3725 -243
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
Jackie-Jiang
left a comment
There was a problem hiding this comment.
We need a bigger refactoring and make all methods propagating the error through the exception
| } | ||
|
|
||
| protected boolean buildSegmentAndReplace() { | ||
| protected void buildSegmentAndReplace() throws Exception { |
There was a problem hiding this comment.
(code format) Please setup the code format following: https://docs.pinot.apache.org/developers/developers-and-contributors/code-setup#setup-ide
|
|
||
| @Override | ||
| protected boolean buildSegmentAndReplace() { | ||
| protected void buildSegmentAndReplace() throws Exception { |
|
|
||
| protected boolean buildSegmentAndReplace() { | ||
| protected void buildSegmentAndReplace() throws Exception { | ||
| SegmentBuildDescriptor descriptor = buildSegmentInternal(false); |
There was a problem hiding this comment.
(major) when segment build failed, it will return null descriptor instead of throwing exception. In order to make it work, we need to change buildSegmentInternal() to throw exception as well
There was a problem hiding this comment.
yes. buildSegmentInternal has already been changed to throw exception.
There was a problem hiding this comment.
Sorry I missed that. When we throw the exception, we should not log the error. The catcher is responsible for logging the error, or we will log multiple errors for the same exception
There was a problem hiding this comment.
yup. working on it.
| success = buildSegmentAndReplace(); | ||
| if (success) { | ||
| try { | ||
| buildSegmentAndReplace(); |
There was a problem hiding this comment.
This is not the only place buildSegmentAndReplace is being used. It's also being used two times in goOnlineFromConsuming method. If we throw exception in buildSegmentAndRepalce, then the segment goes to error state in external view upon receiving helix transition message consuming -> online.
The existing behaviour is even worse because segment is not built but external view will move to online state!!
I suggest we proceed with throwing exception in buildSegmentAndReplace, but we try-catch its usage in goOnlineFromConsuming and in the catch part, we use downloadSegmentAndReplace.
@mcvsubbu @Jackie-Jiang what do you guys think?
There was a problem hiding this comment.
yes I thought about exactly the same thing. the return null also make the caller alter the state --> if we do not try catch it properly the states wont change.
Ultimately I think we need to rethink when should _realtimeTableDataManager.addSegmentError occur, my opinion is that we should make a single state machine function for each functionality, in this case: the only places we catch and process exception should be: PartitionConsumer and goOnlineFromConsuming, in which both modifies the internal state. the rest of the internal methods should all bubble up the exception by catch and rethrowing.
There was a problem hiding this comment.
adjusted plz kindly take a look
There was a problem hiding this comment.
The state should be set to DISCARDED if buildSegmentAndReplace fails for whatever reason (It really should not, since another host was able to build). This will allow us to replace the segment by downloading, as a last effort to keep the service going.
What exception are you seeing? Can you paste a stack?
There was a problem hiding this comment.
First of all, this is not an addSegment error (unless that method in RealtimeTableDataManager has been misnamed). Secondly, this is an "error" that is totally recoverable, since some other host has already completed the segment, and it can be downloaded. So, no "error" should be raised in this condition.
If you can please specify the problem you are trying to solve, we can help.
There was a problem hiding this comment.
OK, I realized that the addSegmentError is a method that adds all segment errors. But my question still stands. When we detect a recoverable error, do we still want to add an error? I don't think so. I think we should call addSegmentError() API only when the error is unrecoverable.
| .error("Failed to find file: {} under index directory: {}", V1Constants.MetadataKeys.METADATA_FILE_NAME, | ||
| indexDir); | ||
| return null; | ||
| throw new IllegalStateException(String.format("Failed to find file: %s under index directory: %s", |
There was a problem hiding this comment.
IllegalStateException doesn't get caught in goOnlineFromConsuming so the desired segment download step wouldn't happen in this case.
|
@walterddr is there an issue created that has some details on what really happened? Or else, if you can describe what the problem was, that will help as well (maybe it was discussed in the chat channel, but it would help to have it in this PR). Thanks |
| success = buildSegmentAndReplace(); | ||
| if (success) { | ||
| try { | ||
| buildSegmentAndReplace(); |
There was a problem hiding this comment.
The state should be set to DISCARDED if buildSegmentAndReplace fails for whatever reason (It really should not, since another host was able to build). This will allow us to replace the segment by downloading, as a last effort to keep the service going.
What exception are you seeing? Can you paste a stack?
| _realtimeTableDataManager.addSegmentError(_segmentNameStr, | ||
| new SegmentErrorInfo(System.currentTimeMillis(), "Could not build segment", null)); | ||
| new SegmentErrorInfo(System.currentTimeMillis(), "Could not build segment!", e)); | ||
| _state = State.ERROR; |
There was a problem hiding this comment.
see comment above regarding state
this was actually a very great question @mcvsubbu . Thanks for calling this out I think we deviated from the original purpose of this PR. I think we are now addressing a related but different (and bigger) issue. I can create an issue for what's being discussed here. But for the null-stacktrace issue we can simple move the reporting to where the actual exception was thrown. what do you guys think? (@sajjad-moradi @Jackie-Jiang ) |
That's a good idea. It makes the changes in this PR small. We can create a separate PR for the other issue. |
|
sounds good. i will go ahead and summarize and create the issue; meanwhile I created a simpler PR to address the null stacktrace here: #7909 |
|
I still don't understand the problem that was faced in the installation (or, is this something you noticed in the source and want to fix)? The reason I ask is that there are many things going on when the segment finishes, and changing the behavior around that time needs to be done carefully. Assuming this was a problem in some installation, can you elaborate the issue? (e.g. you noticed segment was not consuming, you looked at the debug api, but found null stack traces, you looked at the server logs and found the segment build had failed due to X -- put down the reason please). Thanks! And if you found a stack trace in the server that you would like to see in debug API, please add that as well. We need to know if the stack trace was while transitioning from consuming to online, or whether it was before that during segment completion protocol. thanks for your patience. |
|
closing this. please move to issue for discussion. |
this fixes the issue when datamanager doesn't log segment error properly and shown to user. stacktrace basically only returns a null value and not useful for debugging.