-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Kafka Indexing task pause forever if no events in taskDuration (#5656) #5899
Conversation
* Fix Nullpointer Exception in overlord if taskGroups does not contain the groupId * If the endOffset is same as startOffset, still let the task resume instead of returning endOffsets early which causes the tasks to pause forever and ultimately fail on timeout
@@ -888,6 +888,10 @@ String generateSequenceName( | |||
@VisibleForTesting | |||
String generateSequenceName(int groupId) | |||
{ | |||
if (taskGroups.get(groupId) == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks that this should never happen. Would you elaborate more on when this can happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This happens if no events were passed to kafka tasks, this is the issue from #5666. So checkTaskDuration()
removes the groupId taskGroups.remove(groupId);
but later checkPendingCompletionTasks()
tries to get the groupId in sequenceTaskGroup.remove(generateSequenceName(groupId));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jihoonson I tried again, removing the null check, and I cannot reproduce the NPE
now with the task resume fix in, but I think, this null check couldn't hurt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@surekhasaharan thanks. Null check is always good, but I'm not sure about returning null in this method. If groupId
is never expected to be null, we should throw an exception. Otherwise, this method can return null, but all callers should check the returned sequenceName is null or not. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @jihoonson -- let's fix this in a different patch, since it's really a different bug from the main one we're fixing. The main one being:
If the endOffset is same as startOffset, still let the task resume instead of returning
endOffsets early which causes the tasks to pause forever and ultimately fail on timeout
I raised a new issue for this NPE: #5900
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jihoonson Agree that all the callers will need to handle null in this case. Will address the possible NPE in separate issue raised by @gianm . Removing the null check here.
*Remove the null check and do not return null from generateSequenceName
@surekhasaharan The test failure seems legit:
|
All tests in |
@surekhasaharan Ah okay cool - I restarted it and it passed. |
…pache#5656) (apache#5899) * Fix Kafka Indexing task pause forever (apache#5656) * Fix Nullpointer Exception in overlord if taskGroups does not contain the groupId * If the endOffset is same as startOffset, still let the task resume instead of returning endOffsets early which causes the tasks to pause forever and ultimately fail on timeout * Address PR comment *Remove the null check and do not return null from generateSequenceName
…pache#5656) (apache#5899) * Fix Kafka Indexing task pause forever (apache#5656) * Fix Nullpointer Exception in overlord if taskGroups does not contain the groupId * If the endOffset is same as startOffset, still let the task resume instead of returning endOffsets early which causes the tasks to pause forever and ultimately fail on timeout * Address PR comment *Remove the null check and do not return null from generateSequenceName
…5656) (#5899) (#5971) * Fix Kafka Indexing task pause forever (#5656) * Fix Nullpointer Exception in overlord if taskGroups does not contain the groupId * If the endOffset is same as startOffset, still let the task resume instead of returning endOffsets early which causes the tasks to pause forever and ultimately fail on timeout * Address PR comment *Remove the null check and do not return null from generateSequenceName
If the endOffset is same as startOffset, still let the task resume instead of returning
endOffsets early which causes the tasks to pause forever and ultimately fail on timeout