Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Kafka Indexing task pause forever if no events in taskDuration (#5656) #5899

Merged
merged 3 commits into from
Jun 26, 2018

Conversation

surekhasaharan
Copy link

@surekhasaharan surekhasaharan commented Jun 22, 2018

If the endOffset is same as startOffset, still let the task resume instead of returning
endOffsets early which causes the tasks to pause forever and ultimately fail on timeout

* Fix Nullpointer Exception in overlord if taskGroups does not contain the groupId
* If the endOffset is same as startOffset, still let the task resume instead of returning
   endOffsets early which causes the tasks to pause forever and ultimately fail on timeout
@@ -888,6 +888,10 @@ String generateSequenceName(
@VisibleForTesting
String generateSequenceName(int groupId)
{
if (taskGroups.get(groupId) == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks that this should never happen. Would you elaborate more on when this can happen?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This happens if no events were passed to kafka tasks, this is the issue from #5666. So checkTaskDuration() removes the groupId taskGroups.remove(groupId); but later checkPendingCompletionTasks() tries to get the groupId in sequenceTaskGroup.remove(generateSequenceName(groupId));

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jihoonson I tried again, removing the null check, and I cannot reproduce the NPE now with the task resume fix in, but I think, this null check couldn't hurt.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@surekhasaharan thanks. Null check is always good, but I'm not sure about returning null in this method. If groupId is never expected to be null, we should throw an exception. Otherwise, this method can return null, but all callers should check the returned sequenceName is null or not. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @jihoonson -- let's fix this in a different patch, since it's really a different bug from the main one we're fixing. The main one being:

If the endOffset is same as startOffset, still let the task resume instead of returning
endOffsets early which causes the tasks to pause forever and ultimately fail on timeout

I raised a new issue for this NPE: #5900

Copy link
Author

@surekhasaharan surekhasaharan Jun 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jihoonson Agree that all the callers will need to handle null in this case. Will address the possible NPE in separate issue raised by @gianm . Removing the null check here.

Surekha Saharan added 2 commits June 22, 2018 22:54
*Remove the null check and do not return null from generateSequenceName
@gianm
Copy link
Contributor

gianm commented Jun 25, 2018

@surekhasaharan The test failure seems legit:

testIncrementalHandOff[isIncrementalHandoffSupported = true](io.druid.indexing.kafka.KafkaIndexTaskTest)  Time elapsed: 2.147 sec  <<< FAILURE!
java.lang.AssertionError: expected:<SUCCESS> but was:<FAILED>
	at org.junit.Assert.fail(Assert.java:88)
	at org.junit.Assert.failNotEquals(Assert.java:743)
	at org.junit.Assert.assertEquals(Assert.java:118)
	at org.junit.Assert.assertEquals(Assert.java:144)
	at io.druid.indexing.kafka.KafkaIndexTaskTest.testIncrementalHandOff(KafkaIndexTaskTest.java:517)

@surekhasaharan
Copy link
Author

All tests in KafkaIndexTaskTest seem to pass locally.

@gianm
Copy link
Contributor

gianm commented Jun 26, 2018

@surekhasaharan Ah okay cool - I restarted it and it passed.

@gianm gianm merged commit 0f42929 into apache:master Jun 26, 2018
@surekhasaharan surekhasaharan deleted the issue-5656 branch June 27, 2018 16:56
@jihoonson jihoonson added this to the 0.12.2 milestone Jul 5, 2018
jihoonson pushed a commit to implydata/druid-public that referenced this pull request Jul 5, 2018
…pache#5656) (apache#5899)

* Fix Kafka Indexing task pause forever (apache#5656)

* Fix Nullpointer Exception in overlord if taskGroups does not contain the groupId
* If the endOffset is same as startOffset, still let the task resume instead of returning
   endOffsets early which causes the tasks to pause forever and ultimately fail on timeout

* Address PR comment

*Remove the null check and do not return null from generateSequenceName
jihoonson pushed a commit to jihoonson/druid that referenced this pull request Jul 7, 2018
…pache#5656) (apache#5899)

* Fix Kafka Indexing task pause forever (apache#5656)

* Fix Nullpointer Exception in overlord if taskGroups does not contain the groupId
* If the endOffset is same as startOffset, still let the task resume instead of returning
   endOffsets early which causes the tasks to pause forever and ultimately fail on timeout

* Address PR comment

*Remove the null check and do not return null from generateSequenceName
gianm pushed a commit that referenced this pull request Jul 9, 2018
…5656) (#5899) (#5971)

* Fix Kafka Indexing task pause forever (#5656)

* Fix Nullpointer Exception in overlord if taskGroups does not contain the groupId
* If the endOffset is same as startOffset, still let the task resume instead of returning
   endOffsets early which causes the tasks to pause forever and ultimately fail on timeout

* Address PR comment

*Remove the null check and do not return null from generateSequenceName
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants