Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce segment limit for indexing tasks #7514

Closed

Conversation

justinborromeo
Copy link
Contributor

@justinborromeo justinborromeo commented Apr 19, 2019

[WIP]

Closes #7238. See #7238 for proposal.

}
} else {
boolean rowInNewSegment = true;
for (SegmentIdWithShardSpec segment : appenderator.getUnpublishedSegments()) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be done O(log(n)) instead of O(n) if the list of segments is sorted by time. If a reviewer thinks this approach for enforcing max total segments is cool, I can optimize this part (it'll take a bit more work).

@@ -313,6 +316,23 @@ public TaskStatus run(final TaskToolbox toolbox)
log.debug("Discarded null row, considering thrownAway.");
rowIngestionMeters.incrementThrownAway();
} else {
// Check if the upcoming row will result in the creation of a new segment. If so and the new number of segments
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love the added complexity and the code duplication (this similar block is repeated 4 times) introduced by trying to pre-determine whether the row will cause a new segment to be generated before it's sent to the appenderator. Also, AppenderatorDriverAddResult.isPushRequired() feels like the right place for this decision to be made, and not in every indexing task type that uses an appenderator.

I understand that after Stream/BatchAppenderatorDriver.add() is called, if you determine that you're now over maxTotalSegments, you wind up pushing a segment that has only one row. But that is only true if you try to push all of the open segments. What if instead you made appropriate modifications to methods like StreamAppenderatorDriver.publish() and BatchAppenderatorDriver.pushAllAndClear() (or introduced new methods) that allow you to provide a list of segmentIds to exclude, which you would use to exclude the newly created segment that pushed you over the limit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's doable

@stale
Copy link

stale bot commented Aug 15, 2019

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.

@stale stale bot added the stale label Aug 15, 2019
@stale
Copy link

stale bot commented Sep 12, 2019

This pull request/issue has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@stale stale bot closed this Sep 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[PROPOSAL] Add segment limit for native and streaming index tasks
2 participants