Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up pendingSegments table #3565

Closed
dclim opened this issue Oct 13, 2016 · 9 comments
Closed

Clean up pendingSegments table #3565

dclim opened this issue Oct 13, 2016 · 9 comments

Comments

@dclim
Copy link
Contributor

dclim commented Oct 13, 2016

The pendingSegments table should have entries purged when they're no longer required to determine sequences of segments.

See: https://groups.google.com/forum/#!topic/druid-user/O0yxORw92VM

@haoxiang47
Copy link

Ok, I will try to fix this issue

@dclim
Copy link
Contributor Author

dclim commented Nov 7, 2016

@haoxiang47 expressed interest in submitting a PR to fix this issue - @gianm do you have any thoughts about how best to do this?

@gianm
Copy link
Contributor

gianm commented Nov 7, 2016

Hmm, no immediate thoughts, other than we should make sure that we don't accidentally clean up pending segments that might be used by ingestions that are either long running or have been paused for a long time.

@dclim
Copy link
Contributor Author

dclim commented Nov 7, 2016

Do you see any issue with cleaning the entries up after the first task in a set of replicas successfully completes handoff in FiniteAppenderatorDriver.finish()? At least in the KafkaIndexTask case, this shouldn't be problematic because a) All the replicas should already be in publishing state and not allocating any more segments, and b) If one replica succeeds handoff, the rest of the replicas will be stopped shortly.

Alternately KafkaSupervisor could manage the entry removal once all the tasks associated with a sequenceName have been completed or stopped, but it feels cleaner doing it in FiniteAppenderatorDriver since it was the one who created the entry in the first place.

@gianm
Copy link
Contributor

gianm commented Nov 7, 2016

Ideally we don't rely on something acting at a single point in time for cleanup, because failures of that thing could leave rows dangling around forever. If we can remove all stale pending segments though, and not just the ones for our current sequence, that'd work.

@gianm
Copy link
Contributor

gianm commented Nov 7, 2016

(subject to being careful not to remove pending segments that are for still active sequences)

@dclim
Copy link
Contributor Author

dclim commented Nov 7, 2016

Hm alright, another way we could handle this is having a cleanup thread on the overlord that periodically compares the druid_pendingSegments table to the druid_tasks table and removes any entries in druid_pendingSegments with a sequenceName that doesn't have a corresponding 'active' entry in druid_tasks.

@haoxiang47
Copy link

well, in our system we have lots of datasources, so in overload it will create lots of tasks, when the times go, the table will biggger than before so that it will influence query from mysql. Now we just add index in druid_pendingSegments table, and it make the query fast a lot. So I think we can first simplely add indexes to this table and than cleanup the table automatically.

@jon-wei
Copy link
Contributor

jon-wei commented Jan 23, 2018

Addressed by #5149

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants