Clean up pendingSegments table #3565

dclim · 2016-10-13T18:12:11Z

The pendingSegments table should have entries purged when they're no longer required to determine sequences of segments.

See: https://groups.google.com/forum/#!topic/druid-user/O0yxORw92VM

haoxiang47 · 2016-11-03T12:48:04Z

Ok, I will try to fix this issue

dclim · 2016-11-07T18:10:54Z

@haoxiang47 expressed interest in submitting a PR to fix this issue - @gianm do you have any thoughts about how best to do this?

gianm · 2016-11-07T18:31:47Z

Hmm, no immediate thoughts, other than we should make sure that we don't accidentally clean up pending segments that might be used by ingestions that are either long running or have been paused for a long time.

dclim · 2016-11-07T18:57:34Z

Do you see any issue with cleaning the entries up after the first task in a set of replicas successfully completes handoff in FiniteAppenderatorDriver.finish()? At least in the KafkaIndexTask case, this shouldn't be problematic because a) All the replicas should already be in publishing state and not allocating any more segments, and b) If one replica succeeds handoff, the rest of the replicas will be stopped shortly.

Alternately KafkaSupervisor could manage the entry removal once all the tasks associated with a sequenceName have been completed or stopped, but it feels cleaner doing it in FiniteAppenderatorDriver since it was the one who created the entry in the first place.

gianm · 2016-11-07T19:06:54Z

Ideally we don't rely on something acting at a single point in time for cleanup, because failures of that thing could leave rows dangling around forever. If we can remove all stale pending segments though, and not just the ones for our current sequence, that'd work.

gianm · 2016-11-07T19:07:14Z

(subject to being careful not to remove pending segments that are for still active sequences)

dclim · 2016-11-07T19:14:40Z

Hm alright, another way we could handle this is having a cleanup thread on the overlord that periodically compares the druid_pendingSegments table to the druid_tasks table and removes any entries in druid_pendingSegments with a sequenceName that doesn't have a corresponding 'active' entry in druid_tasks.

haoxiang47 · 2016-11-08T01:59:49Z

well, in our system we have lots of datasources, so in overload it will create lots of tasks, when the times go, the table will biggger than before so that it will influence query from mysql. Now we just add index in druid_pendingSegments table, and it make the query fast a lot. So I think we can first simplely add indexes to this table and than cleanup the table automatically.

jon-wei · 2018-01-23T23:17:01Z

Addressed by #5149

dclim added the Improvement label Oct 13, 2016

jon-wei closed this as completed Jan 23, 2018

jon-wei mentioned this issue Jan 23, 2018

add interface to auto clean mysql pendingSegments table #3831

Closed

seoeun25 pushed a commit to seoeun25/incubator-druid that referenced this issue Feb 25, 2022

apache#3565 Make sql join support as default

f618358

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean up pendingSegments table #3565

Clean up pendingSegments table #3565

dclim commented Oct 13, 2016

haoxiang47 commented Nov 3, 2016

dclim commented Nov 7, 2016

gianm commented Nov 7, 2016

dclim commented Nov 7, 2016

gianm commented Nov 7, 2016

gianm commented Nov 7, 2016

dclim commented Nov 7, 2016

haoxiang47 commented Nov 8, 2016

jon-wei commented Jan 23, 2018

Clean up pendingSegments table #3565

Clean up pendingSegments table #3565

Comments

dclim commented Oct 13, 2016

haoxiang47 commented Nov 3, 2016

dclim commented Nov 7, 2016

gianm commented Nov 7, 2016

dclim commented Nov 7, 2016

gianm commented Nov 7, 2016

gianm commented Nov 7, 2016

dclim commented Nov 7, 2016

haoxiang47 commented Nov 8, 2016

jon-wei commented Jan 23, 2018