Skip to content

[Backport] Add SegmentAllocationQueue to batch SegmentAllocateActions (#13369)#13493

Merged
kfaraz merged 1 commit intoapache:25.0.0from
kfaraz:backport_batch_alloc
Dec 5, 2022
Merged

[Backport] Add SegmentAllocationQueue to batch SegmentAllocateActions (#13369)#13493
kfaraz merged 1 commit intoapache:25.0.0from
kfaraz:backport_batch_alloc

Conversation

@kfaraz
Copy link
Contributor

@kfaraz kfaraz commented Dec 5, 2022

Backports #13369

)

In a cluster with a large number of streaming tasks (~1000), SegmentAllocateActions 
on the overlord can often take very long intervals of time to finish thus causing spikes 
in the `task/action/run/time`. This may result in lag building up while a task waits for a
segment to get allocated.

The root causes are:
- large number of metadata calls made to the segments and pending segments tables
- `giant` lock held in `TaskLockbox.tryLock()` to acquire task locks and allocate segments

Since the contention typically arises when several tasks of the same datasource try
to allocate segments for the same interval/granularity, the allocation run times can be
improved by batching the requests together.

Changes
- Add flags
   - `druid.indexer.tasklock.batchSegmentAllocation` (default `false`)
   - `druid.indexer.tasklock.batchAllocationMaxWaitTime` (in millis) (default `1000`)
- Add methods `canPerformAsync` and `performAsync` to `TaskAction`
- Submit each allocate action to a `SegmentAllocationQueue`, and add to correct batch
- Process batch after `batchAllocationMaxWaitTime`
- Acquire `giant` lock just once per batch in `TaskLockbox`
- Reduce metadata calls by batching statements together and updating query filters
- Except for batching, retain the whole behaviour (order of steps, retries, etc.)
- Respond to leadership changes and fail items in queue when not leader
- Emit batch and request level metrics
@kfaraz kfaraz added the Backport label Dec 5, 2022
@kfaraz kfaraz added this to the 25.0 milestone Dec 5, 2022
@lgtm-com
Copy link

lgtm-com bot commented Dec 5, 2022

This pull request introduces 2 alerts when merging 3e26b96 into baf6ca4 - view on LGTM.com

new alerts:

  • 2 for User-controlled data in numeric cast

Heads-up: LGTM.com's PR analysis will be disabled on the 5th of December, and LGTM.com will be shut down ⏻ completely on the 16th of December 2022. Please enable GitHub code scanning, which uses the same CodeQL engine ⚙️ that powers LGTM.com. For more information, please check out our post on the GitHub blog.

@kfaraz kfaraz merged commit c04ecde into apache:25.0.0 Dec 5, 2022
@kfaraz kfaraz deleted the backport_batch_alloc branch December 5, 2022 14:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants