Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SegmentAllocationQueue to batch allocation actions #13369

Merged
merged 36 commits into from
Dec 5, 2022

Conversation

kfaraz
Copy link
Contributor

@kfaraz kfaraz commented Nov 15, 2022

Description

In a cluster with a large number of streaming tasks (~1000), SegmentAllocateActions on the overlord can often take very long intervals of time to finish thus causing spikes in the task/action/run/time. This may result in lag building up while a task waits for a segment to get allocated.

The root causes are:

  • large number of metadata calls made to the segments and pending segments tables
  • giant lock held in TaskLockbox.tryLock() to acquire task locks and allocate segments

Since the contention typically arises when several tasks of the same datasource try to allocate segments for the same interval/granularity, the allocation run times can be improved by batching the requests together.

Changes

Broad strokes

  • Add flags
    • druid.indexer.tasklock.batchSegmentAllocation (default false)
    • druid.indexer.tasklock.batchAllocationMaxWaitTime (in millis) (default 1000)
  • Put a bound on the amount of time an allocation action can take by configuring the batchAllocationMaxWaitTime
  • Add SegmentAllocationQueue to queue and batch allocate actions
  • Identify a batch uniquely with these fields
    • dataSource
    • groupId
    • preferredAllocationInterval
    • skipSegmentLineageCheck
    • useNonRootGenerationPartitionSpace (as a proxy for PartialShardSpec.getClass())
    • lockGranularity
  • Metadata calls are now batched
  • giant lock is acquired just once per batch per tryInterval in TaskLockbox
  • Rest of the behaviour (order of steps, retries, etc.) in the new flow remains exactly the same as the old flow

Adding a request

  • Add methods canPerformAsync and performAsync to TaskAction
  • performAsync currently throws an UnsupportedOperationException for other action types
  • Submit each allocate action to a SegmentAllocationQueue
  • Add the action to a batch and execute it when it is due

Processing a request batch

  • Wait for batchAllocationMaxWaitTime before processing a batch
  • Send all requests in the batch to TaskLockbox to acquire task locks and allocate segments
  • Respond to leadership changes and fail items in queue when not leader
  • Emit batch and request level metrics

Allocating segments

  • Add TaskLockbox.allocateSegments() retaining the logic from TaskLockbox.tryLock()
  • Add IndexerMetadataStorageCoordinator.allocatePendingSegments() retaining the logic from allocatePendingSegment()

New metrics

Per batch (dims: dataSource, taskActionType=segmentAllocate):

  • task/action/batch/runTime
  • task/action/batch/queueTime
  • task/action/batch/size
  • task/action/batch/attempts

Per request (dims: taskId, taskType, dataSource, taskActionType=segmentAllocate):

  • task/action/failed/count
  • task/action/success/count

Pending changes

  • Docs for metrics
  • Perf eval of new flow against old flow

Further improvements

  • An improvement could be to free up the HTTP threads on which the request is received by using HttpServletRequest.startAsync(). This would allow batching of even more requests.
  • An improvement could be to free up the HTTP threads on which the request is received by using HttpServletRequest.startAsync().
  • There are some metadata queries which have not been batched in this PR such as updating the acquired task lock in the metadata store and fetching a matching pending when skipLineageCheck = false.

Release note

Speed up segment allocation and reduce metadata calls by clubbing multiple requests together.
Set druid.indexer.tasklock.batchSegmentAllocation=true to enable this feature.

@kfaraz kfaraz added the WIP label Nov 16, 2022
@lgtm-com
Copy link

lgtm-com bot commented Nov 16, 2022

This pull request introduces 2 alerts when merging 0524e83 into 78d0b0a - view on LGTM.com

new alerts:

  • 2 for User-controlled data in numeric cast

Heads-up: LGTM.com's PR analysis will be disabled on the 5th of December, and LGTM.com will be shut down ⏻ completely on the 16th of December 2022. Please enable GitHub code scanning, which uses the same CodeQL engine ⚙️ that powers LGTM.com. For more information, please check out our post on the GitHub blog.

@lgtm-com
Copy link

lgtm-com bot commented Nov 17, 2022

This pull request introduces 2 alerts when merging b546f6c into 78d0b0a - view on LGTM.com

new alerts:

  • 2 for User-controlled data in numeric cast

Heads-up: LGTM.com's PR analysis will be disabled on the 5th of December, and LGTM.com will be shut down ⏻ completely on the 16th of December 2022. Please enable GitHub code scanning, which uses the same CodeQL engine ⚙️ that powers LGTM.com. For more information, please check out our post on the GitHub blog.

@lgtm-com
Copy link

lgtm-com bot commented Nov 17, 2022

This pull request introduces 2 alerts when merging 4c66afd into bf10ff7 - view on LGTM.com

new alerts:

  • 2 for User-controlled data in numeric cast

Heads-up: LGTM.com's PR analysis will be disabled on the 5th of December, and LGTM.com will be shut down ⏻ completely on the 16th of December 2022. Please enable GitHub code scanning, which uses the same CodeQL engine ⚙️ that powers LGTM.com. For more information, please check out our post on the GitHub blog.

@lgtm-com
Copy link

lgtm-com bot commented Nov 18, 2022

This pull request introduces 2 alerts when merging e07e852 into bf10ff7 - view on LGTM.com

new alerts:

  • 2 for User-controlled data in numeric cast

Heads-up: LGTM.com's PR analysis will be disabled on the 5th of December, and LGTM.com will be shut down ⏻ completely on the 16th of December 2022. Please enable GitHub code scanning, which uses the same CodeQL engine ⚙️ that powers LGTM.com. For more information, please check out our post on the GitHub blog.

@lgtm-com
Copy link

lgtm-com bot commented Nov 29, 2022

This pull request introduces 2 alerts when merging 92479da into edd076c - view on LGTM.com

new alerts:

  • 2 for User-controlled data in numeric cast

Heads-up: LGTM.com's PR analysis will be disabled on the 5th of December, and LGTM.com will be shut down ⏻ completely on the 16th of December 2022. Please enable GitHub code scanning, which uses the same CodeQL engine ⚙️ that powers LGTM.com. For more information, please check out our post on the GitHub blog.

final Set<DataSegment> usedSegments = retrieveUsedSegments(requestKey);
final int successCount = allocateSegmentsForBatch(requestBatch, usedSegments);

emitBatchMetric("task/action/batch/retries", 1L, requestKey);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: task/action/batch/attempts seems more accurate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Yeah, I had originally used attempts but then moved the code around a bit and this got missed.

return true;
}

// Requeue the batch only if used segments have changed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please explain a bit more about this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to retain behaviour from the existing flow. If the set of used segments has changed since we started processing the request, it's possible that retrying might make it pass.

holderList.getPending().forEach(holder -> acquireTaskLock(holder, false));
}

// TODO: for failed allocations, cleanup newly created locks from the posse map
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be addressed as part of these changes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let me check this. If not done, this might leave some garbage locks in the posse map.

emitTaskMetric("task/action/success/count", 1L, request);
requestToFuture.remove(request).complete(result.getSegmentId());
} else if (request.canRetry()) {
log.debug(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be info or warn I think. If the retry is causing a delay, we will only know it from logs.

holderList.getPending().forEach(holder -> acquireTaskLock(holder, false));
}

// TODO: for failed allocations, cleanup newly created locks from the posse map
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this todo be resolved?

Copy link
Contributor Author

@kfaraz kfaraz Dec 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to verify if this is an actual issue. Also, this behaviour is the same in the old flow too.
I think I should tackle this in an immediate follow up PR, so that this PR just adds the new flow without modifying any behaviour.

log.debug("Added a new batch [%s] to queue.", batch.key);
return true;
} else {
log.warn("Cannot add batch [%s] as queue is full. Failing [%d] requests.", batch.key, batch.size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens as a result of this? I assume it will be caused when the metadata store is slow. can there be any other reason? should we add some possible corrective action here like checking metadata store sizing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can also happen when things keep getting retried, due to frequent updates to the set of used segments. It would be up to the client to retry in such cases.

The action can be to look at the emitted metrics batch/runTime and batch/attempts. A greater value of the first would be caused by metadata slowness, the second is retries. The user can then determine if the metadata store needs resizing. What do you think?

Comment on lines 225 to 227
if (!isLeader.get()) {
return;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this necessary? what if some items were added to the queue just before line 225?

Copy link
Contributor Author

@kfaraz kfaraz Nov 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that is not likely to happen. New items are rejected if we are not the leader.
But maybe it could happen if the leader is flapping?

  • t1: we are not leader, we clear the queue
  • t2: become leader
  • t3: add a request
  • t4: stop being leader
  • t1: we are not leader, return
    the request added by t3 will thus remain stuck in the queue, until we become leader again and process it. The future will timeout after 5 minutes.

A simpler alternative to all of this can be to have the leadership check done for each batch separately in processBatch (this check should probably exist anyway). If not leader, fail it there.

The differences would be:

  • even non-leaders would keep polling the queue
  • requests would wait until their due time to be failed

Please let me know what you think.

++numProcessedBatches;
}
catch (Throwable t) {
processed = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well there could be db failures for example. If we don't retry, we should surface the failure to the caller.


for (SegmentCreateRequest request : requests) {
// Check if the required segment has already been created in this batch
final String sequenceHash = getSequenceNameAndPrevIdSha(request, interval, skipSegmentLineageCheck);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When can this occur?
Also, SegmentCreateRequest states that requests must be treated distinctly even if all fields are the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple requests for the same segment, in the case of task replicas, I should think.

Copy link
Contributor Author

@kfaraz kfaraz Nov 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SegmentCreateRequest states that requests must be treated distinctly even if all fields are the same.

The SegmentCreateRequest is different, the segment id created is not different.

// If there is an existing chunk, find the max id with the same version as the existing chunk.
// There may still be a pending segment with a higher version (but no corresponding used segments)
// which may generate a clash with an existing segment once the new id is generated
final SegmentIdWithShardSpec overallMaxId =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overallMaxId is computed assuming that committedMaxId has already added to the set of pendingSegments.
I think it's better to not make that assumption and add committedMaxId explicitly in this method

@lgtm-com
Copy link

lgtm-com bot commented Nov 30, 2022

This pull request introduces 2 alerts when merging a03ba2e into 50963ed - view on LGTM.com

new alerts:

  • 2 for User-controlled data in numeric cast

Heads-up: LGTM.com's PR analysis will be disabled on the 5th of December, and LGTM.com will be shut down ⏻ completely on the 16th of December 2022. Please enable GitHub code scanning, which uses the same CodeQL engine ⚙️ that powers LGTM.com. For more information, please check out our post on the GitHub blog.

@kfaraz
Copy link
Contributor Author

kfaraz commented Dec 1, 2022

@AmatyaAvadhanula , @abhishekagarwal87 , thanks for the review. I have addressed your feedback.

Changes made:

  • Queue polling is now always active irrespective of being leader or not. This can be improved later if needed.
  • Queue polling starts on lifecycle start and not on becoming leader.
  • A future now either completes successfully or exceptionally, i.e. it never completes with null. This allows propagation of actual error message to the caller. Added tests for this.
  • Clarify usage of committedMaxId in the set of pendingSegments
  • Rename metric batch/retries to batch/attempts

Changes in follow up PR:

  • docs (mostly just metrics)
  • cleanup of empty locks from posse map (both old and new flow)
  • limit number of requests in each batch
  • cancel the future if it times out on LocalTaskActionClient

@lgtm-com
Copy link

lgtm-com bot commented Dec 1, 2022

This pull request introduces 2 alerts when merging 0d09a6e into f6f625e - view on LGTM.com

new alerts:

  • 2 for User-controlled data in numeric cast

Heads-up: LGTM.com's PR analysis will be disabled on the 5th of December, and LGTM.com will be shut down ⏻ completely on the 16th of December 2022. Please enable GitHub code scanning, which uses the same CodeQL engine ⚙️ that powers LGTM.com. For more information, please check out our post on the GitHub blog.

final Map<Interval, List<SegmentAllocateRequest>> overlapIntervalToRequests = new HashMap<>();

for (SegmentAllocateRequest request : allRequests) {
// If there is an overlapping used segment, the interval of the used segment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The allocation for a single segment considers overlap with queryGranularity.bucket(row.timestamp())
This code fetches all used segments overlapping with segmentGranularity.bucket(row.timestamp()). I think this could lead to a case where the row is allocated to a used segment interval to which it doesn't belong

Copy link
Contributor Author

@kfaraz kfaraz Dec 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The used segments are fetched for the allocate interval (segmentGranularity.bucket(row.timestamp())) so that we don't have to make the metadata call multiple times.
The logic hasn't changed because we use the rowInterval (queryGranularity.bucket(row.timestamp())) to find the actual overlapping interval. Please look at the line after this.

@lgtm-com
Copy link

lgtm-com bot commented Dec 2, 2022

This pull request introduces 2 alerts when merging c132748 into 138a6de - view on LGTM.com

new alerts:

  • 2 for User-controlled data in numeric cast

Heads-up: LGTM.com's PR analysis will be disabled on the 5th of December, and LGTM.com will be shut down ⏻ completely on the 16th of December 2022. Please enable GitHub code scanning, which uses the same CodeQL engine ⚙️ that powers LGTM.com. For more information, please check out our post on the GitHub blog.

@lgtm-com
Copy link

lgtm-com bot commented Dec 2, 2022

This pull request introduces 2 alerts when merging 29bc0d4 into 30498c1 - view on LGTM.com

new alerts:

  • 2 for User-controlled data in numeric cast

Heads-up: LGTM.com's PR analysis will be disabled on the 5th of December, and LGTM.com will be shut down ⏻ completely on the 16th of December 2022. Please enable GitHub code scanning, which uses the same CodeQL engine ⚙️ that powers LGTM.com. For more information, please check out our post on the GitHub blog.

@kfaraz
Copy link
Contributor Author

kfaraz commented Dec 5, 2022

All unit and integration tests have run successfully with segment allocation enabled in test PR #13484 .
There was only one flaky test. I couldn't retry that test due to some Travis access issues.

Copy link
Contributor

@AmatyaAvadhanula AmatyaAvadhanula left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @kfaraz. LGTM

@kfaraz
Copy link
Contributor Author

kfaraz commented Dec 5, 2022

Merging as build is green. Latest commit is only comment change.

@kfaraz kfaraz merged commit 45a8fa2 into apache:master Dec 5, 2022
kfaraz added a commit to kfaraz/druid that referenced this pull request Dec 5, 2022
)

In a cluster with a large number of streaming tasks (~1000), SegmentAllocateActions 
on the overlord can often take very long intervals of time to finish thus causing spikes 
in the `task/action/run/time`. This may result in lag building up while a task waits for a
segment to get allocated.

The root causes are:
- large number of metadata calls made to the segments and pending segments tables
- `giant` lock held in `TaskLockbox.tryLock()` to acquire task locks and allocate segments

Since the contention typically arises when several tasks of the same datasource try
to allocate segments for the same interval/granularity, the allocation run times can be
improved by batching the requests together.

Changes
- Add flags
   - `druid.indexer.tasklock.batchSegmentAllocation` (default `false`)
   - `druid.indexer.tasklock.batchAllocationMaxWaitTime` (in millis) (default `1000`)
- Add methods `canPerformAsync` and `performAsync` to `TaskAction`
- Submit each allocate action to a `SegmentAllocationQueue`, and add to correct batch
- Process batch after `batchAllocationMaxWaitTime`
- Acquire `giant` lock just once per batch in `TaskLockbox`
- Reduce metadata calls by batching statements together and updating query filters
- Except for batching, retain the whole behaviour (order of steps, retries, etc.)
- Respond to leadership changes and fail items in queue when not leader
- Emit batch and request level metrics
@lgtm-com
Copy link

lgtm-com bot commented Dec 5, 2022

This pull request introduces 2 alerts when merging 88c4cca into 9177419 - view on LGTM.com

new alerts:

  • 2 for User-controlled data in numeric cast

Heads-up: LGTM.com's PR analysis will be disabled on the 5th of December, and LGTM.com will be shut down ⏻ completely on the 16th of December 2022. Please enable GitHub code scanning, which uses the same CodeQL engine ⚙️ that powers LGTM.com. For more information, please check out our post on the GitHub blog.

kfaraz added a commit that referenced this pull request Dec 5, 2022
…13493)

In a cluster with a large number of streaming tasks (~1000), SegmentAllocateActions 
on the overlord can often take very long intervals of time to finish thus causing spikes 
in the `task/action/run/time`. This may result in lag building up while a task waits for a
segment to get allocated.

The root causes are:
- large number of metadata calls made to the segments and pending segments tables
- `giant` lock held in `TaskLockbox.tryLock()` to acquire task locks and allocate segments

Since the contention typically arises when several tasks of the same datasource try
to allocate segments for the same interval/granularity, the allocation run times can be
improved by batching the requests together.

Changes
- Add flags
   - `druid.indexer.tasklock.batchSegmentAllocation` (default `false`)
   - `druid.indexer.tasklock.batchAllocationMaxWaitTime` (in millis) (default `1000`)
- Add methods `canPerformAsync` and `performAsync` to `TaskAction`
- Submit each allocate action to a `SegmentAllocationQueue`, and add to correct batch
- Process batch after `batchAllocationMaxWaitTime`
- Acquire `giant` lock just once per batch in `TaskLockbox`
- Reduce metadata calls by batching statements together and updating query filters
- Except for batching, retain the whole behaviour (order of steps, retries, etc.)
- Respond to leadership changes and fail items in queue when not leader
- Emit batch and request level metrics
@kfaraz kfaraz deleted the seg_alloc_queue branch May 3, 2023 15:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants