Rework shard snapshot workers #88209

pxsalehi · 2022-06-30T12:12:09Z

Currently, when starting the shard snapshot tasks, we loop through the
shards to be snapshotted back-to-back. This will calculate whether/which
changes in the shard there are to be snapshotted, write some metadata,
and ONLY if there are changes and there are files to be uploaded, fork
off the file upload calls. This could be further improved by parallelizing
shard snapshot tasks (currently, only file uploads happen in parallel).

This change uses a worker pool and a queue to:

Paralleize (with limited number of workers) shard snapshotting, and
limit the number of the concurrently running snapshot tasks.
Prioritize shard snapshot tasks over file snapshot tasks.

Closes #83408

elasticsearchmachine · 2022-06-30T13:04:45Z

Hi @pxsalehi, I've created a changelog YAML for you.

elasticmachine · 2022-06-30T13:05:22Z

Pinging @elastic/es-distributed (Team:Distributed)

server/src/main/java/org/elasticsearch/snapshots/SnapshotShardsService.java

henningandersen · 2022-07-04T10:08:19Z

I wonder if this approach is necessarily better. IIUC, we currently keep the one thread looping until the end. In the new approach, we risk the first (or an early) shard filling up the snapshot queue before the last tasks are added to the queue. This could mean that shards with no changes being scheduled after large file uploads?

original-brownbear · 2022-07-04T10:23:07Z

This could mean that shards with no changes being scheduled after large file uploads?

We figured this was somewhat unlikely (to happen a lot) and outweigh by the fact that we now multi-thread the starting of shard snapshots (which is the main motivation for this change, as we realized that running this single-threaded could take a long time on large data nodes).
But I guess if we want to be stricter about this, we could implement this differently by putting all the shards to start in a queue and then running threads (one of which would be the original thread that we fork from) that poll from that queue. That way we get even better ordering at the price of a little more code? Think it's worth it?

pxsalehi · 2022-07-04T11:43:40Z

(Apologies for using too many vague/relative terms like many/few/large! But I need to somehow ask my question!)

For me one missing piece to evaluate the worst case impact of a file upload on unchanged shard snapshots, is the following: could the size of single files (to be uploaded) be extremely large? What would be a realistic "average" case? Or is it hard to tell since it could be wildly different depending on the settings/load? It seemed to me that many not-huge files to uploads is the common case. (Also another assumption from our talk, it seems often what happens during a snapshot is that most shards have no changes, and only few of them need to upload files.)

henningandersen · 2022-07-05T07:18:28Z

We generally try to not fill the queue with all tasks, for instance in BlobStoreRepository.snapshotShard which only schedules as many threads as there are. I assume to allow other work to interleave?

If we do all shards in parallel, we risk getting #shards * thread_pool_size into the queue. It seems like the "throttling" we do per shard is no longer useful then? This could mean a duration of many minutes if not hours of no response on the SNAPSHOT thread pool. I think that could block restores, new followers and single-index snapshots? Am I missing something essential here?

I think this would be true also if we queue up the shard level execution and only do n jobs in parallel.

original-brownbear · 2022-07-05T07:36:39Z

Am I missing something essential here?

You're not directly missing anything actually. But I think there is a tradeoff here. This change makes it so that running a snapshot on a 10k shard warm-node may block the snapshot pool for ~15 min (assuming ~0.5s per shard snapshot and 5 snapshot threads max) vs. taking 75min to run a single snapshot without blocking the pool.
Taking the 15 min over the 75 min and accepting the blocking seems like the right tradeoff to me, especially when it's this easy to implement?
Also, note that we are doing the delete side where we're not being clever about doing any sort of hand-crafted work stealing queue like we do for file uploads either and no complaints about it have come in ever as far as I'm aware :)

I think this would be true also if we queue up the shard level execution and only do n jobs in parallel.

True, the advantage of that approach would be that we'd have a harder guarantee on running the metadata work right before the file uploads (and it's a little faster potentially I guess).

pxsalehi · 2022-07-05T10:15:45Z

I'm trying to understand how much of this discussion relates to the original issue (#83408).

If blocking the snapshot threadpool during a snapshot is a concern, then what would be an acceptable approach that addresses the issue and avoids blocking the threadpool? One thread going through the shard snapshot tasks, calculating which snapshots do not upload a file, doing all of those in the same thread, and then again in the same thread going through the shard snapshots that do have a change, and only fork-off the file-uploads?

Basically, everything else other than file upload should be done on the thread going through the shards, and actually in two passes? No snapshotting of unchanged shards in parallel? Considering having many thousand unchanged shard snapshots and few shards actually uploading files is a normal occurrence, as Armin mentioned, parallelizing snapshotting of unchanged shards seems to reduce snapshot time, which I guess is an improvement.

My question is that if we do not want to parallelize shard snapshot tasks, then would the approach I mentioned above have any real improvement for the issue? Currently, all unchanged snapshots happen in the same thread, which doesn't seem that different than the two phase approach!

henningandersen · 2022-07-07T14:22:49Z

Armin and I had a brief conversation on this and while the existing mechanism for limiting the number of upload threads is somewhat broken, running the outer level tasks in parallel will slightly increase this brokennes. We prefer to instead fix this for good by maintaining enough data structures to just fill the snapshot pool with the right amount of work without exhausting it for potentially hours.

Something like 2 queues (the other level work and the actual file upload work) and a counter for how many jobs are active should do. When any job finishes it checks the outer level queue first and once that is depleted it does the actual file upload jobs.

This also ensures that all no-change shards are done prior to any file uploads. I think Armin intends to chat sync with you on this.

pxsalehi · 2022-07-19T14:45:27Z

@elasticmachine please run elasticsearch-ci/part-1

(I think, the test failure was unrelated to the PR. I opened: #88615)

This reverts commit 88847d4183185f8ec4f5fec846ce13bd0cda5a61.

…t-calls

pxsalehi · 2022-09-01T09:13:06Z

@henningandersen Thanks for the detailed feedback. I addressed all your comments. Please have another look. Meanwhile I'll look into that previous CI failure to see if it is related or I was just paranoid!

pxsalehi · 2022-09-05T15:39:00Z

@henningandersen Thanks for the detailed feedback. I addressed all your comments. Please have another look. Meanwhile I'll look into that previous CI failure to see if it is related or I was just paranoid!

I am not able to reproduce that issue on the branch. For the record, it used to timeout on different asserts in the test, the couple of times that it happened (not just a specific one), and that was on a pretty slow system. I think, it is safe to dismiss that.

henningandersen

This looks good to me now, but I'd like @original-brownbear to do the final review on this.

henningandersen · 2022-09-06T08:54:01Z

.../test/java/org/elasticsearch/common/util/concurrent/PrioritizedThrottledTaskRunnerTests.java

+
+        @Override
+        public int compareTo(TestTask o) {
+            return Integer.compare(priority, o.priority);


Can we return priority - o.priority instead? Integer.comparenormalizes to -1, 0, 1, thus we do not test that any other values has the right meaning.

@henningandersen the compareTo is only used by the priority queue, which doesn't care for the actual distance between the two priorities, I think! So I don't understand why does this matter? Could you please elaborate why is this important?

This is part of a test verifying that PrioritizedThrottledTaskRunner executes tasks in the right order when presented with any comparable object. But the test here is restricted to only verify using a sub-set of such comparable objects that use "normalized" return values in compareTo.

Arguing in terms of the implementation seems invalid, the purpose of the tests is to demonstrate that the implementation works under as many circumstances as possible. Given the simple change to demonstrate this under more circumstances here, I think we should make the change.

I've done the change already. Thanks for the explanation! :)

…t-calls

original-brownbear

This looks good to me, just one point where I asked for some docs because I'm having trouble understanding the code and one rather trivial detail and this should be good to go :)

original-brownbear · 2022-09-06T12:06:38Z

...r/src/main/java/org/elasticsearch/common/util/concurrent/PrioritizedThrottledTaskRunner.java

+    }
+
+    // visible for testing
+    protected void pollAndSpawn() {


Can we add some commentary on how this loop works. In particular, on why we need to peek the queue. This is somewhat hard to follow for me and will be even harder to follow for future readers of this code.

Good point!

original-brownbear · 2022-09-06T12:14:55Z

server/src/main/java/org/elasticsearch/repositories/blobstore/ShardSnapshotTaskRunner.java

+            this.context = context;
+        }
+
+        public abstract short priority();


Should we just use int here. short is kind of pointless isn't it, might even be a net negative for comparison performance?

pxsalehi · 2022-09-06T18:34:26Z

@original-brownbear @henningandersen All done! Please check again.

original-brownbear

LGTM, I'm good with this one now as is. Performance seems alright too, gave it a quick test run benchmarking. Unless @henningandersen has anything open? I think we're good to go here.

original-brownbear · 2022-09-07T11:34:11Z

.../test/java/org/elasticsearch/common/util/concurrent/PrioritizedThrottledTaskRunnerTests.java

+        for (int i = 0; i < enqueued; i++) {
+            new Thread(() -> {
+                try {
+                    threadBlocker.countDown();


NIT: could use CyclicBarrier here instead which seems to be the correct primitive?

We could. But since we need only a one-time barrier, the latch is enough I think. I've seen it used like that in several places in the code.

I'd agree with Armin here, CyclicBarrier is good for a rendezvous interaction. And saves a line of code too.

But fine to leave as is...

pxsalehi · 2022-09-08T08:55:45Z

Thanks Henning and Armin!

This PR fixes a bug introduced in #88209 while refactoring how file upload tasks run in a shard snapshot. The corner case where the queue of files to snapshot gets cleared when a file snapshot runs into an exception is not addressed in that PR. Closes #89927 Closes #89956

elasticsearchmachine added the v8.4.0 label Jun 30, 2022

pxsalehi added >enhancement :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs labels Jun 30, 2022

pxsalehi requested a review from original-brownbear June 30, 2022 13:04

pxsalehi marked this pull request as ready for review June 30, 2022 13:05

elasticmachine added the Team:Distributed Meta label for distributed team label Jun 30, 2022

pxsalehi requested a review from henningandersen June 30, 2022 13:41

idegtiarenko reviewed Jul 4, 2022

View reviewed changes

server/src/main/java/org/elasticsearch/snapshots/SnapshotShardsService.java Outdated Show resolved Hide resolved

pxsalehi marked this pull request as draft July 11, 2022 10:32

pxsalehi changed the title ~~Fork off shard snapshot calls in parallel~~ [WIP] Rework shard snapshot workers Jul 13, 2022

pxsalehi changed the base branch from master to 8.3 July 20, 2022 09:18

pxsalehi changed the base branch from 8.3 to master July 20, 2022 09:18

pxsalehi and others added 8 commits July 20, 2022 11:19

Fork off shard snapshot calls in parallel

031b2af

Update docs/changelog/88209.yaml

7f40dc5

temp

0184d38

Revert "temp"

72f8624

This reverts commit 88847d4183185f8ec4f5fec846ce13bd0cda5a61.

Use queues and workers to handle shard snapshot work

4bfdfc7

cleanup, minor corrections

b4fb408

remove test logging

691de91

Create only as many workers as needed

8fea5bf

pxsalehi added 7 commits August 31, 2022 16:16

Use generic pool in test

09d10df

Rename test

865ceca

Use threads not threadpool for test

ba0a411

Add a priority to the TestTask

56619c0

Add more test and assertion

79331f8

Sync threads in test

a752881

Merge remote-tracking branch 'upstream/main' into ps-fork-off-snapsho…

ed7ef03

…t-calls

pxsalehi marked this pull request as ready for review September 1, 2022 09:13

pxsalehi requested a review from henningandersen September 1, 2022 09:13

henningandersen reviewed Sep 6, 2022

View reviewed changes

pxsalehi added 2 commits September 6, 2022 11:14

Merge remote-tracking branch 'upstream/main' into ps-fork-off-snapsho…

fc8f8df

…t-calls

do not use Integer.compare

4b6d83f

original-brownbear reviewed Sep 6, 2022

View reviewed changes

Address review comments

ad896fe

pxsalehi requested review from henningandersen and original-brownbear September 6, 2022 18:33

pxsalehi mentioned this pull request Sep 7, 2022

Limit the number of concurrent shard snapshots #89826

Open

original-brownbear approved these changes Sep 7, 2022

View reviewed changes

pxsalehi merged commit c7e36c3 into elastic:main Sep 8, 2022

original-brownbear mentioned this pull request Sep 8, 2022

[CI] DedicatedClusterSnapshotRestoreIT testSnapshotDeleteRelocatingPrimaryIndex failing #89927

Closed

pxsalehi mentioned this pull request Sep 8, 2022

Make sure listener is resolved when file queue is cleared #89929

Merged

Leaf-Lin mentioned this pull request Oct 9, 2022

[8.5][DOCS] Adds snapshot-related highlights to 8.5 #90744

Closed

original-brownbear mentioned this pull request Nov 1, 2022

Automatic write blocks to ensure strong snapshot consistency across multiple shards, or cluster wide #86617

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework shard snapshot workers #88209

Rework shard snapshot workers #88209

pxsalehi commented Jun 30, 2022 •

edited

Loading

elasticsearchmachine commented Jun 30, 2022

elasticmachine commented Jun 30, 2022

henningandersen commented Jul 4, 2022

original-brownbear commented Jul 4, 2022

pxsalehi commented Jul 4, 2022

henningandersen commented Jul 5, 2022

original-brownbear commented Jul 5, 2022

pxsalehi commented Jul 5, 2022

henningandersen commented Jul 7, 2022

pxsalehi commented Jul 19, 2022 •

edited

Loading

pxsalehi commented Sep 1, 2022

pxsalehi commented Sep 5, 2022

henningandersen left a comment

henningandersen Sep 6, 2022

pxsalehi Sep 6, 2022

henningandersen Sep 6, 2022

pxsalehi Sep 7, 2022

original-brownbear left a comment

original-brownbear Sep 6, 2022

pxsalehi Sep 6, 2022

original-brownbear Sep 6, 2022

pxsalehi commented Sep 6, 2022

original-brownbear left a comment

original-brownbear Sep 7, 2022

pxsalehi Sep 8, 2022

henningandersen Sep 8, 2022

pxsalehi commented Sep 8, 2022

Rework shard snapshot workers #88209

Rework shard snapshot workers #88209

Conversation

pxsalehi commented Jun 30, 2022 • edited Loading

elasticsearchmachine commented Jun 30, 2022

elasticmachine commented Jun 30, 2022

henningandersen commented Jul 4, 2022

original-brownbear commented Jul 4, 2022

pxsalehi commented Jul 4, 2022

henningandersen commented Jul 5, 2022

original-brownbear commented Jul 5, 2022

pxsalehi commented Jul 5, 2022

henningandersen commented Jul 7, 2022

pxsalehi commented Jul 19, 2022 • edited Loading

pxsalehi commented Sep 1, 2022

pxsalehi commented Sep 5, 2022

henningandersen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

original-brownbear left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pxsalehi commented Sep 6, 2022

original-brownbear left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pxsalehi commented Sep 8, 2022

pxsalehi commented Jun 30, 2022 •

edited

Loading

pxsalehi commented Jul 19, 2022 •

edited

Loading