Async task client for SeekableStreamSupervisors. #13354

gianm · 2022-11-11T10:04:05Z

Motivations:

Contact all tasks at once, instead of using synchronous comms in chatThreads. Shortens the time it takes to execute RunNotices when there are a lot of tasks. Also shortens the time for other operations that require contacting all tasks, like getting task reports.
Eliminate the need for chatThreads.
Reduce demands on workerThreads, mainly due to the restructuring of discoverTasks.

Main changes:

Convert SeekableStreamIndexTaskClient to an interface, move old code
to SeekableStreamIndexTaskClientSyncImpl, and add new implementation
SeekableStreamIndexTaskClientAsyncImpl that uses ServiceClient.
Add "chatAsync" parameter to seekable stream supervisors that causes
the supervisor to use an async task client.
In SeekableStreamSupervisor.discoverTasks, adjust logic to avoid making
blocking RPC calls in workerExec threads.
In SeekableStreamSupervisor generally, switch from Futures.successfulAsList
to FutureUtils.coalesce, so we can better capture the errors that occurred
with contacting individual tasks.

Other, related changes:

Add ServiceRetryPolicy.retryNotAvailable, which controls whether
ServiceClient retries unavailable services. Useful since we do not
want to retry calls unavailable tasks within the service client. (The
supervisor does its own higher-level retries.)
Add FutureUtils.transformAsync, a more lambda friendly version of
Futures.transform(f, AsyncFunction).
Add FutureUtils.coalesce. Similar to Futures.successfulAsList, but
returns Either instead of using null on error.
Add JacksonUtils.readValue overloads for JavaType and TypeReference.

gianm · 2022-11-11T10:23:27Z

In this patch chatAsync defaults to false. I would like to change the default to true after some more testing. After a release cycle, if the async client continues to work out well, I'd like to remove the sync client.

Main changes: 1) Convert SeekableStreamIndexTaskClient to an interface, move old code to SeekableStreamIndexTaskClientSyncImpl, and add new implementation SeekableStreamIndexTaskClientAsyncImpl that uses ServiceClient. 2) Add "chatAsync" parameter to seekable stream supervisors that causes the supervisor to use an async task client. 3) In SeekableStreamSupervisor.discoverTasks, adjust logic to avoid making blocking RPC calls in workerExec threads. 4) In SeekableStreamSupervisor generally, switch from Futures.successfulAsList to FutureUtils.coalesce, so we can better capture the errors that occurred with contacting individual tasks. Other, related changes: 1) Add ServiceRetryPolicy.retryNotAvailable, which controls whether ServiceClient retries unavailable services. Useful since we do not want to retry calls unavailable tasks within the service client. (The supervisor does its own higher-level retries.) 2) Add FutureUtils.transformAsync, a more lambda friendly version of Futures.transform(f, AsyncFunction). 3) Add FutureUtils.coalesce. Similar to Futures.successfulAsList, but returns Either instead of using null on error. 4) Add JacksonUtils.readValue overloads for JavaType and TypeReference.

kfaraz

Thanks for the improvement @gianm , the changes look good.
I have added some minor suggestions.

The only thing I am not clear on is the CONNECT_EXEC_THREADS being hardcoded to 4. Since these threads are all meant to do network IO and would basically remain blocked until a response is received, wouldn't we benefit from a higher/configurable number to allow more concurrency when chatting to multiple tasks? Or maybe I have misunderstood how the CONNECT_EXEC_THREADS are used by the ServiceClient.

kfaraz · 2022-11-16T15:46:43Z

core/src/main/java/org/apache/druid/common/guava/FutureUtils.java

+   * Like {@link Futures#transform(ListenableFuture, AsyncFunction)}, but works better with lambdas due to not having
+   * overloads.
+   *
+   * One can write {@code FutureUtils.transform(future, v -> ...)} instead of


Suggested change

* One can write {@code FutureUtils.transform(future, v -> ...)} instead of

* One can write {@code FutureUtils.transformAsync(future, v -> ...)} instead of

kfaraz · 2022-11-16T15:47:15Z

core/src/main/java/org/apache/druid/common/guava/FutureUtils.java

+  }
+
+  /**
+   * Like {@link Futures#successfulAsList}, but returns {@link Either} instead of using {@code} null in case of error.


Suggested change

* Like {@link Futures#successfulAsList}, but returns {@link Either} instead of using {@code} null in case of error.

* Like {@link Futures#successfulAsList}, but returns {@link Either} instead of using {@code null} in case of error.

kfaraz · 2022-11-16T16:19:32Z

...rc/main/java/org/apache/druid/indexing/kinesis/supervisor/KinesisSupervisorTuningConfig.java

+
+  @JsonProperty("chatAsync")
+  @JsonInclude(JsonInclude.Include.NON_NULL)
+  Boolean getChatAsyncConfigured()


Do we want to retain the boxed value in order to have desired behaviour when we switch the default value of chatAsync to true?

Style-wise, I think it might be simpler and similar to the other fields if we just name this method as getChatAsync() and mark it as @JsonProperty with the other one just being called chatAsync.

I did it this way because I didn't want chatAsync to appear in serialized tuningConfigs unless it was actually set by the user. And, I wanted us to be able to change the default in the future, and have everyone get the new default if they hadn't explicitly set this. Let me know what you think of that rationale.

kfaraz · 2022-11-16T18:48:48Z

.../main/java/org/apache/druid/indexing/seekablestream/supervisor/SeekableStreamSupervisor.java

+  {
+    return Futures.transform(
+        taskClient.getStatusAsync(taskId),
+        new AsyncFunction<SeekableStreamIndexTaskRunner.Status, Pair<SeekableStreamIndexTaskRunner.Status, Map<PartitionIdType, SequenceOffsetType>>>()


Suggestion: Maybe use the new FutureUtils.transform here and create a new class for the Pair<Status, Map> to make this more readable.

Hmm. It's only used in one place and a new class would be a bunch of boilerplate. I tried switching to FutureUtils.transformAsync and adding javadocs. Let me know if you think it's clear enough.

Makes sense, thanks.

kfaraz · 2022-11-16T19:18:21Z

...in/java/org/apache/druid/indexing/seekablestream/SeekableStreamIndexTaskClientAsyncImpl.java

+    {
+      final ServiceRetryPolicy retryPolicy = makeRetryPolicy(taskId, retry);
+      final SeekableStreamTaskLocator locator = new SeekableStreamTaskLocator(taskInfoProvider, taskId);
+      return serviceClientFactory.makeClient(taskId, locator, retryPolicy);


Would this create a new client on every call? Would it make sense to cache this to help with the preferred location in case of redirects?

I did it this way because I didn't want to deal with figuring out how long to keep the clients for, and they're pretty cheap to create. The only state they have is the cache of redirects. But, tasks don't do redirects — only leadery things like Coordinators and Overlords do that. So I think it's OK. I added some comments:

// We're creating a new locator for each request and not closing it. This is OK, since SeekableStreamTaskLocator // is stateless, cheap to create, and its close() method does nothing. final SeekableStreamTaskLocator locator = new SeekableStreamTaskLocator(taskInfoProvider, taskId); // We're creating a new client for each request. This is OK, clients are cheap to create and do not contain // state that is important for us to retain across requests. (The main state they retain is preferred location // from prior redirects; but tasks don't do redirects.) return serviceClientFactory.makeClient(taskId, locator, retryPolicy);

gianm · 2022-11-21T01:40:39Z

The only thing I am not clear on is the CONNECT_EXEC_THREADS being hardcoded to 4. Since these threads are all meant to do network IO and would basically remain blocked until a response is received, wouldn't we benefit from a higher/configurable number to allow more concurrency when chatting to multiple tasks? Or maybe I have misunderstood how the CONNECT_EXEC_THREADS are used by the ServiceClient.

The idea is that they aren't doing very much, just handling scheduled retries, so we don't need very many. I'm not sure if 4 threads is always going to be enough but I don't have a better idea about what to set it to. I did do some testing with MSQ shuffle (another user of this) and found that it was not limiting to set this to 4, even when sending a lot of data around. I figure that if we ever discover it isn't enough, we could adjust it at that time.

kfaraz · 2022-11-21T13:48:32Z

The idea is that they aren't doing very much, just handling scheduled retries, so we don't need very many. I'm not sure if 4 threads is always going to be enough but I don't have a better idea about what to set it to. I did do some testing with MSQ shuffle (another user of this) and found that it was not limiting to set this to 4, even when sending a lot of data around. I figure that if we ever discover it isn't enough, we could adjust it at that time.

Thanks for the clarification. Since this has been working fine on a cluster with ~1k tasks, I guess it'll hold up.

This functionality was originally added in apache#13354.

This functionality was originally added in #13354.

gianm added the Area - Streaming Ingestion label Nov 11, 2022

gianm mentioned this pull request Nov 11, 2022

Kafka ingestion lag spikes up whenever tasks are rolling #11414

Open

gianm force-pushed the sss-chat-async branch from 20a0947 to afd2bad Compare November 11, 2022 17:35

gianm added 7 commits November 11, 2022 11:25

Checkstyle.

b3e548b

Fixes from testing.

dcd66d9

Updates.

549c506

Test stuff.

4d6ddc2

Merge branch 'master' into sss-chat-async

b805a1d

Add additional test.

fc4e989

Fixes.

2e5e4a9

gianm added this to the 25.0 milestone Nov 14, 2022

kfaraz approved these changes Nov 16, 2022

View reviewed changes

gianm added 2 commits November 20, 2022 15:55

Merge branch 'master' into sss-chat-async

c253eea

Adjustments from review.

447b737

kfaraz merged commit bfffbab into apache:master Nov 21, 2022

gianm added a commit to gianm/druid that referenced this pull request Dec 5, 2022

Set chatAsync default to true.

a5efaa1

This functionality was originally added in apache#13354.

gianm mentioned this pull request Dec 5, 2022

Set chatAsync default to true. #13491

Merged

gianm added a commit that referenced this pull request Dec 6, 2022

Set chatAsync default to true. (#13491)

fda0a1a

This functionality was originally added in #13354.

This was referenced Dec 18, 2022

[Draft] 25.0.0 Release Notes #13592

Closed

Add SegmentAllocationQueue to batch allocation actions #13369

Merged

gianm deleted the sss-chat-async branch May 3, 2023 04:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Async task client for SeekableStreamSupervisors. #13354

Async task client for SeekableStreamSupervisors. #13354

gianm commented Nov 11, 2022 •

edited

gianm commented Nov 11, 2022

kfaraz left a comment

kfaraz Nov 16, 2022

kfaraz Nov 16, 2022

kfaraz Nov 16, 2022

gianm Nov 20, 2022

kfaraz Nov 16, 2022

gianm Nov 21, 2022

kfaraz Nov 21, 2022

kfaraz Nov 16, 2022

gianm Nov 21, 2022

kfaraz Nov 21, 2022

gianm commented Nov 21, 2022

kfaraz commented Nov 21, 2022 •

edited

	* One can write {@code FutureUtils.transform(future, v -> ...)} instead of
	* One can write {@code FutureUtils.transformAsync(future, v -> ...)} instead of

	* Like {@link Futures#successfulAsList}, but returns {@link Either} instead of using {@code} null in case of error.
	* Like {@link Futures#successfulAsList}, but returns {@link Either} instead of using {@code null} in case of error.

Async task client for SeekableStreamSupervisors. #13354

Async task client for SeekableStreamSupervisors. #13354

Conversation

gianm commented Nov 11, 2022 • edited

gianm commented Nov 11, 2022

kfaraz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gianm commented Nov 21, 2022

kfaraz commented Nov 21, 2022 • edited

gianm commented Nov 11, 2022 •

edited

kfaraz commented Nov 21, 2022 •

edited