Add direct path code path #30764

m-trieu · 2024-03-27T06:16:11Z

Add direct path code path

we will be able to run direct path pipelines by passing in the option is IsWindmillServiceDirectPathEnabled

I added some new components (they are just encapsulated existing logic from StreamingDataflowWorker.java) and will back port this to StreamingDataflowWorker in a seperate pr
the pr that added the channel caches were rolled back, am working on reintegrating that

R: @scwhittle

Please add a meaningful description for your change here

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

github-actions · 2024-03-27T07:05:48Z

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

m-trieu · 2024-04-02T07:15:59Z

@scwhittle still wrapping up the unit tests, just would like an initial pass
thanks!

github-actions · 2024-04-04T07:05:49Z

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @Abacn added as fallback since no labels match configuration

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

codecov-commenter · 2024-04-04T07:06:27Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 71.27%. Comparing base (96dc16a) to head (ec12c72).
Report is 234 commits behind head on master.

❗ Current head ec12c72 differs from pull request most recent head e60cb1e. Consider uploading reports for the commit e60cb1e to get more accurate results

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #30764      +/-   ##
============================================
+ Coverage     70.95%   71.27%   +0.32%     
+ Complexity     4470     1485    -2985     
============================================
  Files          1257      904     -353     
  Lines        140917   112898   -28019     
  Branches       4305     1076    -3229     
============================================
- Hits          99989    80471   -19518     
+ Misses        37451    30408    -7043     
+ Partials       3477     2019    -1458

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

scwhittle

See direct path harness comment first, I think further work needs to be done to reduce the duplication because we're going to have both direct/non-direct paths for a while.

scwhittle · 2024-04-04T08:32:33Z

...orker/src/main/java/org/apache/beam/runners/dataflow/worker/DataflowWorkProgressUpdater.java

@@ -67,7 +67,7 @@ public DataflowWorkProgressUpdater(
    super(worker, Integer.MAX_VALUE);
    this.workItemStatusClient = workItemStatusClient;
    this.workItem = workItem;
-    this.hotKeyLogger = new HotKeyLogger();
+    this.hotKeyLogger = HotKeyLogger.ofSystemClock();


Can cleanups be moved to separate PRs? Less churn if things are reverted and easier to review and summarize with commit description.

...ava/worker/src/main/java/org/apache/beam/runners/dataflow/worker/DataflowWorkUnitClient.java

...dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/HotKeyLogger.java

...-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/ReaderCache.java

scwhittle · 2024-04-04T08:38:32Z

...va/worker/src/main/java/org/apache/beam/runners/dataflow/worker/StreamingDataflowWorker.java

+    StreamingWorkerHarness worker =
+        isDirectPathPipeline(options)
+            ? StreamingEngineDirectPathWorkerHarness.fromOptions(options)
+            : StreamingDataflowWorker.fromOptions(options);


how about moving the harness portions of StreamingDataflowWorker to a new StreamingSingleEndpointWorkerHarness class? (open to better name ideas).

I think that is clearer which parts of this file are non-direct path specific and which are shared.

reused code across different harness impls

scwhittle · 2024-04-04T08:58:14Z

...ain/java/org/apache/beam/runners/dataflow/worker/streaming/computations/ActiveWorkState.java

                    .build());
  }

+  private static Stream<DirectHeartbeatRequest> toHeartbeatRequestStreamDirectPath(


It would be nice to have a single helper used by both direct/non-direct methods since they are largely the same and could otherwise drift.

done added HeartbeatRequests.java

scwhittle · 2024-04-04T08:59:16Z

...in/java/org/apache/beam/runners/dataflow/worker/streaming/computations/ComputationState.java

+    executionStateQueue.offer(executionState);
+  }
+
+  public Optional<ExecutionState> getExecutionState() {


name acquireExecutionState or pollExecutionState?
get makes it sound like a simple accessor

scwhittle · 2024-04-04T09:03:19Z

...rs/dataflow/worker/streaming/computations/StreamingApplianceComputationStateCacheLoader.java

+import org.slf4j.LoggerFactory;
+
+@Internal
+public final class StreamingApplianceComputationStateCacheLoader


Let's put the config loading changes in a separate PR

currently needed for code dedup since the ConfigLoaders are used in the ComputationStateCache cache loaders

...org/apache/beam/runners/dataflow/worker/streaming/config/StreamingApplianceConfigLoader.java

scwhittle · 2024-04-04T09:12:28Z

...e/beam/runners/dataflow/worker/streaming/harness/StreamingEngineDirectPathWorkerHarness.java

+      LoggerFactory.getLogger(StreamingEngineDirectPathWorkerHarness.class);
+  // Controls processing parallelism. Maximum number of threads for processing.  Currently, each
+  // thread processes one key at a time.
+  private static final int MAX_PROCESSING_THREADS = 300;


There is too much duplication between this and the other harness, which will make it difficult to add new features (such as current PR to make processing threads dynamic).

It seems like a lot of the things: executor, metrics, reporting, cache etc are not affected by how work is obtained, committed or state fetched. It would be better if we could instead keep the logic and just inject different work obtainer, committer, state fetcher.

Or alternatively we could make everything work with direct-path by always plumbing somethign to use for getdata/commitwork and in the non-direct path cases just having a single one.

done, reused components (mainly around config loading, computationStateCache, and work execution) across all streaming worker harness impls

…ween harness to their own classes/files. Add different harness implementations for Dispatched, Appliance, and Direct Path streaing jobs. StreamingDataflowWorker is now just a main method

… in work processing context, guard for null dataflowServiceOptions in StreamingEngineClient

scwhittle · 2024-04-22T09:34:41Z

runners/google-cloud-dataflow-java/worker/build.gradle

@@ -257,3 +257,4 @@ checkstyleMain.enabled = false
 checkstyleTest.enabled = false
 //TODO(https://github.com/apache/beam/issues/19119): javadoc task should be enabled in the future.
 javadoc.enabled = false
+test.outputs.upToDateWhen {false}


scwhittle · 2024-04-22T09:36:47Z

...dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/HotKeyLogger.java


-  HotKeyLogger() {}
+  public static HotKeyLogger ofSystemClock() {


just name create?

scwhittle · 2024-04-22T09:40:47Z

runners/core-java/src/main/java/org/apache/beam/runners/core/metrics/MetricsLogger.java

    super(stepName);
  }

+  public static MetricsLogger createUnboundedMetricsLogger() {


unbounded is confusing with bounded/unbounded pcollection.
How about workerMetricsLogger if it is metrics not scoped to a step?

scwhittle · 2024-04-22T09:42:36Z

.../src/main/java/org/apache/beam/runners/dataflow/worker/logging/DataflowWorkerLoggingMDC.java

@@ -17,15 +17,19 @@
 */
 package org.apache.beam.runners.dataflow.worker.logging;

+import org.checkerframework.checker.nullness.qual.Nullable;
+
 /** Mapped diagnostic context for the Dataflow worker. */
 @SuppressWarnings({
  "nullness" // TODO(https://github.com/apache/beam/issues/20497)


can this be removed with your changes?

scwhittle · 2024-04-22T09:44:40Z

.../worker/src/main/java/org/apache/beam/runners/dataflow/worker/streaming/ActiveWorkState.java

@@ -90,11 +83,11 @@ private ActiveWorkState(
  }

  static ActiveWorkState create(WindmillStateCache.ForComputation computationStateCache) {
-    return new ActiveWorkState(new HashMap<>(), computationStateCache);
+    return new ActiveWorkState(new ConcurrentHashMap<>(), computationStateCache);


why is the concurrent hash map required? It seems getReadOnlyActivework is synchronized below

yea this was unintended changed back

scwhittle · 2024-04-22T11:24:49Z

...he/beam/runners/dataflow/worker/streaming/StreamingApplianceComputationStateCacheLoader.java

+
+    Windmill.GetConfigResponse response = maybeResponse.get();
+
+    // The max work item commit bytes should be modified to be dynamic once it is available in


odd spot for this comment, the global id has this field but not individual computations.

scwhittle · 2024-04-22T11:33:20Z

...he/beam/runners/dataflow/worker/streaming/StreamingApplianceComputationStateCacheLoader.java

+                  workUnitExecutor,
+                  transformUserNameToStateFamilyByComputationId.getOrDefault(
+                      computationId, ImmutableMap.of()),
+                  perComputationStateCacheViewFactory.apply(computationId));


it seems like the appliance and SE differ just in how they get the map task and username->sf map.

What if we had an interface instead of StreamingConfigLoader<Windmill.GetConfigResponse> that can vend the MapTask and usertransform to statefamilymap for a computation?

Then the ComputationStateCacheLoader could just be a final class taking that loader. The StreamingEngineConfigLoader could implement that interface as well as the other one for the dynamic config and the appliance could just implement that.

I'm not sure the StreamingConfigLoader templated interface is buying us much since only the SE needs the background threads.

scwhittle · 2024-04-22T11:45:21Z

...he/beam/runners/dataflow/worker/streaming/StreamingApplianceComputationStateCacheLoader.java

+    // the request.
+    for (Windmill.GetConfigResponse.SystemNameToComputationIdMapEntry entry :
+        response.getSystemNameToComputationIdMapList()) {
+      systemNameToComputationIdMap.put(entry.getSystemName(), entry.getComputationId());


can you get rid of this?
It seems we just request a single item, so should we just get a single response back? Can we then just pass the computationId to createComputationSTate instead of trying to map back from the response?

is there a reason why this was being previously done? looks like based on

beam/runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/StreamingDataflowWorker.java

Line 1420 in 673da54

for (String serializedMapTask : response.getCloudWorksList()) {

, there might be multiple map tasks returned from GetConfigResponse and we create a different ComputationState based on each MapTask

scwhittle · 2024-04-22T12:06:59Z

...c/main/java/org/apache/beam/runners/dataflow/worker/windmill/work/WorkProcessingContext.java

+
+    public abstract Builder setGetDataStream(GetDataStream value);
+
+    abstract WorkProcessingContext autoBuild();


autoBuild needs to be abstract since it is handled by autovalue codegen

could it be protected then? we don't want this to be called directly correct?

will try to use protected and see if it works
previously was following this guide https://github.com/google/auto/blob/main/value/userguide/builders-howto.md#-validate-property-values

scwhittle · 2024-04-22T12:27:54Z

...n/java/org/apache/beam/runners/dataflow/worker/streaming/harness/ApplianceWorkerHarness.java

+    FileSystems.setDefaultPipelineOptions(options);
+  }
+
+  public static ApplianceWorkerHarness fromOptions(DataflowWorkerHarnessOptions options) {


At a high-level the difference between appliance/dispatcher/direct paths seem to be limited to:

how to get the config

how to get/schedule WorkProcessingContext (which then knows how to fetch/commit/heartbeat to right worker).

I think that moving things out of StreamingWorkerHarness into classes can help testing/readability but that there is still a lot of duplicated setup between the different harnesses which will make it harder to maintain.

Can we instead have a shared StreamingDataflowWorker that uses the objects and which just minimal injection differences are made when constructing for appliance/fanout/direct?

sgtm, will redo some of the components to manage that

Thanks! if you see ways to break up this change that would be great too. It's really big and github reviews are particularly painful. If we can just pull out various things from StreamingDataflowWorker to their own files/tests before getting to the direct path stuff I think it will help reviewing. I will try to stay on top of the reviews so hopefully we don't get bogged down too much due to more PRs.

github-actions · 2024-05-06T12:13:41Z

Reminder, please take a look at this pr: @Abacn

scwhittle · 2024-05-07T09:52:16Z

@Abacn not sure how to turn off the reminders but this doesn't require a look. It is being broken apart into smaller PRs which I am reviewing.

scwhittle · 2024-05-07T09:53:12Z

@m-trieu Should we close this one for now? We can reopen after rebasing on top of the broken off PRs

github-actions · 2024-05-14T12:13:49Z

Reminder, please take a look at this pr: @Abacn

github-actions bot added runners dataflow labels Mar 27, 2024

m-trieu force-pushed the mt-dp branch 4 times, most recently from 6a6e737 to 3c88dfa Compare April 2, 2024 07:13

m-trieu force-pushed the mt-dp branch 8 times, most recently from 87a043e to ec12c72 Compare April 4, 2024 06:40

github-actions bot added the Next Action: Reviewers label Apr 4, 2024

scwhittle self-requested a review April 4, 2024 08:31

scwhittle requested changes Apr 4, 2024

View reviewed changes

github-actions bot added core and removed core labels Apr 9, 2024

integrate direct path code path in streaming dataflow worker

df46ff9

m-trieu added 6 commits April 16, 2024 11:40

encapsulate work execution, status pages, etc common logic shared bet…

4aef673

…ween harness to their own classes/files. Add different harness implementations for Dispatched, Appliance, and Direct Path streaing jobs. StreamingDataflowWorker is now just a main method

fix broken tests, use new components in direct path worker impl

96bf741

mark public classes as internal

952a01c

address cl comments

17c063b

fix bad import

a2edf60

add tests, remove some unneeded classes

af4f202

m-trieu force-pushed the mt-dp branch from f787938 to af4f202 Compare April 16, 2024 02:51

github-actions bot added core and removed core labels Apr 16, 2024

allow get data stream to be closed properly, populate get data stream…

6c7b399

… in work processing context, guard for null dataflowServiceOptions in StreamingEngineClient

github-actions bot added core and removed core labels Apr 17, 2024

fix rebase issues

6791c75

github-actions bot added core and removed core labels Apr 17, 2024

fix test

f8732c8

github-actions bot added core and removed core labels Apr 17, 2024

scwhittle requested changes Apr 22, 2024

View reviewed changes

address cl comments

e60cb1e

github-actions bot added core and removed core labels Apr 29, 2024

github-actions bot added the slow-review label May 6, 2024

github-actions bot removed the slow-review label May 7, 2024

github-actions bot added the slow-review label May 14, 2024

m-trieu closed this May 14, 2024


		HotKeyLogger() {}
		public static HotKeyLogger ofSystemClock() {


		Windmill.GetConfigResponse response = maybeResponse.get();

		// The max work item commit bytes should be modified to be dynamic once it is available in


		public abstract Builder setGetDataStream(GetDataStream value);

		abstract WorkProcessingContext autoBuild();

Add direct path code path #30764

Add direct path code path #30764

Conversation

m-trieu commented Mar 27, 2024 • edited

GitHub Actions Tests Status (on master branch)

github-actions bot commented Mar 27, 2024

m-trieu commented Apr 2, 2024

github-actions bot commented Apr 4, 2024

codecov-commenter commented Apr 4, 2024 • edited by codecov bot

Codecov Report

scwhittle left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented May 6, 2024

scwhittle commented May 7, 2024

scwhittle commented May 7, 2024

github-actions bot commented May 14, 2024

m-trieu commented Mar 27, 2024 •

edited

codecov-commenter commented Apr 4, 2024 •

edited by codecov bot