[FLINK-10887] [jobmaster] Add source watermark tracking to the JobMaster #7099

jgrier · 2018-11-15T00:21:35Z

What is the purpose of the change

This commit adds a JobMaster RPC endpoint that is used to for global information sharing. One use case will be event time source synchronization where it will be used to share watermarks but there are others. It takes the form of a set of named aggregates that can be updated by a client-supplied AggregateFunction.

Note that the RPC endpoint accepts a serialized AggregateFunction in the form of a byte array. We need to do this so that we can deserialize this using the UserCodeClassLoader. The normal RpcService path does not use the UserCodeClassLoader nor is there any easy way to make it do so.

This PR also includes the code/wiring neccessary to expose this functionality to user functions via the StreamingRuntimeEnvironment.

The PR seems large but it is mostly wiring. To quickly assess the changes I suggest looking at the following classes:

GlobalAggregateManager (to understand the API)
RpcGlobalAggregateManager (to see the client-side RPC with the JobMaster)
JobMaster / JobMasterGateway (server side implementation of the above)
GlobalAggregateManagerITCase (for typical usage from user code)

Most of the rest of the PR is just wiring it all up.

Brief change log

New RPC endpoint on JobMaster to create, update, and retrieve named aggregates.
Updated JobMaster Tests
Client side exposure of above endpoint via the StreamingRuntimeEnvironment and GlobalAggregateManager classes.
Integration test exercising typical usage from user code.

Verifying this change

This change added tests and can be verified as follows:

JobMasterTest
GlobalAggregateManagerITCase

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): No
The public API, i.e., is any changed class annotated with @Public(Evolving): Yes
The serializers: No
The runtime per-record code paths (performance sensitive): No
Anything that affects deployment or recovery: No
The S3 file system connector: No

Documentation

Does this pull request introduce a new feature? No
If yes, how is the feature documented? not applicable

tweise · 2018-11-15T05:13:48Z

flink-runtime/src/main/java/org/apache/flink/runtime/watermark/SourceWatermark.java

+public class SourceWatermark implements Serializable {
+
+	private static final long serialVersionUID = 1L;
+	private long timestamp;


What does the timestamp represent? Is it when the watermark last changed or when it was last communicated by the subtask (even if it did not change, for example because the subtask is just reading a lot of data under the same watermark). We will need a way to detect that a source subtask is idle so we can avoid waiting for it (similar to how we has to identify idle within a subtask).

The timestamp here is meant to represent the watermark itself -- the current low watermark for the sub-task that sent it.

I do agree, however, that we will also need to know at what time the watermark was sent so that we can ignore it if it hasn't been updated in some configurable amount of time.

Very good point.

tweise · 2018-11-15T05:17:33Z

flink-runtime/src/main/java/org/apache/flink/runtime/watermark/SourceWatermark.java

+/**
+ * This represents the watermark for a single source partition.
+ */
+public class SourceWatermark implements Serializable {


I wonder if it should be qualified as SourceWatermark vs. just Watermark? Perhaps there are use cases for exchanging watermarks across subtasks that don't necessarily belong to a source. One such example could be operators that perform asynchronous operations. Related, do we want to allow for an identifier for the watermark so that within an application multiple independent groupings could be formed?

This may be a bit far fetched, but can it be generalized further to something like a named counter/metric? Currently there isn't anything watermark specific here?

One problem is that there is already a Watermark class, but I agree with Thomas' comment. In the future, not all "sources" might be actual physical sources in the pipeline.

Yeah, my intention was to keep this very focused on the exact use case at hand -- to provide simple state sharing for watermarks in the service of the source synchronization effort. This is why the very specific naming and lack of additional features like namespaces, etc.

If we were to generalize this more it would be good to understand some other specific use cases -- and also to consider whether it's important to tackle that here or just go with the simplest interface we need for the task at hand.

@tweise @aljoscha If we do something more general what are you thinking? Something more like a hash table or a collection of namespaced hashtables? Would we need to make the key and value types generic, etc? Would we want to then distribute the entire hashtable to every sub-task?

I could imagine scenarios where different sources have different synchronization. That could be supported with a grouping mechanism for the tasks that participate in the watermark sync. The RPC would pass the group/namespace identifier as additional parameter and only get back the watermark for that (hash table would remain internal).

Sounds good. Will update shortly.

StephanEwen · 2018-11-25T22:15:33Z

I think this is a very nice feature, +1 to have this.

We have seen other use cases that need a similar mechanism, so I am wondering if we can generify this to a some transient aggregator. One of those use case would need the max across all values and is otherwise almost the same.

jgrier · 2018-12-03T15:00:45Z

Sorry I haven't responded to this. We had a baby boy this week so that has kept me pretty busy ;)

Okay, so I'm on board with generifying this further. @StephanEwen if we're to do a generic transient aggregator do you mean to allow the client to provide the aggregation function? In this case the API would look something like this:

/**
 * Update the aggregate and return the new value.
 *
 * @param aggregateName The name of the aggregate to update
 * @param aggregand The value to add to the aggregate
 * @param aggregationFunction The function to apply to the current aggregate and aggregand to obtain the new aggregate value
 * @return The updated aggregate
CompletableFuture<Object> updateAggregate(
      String aggregateName,
      Object aggregand,
      AggregateFunction aggregationFunction);

Is something like this what you had in mind?

jgrier · 2018-12-06T15:19:49Z

@aljoscha @tweise Can you guys comment on the above generic aggregator proposal? I'd like to keep this moving forward.

aljoscha · 2018-12-07T10:27:47Z

Can we always assume that the user-jar/class loader will be available where the AggregateFunction is needed? If yes, I think this is a nice approach! (We can probably use concrete types in the interface, though)

aljoscha · 2018-12-07T10:28:05Z

Also, congratulations, I guess! 🎉

jgrier · 2018-12-07T18:07:09Z

Can we always assume that the user-jar/class loader will be available where the AggregateFunction is needed? If yes, I think this is a nice approach! (We can probably use concrete types in the interface, though)

@aljoscha I'm actually not sure if the user code classloader is available from the JobMaster but I would think that's reasonable since there's a 1:1 relationship between the JobMaster and a single job.

WRT concrete types in the RPC interface I'm not sure what you're thinking there. The concrete types are not known in this approach. The types are up to the user/client and can be different for each named aggregate.

tweise · 2018-12-07T18:14:34Z

@aljoscha @tweise Can you guys comment on the above generic aggregator proposal? I'd like to keep this moving forward.

@jgrier +1 as suggested earlier. We just need to confirm that the class of aggregationFunction is available.

jgrier · 2019-01-25T02:13:21Z

@tweise Have a look.. Also, I didn't ignore the input about "idleness" issues but that will be handled by the particular aggregation function used. For the example of event time source sync we will want to do this but not in general.

tweise

@jgrier looks good! I also think the watermark and timeout specific logic can be handled with an aggregation function that retains the latest entry for each subtask ID, just like we do in the ZK based implementation.

tweise · 2019-01-28T19:30:04Z

flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java

+
+		AggregateFunction aggregateFunction = InstantiationUtil.deserializeObject(serializedAggregateFunction, userCodeLoader);
+
+		Object accumulator = accumulators.get(aggregateName);


Needs synchronization? Or is all access to JobMaster already synchronized?

I believe it's already all synchronized. The RpcService is implemented as as a single Akka actor and thus access is already serialized.

Also: https://github.com/jgrier/flink/blob/d9b28e817351eb2eb6b4cdd9597061713d9160e8/flink-runtime/src/main/java/org/apache/flink/runtime/rpc/RpcEndpoint.java#L47-L47

jgrier · 2019-01-31T16:56:45Z

@tweise @aljoscha @StephanEwen I think this is in a good state for final review and merge. Take a look when you get the chance please.

…to share information across source subtasks. This will be used implement things like event time source synchronization across sources. This functionality can be accessed from user code via the StreamingRuntimeEnvironment.

tweise · 2019-02-01T16:29:24Z

@jgrier thanks for the update, will take a look soon.

Meanwhile, we will put this to work internally.

Summary: This closes apache#7662. [FILNK-11597][test] Remove legacy JobManagerActorTestUtils (apache#7700) [FLINK-11081][rest] Support server port range [hotfix][runtime] Fix mistake in RestfulGateway and JobMasterGateway Javadoc Method requestOperatorBackPressureStats never returns a future that completes with null. [FLINK-11578][runtime] Expose MiniCluster#getDispatcherGatewayFuture for testing [FLINK-11578][tests] Port BackPressureStatsTrackerImplITCase to new code base This closes apache#7697. [FLINK-11626][build] Bump flink-shaded to 6.0 [FLINK-11628][travis] Cache maven [FLINK-11424][metrics] Properly remove string/failing gauges [hotfix] Reorder StandaloneJobClusterEntryPoint class members [hotfix] Improve error message in JobID.fromHexString [FLINK-11545] [container] Add job ID to StandaloneJobClusterConfiguration [FLINK-11545] [container] Add null checks and reduce visibility in StandaloneJobClusterConfiguration Fix line breaks in StandaloneJobClusterConfiguration [FLINK-11545] [container] Parse job ID in StandaloneJobClusterConfigurationParserFactory [pr-review] Make getJobId static [pr-review] Add expected format and example to error message Fix checkstyle Don't wrap JobID.fromHexString error message in StandaloneJobClusterConfigurationParserFactory [pr-review] Add test for short options [FLINK-11545] [container] Pass job ID to ClassPathJobGraphRetriever [FLINK-11545] [container] Catch Exception before exiting with error [FLINK-11545] [container] Add docs for job-id argument [hotfix][tests] Remove mocking from ResourceManagerJobMasterTest [hotfix][tests] Remove mocking from ResourceManagerTaskExecutorTest [hotfix][tests] Refactor ResourceManagerTest to use Before and After methods [FLINK-11596][test] Remove legacy ResourceManagerTest [FLINK-11596][tests] Add ResourceManagerTaskExecutorTest#testDisconnectTaskExecutor [FLINK-11596][tests] Add heartbeat timeout test to ResourceManagerTest - Add ResourceManagerTest#testHeartbeatTimeoutWithJobMaster - Add ResourceManagerTest#testHeartbeatTimeoutWithTaskExecutor This closes apache#7698. [hotfix][travis] Remove stray slash [hotfix][build] Remove hard-coded scala version [FLINK-11154][network] Bump Netty to 4.1.32 Notable changes since 4.1.24: - big improvements (performance, feature set) for using openSSL based SSL engine (useful for FLINK-9816) - allow multiple shaded versions of the same netty artifact (as long as the shaded prefix is different) - Ensure ByteToMessageDecoder.Cumulator implementations always release - Don't re-arm timerfd each epoll_wait - Use a non-volatile read for ensureAccessible() whenever possible to reduce overhead and allow better inlining. - Do not fail on runtime when an older version of Log4J2 is on the classpath - Fix leak and corruption bugs in CompositeByteBuf - Add support for TLSv1.3 - Harden ref-counting concurrency semantics - bug fixes - Java 9-12 related fixes - no license changes - no changes in Netty's NOTICE file [FLINK-11577][tests, runtime] Improve test coverage of stack trace sampling in TM - Extract stack trace sampling logic to StackTraceSampleService - Add unit tests for StackTraceSampleService [FLINK-11577][tests] Delete obsolete test StackTraceSampleCoordinatorITCase - Test could silently fail - Test had no assertions - Test is superseded by StackTraceSampleServiceTest [FLINK-10887] [jobmaster] Add global aggregate tracking to the JobMaster (apache#7099) This adds a JobMaster RPC endpoint that is used to share information across source subtasks. This will be used implement things like event time source synchronization across sources. This functionality can be accessed from user code via the StreamingRuntimeEnvironment. init commit without migration test Differential Revision: https://aone.alibaba-inc.com/code/D839631

tweise reviewed Nov 15, 2018

View reviewed changes

aljoscha requested a review from tillrohrmann November 16, 2018 10:20

tweise reviewed Jan 28, 2019

View reviewed changes

jgrier force-pushed the FLINK-10887-wm-jm branch from 3200d5e to d4a1ef5 Compare January 31, 2019 19:38

This was referenced Jan 31, 2019

Backport [FLINK-10887] Add JobMaster RPC endpoint that is used to share information across source subtasks. #7625

Closed

Backport [FLINK-10887] Add JobMaster RPC endpoint that is used to share information across source subtasks. lyft/flink#22

Merged

jgrier force-pushed the FLINK-10887-wm-jm branch from d4a1ef5 to efc7627 Compare January 31, 2019 22:08

jgrier closed this Feb 1, 2019

jgrier reopened this Feb 1, 2019

tweise approved these changes Feb 16, 2019

View reviewed changes

tweise added 2 commits February 16, 2019 15:49

Merge branch 'master' into FLINK-10887-wm-jm

89d3435

Fix whitespace

5ba18a9

tweise merged commit 32c822a into apache:master Feb 17, 2019

rmetzger added the component=Runtime/Coordination label Mar 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-10887] [jobmaster] Add source watermark tracking to the JobMaster #7099

[FLINK-10887] [jobmaster] Add source watermark tracking to the JobMaster #7099

jgrier commented Nov 15, 2018 •

edited

Loading

tweise Nov 15, 2018

jgrier Nov 15, 2018

tweise Nov 15, 2018

tweise Nov 15, 2018

aljoscha Nov 15, 2018

jgrier Nov 15, 2018

tweise Nov 21, 2018

jgrier Nov 21, 2018

StephanEwen commented Nov 25, 2018

jgrier commented Dec 3, 2018 •

edited

Loading

jgrier commented Dec 6, 2018 •

edited

Loading

aljoscha commented Dec 7, 2018

aljoscha commented Dec 7, 2018

jgrier commented Dec 7, 2018

tweise commented Dec 7, 2018

jgrier commented Jan 25, 2019

tweise left a comment

tweise Jan 28, 2019

jgrier Jan 29, 2019

jgrier Jan 29, 2019

jgrier commented Jan 31, 2019

tweise commented Feb 1, 2019


		AggregateFunction aggregateFunction = InstantiationUtil.deserializeObject(serializedAggregateFunction, userCodeLoader);

		Object accumulator = accumulators.get(aggregateName);

[FLINK-10887] [jobmaster] Add source watermark tracking to the JobMaster #7099

[FLINK-10887] [jobmaster] Add source watermark tracking to the JobMaster #7099

Conversation

jgrier commented Nov 15, 2018 • edited Loading

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StephanEwen commented Nov 25, 2018

jgrier commented Dec 3, 2018 • edited Loading

jgrier commented Dec 6, 2018 • edited Loading

aljoscha commented Dec 7, 2018

aljoscha commented Dec 7, 2018

jgrier commented Dec 7, 2018

tweise commented Dec 7, 2018

jgrier commented Jan 25, 2019

tweise left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jgrier commented Jan 31, 2019

tweise commented Feb 1, 2019

jgrier commented Nov 15, 2018 •

edited

Loading

jgrier commented Dec 3, 2018 •

edited

Loading

jgrier commented Dec 6, 2018 •

edited

Loading