[FLINK-9295][kafka] Fix transactional.id collisions for FlinkKafkaProducer011 #5977

pnowojski · 2018-05-09T12:17:47Z

Previously if there were two completely independent FlinkKafkaProducer011 data sinks in the job graph, their transactional.id would collide with one another. Fix is to use operator's unique ID as well along task name and subtask id.

In order to do that, operator's unique ID has to be exposed to UDF via RuntimeContext.

This change is backward compatible for recovering from older savepoints, since transactional.ids generated by the old generator still will be used after restoring from state.

Brief change log

Please check individual commit messages

Verifying this change

This change adds a new FlinkKafaProducer011 test case that covers bug fix.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (yes / no)
The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
The serializers: (yes / no / don't know)
The runtime per-record code paths (performance sensitive): (yes / no / don't know)
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes / no / don't know)
The S3 file system connector: (yes / no / don't know)

Documentation

Does this pull request introduce a new feature? (yes / no)
If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

zentol · 2018-05-10T06:32:21Z

flink-streaming-java/src/test/java/org/apache/flink/streaming/util/MockStreamConfig.java

+	public MockStreamConfig() {
+		super(new Configuration());
+
+		setTypeSerializerIn1(IntSerializer.INSTANCE);


not sure about this one. I would expect a MockStreamConfig to be generally usable, but this one can (unexpectedly) mess things up in regards to serializers. I would remove this line and replace the usage in AsyncWaitOperatorTestwith

StreamConfig streamConfig = new MockStreamConfig(); streamConfig.setTypeSerializerIn1(IntSerializer.INSTANCE);

Good point, fixed.

StephanEwen

Good fix, with two comments. Otherwise good to merge...

StephanEwen · 2018-05-11T14:56:06Z

...ka-0.11/src/main/java/org/apache/flink/streaming/connectors/kafka/FlinkKafkaProducer011.java

@@ -837,7 +837,7 @@ public void initializeState(FunctionInitializationContext context) throws Except
 		nextTransactionalIdHintState = context.getOperatorStateStore().getUnionListState(
 			NEXT_TRANSACTIONAL_ID_HINT_DESCRIPTOR);
 		transactionalIdsGenerator = new TransactionalIdsGenerator(
-			getRuntimeContext().getTaskName(),
+			getRuntimeContext().getTaskName() + "-" + getRuntimeContext().getOperatorUniqueID(),


You could probably use only the Operator ID here - the task name does not add to the uniqueness. Unless the purpose of the task name is "human readability" in log files or metrics.

Yes, I wanted to keep the task name for the readability reasons.

StephanEwen · 2018-05-11T14:58:39Z

flink-core/src/main/java/org/apache/flink/api/common/functions/RuntimeContext.java

+	 * @return String representation of the operator's unique id.
+	 */
+	@PublicEvolving
+	String getOperatorUniqueID();


Rather than adding this here and failing for DataSet programs, hoe about adding this to StreamingRuntimeContext and casting inside the Kafka Producer? Not super pretty, but nicer than having something that looks like a pretty generic concept (operator id) throwing an exception in a whole class of programs (batch jobs). This problem should go away anyways with the batch / streaming unification later.

Probably better solution from the perspective of this particular use case, but do the UDF's have an access to StreamingRuntimeContext? If not, casting requirement would be huge blocker for any users from using this, thus making this almost a private API.

However I personally do not mind one way or the other.

PS, There are already operations/methods that do not work in streaming/batch

I'm slightly leaning towards Stephan's suggestion, which I also agree is the better solution for this case.

It might be ok to have this as a "hidden" API for now anyways, since 1) it is marked @PublicEvolving, and 2) the API was added in quite a short timeframe.
If we want this fix in 1.5, I wouldn't suggest "fully" exposing it.

I slightly prefer the current approach. Looking at other methods, this seems to be the "established" way for methods that do not exist in batch. Why mixing up two different ways of dealing with them?

Different point here, I think we would usually like to return OperatorID here and not string, but it is probably in a wrong module. When moving OperatorID seems to much, I would suggest to call the method something like getOperatorIDAsString to make this a bit more explicit.

I am still much in favor of not exposing this in the RuntimeContext:

Having the state accesses in the RuntimeContext was a necessity of that moment, because there was no initializeState() and it is crucial to be exposed to users.

This operatorID is not crucial to be exposed to users, hence a very different case to me.

It is super easy to expose it later, it is much harder (even if marked as PublicEvolving) to hide it later. For a quick move, not exposing an addition publicly should always be the default choice, also beyond this specific case here.

StefanRRichter · 2018-05-15T09:27:07Z

Overall LGTM 👍

…nvironment constructors

…grams This allows to uniquely and stably across multiple job submissions identify operators. Previously two different operators that were executed by tasks that had the same name were indistinguishable.

…ducer011 Previously if there were two completely independent FlinkKafkaProducer011 data sinks in the job graph, their transactional.id would collide with one another. Fix is to use operator's unique ID as well along task name and subtask id. This change is backward compatible for recovering from older savepoints, since transactional.ids generated by the old generator still will be used after restoring from state.

pnowojski · 2018-05-17T12:09:40Z

I have removed getOperatorUniqueID() from RuntimeContext and our FlinkKafkaProducer011 is now casting RuntimeContext to StreamingRuntimeContext.

tzulitai · 2018-05-22T08:11:03Z

Thanks for the update @pnowojski.
Changes LGTM, +1.

Merging this ..

…ducer011 Previously if there were two completely independent FlinkKafkaProducer011 data sinks in the job graph, their transactional.id would collide with one another. Fix is to use operator's unique ID as well along task name and subtask id. This change is backward compatible for recovering from older savepoints, since transactional.ids generated by the old generator still will be used after restoring from state. This closes #5977.

…ducer011 Previously if there were two completely independent FlinkKafkaProducer011 data sinks in the job graph, their transactional.id would collide with one another. Fix is to use operator's unique ID as well along task name and subtask id. This change is backward compatible for recovering from older savepoints, since transactional.ids generated by the old generator still will be used after restoring from state. This closes apache#5977.

pnowojski force-pushed the f9295 branch from fe34837 to ad86d8e Compare May 9, 2018 16:07

zentol reviewed May 10, 2018

View reviewed changes

pnowojski force-pushed the f9295 branch 2 times, most recently from 8ed2ecd to 7cb5f5a Compare May 11, 2018 11:53

StephanEwen reviewed May 11, 2018

View reviewed changes

pnowojski added 4 commits May 17, 2018 13:03

[hotfix][tests] Introduce MockEnvironmentBuilder to deduplicate MockE…

3756cf7

…nvironment constructors

[hotfix][tests] Reduce mockito usage in tests

5e930ca

[FLINK-9316][streaming] Expose operator's unique ID in DataStream pro…

14e0639

…grams This allows to uniquely and stably across multiple job submissions identify operators. Previously two different operators that were executed by tasks that had the same name were indistinguishable.

pnowojski force-pushed the f9295 branch from 7cb5f5a to ac4b326 Compare May 17, 2018 12:08

asfgit closed this in 9b7493d May 22, 2018

rmetzger added the component=Connectors/Kafka label Mar 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-9295][kafka] Fix transactional.id collisions for FlinkKafkaProducer011 #5977

[FLINK-9295][kafka] Fix transactional.id collisions for FlinkKafkaProducer011 #5977

pnowojski commented May 9, 2018

zentol May 10, 2018 •

edited

pnowojski May 11, 2018

StephanEwen left a comment

StephanEwen May 11, 2018

pnowojski May 11, 2018

StephanEwen May 11, 2018

pnowojski May 11, 2018 •

edited

tzulitai May 15, 2018

StefanRRichter May 15, 2018

StefanRRichter May 15, 2018

StephanEwen May 16, 2018

StefanRRichter commented May 15, 2018

pnowojski commented May 17, 2018

tzulitai commented May 22, 2018

[FLINK-9295][kafka] Fix transactional.id collisions for FlinkKafkaProducer011 #5977

[FLINK-9295][kafka] Fix transactional.id collisions for FlinkKafkaProducer011 #5977

Conversation

pnowojski commented May 9, 2018

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

zentol May 10, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StephanEwen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pnowojski May 11, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StefanRRichter commented May 15, 2018

pnowojski commented May 17, 2018

tzulitai commented May 22, 2018

zentol May 10, 2018 •

edited

pnowojski May 11, 2018 •

edited