[FLINK-7637] [kinesis] Fix at-least-once guarantee in FlinkKinesisProducer by tzulitai · Pull Request #4871 · apache/flink

tzulitai · 2017-10-20T09:18:58Z

What is the purpose of the change

Prior to this PR, there is no flushing of KPL outstanding records on checkpoints in the FlinkKinesisProducer. Likewise to the at-least-once issue on the Flink Kafka producer before, this may lead to data loss if there are asynchronous failing records after a checkpoint which the records was part of was completed.

Brief change log

Fix at-least-once in the Kinesis producer by properly flushing on checkpoints.
Minor fixes (last 2 commits) that cleans up the code.

Verifying this change

New unit tests are added to FlinkKinesisProducerTest to verify at-least-once.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): no
The public API, i.e., is any changed class annotated with @Public(Evolving): no
The serializers: no
The runtime per-record code paths (performance sensitive): no
Anything that affects deployment or recovery: no

Documentation

Does this pull request introduce a new feature? no
If yes, how is the feature documented? n/a

…ducer Prior to this commit, there is no flushing of KPL outstanding records on checkpoints in the FlinkKinesisProducer. Likewise to the at-least-once issue on the Flink Kafka producer before, this may lead to data loss if there are asynchronous failing records after a checkpoint which the records was part of was completed.

…ucer classes

pnowojski · 2017-10-23T08:35:26Z

...inesis/src/main/java/org/apache/flink/streaming/connectors/kinesis/FlinkKinesisProducer.java

 		if (this.producer == null) {
 			throw new RuntimeException("Kinesis producer has been closed");
 		}
-		if (thrownException != null) {


nit: it would be easier to review the code, if refactor (like extracting this code to a method) was in separate commit then "real" "production" changes. Especially if those production changes are pretty minimal in term of number of changes line codes :)

Will keep that in mind for the future 👍

pnowojski · 2017-10-23T08:37:48Z

...inesis/src/main/java/org/apache/flink/streaming/connectors/kinesis/FlinkKinesisProducer.java

-					break;
-				}
-			}
+			flushSync(kp);


why do we need this kp local variable? Without it, there would be no need to passe KinesisProducer as a param to flushSync, because flushSync() could just use this.producer.

Makes sense, will change.

pnowojski · 2017-10-23T08:40:17Z

...inesis/src/main/java/org/apache/flink/streaming/connectors/kinesis/FlinkKinesisProducer.java

+	/**
+	 * Check if there are any asynchronous exceptions. If so, rethrow the exception.
+	 */
+	private void checkAndPropagateAsyncError() throws Exception {


I am assuming that during this moving - copy/pasting - there were no changes in the code? (Btw, that's another reason why having refactors in separate commits makes reviewing easier ;) )

yes, only a refactoring of the code to a separate method.

pnowojski · 2017-10-23T08:47:06Z

...is/src/test/java/org/apache/flink/streaming/connectors/kinesis/FlinkKinesisProducerTest.java

+
+	/**
+	 * Test ensuring that if an async exception is caught for one of the flushed requests on checkpoint,
+	 * it should be rethrown; we set a timeout because the test will not finish if the logic is broken.


Did you forget to set the timeout?

Indeed, will add.

pnowojski · 2017-10-23T09:02:14Z

...is/src/test/java/org/apache/flink/streaming/connectors/kinesis/FlinkKinesisProducerTest.java

+	 */
+	@SuppressWarnings("unchecked")
+	@Test(timeout = 10000)
+	public void testAtLeastOnceProducer() throws Throwable {


I am not sure if this test is good enough:

It is not testing the code that it should :( it is using overriden version of the FlinkKinesisProducer - DummyFlinkKinesisProducer can hide bugs in the real implementation.

Implementing it as a unit test with mocks, doesn't test for out integration with Kinesis. You made some assumption how at-least-once should be implemented, you implemented it in production code and here you are repeating the same code using the same assumptions :(

However looking at Kafka tests instability I'm not sure which approach is worse... Unless those are not tests instabilities but bugs in our code, which Kafka's ITCases are triggering from time to time - this mockito based test would not discover such bugs.

@pnowojski I can see your point. Regarding your concerns:

For 1.: I think it is still ok, since the DummyFlinkKinesisProducer only overrides the getKinesisProducer method to implement a mock producer. Also, while the snapshotState method is overriden, I'm only overriding it to inject sync-point latches. The flushing behaviour of the snapshotState's implementation should still be guarded by this test.
For 2.: The lack of integration tests with Kinesis has always been an issue. There simply is no simple way to implement IT tests for that.

What do you think?

I am still not convinced on the value of such tests (reverse implementing the production code), but I will not press it since:

There simply is no simple way to implement IT tests for that.

aljoscha · 2017-10-23T15:20:39Z

...is/src/test/java/org/apache/flink/streaming/connectors/kinesis/FlinkKinesisProducerTest.java

+		try {
+			testHarness.snapshot(123L, 123L);
+		} catch (Exception e) {
+			// the next invoke should rethrow the async exception


nit: the comment refers to invoke, which is probably copy-pasted form above

Good catch, will fix.

aljoscha · 2017-10-23T15:21:34Z

...is/src/test/java/org/apache/flink/streaming/connectors/kinesis/FlinkKinesisProducerTest.java

+			snapshotThread.sync();
+		} catch (Exception e) {
+			// the next invoke should rethrow the async exception
+			e.printStackTrace();


Leftover printing.

bowenli86 · 2017-10-24T02:28:09Z

...inesis/src/main/java/org/apache/flink/streaming/connectors/kinesis/FlinkKinesisProducer.java

+		checkAndPropagateAsyncError();
+
+		flushSync(producer);
+		if (producer.getOutstandingRecordsCount() > 0) {


what if records are added by another thread between the calls of flushSync() and producer.getOutstandingRecordsCount()?

@bowenli86 I don't think that would happen. Records are added to the producer only in invoke, which is guaranteed to not be executed concurrently with snapshotState.

tzulitai · 2017-10-24T10:59:29Z

@pnowojski @aljoscha @bowenli86 thanks a lot for the reviews.
I've either addressed them with the follow up commits or left comments.

bowenli86 · 2017-10-24T16:35:36Z

LGTM 👍

…rTest

tzulitai · 2017-10-25T05:03:10Z

Thanks.

I made one last change: allow direct import of non-shaded guava in FlinkKinesisProducerTest.
The reason for this is that the Kinesis API directly exposes Guava, so we can't use the Flink shaded dependencies. We already allow direct guava import in FlinkKinesisProducer, so I'm following the same approach here.

Will proceed to merge if Travis gives green.
cc @aljoscha in case you want a final pass before that happens!

aljoscha · 2017-10-25T10:05:43Z

Thanks for the heads-up but I think this looks good!

tzulitai · 2017-10-25T10:35:13Z

Merging ..

…ducer Prior to this commit, there is no flushing of KPL outstanding records on checkpoints in the FlinkKinesisProducer. Likewise to the at-least-once issue on the Flink Kafka producer before, this may lead to data loss if there are asynchronous failing records after a checkpoint which the records was part of was completed. This closes apache#4871.

tzulitai added 3 commits October 20, 2017 17:19

[hotfix] [kinesis] Fix inproper test name in FlinkKinesisProducerTest

b24f827

[hotfix] [kinesis] Properly add serialVersionUIDs to FlinkKinesisProd…

0ebeefd

…ucer classes

tzulitai force-pushed the FLINK-7637 branch from d547dd7 to 0ebeefd Compare October 20, 2017 09:20

pnowojski reviewed Oct 23, 2017

View reviewed changes

aljoscha reviewed Oct 23, 2017

View reviewed changes

bowenli86 reviewed Oct 24, 2017

View reviewed changes

tzulitai added 3 commits October 24, 2017 18:10

fix! Remove redundant method parameter for flushSync

daaad21

fix! Address comments in tests

c8bc4bc

fix! Fix checkstyle import errors

6e8de8f

pnowojski approved these changes Oct 24, 2017

View reviewed changes

fix! Suppress IllegalImport checkstyle errors for FlinkKinesisProduce…

4c6b966

…rTest

asfgit closed this in 073b82c Oct 25, 2017

rmetzger added the component=Connectors/Kinesis label Mar 18, 2019

Conversation

tzulitai commented Oct 20, 2017

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tzulitai Oct 24, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tzulitai commented Oct 24, 2017

Uh oh!

bowenli86 commented Oct 24, 2017

Uh oh!

tzulitai commented Oct 25, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aljoscha commented Oct 25, 2017

Uh oh!

tzulitai commented Oct 25, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

tzulitai Oct 24, 2017 •

edited

Loading

tzulitai commented Oct 25, 2017 •

edited

Loading