[FLINK-7637] [kinesis] Fix at-least-once guarantee in FlinkKinesisProducer#4871
[FLINK-7637] [kinesis] Fix at-least-once guarantee in FlinkKinesisProducer#4871tzulitai wants to merge 7 commits intoapache:masterfrom
Conversation
…ducer Prior to this commit, there is no flushing of KPL outstanding records on checkpoints in the FlinkKinesisProducer. Likewise to the at-least-once issue on the Flink Kafka producer before, this may lead to data loss if there are asynchronous failing records after a checkpoint which the records was part of was completed.
| if (this.producer == null) { | ||
| throw new RuntimeException("Kinesis producer has been closed"); | ||
| } | ||
| if (thrownException != null) { |
There was a problem hiding this comment.
nit: it would be easier to review the code, if refactor (like extracting this code to a method) was in separate commit then "real" "production" changes. Especially if those production changes are pretty minimal in term of number of changes line codes :)
There was a problem hiding this comment.
Will keep that in mind for the future 👍
| break; | ||
| } | ||
| } | ||
| flushSync(kp); |
There was a problem hiding this comment.
why do we need this kp local variable? Without it, there would be no need to passe KinesisProducer as a param to flushSync, because flushSync() could just use this.producer.
There was a problem hiding this comment.
Makes sense, will change.
| /** | ||
| * Check if there are any asynchronous exceptions. If so, rethrow the exception. | ||
| */ | ||
| private void checkAndPropagateAsyncError() throws Exception { |
There was a problem hiding this comment.
I am assuming that during this moving - copy/pasting - there were no changes in the code? (Btw, that's another reason why having refactors in separate commits makes reviewing easier ;) )
There was a problem hiding this comment.
yes, only a refactoring of the code to a separate method.
|
|
||
| /** | ||
| * Test ensuring that if an async exception is caught for one of the flushed requests on checkpoint, | ||
| * it should be rethrown; we set a timeout because the test will not finish if the logic is broken. |
There was a problem hiding this comment.
Did you forget to set the timeout?
| */ | ||
| @SuppressWarnings("unchecked") | ||
| @Test(timeout = 10000) | ||
| public void testAtLeastOnceProducer() throws Throwable { |
There was a problem hiding this comment.
I am not sure if this test is good enough:
- It is not testing the code that it should :( it is using overriden version of the
FlinkKinesisProducer-DummyFlinkKinesisProducercan hide bugs in the real implementation. - Implementing it as a unit test with mocks, doesn't test for out integration with
Kinesis. You made some assumption howat-least-onceshould be implemented, you implemented it in production code and here you are repeating the same code using the same assumptions :(
However looking at Kafka tests instability I'm not sure which approach is worse... Unless those are not tests instabilities but bugs in our code, which Kafka's ITCases are triggering from time to time - this mockito based test would not discover such bugs.
There was a problem hiding this comment.
@pnowojski I can see your point. Regarding your concerns:
For 1.: I think it is still ok, since the DummyFlinkKinesisProducer only overrides the getKinesisProducer method to implement a mock producer. Also, while the snapshotState method is overriden, I'm only overriding it to inject sync-point latches. The flushing behaviour of the snapshotState's implementation should still be guarded by this test.
For 2.: The lack of integration tests with Kinesis has always been an issue. There simply is no simple way to implement IT tests for that.
What do you think?
There was a problem hiding this comment.
I am still not convinced on the value of such tests (reverse implementing the production code), but I will not press it since:
There simply is no simple way to implement IT tests for that.
| try { | ||
| testHarness.snapshot(123L, 123L); | ||
| } catch (Exception e) { | ||
| // the next invoke should rethrow the async exception |
There was a problem hiding this comment.
nit: the comment refers to invoke, which is probably copy-pasted form above
There was a problem hiding this comment.
Good catch, will fix.
| snapshotThread.sync(); | ||
| } catch (Exception e) { | ||
| // the next invoke should rethrow the async exception | ||
| e.printStackTrace(); |
| checkAndPropagateAsyncError(); | ||
|
|
||
| flushSync(producer); | ||
| if (producer.getOutstandingRecordsCount() > 0) { |
There was a problem hiding this comment.
what if records are added by another thread between the calls of flushSync() and producer.getOutstandingRecordsCount()?
There was a problem hiding this comment.
@bowenli86 I don't think that would happen. Records are added to the producer only in invoke, which is guaranteed to not be executed concurrently with snapshotState.
|
@pnowojski @aljoscha @bowenli86 thanks a lot for the reviews. |
|
LGTM 👍 |
|
Thanks. I made one last change: allow direct import of non-shaded guava in Will proceed to merge if Travis gives green. |
|
Thanks for the heads-up but I think this looks good! |
|
Merging .. |
…ducer Prior to this commit, there is no flushing of KPL outstanding records on checkpoints in the FlinkKinesisProducer. Likewise to the at-least-once issue on the Flink Kafka producer before, this may lead to data loss if there are asynchronous failing records after a checkpoint which the records was part of was completed. This closes apache#4871.
What is the purpose of the change
Prior to this PR, there is no flushing of KPL outstanding records on checkpoints in the
FlinkKinesisProducer. Likewise to the at-least-once issue on the Flink Kafka producer before, this may lead to data loss if there are asynchronous failing records after a checkpoint which the records was part of was completed.Brief change log
Verifying this change
New unit tests are added to
FlinkKinesisProducerTestto verify at-least-once.Does this pull request potentially affect one of the following parts:
@Public(Evolving): noDocumentation