[FLINK-30202][tests] Do not assert on checkpointId #21416

zentol · 2022-11-29T08:55:09Z

Capturing the checkpointId for a generated record in a subsequent map function is impossible since the notifyCheckpointComplete notification may arrive at any time (or not at all). Instead just assert that each subtask got exactly as many records as expected, which can only happen (reliably) if the rate-limiting works as expected.

Capturing the checkpointId for a generated record is impossible since the notifyCheckpointComplete notification may arrive at any time (or not at all). Instead just assert that each subtask got exactly as many records as expected, which can only happen (reliably) if the rate-limiting works as expected.

flinkbot · 2022-11-29T08:59:38Z

CI report:

e52d31e Azure: FAILURE

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run azure re-run the last Azure build

XComp

which can only happen (reliably) if the rate-limiting works as expected.

You're saying that we would introduce a test instability here if the RateLimitedStrategy wouldn't perform as expected?

zentol · 2022-11-29T13:40:54Z

You're saying that we would introduce a test instability here if the RateLimitedStrategy wouldn't perform as expected?

Yes. At least that was the idea. Now I'm not so sure anymore whether this makes sense. Given that we limit the count we invariably end up with capacityPerCycle * numCycles elements, regardless of whether rate-limiting was applied or not.

XComp · 2022-11-29T13:44:08Z

yeah, that's something I was wondering as well. But the behavior of the RateLimitedStrategy doesn't necessarily need to be tested here, I guess. It feels like we're missing a RateLimitedSourceReaderTest for that kind of functionality. 🤔

zentol · 2022-11-29T13:58:53Z

There is a RateLimitedSourceReaderITCase.

I'll try finding another way to test this; my current thinking goes towards using a FlatMapFunction that stops emitting values after the first call to snapshotState, so it should truly only emit the values of the first checkpoint (and then you can assert the total number of records emitted in a single checkpoint). but that's so far also not working; too many values get emitted...

zentol · 2022-11-29T14:12:51Z

I think we actually found a bug.

If a split was already assigned to a reader, then the first call to SourceReader#pollNext (which happens before SourceReader#isAvailable) circumvents rate-limiting.
We need to force this first call to also go through isAvailable.

zentol · 2022-11-29T14:14:22Z

Additionally, the RateLimitedSourceReader may reset the checkpoint limit at the wrong time. We don't really that to happen when the checkpoint is complete, but rather when the next checkpoint starts (== when snapshotState was called).
That said I haven't seen a test failure because of this (yet).

XComp

I really should push up reading up on FLIP-27 in my todo list. 8) Anyway, after some code reading, the change in pollNext() makes sense. Inially, I thought of initializing availabilityFuture in pollNext() instead of returning NOTHING_AVAILABLE. But that was a wrong train of thought. I still don't get you 2nd comment, though. Please find my remarks below.

...tagen/src/test/java/org/apache/flink/connector/datagen/source/DataGeneratorSourceITCase.java

XComp

The issue where we complete the gatingFuture when receiving the completed checkpoint notification instead of when a new checkpoint is triggered sounds like a separate issue. I think it would make sense to create a new Jira ticket for that. WDYT?

zentol · 2022-11-30T15:45:16Z

I think it would make sense to create a new Jira ticket for that. WDYT?

yes, it's a separate issue (and for the one in this ticket we at least already have a test that shows the issue).

- the test was never calling isAvailable(), relying on the previous (bugged) behavior of rate-limiting not being enforced - The loop was difficult to understand in terms of how many records are actually being processed and was refactored accordingly - there were a series of math errors in here; 563-177=386, but 128(elementsPerCycle)*3 = 384. This was hidden by the final call to pollNext() in the while loop (emitting 1 additional record), and the final range assertion also incrementing to by 1.

zentol · 2022-11-30T16:29:10Z

Another test relied on the previous (bugged) behavior :(

- use 0-383 to make off-by-one error obvious (the splits included 385 values, not 384) - assert that we reach END_OF_INPUT - correctly assert all 384 elements

...datagen/src/test/java/org/apache/flink/connector/datagen/source/DataGeneratorSourceTest.java

more assertions

XComp

LGTM 👍 ...just a few minor things

...datagen/src/test/java/org/apache/flink/connector/datagen/source/DataGeneratorSourceTest.java

...tagen/src/test/java/org/apache/flink/connector/datagen/source/DataGeneratorSourceITCase.java

zentol requested a review from XComp November 29, 2022 08:55

flinkbot added the component=Connectors/Common label Nov 29, 2022

XComp reviewed Nov 29, 2022

View reviewed changes

Force first call to go through isAvailable

03d91f4

XComp reviewed Nov 30, 2022

View reviewed changes

...tagen/src/test/java/org/apache/flink/connector/datagen/source/DataGeneratorSourceITCase.java Outdated Show resolved Hide resolved

...tagen/src/test/java/org/apache/flink/connector/datagen/source/DataGeneratorSourceITCase.java Show resolved Hide resolved

XComp approved these changes Nov 30, 2022

View reviewed changes

- more loop fixes

2e3fc19

- use 0-383 to make off-by-one error obvious (the splits included 385 values, not 384) - assert that we reach END_OF_INPUT - correctly assert all 384 elements

zentol requested a review from XComp December 1, 2022 12:45

XComp reviewed Dec 2, 2022

View reviewed changes

zentol added 4 commits December 2, 2022 14:13

Add numCycles

c2d400e

assert number of splits

d3d72be

Update DataGeneratorSourceTest.java

8640f76

more assertions

comments

1493da0

zentol requested a review from XComp December 2, 2022 13:15

XComp approved these changes Dec 2, 2022

View reviewed changes

comments

e52d31e

zentol merged commit 81ed6c6 into apache:master Dec 4, 2022

zentol deleted the 30202 branch April 20, 2023 07:51

[FLINK-30202][tests] Do not assert on checkpointId #21416

[FLINK-30202][tests] Do not assert on checkpointId #21416

Uh oh!

Conversation

zentol commented Nov 29, 2022

Uh oh!

flinkbot commented Nov 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI report:

Uh oh!

XComp left a comment

Choose a reason for hiding this comment

Uh oh!

zentol commented Nov 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

XComp commented Nov 29, 2022

Uh oh!

zentol commented Nov 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zentol commented Nov 29, 2022

Uh oh!

zentol commented Nov 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

XComp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

XComp left a comment

Choose a reason for hiding this comment

Uh oh!

zentol commented Nov 30, 2022

Uh oh!

zentol commented Nov 30, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

XComp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

flinkbot commented Nov 29, 2022 •

edited

Loading

zentol commented Nov 29, 2022 •

edited

Loading

zentol commented Nov 29, 2022 •

edited

Loading

zentol commented Nov 29, 2022 •

edited

Loading