-
Notifications
You must be signed in to change notification settings - Fork 13.8k
[FLINK-19473] Implement multi inputs sorting DataInput #13529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community Automated ChecksLast check on commit 6270bf0 (Thu Oct 01 14:35:15 UTC 2020) Warnings:
Mention the bot in a comment to re-run the automated checks. Review Progress
Please see the Pull Request Review Guide for a full explanation of the review process. DetailsThe Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commandsThe @flinkbot bot supports the following commands:
|
d91430e to
e1aaede
Compare
|
@flinkbot run azure |
0adec04 to
0678c76
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like most of the commits here. Well done @dawidwys with the refactoring/cleanups/deduplicating the code.
I think my biggest concerns are about the bug fix commit hidden in this PR.
Note, I didn't review the MultiInputSortingDataInputs code itself, I hope @aljoscha will be able to do the most work here. Please let me know if you would like me to take a look at something there after all.
...g-java/src/main/java/org/apache/flink/streaming/runtime/io/StreamMultipleInputProcessor.java
Outdated
Show resolved
Hide resolved
...src/main/java/org/apache/flink/streaming/runtime/io/StreamMultipleInputProcessorFactory.java
Show resolved
Hide resolved
...src/main/java/org/apache/flink/streaming/runtime/io/StreamMultipleInputProcessorFactory.java
Show resolved
Hide resolved
...k-streaming-java/src/main/java/org/apache/flink/streaming/api/operators/EndOfInputAware.java
Outdated
Show resolved
Hide resolved
...ming-java/src/test/java/org/apache/flink/streaming/api/operators/sort/SortedInputITCase.java
Outdated
Show resolved
Hide resolved
...aming-java/src/main/java/org/apache/flink/streaming/runtime/io/TwoInputSelectionHandler.java
Outdated
Show resolved
Hide resolved
...eaming-java/src/main/java/org/apache/flink/streaming/runtime/io/StreamTwoInputProcessor.java
Outdated
Show resolved
Hide resolved
...eaming-java/src/main/java/org/apache/flink/streaming/runtime/io/StreamTwoInputProcessor.java
Outdated
Show resolved
Hide resolved
...eaming-java/src/main/java/org/apache/flink/streaming/runtime/io/StreamTwoInputProcessor.java
Outdated
Show resolved
Hide resolved
| @SuppressWarnings("unchecked") | ||
| MultiInputSortingDataInputs<Object> multiInputs = new MultiInputSortingDataInputs<Object>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I remember someone complaining about unchecked warnings 🙈 and Object? 😈 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I must admit I was a bit harsh on this one 😅
|
Just one comment regarding commits tags. I mostly use |
aljoscha
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall the changes look very good! I had one comment before about commit tags and I had some inline comments about exception messages and some possible code changes.
| import static org.apache.flink.util.Preconditions.checkNotNull; | ||
|
|
||
| /** | ||
| * An entry class for creating coupled, sorting inputs. The inputs are sorted independently and afterwards |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forget if we mentioned this in the one-input implementations but we should mention that this does sorting by key and timestamp (in that order). It's not some generic sorting utility.
| public CompletableFuture<Void> prepareSnapshot( | ||
| ChannelStateWriter channelStateWriter, | ||
| long checkpointId) { | ||
| throw new UnsupportedOperationException(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, not sure I caught this on the one-input version but we should give some information about why it's not supported.
| */ | ||
| private class SortingPhaseDataOutput implements PushingAsyncDataInput.DataOutput<Object> { | ||
|
|
||
| int currentIdx; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it maybe be better to have one SortingPhaseDataOutput per indexed input. It could also have "hard links" to the things it needs like key selector, sorter, etc.
And then we also wouldn't need this mutable field.
| public void close() throws IOException { | ||
| IOException ex = null; | ||
| try { | ||
| inputs[idx].close(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using array accesses for inputs and sorters, could we just get them once in the constructor and store in fields? This is related to the question below about using one SortingPhaseDataOutput per input.
aljoscha
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very good! +1 to merge once CI is green.
I had two nitpicks about indentation and Javadocs.
| * serialized key is constant, or {@link VariableLengthByteKeyComparator} otherwise. | ||
| * | ||
| * <p>Watermarks, stream statuses, nor latency markers are propagated downstream as they do not make | ||
| * sense with buffered records. The input emits a MAX_WATERMARK after all records. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It emits the maximum watermark it has seen, MAX_WATERMARK could be interpreted as Long.MAX_VALUE.
| .mapToObj( | ||
| idx -> { | ||
| try { | ||
| KeyAndValueSerializer<Object> keyAndValueSerializer = new KeyAndValueSerializer<>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indentation seems off here. (Another point for having an automatic formatting tool... 😅)
|
@pnowojski I'd like to double check with you, are you fine with the current state of the PR? |
pnowojski
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM % a couple of nits
...eaming-java/src/main/java/org/apache/flink/streaming/runtime/io/StreamOneInputProcessor.java
Outdated
Show resolved
Hide resolved
...eaming-java/src/main/java/org/apache/flink/streaming/runtime/io/StreamOneInputProcessor.java
Show resolved
Hide resolved
.../src/main/java/org/apache/flink/streaming/api/operators/sort/MultiInputSortingDataInput.java
Outdated
Show resolved
Hide resolved
…cessors I replaced the OperatorChain in StreamOneInputProcessor and other InputProcessors with a BoundedMultiInput. Moroever I made the OperatorChain implement BoundedMultiInput interface. It makes it easier to instantiate StreamOneInputProcessor in tests without the need to instantiate a whole OperatorChain.
I implement a MultiInputSortingDataInputs which is kind of a factory for multiple, related inputs. In case of sorting inputs of Two/Multiple input operators not only the independent inputs should be sorted, but also records across different inputs. Therefore the produced inputs share a common context to synchronize which input should emit records next. The coordination is done via inputs Availability. Only one of the inputs, with the current smallest element, is available at a time.
pnowojski
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one optional nit besides, LGTM (assuming azure will be green).
Thanks for the changes @dawidwys.
| inputSelectionHandler.nextSelection(); | ||
| } else { | ||
| secondInputStatus = processor2.processInput(); | ||
| checkFinished(secondInputStatus); | ||
| inputSelectionHandler.nextSelection(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitty nit: move inputSelectionHandler.nextSelection(); outside of the if/else?
What is the purpose of the change
This PR implements sorting inputs for multi input operators/tasks. It is based on the work in #13521
First I sort all inputs individually and after that I do a sorting merge of all the inputs to make sure that operator sees all incoming records with the same key from all inputs.
Brief change log
For a more thorough explanation see the commits messages. Some highlights:
AvailabilityProviderhandling logic, which did not work before.Verifying this change
Added tests in:
Does this pull request potentially affect one of the following parts:
@Public(Evolving): (yes / no)Documentation