Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-32336][tests] PartitionITCase#ComparablePojo now public #22778

Merged
merged 2 commits into from
Jun 15, 2023

Conversation

zentol
Copy link
Contributor

@zentol zentol commented Jun 14, 2023

POJOs should be public so they are serialized with the pojo serializer.

The modification to the MinMaxSelector is necessary because by being treated as a pojo the partitioning is being changed. Seemingly the first field of the ComparablePojo dominates the partitioning (ranging from 0-2), causing only 3 partitions to be created but the job runs with p=4.

I don't really know what the semantics for range partitioning of Kryo records are, and am somewhat confused that this test passed with Kryo in the first place because it does check for order of elements within partitions. 馃

Anyhow, since this is a DataSet test I'm not too inclined to dig any deeper into the why's. The hasNext check is useful anyway because the test should pass with any parallelism, even one greater than the number of elements.

@zentol zentol requested a review from XComp June 14, 2023 11:03
@flinkbot
Copy link
Collaborator

flinkbot commented Jun 14, 2023

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

Copy link
Contributor

@XComp XComp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 馃憤

Out of curiosity: Why would the serialization have an effect on the execution of this test. Isn't the serialization only used to generate the objects and the actual MinMaxSelector works on the serialized object? In this sense, I don't understand your confusion on why the test succeeded with the kryo serialization. 馃

@zentol
Copy link
Contributor Author

zentol commented Jun 15, 2023

MinMaxSelector

I assume you are referring to the partitioner.

Why would the serialization have an effect on the execution of this test.

POJO vs Kryo isn't just about serialization, it is also about the amount of information that Flink has about the type.

I don't understand your confusion on why the test succeeded with the kryo serialization.

My assumption is that, based on the javadocs, range partitioning takes the value ranges into account (aka, isn't just hash-partitioning data), but Kryo records are opaque and there isn't a real value that Flink can inspect, in contrast to the POJO type where Flink can directly read the long values.

If you don't have access to the values, then you either must use hashes or base it on the serialized data; using the serialized data of an opaque data type doesn't really make sense to me, and hashes shouldn't (?) be able to ensure that the partitioning works as nicely as the test requires.

Maybe we just sort the set locally (since the POJO does implement Comparable) and slice that into partitions 馃し

@zentol zentol merged commit 6df09d8 into apache:master Jun 15, 2023
@zentol zentol deleted the 32336 branch June 15, 2023 14:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants