New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-19981][core][table] Add name-based field mode for Row #14420
Conversation
Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community Automated ChecksLast check on commit d5d48b9 (Fri May 28 07:00:33 UTC 2021) Warnings:
Mention the bot in a comment to re-run the automated checks. Review Progress
Please see the Pull Request Review Guide for a full explanation of the review process. The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commandsThe @flinkbot bot supports the following commands:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, this looks very good! I had some inline questions.
And we should also wait until we have some benchmarking results.
public RowSerializer( | ||
TypeSerializer<?>[] fieldSerializers, | ||
@Nullable LinkedHashMap<String, Integer> positionByName, | ||
boolean legacyModeEnabled) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering if it wouldn't make sense to drop legacy mode now, for Flink 1.13.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 I can open a PR for it
@@ -90,8 +88,6 @@ private boolean deepEquals0(Object e1, Object e2) { | |||
return deepEqualsArray(e1, e2); | |||
} else if (e1 instanceof Tuple && e2 instanceof Tuple) { | |||
return deepEqualsTuple((Tuple) e1, (Tuple) e2); | |||
} else if (e1 instanceof Row && e2 instanceof Row) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not needed anymore because the Row itself now does the deep equals, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the Row implementation supports that now. And if a custom implementation is still needed the checker allows to override it if necessary.
final Object value = external.getField(pos); | ||
genericRow.setField(pos, fieldConverters[pos].toInternalOrNull(value)); | ||
|
||
final Set<String> fieldNames = external.getFieldNames(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be interesting to see in benchmarks what the impact of this is, and of name-based field access in general.
Thanks for the review @aljoscha. I will prepare some benchmarks as the next step. |
I added a commit for with benchmarks in this PR. I will remove it again during merging. As expected, the name-based rows perform 35 % worse then position-based rows. But I think for the sake of code readability and not having to deal with indices, this is acceptable. Users can always fallback to position-based fields. I also added a comment to the JavaDoc about this. Here are the benchmark results:
Interestingly, the new refactoring of the serializer seems to slightly improve the position-based rows as well. This might be due to a modified |
Changes still look good to me! And thanks for taking the time to do the benchmark. And yes, it's really interesting that this also increases perf for positional rows... 😅 |
This adds a name-based field mode to the Row class. A row can operate in 3 different modes: name-based, position-based, or a hybrid of both when leaving the Flink runtime. It simplifies the handling of large rows (possibly with hundreds of fields) and will make it easier to switch between DataStream API and Table API. See the documentation of the Row class for more information. This closes apache#14420.
This adds a name-based field mode to the Row class. A row can operate in 3 different modes: name-based, position-based, or a hybrid of both when leaving the Flink runtime. It simplifies the handling of large rows (possibly with hundreds of fields) and will make it easier to switch between DataStream API and Table API. See the documentation of the Row class for more information. This closes apache#14420.
What is the purpose of the change
This adds a name-based field mode to the
Row
class. It simplifies the handling of large rows (possibly with hundreds of fields) and will make it easier to switch between DataStream API and Table API.See the documentation of Row:
Brief change log
Row
RowSerializer
RowRowConverter
Verifying this change
This change added tests and can be verified as follows:
RowTest
RowSerializerTest
DataStructureConverterTest
RowFunctionTest
Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: yesDocumentation