Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-24684][table-planner] Add to string cast rules using the new CastRule stack #17658

Closed
wants to merge 14 commits into from

Conversation

slinkydeveloper
Copy link
Contributor

@slinkydeveloper slinkydeveloper commented Nov 3, 2021

What is the purpose of the change

This PR ports all the "to string" casting logic to the new CastRule stack, as discussed https://issues.apache.org/jira/browse/FLINK-24684.

Brief change log

Major PR changes

  • Port all cast to string logic to the new CastRule stack
  • Cleaned up old logic in ScalarOperatorGens for to string casting
  • Now CastRuleUtils provides a CodeWriter to simplify and organize code generation using java
  • Rewrote the existing rules using the new writer

Included minor changes

  • Write in the code generated functions a comment to identify which rule was used for generating that specific code
  • Add logging in OperatorCodeGen to debug codegen issues
  • Now ExpressionTestBase provides more verbose assertion error logging

Verifying this change

Each new rule is tested with new test cases in CastRulesTest, and then they are covered inside integration tests provided by CastFunctionsITCase.

In some tests I had to change null to NULL, because now the null values is represented as an uppercase string.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? not applicable

@flinkbot
Copy link
Collaborator

flinkbot commented Nov 3, 2021

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit aabf462 (Wed Nov 03 10:43:20 UTC 2021)

Warnings:

  • No documentation files were touched! Remember to keep the Flink docs up to date!

Mention the bot in a comment to re-run the automated checks.

Review Progress

  • ❓ 1. The [description] looks good.
  • ❓ 2. There is [consensus] that the contribution should go into to Flink.
  • ❓ 3. Needs [attention] from.
  • ❓ 4. The change fits into the overall [architecture].
  • ❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.


The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands
The @flinkbot bot supports the following commands:

  • @flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
  • @flinkbot approve all to approve all aspects
  • @flinkbot approve-until architecture to approve everything until architecture
  • @flinkbot attention @username1 [@username2 ..] to require somebody's attention
  • @flinkbot disapprove architecture to remove an approval you gave earlier

@flinkbot
Copy link
Collaborator

flinkbot commented Nov 3, 2021

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

Copy link
Contributor

@matriv matriv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @slinkydeveloper, great stuff!

I left some comments, and could you also add some integration tests missing in CastFunctionITCase or CastFunctionMiscITCase regarding the structured type -> string casts?

@slinkydeveloper
Copy link
Contributor Author

@matriv regarding testing the rules in CastFunctionsITCase, I effectively found out that the row, array and map rules doesn't work, because some validation steps before (SqlValidatorImpl and LogicalTypeCasts) thinks the cast combination is invalid. Because the goal of this PR is to simply port the logic from ScalarOperatorGens to the new rules, perhaps should we tackle the issue of "exposing" the rules in the context of https://issues.apache.org/jira/browse/FLINK-21456?

@matriv
Copy link
Contributor

matriv commented Nov 3, 2021

@matriv regarding testing the rules in CastFunctionsITCase, I effectively found out that the row, array and map rules doesn't work, because some validation steps before (SqlValidatorImpl and LogicalTypeCasts) thinks the cast combination is invalid. Because the goal of this PR is to simply port the logic from ScalarOperatorGens to the new rules, perhaps should we tackle the issue of "exposing" the rules in the context of https://issues.apache.org/jira/browse/FLINK-21456?

Yes of course, my mistake, thought those casts work end to end now.

Copy link
Contributor

@matriv matriv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@slinkydeveloper Thx a lot for adding those generated code examples, I think they are really useful, for debugging, or for changing something in the future.

Copy link
Contributor

@twalthr twalthr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @slinkydeveloper. This is a great PR. Very readable code with good utilities that we can generalize to common code gen utils in Java. But this is future work. I added some feedback.

private static LogicalType sanitizeTargetType(
ArrayType inputArrayType, ArrayType targetArrayType) {
LogicalType innerTargetType = targetArrayType.getElementType();
// TODO this seems rather a bug of the planner that generates/allows/doesn't sanitize
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See LogicalTypeCasts:

    private static boolean supportsCasting(
            LogicalType sourceType, LogicalType targetType, boolean allowExplicit) {
        // a NOT NULL type cannot store a NULL type
        // but it might be useful to cast explicitly with knowledge about the data
        if (sourceType.isNullable() && !targetType.isNullable() && !allowExplicit) {
            return false;
        }
        // ignore nullability during compare
        if (sourceType.copy(true).equals(targetType.copy(true))) {
            return true;
        }

For explicit casts, this is totally fine if the user knows the data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, I've modified the comment to explain why we need that logic.


/* Example generated code for ARRAY<INT>:

isNull$0 = _myInputIsNull;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use <pre>?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? This is not a javadoc


// Write the comma
if (fieldIndex != 0) {
writer.stmt(methodCall(builderTerm, "append", strLiteral(",")));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use a space after the comma

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for #17658 (comment), choosing which representation to use is out of scope of this PR. We have an issue open for that already: https://issues.apache.org/jira/browse/FLINK-17321

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A row is not a collection type. But let me create a bunch of issues to not forget about it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened FLINK-24802.

"delete",
0,
methodCall(builderTerm, "length")))
.stmt(methodCall(builderTerm, "append", strLiteral("(")));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe document why we chose this representation, because I was just thinking about if we should synchronize this with the Row.toString. but maybe ( ) is ok to distinguish it from arrays and rows with +I[] where we also have a change flag

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

(inputLogicalType.is(LogicalTypeRoot.TIMESTAMP_WITH_LOCAL_TIME_ZONE))
? context.getSessionTimeZoneTerm()
: className(DateTimeUtils.class) + ".UTC_ZONE";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use the old method for performance? I don't see a reason to declare and access a time zone for a time zone less type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old method was invoking the same method defaulting the zone to UTC_ZONE. Note that we don't declare any variable for UTC_ZONE, we just statically access to it. The variable is only declared for timestamp_ltz

RAW(LocalDateTime.class, new LocalDateTimeSerializer()),
RawValueData.fromObject(
LocalDateTime.parse("2020-11-11T18:08:01.123")),
StringData.fromString("2020-11-11T18:08:01.123")),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please also add test for the NULL type, MULTISET, and structured types. Structured types have the speciality that they can either be backed by a POJO (in which case we call toString on them) or Row in which case I would use the RowData representation.

@slinkydeveloper slinkydeveloper force-pushed the FLINK-24684 branch 2 times, most recently from 2b2ea98 to ff2edf0 Compare November 8, 2021 08:18
Copy link
Contributor

@twalthr twalthr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @slinkydeveloper. I think the PR should be good in the next iteration. I thought about the changes in this PR again and I think we should use null as the string representation instead of NULL for now. The change from null to NULL should be hidden behind a feature flag that we can enable by default once all casting rules have been reworked. We should introduce a boolean option use-old-casting which is set to false by default in 1.15.

Copy link
Contributor

@twalthr twalthr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM waiting for a green build.

…generated the casting

Signed-off-by: slinkydeveloper <francescoguard@gmail.com>
Signed-off-by: slinkydeveloper <francescoguard@gmail.com>
Signed-off-by: slinkydeveloper <francescoguard@gmail.com>
…astRule stack

Signed-off-by: slinkydeveloper <francescoguard@gmail.com>
…o reset the builder

Signed-off-by: slinkydeveloper <francescoguard@gmail.com>
Signed-off-by: slinkydeveloper <francescoguard@gmail.com>
Signed-off-by: slinkydeveloper <francescoguard@gmail.com>
Signed-off-by: slinkydeveloper <francescoguard@gmail.com>
Signed-off-by: slinkydeveloper <francescoguard@gmail.com>
Signed-off-by: slinkydeveloper <francescoguard@gmail.com>
Signed-off-by: slinkydeveloper <francescoguard@gmail.com>
Signed-off-by: slinkydeveloper <francescoguard@gmail.com>
Signed-off-by: slinkydeveloper <francescoguard@gmail.com>
Signed-off-by: slinkydeveloper <francescoguard@gmail.com>
@twalthr twalthr closed this in 43d8a4a Nov 8, 2021
@slinkydeveloper slinkydeveloper deleted the FLINK-24684 branch November 8, 2021 16:55
hililiwei pushed a commit to hililiwei/flink that referenced this pull request Nov 15, 2021
…astRule stack

Signed-off-by: slinkydeveloper <francescoguard@gmail.com>

This closes apache#17658.
niklassemmler pushed a commit to niklassemmler/flink that referenced this pull request Feb 3, 2022
…astRule stack

Signed-off-by: slinkydeveloper <francescoguard@gmail.com>

This closes apache#17658.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants