-
Notifications
You must be signed in to change notification settings - Fork 13.8k
[FLINK-24419][table-planner] Trim to precision when casting to BINARY/VARBINARY #17919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community Automated ChecksLast check on commit aa962bd (Fri Nov 26 03:25:12 UTC 2021) Warnings:
Mention the bot in a comment to re-run the automated checks. Review Progress
Please see the Pull Request Review Guide for a full explanation of the review process. The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commandsThe @flinkbot bot supports the following commands:
|
} else if (targetLogicalType instanceof VarBinaryType) { | ||
targetLength = ((VarBinaryType) targetLogicalType).getLength(); | ||
} | ||
targetLength = Math.min(targetLength, inputLength); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove that, and instead below have a check like:
if (targetLength >= inputLength) {
return methodCall(inputTerm, "toBytes");
} else {
return staticCall(
Arrays.class, "copyOfRange", methodCall(inputTerm, "toBytes"), 0, targetLength);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @matriv , thanks for your review! I will update the code here.
@shenzhu Could please adjust the title of your PR to:
? |
87fd0c5
to
8e8aaa7
Compare
Sure, just updated PR and commit message, thank you! |
Hey @matriv , I updated this PR based on your review, would you mind taking a look at it when you have a moment? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @shenzhu I left a couple more comments, but overall the approach is not 100% correct, because it applies the trimming only when casting from string types. It should also apply it when casting between binary types themselves, i.e. a byte array with 10 bytes and type BINARY(10)
, casted to let's say a BINARY(5)
(or VARBINARY(5)
) should be trimmed to 5 bytes. So this logic should be applied in for all supported casts to BINARY/VARBINARY.
LogicalType inputLogicalType, | ||
LogicalType targetLogicalType) { | ||
return methodCall(inputTerm, "toBytes"); | ||
int inputLength = Integer.MAX_VALUE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use int inputLength = LogicalTypeChecks.getLength(inputLogicalType)
directly without instanceof
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks!
inputLength = ((VarCharType) inputLogicalType).getLength(); | ||
} | ||
|
||
int targetLength = Integer.MAX_VALUE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
Hey @matriv , thanks for your review! I checked the codebase and found currently we support casting from If I understand it correctly, we have two options for this task:
I'm a little prefer Option 1 because seems that's the direction the community is trying to move forward(?), what do you think about this? Thanks for your help! |
@shenzhu Thx again for your time and effort here. |
d182991
to
e7d585c
Compare
Hey @matriv , thanks for your feedback! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thank you @shenzhu !
import static org.apache.flink.table.planner.functions.casting.CastRuleUtils.staticCall; | ||
|
||
/** {@link LogicalTypeFamily#BINARY_STRING} to {@link LogicalTypeFamily#BINARY_STRING} cast rule. */ | ||
public class BinaryToBinaryCastRule |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to make it public
targetLength | ||
+ " >= " | ||
+ accessField(deserializedByteArrayTerm, "length"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you swap this condition as it's easier to read?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But why swap? To me at least, it's easier to say, if (condition) then <normal> else <doSomethingmore>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With swap I mean accessField(deserializedByteArrayTerm, "length") + " <= " + targetLength
, which IMO is more readable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More than adding the method CastRuleUtils#accessField
, I think it makes sense to just add a method like CastRuleUtils#accessArrayLength
. Can you also check if in the other rules the array length is used and replace it with CastRuleUtils#accessArrayLength
/* Example generated code for BINARY(3): | ||
byte[] deserializedByteArray$0 = result$2.toBytes(typeSerializer$5); | ||
if (deserializedByteArray$0 != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we need this null check, perhaps can you add a test case where the raw value is null and see if this null check is really needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your review! I think you are right, I printed the generated code and found this null check is not required
externalResult$0 = (byte[]) function_org$apache$flink$table$planner$functions$CastFunctionMiscITCase$IntegerToRaw$fd3c18d5f34fde99503159b4b5be6594
.eval(false ? null : ((java.lang.Integer)((int) 123456)));
isNull$2 = externalResult$0 == null;
result$2 = null;
if (!isNull$2) {
result$2 = (org.apache.flink.table.data.binary.BinaryRawValueData) converter$1.toInternalOrNull((byte[]) externalResult$0);
}
// --- Cast section generated by org.apache.flink.table.planner.functions.casting.RawToBinaryCastRule
isNull$3 = isNull$2;
if (!isNull$3) {
byte[] deserializedByteArray$0 = result$2.toBytes(typeSerializer$5);
if (deserializedByteArray$0 != null) {
if (deserializedByteArray$0.length <= 3) {
result$4 = deserializedByteArray$0;
} else {
result$4 = java.util.Arrays.copyOfRange(deserializedByteArray$0, 0, 3);
}
} else {
result$4 = null;
}
isNull$3 = result$4 == null;
} else {
result$4 = null;
}
// --- End cast section
int inputLength = LogicalTypeChecks.getLength(inputLogicalType); | ||
int targetLength = LogicalTypeChecks.getLength(targetLogicalType); | ||
|
||
if (targetLength >= inputLength) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Swap this condition
int inputLength = LogicalTypeChecks.getLength(inputLogicalType); | ||
int targetLength = LogicalTypeChecks.getLength(targetLogicalType); | ||
|
||
if (targetLength >= inputLength) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Swap this condition
int inputLength = LogicalTypeChecks.getLength(inputLogicalType); | ||
int targetLength = LogicalTypeChecks.getLength(targetLogicalType); | ||
|
||
if (targetLength >= inputLength) { | ||
return inputTerm; | ||
} else { | ||
return staticCall(Arrays.class, "copyOfRange", inputTerm, 0, targetLength); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately you can't assume that the input length is the same as the one defined in the type. In our codegen domain we can have a byte[]
of length 10 with type defined as VARBINARY(20)
or even BINARY(20)
. If you look carefully, even the RawToBinaryCastRule
you introduced behaves the same when casting to BINARY(20)
a raw which array has just size 10.
Hence you need the check at runtime to avoid an unnecessary copy, for example in this case:
- input
VARBINARY(30)
with valuebyte[].length == 10
- target
VARBINARY(20)
The first branch of the if is fine IMO, because it's reasonable to assume that if the input length is minor or equal to the target length, then it's fine to just return the same term. The second branch on the other hand should do a length check with the ternary operator and skip the copy in case the runtime length is already less or equal to the target length.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx @slinkydeveloper! I totally missed that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I will update here to check during runtime
import static org.apache.flink.table.planner.functions.casting.CastRuleUtils.staticCall; | ||
|
||
/** {@link LogicalTypeRoot#RAW} to {@link LogicalTypeFamily#BINARY_STRING} cast rule. */ | ||
public class RawToBinaryCastRule extends AbstractNullAwareCodeGeneratorCastRule<Object, byte[]> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove public
int inputLength = LogicalTypeChecks.getLength(inputLogicalType); | ||
int targetLength = LogicalTypeChecks.getLength(targetLogicalType); | ||
|
||
if (targetLength >= inputLength) { | ||
return methodCall(inputTerm, "toBytes"); | ||
} else { | ||
return staticCall( | ||
Arrays.class, "copyOfRange", methodCall(inputTerm, "toBytes"), 0, targetLength); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above about the unnecessary copy, but because you invoke toBytes()
which is potentially expensive, convert this rule to a AbstractNullAwareCodeGeneratorCastRule
and save the toBytes
result in a local variable
Unless I'm mistaken, this PR is also fixing this issue https://issues.apache.org/jira/browse/FLINK-25051 correct? Can you double check @shenzhu? My understanding is that the existing codebase doesn't have an inverse BINARY -> RAW casting, right? |
Yep, true, I asked @shenzhu to do it (see previous comments). @shenzhu Could you please split your PR in 2 commits?
This is not supported yet: https://issues.apache.org/jira/browse/FLINK-24577 is still open |
bd5b627
to
56515bd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have just some minor comments and then the PR looks good! Thank you for your contribution, it looks great!
org.apache.flink.connector.file.src.reader.TextLineFormat.createReader(org.apache.flink.configuration.Configuration, org.apache.flink.core.fs.FSDataInputStream): Returned leaf type org.apache.flink.connector.file.src.reader.StreamFormat$Reader does not satisfy: reside outside of package 'org.apache.flink..' or annotated with @Public or annotated with @PublicEvolving or annotated with @Deprecated | ||
org.apache.flink.connector.file.src.reader.TextLineFormat.createReader(org.apache.flink.configuration.Configuration, org.apache.flink.core.fs.FSDataInputStream): Returned leaf type org.apache.flink.connector.file.src.reader.TextLineFormat$Reader does not satisfy: reside outside of package 'org.apache.flink..' or annotated with @Public or annotated with @PublicEvolving or annotated with @Deprecated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can remove this since it's being addressed already https://issues.apache.org/jira/browse/FLINK-25150
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got thanks! I will remove these lines.
.fromCase(CHAR(3), "foo", new byte[] {102, 111}) | ||
.fromCase(VARCHAR(5), "Flink", new byte[] {70, 108}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test for the non trim case where input length < target length? Both for the "type case" and for the "runtime case"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your detailed review!
I added more test casts to cover "type case" and "runtime case" when converting to VARBINARY
, however, when converting to BINARY(3)
, seems Calcite 1.26 will pad the result to make sure it has the same length as target type, so converting CHAR(1)
with value f
to BINARY(3)
will still result in {102, 0, 0}
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shenzhu Yep, this is something we have to implement as well. I'll open a separate issue for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, in case of this CastFunctionITCase
, if we have padding it's totally fine, as calcite performs optimizations out of our control.
But i'm pretty sure that test case should work as described by me in CastRulesTest
, without padding
e356372
to
45b7547
Compare
605c0ab
to
0095d26
Compare
@shenzhu Again thx for your effort! Please rebase with master and then:
|
STRING(), | ||
fromString("Apache"), | ||
new byte[] {65, 112, 97, 99, 104, 101}), | ||
.fromCase(CHAR(3), StringData.fromString("foo"), new byte[] {102, 111}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use the fromCase
with the legacy behaviour flag and keep testing both with false and with true.
0095d26
to
28b096c
Compare
Sure, thanks for your feedback! I will update this PR. |
Hey @AHeise , would you mind taking a look at this PR when you have a moment? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx for addressing the comments!
I've left a couple of minor comments but otherwise LGTM
return className(clazz) + "." + fieldName; | ||
} | ||
|
||
static String accessArrayLength(String instanceTerm) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I'd prefer: arrayLength
without the access
.
.fromCase(CHAR(3), fromString("foo"), new byte[] {102, 111}) | ||
.fromCase(CHAR(1), fromString("f"), new byte[] {102}) | ||
.fromCase(CHAR(3), fromString("f"), new byte[] {102}) | ||
.fromCase(VARCHAR(5), fromString("Flink"), new byte[] {70, 108}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Here and below, I would use the last arg with false
to be more visible how the test behaves with legacy behaviour enabled and disabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Here and below, I would use the last arg with
false
to be more visible how the test behaves with legacy behaviour enabled and disabled.
Good point! Thanks for your review, I will update here.
28b096c
to
0a89d73
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, Thank you @shenzhu !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM I will fix my minor comments while merging
.build()); | ||
} | ||
|
||
/* Example generated code for BINARY(@): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@
=> 2
String returnVariable, | ||
LogicalType inputLogicalType, | ||
LogicalType targetLogicalType) { | ||
// Get length of target |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unnecessary comment
LogicalType inputLogicalType, | ||
LogicalType targetLogicalType) { | ||
return methodCall(inputTerm, "toBytes"); | ||
// Get length of target |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unnecessary comment
0a89d73
to
7dca486
Compare
…RBINARY This closes apache#17919.
What is the purpose of the change
FLINK-24413: Fix trimming when casting to CHAR and VARCHAR
Brief change log
StringToBinaryCastRule.java
Verifying this change
Please make sure both new and modified tests in this PR follows the conventions defined in our code quality guide: https://flink.apache.org/contributing/code-style-and-quality-common.html#testing
(Please pick either of the following options)
This change is already covered by existing tests, such as
CastRulesTest.java
.Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: (no)Documentation