-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Fix: Do not normalize table names when deserializing from protobuf #18187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This is my first PR in rust and datafusion, so please let me know if there's any style/design changes needed (I just realized I didn't run a linter either) |
We also want this fix to be applied to branch-48, which I've done in #18188. Let me know if this is not the correct way to do this and I can delete that PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just leaving some comments for my colleague :)
@drin I don't think we need the 48 backport, we can work around that for the time being (probably by updating iceberg-rust to the rev where this lands).
fe4d8de
to
35a08bc
Compare
Also, just for clarification, I changed |
This fixes a semantic error when reading a logical plan from protobuf. Existing behavior is to use the `relation` field of `ColumnRelation` message to construct a `TableReference`. However, the `relation` field is a string and `From<String> for TableReference` always calls parse_identifiers_normalized with `ignore_case: False`, which always normalizes the identifier to lower case. New behavior is to implement `From<protobuf::ColumnRelation> for TableReference` which calls a new method, `parse_str_normalized`, with `ignore_case: True`. Overall, if normalization occurs, it should happen prior to serialization to protobuf; thus, deserialization from protobuf should not normalize (if it is desirable, though, `parse_str_normalized` propagates its boolean parameter to `parse_identifiers_normalized` unlike `parse_str`). Issue: apache#18122
A new test case, `roundtrip_mixed_case_table_reference`, exercises a scenario where a logical plan containing a table reference with uppercase characters is roundtripped through protobuf and the deserialization side erroneously normalizes the table reference to lowercase. Issue: apache#18122
For parse_identifiers_normalized, the `ignore_case` parameter controls whether the parsing should be case-sensitive (ignore_case: true) or insensitive (ignore_case: false). The name of the parameter is counter-intuitive to the behavior, so this adds a clarifying comment for the method.
35a08bc
to
0daf927
Compare
I ran |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me 👍
/// ```no_rust | ||
/// [<catalog>, <schema>, table] | ||
/// ``` | ||
pub fn from_slice(parts: &[impl AsRef<str>]) -> Option<Self> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this meant to be used somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it was just in response to feedback from @colinmarc to be a flexible equivalent to from_vec
. I can remove it if you'd prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's not in use I'd prefer it to be removed 👍
/// ```no_rust | ||
/// [<catalog>, <schema>, table] | ||
/// ``` | ||
pub fn from_vec(mut parts: Vec<String>) -> Option<Self> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should keep it private until we have a need to expose it? Or would it be generally useful enough to users to make it public now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just made it public because that always makes sense to me. But this logic was only being done in parse_str previously, and if anyone ever calls parse_identifiers_normalized then this logic is the next step for getting a TableReference. whether that's generally useful, I would have no idea (too new to the ecosystem)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imo given this PR is focused on fixing a proto bug, we should keep this private unless we need to expose it, to minimize our already somewhat large public API
Which issue does this PR close?
Closes #18122
Rationale for this change
Existing behavior is to use the
relation
field ofColumnRelation
message to construct aTableReference
(mod.rs#L146, mod.rs#L171). However, therelation
fieldis a string and
From<String> for TableReference
always callsparse_identifiers_normalized with
ignore_case: False
, which alwaysnormalizes the identifier to lower case (TableReference::parse_str).
For a description of the bug at a bit of a higher level, see #18122.
What changes are included in this PR?
This PR introduces the following:
From<protobuf::ColumnRelation>
andFrom<&protobuf::ColumnRelation>
forTableReference
.TryFrom<&protobuf::DFSchema>
forDFSchema
and inFrom<protobuf::Column>
forColumn
that correctly leads to the newFrom
impls forTableReference
to be invoked.TableReference::parse_str_normalized
, that parses an identifier without normalizing it, with some logic fromTableReference::parse_str
being refactored to accommodate code reuse.Are these changes tested?
Commit a355196 adds a new test case,
roundtrip_mixed_case_table_reference
, that tests the desired behavior.The existing behavior (without the fix in 0616df2 and with the extra line
println!("{}", server_logical_plan.display_indent_schema());
):With the fix implemented (0616df2):
Are there any user-facing changes?
None.