Skip to content

Conversation

drin
Copy link

@drin drin commented Oct 20, 2025

Which issue does this PR close?

Closes #18122

Rationale for this change

Existing behavior is to use the relation field of ColumnRelation message to construct a TableReference (mod.rs#L146, mod.rs#L171). However, the relation field
is a string and From<String> for TableReference always calls
parse_identifiers_normalized with ignore_case: False, which always
normalizes the identifier to lower case (TableReference::parse_str).

For a description of the bug at a bit of a higher level, see #18122.

What changes are included in this PR?

This PR introduces the following:

  1. An implementation From<protobuf::ColumnRelation> and From<&protobuf::ColumnRelation> for
    TableReference.
  2. Updated logic in TryFrom<&protobuf::DFSchema> for DFSchema and in From<protobuf::Column> for Column that correctly leads to the new From impls for TableReference to be invoked.
  3. A new method, TableReference::parse_str_normalized, that parses an identifier without normalizing it, with some logic from TableReference::parse_str being refactored to accommodate code reuse.

Are these changes tested?

Commit a355196 adds a new test case, roundtrip_mixed_case_table_reference, that tests the desired behavior.

The existing behavior (without the fix in 0616df2 and with the extra line println!("{}", server_logical_plan.display_indent_schema());):

cargo test "roundtrip_mixed_case_table_reference" --test proto_integration -- --nocapture
   Compiling datafusion-proto v48.0.1 (/Users/aldrinm/code/bauplanlabs/datafusion/octalene-datafusion/datafusion/proto)
    Finished `test` profile [unoptimized + debuginfo] target(s) in 1.56s
     Running tests/proto_integration.rs (target/debug/deps/proto_integration-775454d70979734b)

running 1 test

thread 'cases::roundtrip_logical_plan::roundtrip_mixed_case_table_reference' panicked at datafusion/proto/tests/cases/roundtrip_logical_plan.rs:2690:5:
assertion `left == right` failed
  left: "Filter: TestData.a = Int64(1) [a:Int64;N]\n  TableScan: TestData projection=[a], partial_filters=[TestData.a = Int64(1)] [a:Int64;N]"
 right: "Filter: testdata.a = Int64(1) [a:Int64;N]\n  TableScan: TestData projection=[a], partial_filters=[testdata.a = Int64(1)] [a:Int64;N]"
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
test cases::roundtrip_logical_plan::roundtrip_mixed_case_table_reference ... FAILED

failures:

failures:
    cases::roundtrip_logical_plan::roundtrip_mixed_case_table_reference

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 112 filtered out; finished in 0.09s

With the fix implemented (0616df2):

running 1 test
Filter: TestData.a = Int64(1) [a:Int64;N]
  TableScan: TestData projection=[a], partial_filters=[TestData.a = Int64(1)] [a:Int64;N]
test cases::roundtrip_logical_plan::roundtrip_mixed_case_table_reference ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 112 filtered out; finished in 0.06s

Are there any user-facing changes?

None.

@github-actions github-actions bot added common Related to common crate proto Related to proto crate labels Oct 20, 2025
@drin
Copy link
Author

drin commented Oct 20, 2025

This is my first PR in rust and datafusion, so please let me know if there's any style/design changes needed (I just realized I didn't run a linter either)

@drin drin changed the title Fix: Do not normalize table names when deserializing from protobuf Fix: Do not normalize table names when deserializing from protobuf (ported from branch-48) Oct 20, 2025
@drin
Copy link
Author

drin commented Oct 20, 2025

We also want this fix to be applied to branch-48, which I've done in #18188. Let me know if this is not the correct way to do this and I can delete that PR.

Copy link
Contributor

@colinmarc colinmarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just leaving some comments for my colleague :)

@drin I don't think we need the 48 backport, we can work around that for the time being (probably by updating iceberg-rust to the rev where this lands).

@drin drin changed the title Fix: Do not normalize table names when deserializing from protobuf (ported from branch-48) Fix: Do not normalize table names when deserializing from protobuf Oct 21, 2025
@drin drin force-pushed the octalene.fix-normalized-tablename-main branch 3 times, most recently from fe4d8de to 35a08bc Compare October 21, 2025 17:33
@drin
Copy link
Author

drin commented Oct 21, 2025

Also, just for clarification, I changed from_vec to use pop instead of remove since shifting elements every time seemed unnecessary, plus it actually makes sense to me when reading the code to see the consistency that every extra part is an extra prefix and table name is always the last field.

drin added 4 commits October 21, 2025 11:21
This fixes a semantic error when reading a logical plan from protobuf.
Existing behavior is to use the `relation` field of `ColumnRelation`
message to construct a `TableReference`. However, the `relation` field
is a string and `From<String> for TableReference` always calls
parse_identifiers_normalized with `ignore_case: False`, which always
normalizes the identifier to lower case.
New behavior is to implement `From<protobuf::ColumnRelation> for
TableReference` which calls a new method, `parse_str_normalized`, with
`ignore_case: True`.

Overall, if normalization occurs, it should happen prior to
serialization to protobuf; thus, deserialization from protobuf should
not normalize (if it is desirable, though, `parse_str_normalized`
propagates its boolean parameter to `parse_identifiers_normalized`
unlike `parse_str`).

Issue: apache#18122
A new test case, `roundtrip_mixed_case_table_reference`, exercises a
scenario where a logical plan containing a table reference with
uppercase characters is roundtripped through protobuf and the
deserialization side erroneously normalizes the table reference to
lowercase.

Issue: apache#18122
For parse_identifiers_normalized, the `ignore_case` parameter controls
whether the parsing should be case-sensitive (ignore_case: true) or
insensitive (ignore_case: false). The name of the parameter is
counter-intuitive to the behavior, so this adds a clarifying comment for
the method.
@drin drin force-pushed the octalene.fix-normalized-tablename-main branch from 35a08bc to 0daf927 Compare October 21, 2025 18:22
@drin
Copy link
Author

drin commented Oct 21, 2025

I ran dev/linter.sh for formatting. I added an extra commit for miscellaneous changes to satisfy the linter (unrelated to the core of this PR).

Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me 👍

/// ```no_rust
/// [<catalog>, <schema>, table]
/// ```
pub fn from_slice(parts: &[impl AsRef<str>]) -> Option<Self> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this meant to be used somewhere?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was just in response to feedback from @colinmarc to be a flexible equivalent to from_vec. I can remove it if you'd prefer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's not in use I'd prefer it to be removed 👍

/// ```no_rust
/// [<catalog>, <schema>, table]
/// ```
pub fn from_vec(mut parts: Vec<String>) -> Option<Self> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should keep it private until we have a need to expose it? Or would it be generally useful enough to users to make it public now?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just made it public because that always makes sense to me. But this logic was only being done in parse_str previously, and if anyone ever calls parse_identifiers_normalized then this logic is the next step for getting a TableReference. whether that's generally useful, I would have no idea (too new to the ecosystem)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo given this PR is focused on fixing a proto bug, we should keep this private unless we need to expose it, to minimize our already somewhat large public API

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate proto Related to proto crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Table names are normalized when roundtripping through protobuf

3 participants