Describe the bug
SQL is case-insensitive language, but the way case-insensitivity is implemented for column identifiers in DataFusion often leads to non-intuitive behavior:
To Reproduce
Create test.json:
{"A": "a_upper", "B": "b_upper", "b": "b_lower"}
Run datafusion-cli:
❯ create external table test stored as json location 'test.json';
❯ select * from test;
+---------+---------+---------+
| A | B | b |
+---------+---------+---------+
| a_upper | b_upper | b_lower |
+---------+---------+---------+
❯ select A from test;
Schema error: No field named a. Valid fields are test."A", test."B", test.b.
❯ select b from test;
+---------+
| b |
+---------+
| b_lower |
+---------+
❯ select B from test;
+---------+
| b |
+---------+
| b_lower |
+---------+
❯ select "B" from test;
+---------+
| B |
+---------+
| b_upper |
+---------+
Expected behavior
In Spark:
- When I load json like
{"A": "a_upper"} both select a from test and select A from test will return the "A" column.
- When I save to Parquet - the schema will preserve the column's original case
- Loading
{"A": "a_upper", "B": "b_upper", "b": "b_lower"} however fails with duplicate column error - points to DF!
What I would expect as a user:
a and A return "A" column
b and B fail as ambiguous
"B" and "b" work, returning upper and lower case columns respectively
- The schema always preserves the original case of identifier
In other words:
- Column names always stored in their original form and never
.to_lower()'ed
- identifiers are matched in a case-insensitive way
- when there is an ambiguity - it can be resolved with quoted identifiers.
Additional context
Currently after switching our ingest from Spark to Datafusion our preprocessing code is full of "X" as x, "Y" as y just to restore case-insensitivity for downstream queries.
I could add an automatic step that lower-cases all columns, but I'd really like to preserve the original case in schemas for aesthetic reasons.
Describe the bug
SQL is case-insensitive language, but the way case-insensitivity is implemented for column identifiers in DataFusion often leads to non-intuitive behavior:
To Reproduce
Create
test.json:{"A": "a_upper", "B": "b_upper", "b": "b_lower"}Run
datafusion-cli:Expected behavior
In Spark:
{"A": "a_upper"}bothselect a from testandselect A from testwill return the "A" column.{"A": "a_upper", "B": "b_upper", "b": "b_lower"}however fails with duplicate column error - points to DF!What I would expect as a user:
aandAreturn "A" columnbandBfail as ambiguous"B"and"b"work, returning upper and lower case columns respectivelyIn other words:
.to_lower()'edAdditional context
Currently after switching our ingest from Spark to Datafusion our preprocessing code is full of
"X" as x, "Y" as yjust to restore case-insensitivity for downstream queries.I could add an automatic step that lower-cases all columns, but I'd really like to preserve the original case in schemas for aesthetic reasons.