Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic when reading from partitioned datasets with columns that have ' in them #9269

Open
alamb opened this issue Feb 19, 2024 · 2 comments
Open
Labels
bug Something isn't working

Comments

@alamb
Copy link
Contributor

alamb commented Feb 19, 2024

Describe the bug

There is a bug when reading from partitioned tables that have commas in their names

Here is the test
https://github.com/apache/arrow-datafusion/blob/b2a04519da97c2ff81789ef41dd652870794a73a/datafusion/sqllogictest/test_files/copy.slt#L109

To Reproduce

Run this script

-- create a table with quotes in the column names
create table test ("'test'" varchar, "'test2'" varchar, "'test3'" varchar);
insert into test VALUES ('a', 'x', 'aa'), ('b','y', 'bb'), ('c', 'z', 'cc');
copy test to '/tmp/escape_quote' (format csv, partition_by '''test2'',''test3''');

-- read back from the table
CREATE EXTERNAL TABLE validate_partitioned_escape_quote STORED AS CSV
LOCATION '/tmp/escape_quote/' PARTITIONED BY ("'test2'", "'test3'");

-- This panics
select * from validate_partitioned_escape_quote;

Here is an example:

-- create a table with quotes in the column names
create table test ("'test'" varchar, "'test2'" varchar, "'test3'" varchar);
insert into test VALUES ('a', 'x', 'aa'), ('b','y', 'bb'), ('c', 'z', 'cc');
copy test to '/tmp/escape_quote' (format csv, partition_by '''test2'',''test3''');

0 rows in set. Query took 0.008 seconds.

+-------+
| count |
+-------+
| 3     |
+-------+
1 row in set. Query took 0.009 seconds.

+-------+
| count |
+-------+
| 3     |
+-------+
1 row in set. Query took 0.029 seconds.

❯ -- read back from the table
CREATE EXTERNAL TABLE validate_partitioned_escape_quote STORED AS CSV
LOCATION '/tmp/escape_quote/' PARTITIONED BY ("'test2'", "'test3'");

0 rows in set. Query took 0.004 seconds.

❯ -- This panics
select * from validate_partitioned_escape_quote;

thread 'thread 'tokio-runtime-workertokio-runtime-worker' panicked at ' panicked at /Users/andrewlamb/Software/arrow-datafusion/datafusion/core/src/datasource/physical_plan/file_scan_config.rs/Users/andrewlamb/Software/arrow-datafusion/datafusion/core/src/datasource/physical_plan/file_scan_config.rs::248:thread '54248:
:tokio-runtime-workerindex out of bounds: the len is 0 but the index is 054' panicked at
/Users/andrewlamb/Software/arrow-datafusion/datafusion/core/src/datasource/physical_plan/file_scan_config.rs:248:
:index out of bounds: the len is 0 but the index is 054
:
index out of bounds: the len is 0 but the index is 0
stack backtrace:
   0: rust_begin_unwind
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/panicking.rs:645:5
   1: core::panicking::panic_fmt
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/panicking.rs:72:14
   2: core::panicking::panic_bounds_check
             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/core/src/panicking.rs:208:5
   3: datafusion::datasource::physical_plan::file_scan_config::PartitionColumnProjector::project
   4: <datafusion::datasource::physical_plan::file_stream::FileStream<F> as futures_core::stream::Stream>::poll_next
   5: datafusion_physical_plan::stream::RecordBatchReceiverStreamBuilder::run_input::{{closure}}
   6: tokio::runtime::task::core::Core<T,S>::poll
   7: tokio::runtime::task::harness::Harness<T,S>::poll
   8: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
   9: tokio::runtime::scheduler::multi_thread::worker::Context::run
  10: tokio::runtime::context::runtime::enter_runtime
  11: tokio::runtime::scheduler::multi_thread::worker::run
  12: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll
  13: tokio::runtime::task::core::Core<T,S>::poll
  14: tokio::runtime::task::harness::Harness<T,S>::poll
  15: tokio::runtime::blocking::pool::Inner::run
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

Expected behavior

Note the data is written correctly

andrewlamb@Andrews-MacBook-Pro:~/Software/influxdb_iox$ find /tmp/escape_quote
/tmp/escape_quote
/tmp/escape_quote/'test2'=x
/tmp/escape_quote/'test2'=x/'test3'=aa
/tmp/escape_quote/'test2'=x/'test3'=aa/3zMw255TXFQxId14.csv
/tmp/escape_quote/'test2'=y
/tmp/escape_quote/'test2'=y/'test3'=bb
/tmp/escape_quote/'test2'=y/'test3'=bb/3zMw255TXFQxId14.csv
/tmp/escape_quote/'test2'=z
/tmp/escape_quote/'test2'=z/'test3'=cc
/tmp/escape_quote/'test2'=z/'test3'=cc/3zMw255TXFQxId14.csv
andrewlamb@Andrews-MacBook-Pro:~/Software/influxdb_iox$ cat /tmp/escape_quote/\'test2\'\=x/\'test3\'\=aa/3zMw255TXFQxId14.csv
'test'
a

Additional context

@devinjdangelo found this in #9240

@alamb alamb added the bug Something isn't working label Feb 19, 2024
alamb added a commit that referenced this issue Feb 19, 2024
@devinjdangelo
Copy link
Contributor

devinjdangelo commented Feb 19, 2024

I tried searching documentation of various engines to see if ' is allowed in partition columns. I didn't find anything concrete.

However, I tried the equivalent example in DuckDB and it does work (see below). It would probably be best to tighten up our parsing of non standard column names. It may be easier to make this robust by extending sqlparser-rs upstream. The recent discussion on the mailing list is relevant https://lists.apache.org/thread/q80j49poyg99x2c01900312qz7ps9wgp

devinjd@devinjd$ ./duckdb 
v0.10.0 20b1486d11
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D create table test ("'test'" varchar, "'test2'" varchar, "'test3'" varchar);
D insert into test VALUES ('a', 'x', 'aa'), ('b','y', 'bb'), ('c', 'z', 'cc');
D select * from test;
┌─────────┬─────────┬─────────┐
│ 'test''test2''test3' │
│ varchar │ varchar │ varchar │
├─────────┼─────────┼─────────┤
│ a       │ x       │ aa      │
│ b       │ y       │ bb      │
│ c       │ z       │ cc      │
└─────────┴─────────┴─────────┘
D copy test to '/tmp/escape_quote' (format csv, partition_by ('''test2''','''test3'''));
D select * from read_csv('/tmp/escape_quote/*/*/*.csv', hive_partitioning=1, header=true);
┌─────────┬─────────┬─────────┐
│ 'test''test2''test3' │
│ varchar │ varchar │ varchar │
├─────────┼─────────┼─────────┤
│ a       │ x       │ aa      │
│ b       │ y       │ bb      │
│ c       │ z       │ cc      │
└─────────┴─────────┴─────────┘

andygrove pushed a commit that referenced this issue Feb 21, 2024
@Jefffrey
Copy link
Contributor

Jefffrey commented Apr 5, 2024

Another case:

DataFusion CLI v37.0.0
❯ create external table test123(a string, ```a=b``` string) stored as parquet location '/tmp/test123/' partitioned by (`a=b`);
0 row(s) fetched.
Elapsed 0.002 seconds.

❯ insert into test123 select 'a', 'b';
+-------+
| count |
+-------+
| 1     |
+-------+
1 row(s) fetched.
Elapsed 0.007 seconds.

❯ select * from test123;
thread 'main' panicked at /home/jeffrey/Code/arrow-datafusion/datafusion/core/src/datasource/physical_plan/file_scan_config.rs:261:54:
index out of bounds: the len is 0 but the index is 0
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
datafusion-cli$

Files exist on disk:

datafusion-cli$ ll /tmp/test123/
total 0
drwxr-xr-x   2 jeffrey jeffrey   60 Apr  5 20:43 '%60a=b%60=b'/
drwxrwxrwt 119 root    root    2.6K Apr  5 20:43  ../
drwxr-xr-x   3 jeffrey jeffrey   60 Apr  5 20:43  ./
datafusion-cli$ ll /tmp/test123/\%60a=b\%60=b/
total 4.0K
-rw-r--r-- 1 jeffrey jeffrey 282 Apr  5 20:43 33T7kTcVyecaVk07.parquet
drwxr-xr-x 3 jeffrey jeffrey  60 Apr  5 20:43 ../
drwxr-xr-x 2 jeffrey jeffrey  60 Apr  5 20:43 ./
datafusion-cli$

Note I had to get cheeky with the column and partition names, as just this didn't work:

DataFusion CLI v37.0.0
❯ create external table test123(a string, `a=b` string) stored as parquet location '/tmp/test123/' partitioned by (`a=b`);
Arrow error: Schema error: Unable to get field named "`a=b`". Valid fields: ["a", "a=b"]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants