Skip to content

Conversation

@devinjdangelo
Copy link
Contributor

@devinjdangelo devinjdangelo commented Sep 3, 2023

Which issue does this PR close?

Closes #7442
Closes #7463 (this PR implements a different syntax than 7463 which turned out much easier to implement. Leaving 7463 draft open to show as possible alternative)

Rationale for this change

Extends syntax and allowed options for SQL statement options to configure parquet column level options (e.g. different compression for each possibly nested column).

What changes are included in this PR?

Implements new parsing utils and options for parquet column specific options. Example:

copy my_table
to my_file.parquet
(compression snappy
'compression::col1' 'zstd(5)',
'compression::col2.nested' 'zstd(10)'

The example defaults all columns to snappy compression and sets col1 to zstd level 5 and the nested column (col2.nested) to zstd level 10.

Using "::" to separate the setting from the column path is obviously rust syntax inspired. We could use any other separator here if there is a more natural one for SQL. Dots (.) would have been the natural choice, but those are being used for separating nested column paths here.

Are these changes tested?

Yes, added a new unit test to verify settings are actually set as expected.

Are there any user-facing changes?

New syntax for specifying column level options, but this PR is backward compatible/no breaking changes unlike 7463

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Sep 3, 2023
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great to me -- thank you @devinjdangelo 🙏

The only other thing I think is needed for this PR is to an end to end test showing this syntax being used (perhaps in copy.slt)

Also, we should add documentation for this syntax to dml.md, but I think that can be done in a follow on PR (I will file a ticket to track it).

Also again sorry for the delay in reviews

@devinjdangelo
Copy link
Contributor Author

This looks great to me -- thank you @devinjdangelo 🙏

The only other thing I think is needed for this PR is to an end to end test showing this syntax being used (perhaps in copy.slt)

Also, we should add documentation for this syntax to dml.md, but I think that can be done in a follow on PR (I will file a ticket to track it).

Also again sorry for the delay in reviews

Np, hope you had a good vacation! 🌴

I just pushed up an expanded test in copy.slt that uses all available options.

COPY source_table 
TO 'test_files/scratch/copy/table_with_options' 
(format parquet,
single_file_output false,
compression snappy,
'compression::col1' 'zstd(5)',
'compression::col2' snappy,
max_row_group_size 12345,
data_pagesize_limit 1234,
write_batch_size 1234,
writer_version 2.0,
dictionary_page_size_limit 123,
created_by 'DF copy.slt',
column_index_truncate_length 123,
data_page_row_count_limit 1234,
bloom_filter_enabled true,
'bloom_filter_enabled::col1' false,
'bloom_filter_fpp::col2' 0.456,
'bloom_filter_ndv::col2' 456,
encoding plain,
'encoding::col1' DELTA_BINARY_PACKED,
'dictionary_enabled::col2' true,
dictionary_enabled false,
statistics_enabled page,
'statistics_enabled::col2' none,
max_statistics_size 123,
bloom_filter_fpp 0.001,
bloom_filter_ndv 100
)

The only thing that bothers me a bit with the syntax is that you have to put column level options inside single quotes (i.e. 'compression::col1') otherwise the :: generates a syntax error I think coming from sqlparser-rs.

External error: query failed: DataFusion error: SQL error: ParserError("Expected string or numeric value, found: ::")

Would there be an easy way to relax the syntax requirements to allow :: without a string literal quote?

@alamb
Copy link
Contributor

alamb commented Sep 7, 2023

Would there be an easy way to relax the syntax requirements to allow :: without a string literal quote?

I am not sure -- we may have to look into updating sqlparser

@alamb
Copy link
Contributor

alamb commented Sep 7, 2023

Filed #7499 to track documentation of this feature

@alamb alamb merged commit ec1c8b8 into apache:main Sep 7, 2023
@andygrove andygrove added the enhancement New feature or request label Sep 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow Setting Column Specific Parquet Write Options via SQL StatementOptions

3 participants