Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conditionally allow to keep partition_by columns when using PARTITIONED BY #10971

Closed
hveiga opened this issue Jun 18, 2024 · 3 comments · Fixed by #11107
Closed

Conditionally allow to keep partition_by columns when using PARTITIONED BY #10971

hveiga opened this issue Jun 18, 2024 · 3 comments · Fixed by #11107
Labels
enhancement New feature or request

Comments

@hveiga
Copy link
Contributor

hveiga commented Jun 18, 2024

Is your feature request related to a problem or challenge?

I am using Datafusion to partition some data stored in parquet files into a different set of parquet files. I would like those newly created files to contain the columns I am partitioning by, however currently the column gets removed as it becomes part of the file directory structure. Something like:

COPY (SELECT col1, col2, col3, col4 FROM my_external_table) TO '/output' STORED AS PARQUET PARTITIONED BY (col1) OPTIONS (compression 'zstd(2)');

/output/col1=val1/some_random_file_name.parquet
/output/col1=val2/some_random_file_name.parquet
/output/col1=val3/some_random_file_name.parquet

some_random_file_name.parquet will not contain the column col1. I would like to keep it as it might be needed later for other use cases.

Describe the solution you'd like

I would like to have a flag as part of the OPTIONS of the COPY statement to conditionally allow the column to remain in the partitioned files. For example: keep_partition_by_columns, set to false by default.

Describe alternatives you've considered

None. I don't think there is an alternative at the moment.

Additional context

Related discussion: #10962

I also don't know if this might have implications when reading a hive-partitioned directory structure as you would have a given column in the parquet files and also as part of the directory structure, but it's worth pointing it out in case there might be a collision.

@hveiga hveiga added the enhancement New feature or request label Jun 18, 2024
@alamb
Copy link
Contributor

alamb commented Jun 18, 2024

Seems like a good idea to me

@hveiga
Copy link
Contributor Author

hveiga commented Jun 20, 2024

I spent some time looking into how to implement this. I have a very basic change that I think would bring the functionality:

main...hveiga:datafusion:issue-10971

I am looking for guidance on two aspects:

  • Where is there best place in the code base to add a test case for this?
  • If we want to allow to pass this flag as part of the COPY command (ie COPY (...) TO (...) STORED AS FILE_FORMAT (OPTIONS keep_partition_by_columns true)), what would be the right way to pass this flag to hive_style_partitions_demuxer?

I see that sql/src/parser.rs can parse the option and keep it the CopyTo struct under options, and then those get used to configure the different FormatOptions for the different output formats. This particular flag would apply to all formats I believe. Should I add it as another config to TableOptions ?

Thanks!

@devinjdangelo
Copy link
Contributor

Where is there best place in the code base to add a test case for this?

sqllogictests are a good place to start. You can do a COPY with partitions and the new option toggled, then create a table on just a single partition path, and do a select that verifies the partition column is in the file.

If we want to allow to pass this flag as part of the COPY command (ie COPY (...) TO (...) STORED AS FILE_FORMAT (OPTIONS keep_partition_by_columns true)), what would be the right way to pass this flag to hive_style_partitions_demuxer?

You could check for the new option in source_option_tuples and then update FileSinkConfig to hold that option and flow it through to the relevant functions. The relevant physical planner code is here

let config = FileSinkConfig {
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
3 participants