Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove ListingTable single_file option #8604

Merged
merged 3 commits into from
Dec 22, 2023

Conversation

devinjdangelo
Copy link
Contributor

Which issue does this PR close?

closes #8548

Rationale for this change

See issue and linked discussion

What changes are included in this PR?

Removes ListingTable single_file option and removes references in the documentation (also cleaned out some references in the docs to other removed options).

The single_file option is discarded if present to maintain backwards compatibility.

Are these changes tested?

Yes by existing tests

Are there any user-facing changes?

No (support for inserting to single_file listing table was already removed, this PR just cleans up docs and removes unneeded code).

@github-actions github-actions bot added the core Core datafusion crate label Dec 20, 2023
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @devinjdangelo -- it is great to remove the old options that are no longer necessary.

@@ -797,7 +787,7 @@ impl TableProvider for ListingTable {
// all of the data at the input is sink after execution finishes. See discussion for rationale:
// https://github.com/apache/arrow-datafusion/pull/7610#issuecomment-1728979918
unbounded_input: is_plan_streaming(&input)?,
single_file_output: self.options.single_file,
single_file_output: false,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this got me thinking we can remove the option from the FileSInkConfig itself.

However I see how that this PR is removing the single_file option from the read side (the ListingTable) but it can also potentially be important for the write side (aka a COPY statement)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, COPY still uses single_file_output option of the FileSinkConfig struct. We could potentially update COPY to instead rely on inference based on the path rather than an explicit option. E.g. copy table to file.parquet vs copy table to folder/. Then single_file_output could be removed entirely.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 that certainly would be a better UX in my opinion (figuring out to use SINGLE_FILE in https://github.com/apache/arrow-datafusion/blob/98a5a4eb1ea1277f5fe001e1c7602b37592452f1/datafusion/sqllogictest/test_files/repartition_scan.slt#L35-L38 was actually quite painful for me)

Shall I propose a ticket to do it?

Copy link
Contributor

@alamb alamb Dec 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #8621 to track removing SINGLE_FILE_OUTPUT completely

.unwrap_or(false);

// Backwards compatibility (#8547)
// Backwards compatibility (#8547), discard deprecated options
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@metesynnada metesynnada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@alamb
Copy link
Contributor

alamb commented Dec 21, 2023

I merged up from main to resolve a conflict in this PR

@comphead
Copy link
Contributor

Is there any way to specify the number of output files?

@devinjdangelo
Copy link
Contributor Author

Is there any way to specify the number of output files?

Yes, datafusion.execution.minimum_parallel_output_files and datafusion.execution.soft_max_rows_per_output_file control the number of files output (when not writing hive partitions, otherwise 1 file is written per partition). These are session level configs and are currently not configurable per table or per COPY statement, but they could be added as statement level options.

@alamb alamb merged commit 39e9f41 into apache:main Dec 22, 2023
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core datafusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove ListingTable Single File Option
4 participants