Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support unprefixed config format options #9594

Closed

Conversation

tinfoil-knight
Copy link
Contributor

Which issue does this PR close?

Closes #9575

Rationale for this change

#9382 standardised write configuration but broke backwards compatibility for the format specific options of the COPY command.

Earlier users could have a query like this without prefixing all options:

COPY source_table TO 'test_files/scratch/copy/format_table/' (
    format parquet,
    compression snappy,
    'compression::col1' 'zstd(5)'
);

But after #9382, they'd need to prefix every format-specific options in this verbose way:

COPY source_table TO 'test_files/scratch/copy/format_table/' (
    format parquet,
    'parquet.compression' snappy,
    'parquet.compression::col1' 'zstd(5)'
);

This PR adds back the support for format-specific options that worked without providing the format prefix.

What changes are included in this PR?

This PR renames keys without the file_type/format prefix for file types supported by the COPY command for backwards compatibility.

Are these changes tested?

Are there any user-facing changes?

Yes. But the current documentation (https://arrow.apache.org/datafusion/user-guide/sql/write_options.html#available-options) lists the legacy format so this PR doesn't require a doc update.

If there are any breaking changes to public APIs, please add the api change label.

This PR restores breaking changes so I'm not adding the api change label. Please let me know if you think we should add the label nonetheless.

Copy link
Contributor

@devinjdangelo devinjdangelo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this @tinfoil-knight! The overall approach looks good and is well tested. I have proposed a few changes. Let me know what you think.

Comment on lines 858 to 877
match &file_type {
// Renames un-prefixed keys to support legacy format specific options
FileType::CSV | FileType::JSON | FileType::PARQUET => {
let prefix = format!("{}", file_type);
let keys_to_rename: Vec<_> = options
.keys()
.filter(|key| !key.starts_with(&prefix))
.cloned()
.collect();

for key in keys_to_rename {
if let Some(value) = options.remove(&key) {
let new_key = format!("{}.{}", prefix, key);
options.insert(new_key, value);
}
}
}
_ => {}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
match &file_type {
// Renames un-prefixed keys to support legacy format specific options
FileType::CSV | FileType::JSON | FileType::PARQUET => {
let prefix = format!("{}", file_type);
let keys_to_rename: Vec<_> = options
.keys()
.filter(|key| !key.starts_with(&prefix))
.cloned()
.collect();
for key in keys_to_rename {
if let Some(value) = options.remove(&key) {
let new_key = format!("{}.{}", prefix, key);
options.insert(new_key, value);
}
}
}
_ => {}
}
let options = match &file_type {
// Renames un-prefixed keys to support legacy format specific options
FileType::CSV | FileType::JSON | FileType::PARQUET => {
options
.into_iter()
.map(|(k, v)| {
// If config does not belong to any namespace, assume it is
// a legacy option and apply file_type namespace for backwards
// compatibility.
if !k.contains('.') {
let new_key = format!("{}.{}", file_type, k);
(new_key, v)
} else {
(k, v)
}
})
.collect()
}
_ => options,
};

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall strategy here looks good, but I propose two changes:

  1. We can make the logic more concise by modifying options in a single pass with into_iter, map, collect. This also avoids clones.
  2. I think we should be a bit more conservative about the keys we modify. If the key has a namespace at all (rather than specifically the file format namespace), we can leave it as is. Downstream users may for example add a custom namespace and we wouldn't wan't this code to modify it unexpectedly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented both changes.

Converted the PR to draft for the while being due to #9594 (comment).

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @tinfoil-knight and @devinjdangelo

I think once @devinjdangelo 's comments are addressed this PR will be ready to go

cc @metesynnada @ozankabak and @mustafasrepo as you commented on the original PR

@@ -484,3 +484,35 @@ COPY (select col2, sum(col1) from source_table
# Copy from table with non literal
query error DataFusion error: SQL error: ParserError\("Expected ',' or '\)' after option definition, found: \+"\)
COPY source_table to '/tmp/table.parquet' (row_group_size 55 + 102);


# Legacy Format Options Support
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

@ozankabak
Copy link
Contributor

Thank you @tinfoil-knight for this. We will do a quick follow-up finalizing the design in #9382. Let's get the follow-up in and then modify this/get this in afterwards for backwards compatibility.

@tinfoil-knight
Copy link
Contributor Author

We will do a quick follow-up finalizing the design in #9382. Let's get the follow-up in and then modify this/get this in afterwards for backwards compatibility.

Okay. Sounds good.

@tinfoil-knight tinfoil-knight marked this pull request as draft March 13, 2024 18:16
@devinjdangelo
Copy link
Contributor

@tinfoil-knight I think this should be unblocked now that #9382 is closed. If you have a moment to rebase/resolve any conflicts and CI failures, I think we can get this merged after a final review.

@alamb
Copy link
Contributor

alamb commented Mar 19, 2024

I think we should also consider holding the release for this PR (see #9682 ) - so if you need help finishing it up @tinfoil-knight please just let us know

@tinfoil-knight
Copy link
Contributor Author

I think we should also consider holding the release for this PR (see #9682 ) - so if you need help finishing it up @tinfoil-knight please just let us know

I'll fix the tests by tomorrow & will let you know if anything comes up.

@tinfoil-knight
Copy link
Contributor Author

tinfoil-knight commented Mar 20, 2024

@devinjdangelo I'm not sure if the original issue is valid anymore. After #9604, you can use formatting options without prefixing them with the format name. See this for an example:
https://github.com/apache/arrow-datafusion/blob/89efc4a7e06bd0295ca72dd6ec5fe987d1ac246b/datafusion/sqllogictest/test_files/copy.slt#L22-L32

I think this PR & the linked issue should be closed.

@alamb noted an issue here: #9604 (review) with the format prefix but IMO it's separate from the original issue.


I also found a CLI bug while working on this issue & I've filed: #9707 & pushed a PR for it separately.

@tinfoil-knight
Copy link
Contributor Author

Update:

#9716 is opened now for the format prefix issue.

@alamb I'm closing this PR since it doesn't seem relevant anymore. See the previous comment (#9594 (comment)) for more details.

Please close the issue: #9575 now as it's stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support backwards compatible / un prefixed config options (compression rather than parquet.compression)
4 participants