Make COPY TO align with CREATE EXTERNAL TABLE #9604

metesynnada · 2024-03-14T07:58:46Z

Which issue does this PR close?

Closes #9369.

Rationale for this change

Changing

OPTIONS (
    format X,
    X.foo.bar baz
)

into

STORED AS X
OPTIONS (
    format.foo.bar baz
)

What changes are included in this PR?

"format" prefix is defined for all formats.
COPY TO statement is now closer to CREATE EXTERNAL TABLE syntax.

More talk on synnada-ai#10

Are these changes tested?

Are there any user-facing changes?

devinjdangelo

Overall, I think the PR looks good, but I have some concerns about the syntax we are dropping. Most of all, I think queries like this should continue to work:

COPY table to 'file.csv'

Copy is useful on the commandline for quickly moving files around and this syntax is more compact and intuitive than

COPY table to 'file.csv' STORED AS CSV

So, I think we should continue supporting the ability to infer the format when copying to a single file with an unabiguous extension.

devinjdangelo · 2024-03-14T19:41:15Z

datafusion/sql/tests/sql_integration.rs

@@ -397,7 +397,7 @@ CopyTo: format=csv output_url=output.csv options: ()

 #[test]
 fn plan_explain_copy_to() {
-    let sql = "EXPLAIN COPY test_decimal to 'output.csv'";
+    let sql = "EXPLAIN COPY test_decimal to 'output.csv' STORED AS CSV";


I think we should continue to support the original syntax here, which is in my opinion much more clear and concise.

I agree with this, I made STORED AS optional.

devinjdangelo · 2024-03-14T19:42:52Z

datafusion/sqllogictest/test_files/clickbench.slt

@@ -23,7 +23,7 @@
 # create.sql came from
 # https://github.com/ClickHouse/ClickBench/blob/8b9e3aa05ea18afa427f14909ddc678b8ef0d5e6/datafusion/create.sql
 # Data file made with DuckDB:
-# COPY (SELECT * FROM 'hits.parquet' LIMIT 10) TO 'clickbench_hits_10.parquet' (FORMAT PARQUET);
+# COPY (SELECT * FROM 'hits.parquet' LIMIT 10) TO 'clickbench_hits_10.parquet' STORED AS PARQUET;


Here I think we should also maintain backwards compatibility. It is nice to have a dedicated keyword for the often required format option, but I think user's queries in the old syntax should continue working

I firmly believe that a singular approach to the format introduction is the most logical and maintainable route. Thus implementing a breaking change aligns better with our goals for clarity and efficiency.

For this particular highlight, I didn't read the DuckDB reminder, this is why I changed it back.

I agree that supporting the duckdb style would be nice, but that this PR's consistency is also nice

I wonder if it would be possible (as a follow on PR) to add some special case backwards compatibility code to support the (FORMAT PARQUET) style? It seems to me like this could be a small amount of additional, localized code, and then we could going forward use the unified approach in this PR by default?

Filed #9713 to track

alamb

Thanks @metesynnada -- other than the commented out test, this PR looks good to go in my opinion.

It would be good to get @devinjdangelo 's opinion as well

I think the follow ups would be:

make the format. prefix optional
Support (FORMAT PARQUET) style option for backwards compatibility

alamb · 2024-03-15T14:45:40Z

datafusion/sqllogictest/test_files/clickbench.slt

@@ -23,7 +23,7 @@
 # create.sql came from
 # https://github.com/ClickHouse/ClickBench/blob/8b9e3aa05ea18afa427f14909ddc678b8ef0d5e6/datafusion/create.sql
 # Data file made with DuckDB:
-# COPY (SELECT * FROM 'hits.parquet' LIMIT 10) TO 'clickbench_hits_10.parquet' (FORMAT PARQUET);
+# COPY (SELECT * FROM 'hits.parquet' LIMIT 10) TO 'clickbench_hits_10.parquet';


I think we should revert this change as it is a comment showing what command was used in duckdb (not a datafusion command)

ozankabak · 2024-03-15T15:16:06Z

Thanks @metesynnada -- other than the commented out test, this PR looks good to go in my opinion.

It would be good to get @devinjdangelo 's opinion as well

I think the follow ups would be:

make the format. prefix optional

Support (FORMAT PARQUET) style option for backwards compatibility

Sounds good. Let's get the consistent base syntax in with this PR, and have follow-ons for shortcuts (1) and backwards compatibility (2).

alamb

Thank you @metesynnada -- I think this PR is good to go from my perspective after:

We file a ticket to track backwards compatible (FORMAT PARQUET) syntax
We file a ticket about what is going on with escaped quotes (aka COPY TO allign with CREATE EXTERNAL TABLE synnada-ai/datafusion-upstream#10 (comment))
@devinjdangelo gives it a final review

alamb · 2024-03-15T20:45:06Z

datafusion/sqllogictest/test_files/copy.slt

-CREATE EXTERNAL TABLE validate_partitioned_escape_quote STORED AS CSV 
-LOCATION 'test_files/scratch/copy/escape_quote/' PARTITIONED BY ("'test2'", "'test3'");
-
+## Until the partition by parsing uses ColumnDef, this test is meaningless since it becomes an overfit. Even in


Here is the expalanation of why this test is commented out: synnada-ai#10 (comment)

Filed #9714 to track

devinjdangelo

Thank you @metesynnada this looks great! It definitely is nice now that copying data into files and then creating a table over the files have very similar syntaxes. And it remains easy/compact to copy to a single file, which is also very nice! 🚀

DataFusion CLI v36.0.0
❯ copy (values (1, 2, 3), (4, 5, 6)) to 'file.csv';
+-------+
| count |
+-------+
| 2     |
+-------+
1 row in set. Query took 0.020 seconds.

❯ select * from 'file.csv';
+---------+---------+---------+
| column1 | column2 | column3 |
+---------+---------+---------+
| 1       | 2       | 3       |
| 4       | 5       | 6       |
+---------+---------+---------+
2 rows in set. Query took 0.005 seconds.
❯ copy (values ('1', 2, 3), ('4', 5, 6)) to 'partitioned_csv/' partitioned by (column1) stored as csv;
+-------+
| count |
+-------+
| 2     |
+-------+
1 row in set. Query took 0.027 seconds.
❯ create external table partitioned_csv stored as CSV location 'partitioned_csv' partitioned by (column1) with header row;
0 rows in set. Query took 0.001 seconds.

❯ select * from partitioned_csv;
+---------+---------+---------+
| column2 | column3 | column1 |
+---------+---------+---------+
| 5       | 6       | 4       |
| 2       | 3       | 1       |
+---------+---------+---------+
2 rows in set. Query took 0.001 seconds.

devinjdangelo · 2024-03-16T12:59:25Z

datafusion/sql/src/parser.rs

+    /// CSV Header row?
+    pub has_header: bool,
+    /// File type (Parquet, NDJSON, CSV, etc)
+    pub stored_as: Option<String>,


devinjdangelo · 2024-03-16T13:06:39Z

datafusion/sqllogictest/test_files/copy.slt

@@ -54,8 +54,8 @@ select * from validate_partitioned_parquet_bar order by col1;

 # Copy to directory as partitioned files
 query ITT
-COPY (values (1, 'a', 'x'), (2, 'b', 'y'), (3, 'c', 'z')) TO 'test_files/scratch/copy/partitioned_table2/' 
-(format parquet, partition_by 'column2, column3', 'parquet.compression' 'zstd(10)');
+COPY (values (1, 'a', 'x'), (2, 'b', 'y'), (3, 'c', 'z')) TO 'test_files/scratch/copy/partitioned_table2/' STORED AS parquet PARTITIONED BY (column2, column3)


This looks pretty cool. Definitely an improvement in readability over the prior syntax 👍

ozankabak

LGTM, left three minor comments

datafusion/common/src/config.rs

datafusion/sql/src/parser.rs

ozankabak · 2024-03-18T17:41:53Z

This seems good to go - I will wait for a little bit more in case there are any more comments and then merge this today.

ozankabak · 2024-03-18T22:45:56Z

Let's address any remaining issues in quick follow-ons.

alamb · 2024-03-20T19:14:53Z

Filed #9716 to track the enhancement to avoid repeating .format over and over

alamb · 2024-03-23T11:47:25Z

Thanks to @devinjdangelo and @tinfoil-knight I think we have completed all the follow on items #9716 and #9713

I filed #9754 to update the documentation

metesynnada added 2 commits March 13, 2024 17:33

COPY TO allign with CREATE EXTERNAL TABLE

7af77ce

Resolve datafusion-cli error

536e291

github-actions bot added sql core Core datafusion crate sqllogictest labels Mar 14, 2024

alamb mentioned this pull request Mar 14, 2024

DataFusion weekly project plan (Andrew Lamb) - March 11, 2024 #9555

Closed

5 tasks

devinjdangelo suggested changes Mar 14, 2024

View reviewed changes

Make STORED AS optional

68105cf

alamb reviewed Mar 15, 2024

View reviewed changes

metesynnada mentioned this pull request Mar 15, 2024

COPY TO allign with CREATE EXTERNAL TABLE synnada-ai/datafusion-upstream#10

Closed

metesynnada requested a review from alamb March 15, 2024 14:52

alamb approved these changes Mar 15, 2024

View reviewed changes

devinjdangelo approved these changes Mar 16, 2024

View reviewed changes

Review

8df4b0e

ozankabak approved these changes Mar 16, 2024

View reviewed changes

datafusion/common/src/config.rs Outdated Show resolved Hide resolved

datafusion/common/src/config.rs Outdated Show resolved Hide resolved

datafusion/sql/src/parser.rs Outdated Show resolved Hide resolved

metesynnada added 4 commits March 18, 2024 09:56

Review resolved

59e51b5

Merge remote-tracking branch 'upstream/main' into copy-to-parser

9efaae0

Merge resolve

8c82a94

Enhancing comments, solving some bugs

065295f

metesynnada mentioned this pull request Mar 18, 2024

Systematic Configuration in 'Create External Table' and 'Copy To' Options #9382

Merged

ozankabak merged commit b137f60 into apache:main Mar 18, 2024
24 checks passed

alamb mentioned this pull request Mar 19, 2024

Release DataFusion 37.0.0 #9682

Closed

8 tasks

tinfoil-knight mentioned this pull request Mar 20, 2024

support unprefixed config format options #9594

Closed

alamb mentioned this pull request Mar 23, 2024

Update COPY documentation to reflect changes #9754

Merged

alamb mentioned this pull request May 7, 2024

DISCUSSION: remove CREATE EXTERNAL TABLE syntax: DELIMITER, WITH HEADER ROW and COMPRESSION #10414

Closed

davisp mentioned this pull request May 24, 2024

Pass BigQuery options to the ArrowSchema #10590

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make COPY TO align with CREATE EXTERNAL TABLE #9604

Make COPY TO align with CREATE EXTERNAL TABLE #9604

metesynnada commented Mar 14, 2024

devinjdangelo left a comment

devinjdangelo Mar 14, 2024

metesynnada Mar 15, 2024 •

edited

Loading

devinjdangelo Mar 14, 2024

metesynnada Mar 15, 2024

alamb Mar 15, 2024

alamb Mar 20, 2024

alamb left a comment

alamb Mar 15, 2024

ozankabak commented Mar 15, 2024

alamb left a comment

alamb Mar 15, 2024

alamb Mar 20, 2024

devinjdangelo left a comment

devinjdangelo Mar 16, 2024

devinjdangelo Mar 16, 2024

ozankabak left a comment

ozankabak commented Mar 18, 2024

ozankabak commented Mar 18, 2024

alamb commented Mar 20, 2024 •

edited

Loading

alamb commented Mar 23, 2024

Make COPY TO align with CREATE EXTERNAL TABLE #9604

Make COPY TO align with CREATE EXTERNAL TABLE #9604

Conversation

metesynnada commented Mar 14, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

devinjdangelo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

metesynnada Mar 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ozankabak commented Mar 15, 2024

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

devinjdangelo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ozankabak left a comment

Choose a reason for hiding this comment

ozankabak commented Mar 18, 2024

ozankabak commented Mar 18, 2024

alamb commented Mar 20, 2024 • edited Loading

alamb commented Mar 23, 2024

metesynnada Mar 15, 2024 •

edited

Loading

alamb commented Mar 20, 2024 •

edited

Loading