Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DISCUSSION: remove CREATE EXTERNAL TABLE syntax: DELIMITER, WITH HEADER ROW and COMPRESSION #10414

Closed
alamb opened this issue May 7, 2024 · 6 comments · Fixed by #10404
Closed
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented May 7, 2024

Is your feature request related to a problem or challenge?

CREATE EXTERNAL TABLE (see docs) has several syntax items that only make sense for certain formats, specifically

CREATE [UNBOUNDED] EXTERNAL TABLE
[ IF NOT EXISTS ]
<TABLE_NAME>[ (<column_definition>) ]
STORED AS <file_type>
[ WITH HEADER ROW ]                             <-- only makes sense for CSV
[ DELIMITER <char> ]                            <-- only makes sense for CSV, JSON
[ COMPRESSION TYPE <GZIP | BZIP2 | XZ | ZSTD> ] <-- only makes sense for CSV, JSON
[ PARTITIONED BY (<column list>) ]
[ WITH ORDER (<ordered column list>) ]
[ OPTIONS (<key_value_list>) ]
LOCATION <literal>

Now, thanks to @metesynnada @devinjdangelo and others, we have a way to support arbitrary format specific options such as has_header:

CREATE EXTERNAL TABLE aggregate_simple (
  c1 FLOAT NOT NULL,
  c2 DOUBLE NOT NULL,
  c3 BOOLEAN NOT NULL
)
STORED AS CSV
LOCATION '../core/tests/data/aggregate_simple.csv'
OPTIONS('format.has_header' 'true')

However, the old inconsistent syntax is still supported, which causes additional code complexity as noted by @ozankabak in #10404 (comment) and potential confusion with users (who don't know how to look for the options)

Describe the solution you'd like

  1. I think we should at least update the documentation so it sends people to the new preferred syntax (OPTIONS(..)).
  2. We should consider deprecating / removing the old CREATE TABLE syntax DELIMITER, WITH HEADER ROW and COMPRESSION

Describe alternatives you've considered

No response

Additional context

@metesynnada did some great work to get COPY to align to external table: #9604

@berkaysynnada has noticed issues related to inconsistency in the EXTERNAL TABLE syntax and config options #9945

@ozankabak
Copy link
Contributor

ozankabak commented May 7, 2024

Thank you for creating this ticket to facilitate the discussion.

For the general audience: The good news is that we found a way to support the old syntax simultaneously with the new options syntax -- the bad news is that it results in code complexity/maintenance burden.

Basically, the aim of this ticket is to gather information on usage so that we can decide whether it is worth to have both syntaxes supported for a while -- or should we just pull the band-aid and remove the old syntax.

@phillipleblanc
Copy link
Contributor

I agree that it makes sense to "pull the band-aid" and remove the syntax from this code.

One place where I think this might get difficult is if an project that relies on DataFusion has some sort of user/external input that relies on this syntax, that the project itself doesn't have the ability to easily change. But the nice thing about DataFusion is its extensible, so I could imagine if someone has a strong requirement to keep the old syntax, we could write a custom parser that wraps DFParser (maybe lives in another repo, or in examples) that adds it back and just sets OPTIONS('format.has_header' 'true'). (See an example of a custom parser at https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/sql_dialect.rs#L45)

@ozankabak
Copy link
Contributor

ozankabak commented May 8, 2024

But the nice thing about DataFusion is its extensible, so I could imagine if someone has a strong requirement to keep the old syntax, we could write a custom parser that wraps DFParser (maybe lives in another repo, or in examples) that adds it back and just sets OPTIONS('format.has_header' 'true'). (See an example of a custom parser at https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/sql_dialect.rs#L45)

Agreed, sounds like a reasonable way forward in case removing the old syntax creates challenges for certain users that can't change their SQL.

@andygrove
Copy link
Member

IIRC, some of this original syntax came from a desire to suppor Hive SQL, but as @phillipleblanc said, if anyone needs this then they can add it back under a specific dialect.

@alamb
Copy link
Contributor Author

alamb commented May 8, 2024

If there is consensus here, given we just branched for the 38 release, maybe we can remove support on main now as part of #10404 and let that sit on main for a while 🤔

@ozankabak
Copy link
Contributor

Sounds good. We will update the PR tomorrow 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
4 participants