-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-29908][SQL] Support partitioning and bucketing through DataFrameWriter.save for V2 Tables #25822
Conversation
Hi, @brkyvz . Could you file a JIRA issue and use it in the PR title, please? |
JIRA was down when I created the PR. Will update asap
…On Tue, Sep 17, 2019, 4:32 PM Dongjoon Hyun ***@***.***> wrote:
Hi, @brkyvz <https://github.com/brkyvz> . Could you file a JIRA issue and
use it in the PR title, please?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#25822?email_source=notifications&email_token=ABIAE67FSNPLWZMVYP6IUSTQKFSKXA5CNFSM4IXVT2J2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD66GN7I#issuecomment-532440829>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABIAE6YKIMBRG2L2XXFX7BDQKFSKXANCNFSM4IXVT2JQ>
.
|
Oh, I see. Thanks, @brkyvz ! |
Test build #110832 has finished for PR 25822 at commit
|
Test build #110833 has finished for PR 25822 at commit
|
This PR is related to #25651 but targets a different use case: Both PRs need to update/extend In #25651, what we need is:
This use case is very similar to Hive's EXTERNAL TABLE. The table metadata is stored in Spark's metastore and the table data is stored outside of Spark (i.e. external data). So in this case, However, people may ask about Hive MANAGED TABLE. What's the corresponding concept in Spark? In Hive, what gets managed is the file directories. So it only applies to file sources(we can also call it path-based data source). Note that it doesn't mean we can only use file source with MANAGED TABLE, Hive can still create an EXTERNAL TABLE pointing to a file directory. To support the use case like Hive MANAGED TABLE, we need a variant of Back to
However, when it comes to I think we can still support
That said, with the change in #25651 , we need to add one mixin trait of |
@cloud-fan Let's look at the interfaces we have thus far:
TableProvider is currently missing the passing of partitioning info. This can be passed as part of DataFrameWriter, but unfortunately not as part of DataFrameReader. This means that for file based sources, where there is no catalog to store the partitioning info, Spark cannot initialize a complete and correct Table definition through user input. I had a more radical idea, and I've started working on it here: #25833 Why don't we make |
It's worthwhile to discuss the usefulness of
|
BTW, I also thought about the save mode problem before. If we want to support all save modes,
|
cc @dbtsai |
What changes were proposed in this pull request?
We add a new interface
SupportsCreateTable
to support the passing of partitioning transforms and table properties for tables that can be created without the existence of a catalog. Traditionally, data sources were passed all necessary information to define a table through the options in DataFrameWriter in conjunction withsave
.Through this new interface, we can continue to perform the necessary checks for SaveMode.ErrorIfExists and SaveMode.Ignore through
save
for V2 tables. For example, a file based data source such as parquet can check if the target directory is empty or not as part of theSupportsCreateTable.canCreateTable
to support these save modes. In addition, if metadata is available for a table (e.g. the schema of a jdbc data source would be available), the data source can check if the correct schema and partitioning transforms have been provided as part of theSupportsCreateTable.buildTable
if a table already exists for the given options. ThebuildTable
method also takes in table properties. While this isn't available for save, they can be provided through the DataFrameWriterV2 API.Thoughts about DataFrameWriterV2:
I'm also thinking that there could be a separate API that can potentially go from and to
options <-> Identifier
. This can make sure that these data sources can also leverage the DFWriterV2 API without requiring a catalog.Why are the changes needed?
Currently partitioning and bucketing information cannot be passed through for DataSources that migrate to DataSource V2 through the DataFrameWriter.save method which is one of the most commonly used methods used in Apache Spark.
Does this PR introduce any user-facing change?
This adds a new interface
SupportsCreateTable
which DataSource developers can implement as part of theirTableProvider
interface to support the creation of tables when a catalog is not available.How was this patch tested?
Tests in DataSourceV2DataFrameSuite