Skip to content

[SPARK-29324][SQL] Fix overwrite behaviour for saveAsTable #25995

Closed
karuppayya wants to merge 2 commits intoapache:masterfrom
karuppayya:SPARK-29324
Closed

[SPARK-29324][SQL] Fix overwrite behaviour for saveAsTable #25995
karuppayya wants to merge 2 commits intoapache:masterfrom
karuppayya:SPARK-29324

Conversation

@karuppayya
Copy link
Contributor

What changes were proposed in this pull request?

When saveAstable is used in overwrite mode, the metadata of the table gets overwritten.
In this PR, adding changes to retain the metadata after overwrite to an existing table

Why are the changes needed?

The tables metadata gets overwritten without this change

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added Unit tests

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@karuppayya
Copy link
Contributor Author

Can someone help review this change

@maropu
Copy link
Member

maropu commented Oct 3, 2019

Sorry, but I miss you point; this is not an expected behaivour? metadata also should be overwritten in the overwrite mode?

@karuppayya
Copy link
Contributor Author

karuppayya commented Oct 3, 2019

Without the change, an external table become a managed table, the file format could change as well(Say table was created using Orc file format, after overwrite it will change to parquet which is the default FF)
Few issues

  1. Any other user who is consuming the table will be querying the data at an older location, instead of the new location(in this case the warehouse location, since it became a managed table).
  2. In case using overwriting a partitioned table with partitionOverwriteMode set to dynamic,
    I think the expectation would be to overwrite the data of the only affected partitions with the same file format(Please correct me if I am wrong). In this case, we will end up writing in a completely different location and maybe with different file format.

@dongjoon-hyun dongjoon-hyun changed the title SPARK-29324: Fix overwrite behaviour for saveAsTable [SPARK-29324][SQL] Fix overwrite behaviour for saveAsTable Oct 3, 2019
@karuppayya
Copy link
Contributor Author

Can someone review this change?
Tagging couple of users from the history of the file @cloud-fan @gatorsmile @maropu

@cloud-fan
Copy link
Contributor

saveAsTable with overwrite mode is kind of REPLACE OR CREATE TABLE AS SELECT. This is the expected behavior. I think what you need is insertInto, but you would need to deal with non-existing table manually.

@karuppayya
Copy link
Contributor Author

  1. When I write to a partitioned table(with fileformat other than parquet) using saveAstable and partitionOverwriteMode set to dynamic, the affected partitions may have data with Parquet format(default) while other partitions will have data in a different format. User needs to specify the fileformat when writing using this API.
    I think this operation should be disallowed when fileformat is not specifie or use the format from the dropped table.

Also during overwrite + saveAsTable operation, we drop the table and recreate it

  1. In the case of Hive table, we might transparently change a Hive(provider = HIve) table to datasource table. Can this have any implication when the table is read with Spark version where there are issues with Orc's data source flow and user wanting to use the Hive read flow?
  2. In the case of partitionOverwriteMode=dynamic, we write to particular partitions only. Due to drop, we lose the partition information. ALso, all the partition level stats is lost(If already analyzed). These are expensive operations in case of larger tables.

@cloud-fan Any thoughts on these?

@cloud-fan
Copy link
Contributor

User needs to specify the fileformat when writing using this API.

This is a bug that we should fix. But I'm not sure this would happen as we explicitly drop and re-create the table.

we might transparently change a Hive(provider = Hive) table to datasource table

This behavior makes sense if you think about it as REPLACE OR CREATE TABLE AS SELECT. I know saveAsTable is not a good name, but that's what it is today. You can try the new DataFrameWriterV2 which has clear semantic after 3.0 preview is released. also cc @rdblue

@HyukjinKwon
Copy link
Member

Closing this per discussed above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants