-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-54462][SQL] Add isDataFrameWriterV1 option for Delta datasource compatibility
#53173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[SPARK-54462][SQL] Add isDataFrameWriterV1 option for Delta datasource compatibility
#53173
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for informing the issue, @juliuszsompolski . I have a few questions.
- Which Apache Spark preview version and Delta version did you test this? Specifically, which preview did this start to happen?
- Why don't we implement this in
io.delta.sql.DeltaSparkSessionExtensionor document it instead of changing Apache Spark source code? - Do you think we can have a test coverage with a dummy data source?
- If this is an emergency fix, what would be the non-emergency fix?
This is an emergency fix to prevent a breaking change resulting in data corruption with Delta data sources in Spark
isDataFrameWriterV1 option for Delta datasource compatibility
| serde = None, | ||
| external = false, | ||
| constraints = Seq.empty) | ||
| val writeOptions = if (source == "delta") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does Apache Spark source code have this kind delta-specific logic before, @juliuszsompolski ?
This looks like the first proposal to have a 3rd-party company data source in Apache Spark source code. At the first glance, this string match looks a little fragile to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun I am right now working on researching cleaner solutions. I have raised this PR as a straw man, given the Spark 4.1 timeline and trying to propose the most narrowly scoped change possible that would prevent a behaviour change that would lead to table corruption in Delta (unintentionally overwriting table's metadata in operations that have required an explicit overwriteSchema option for that before). In any case, if this is allowed to get in as a stop gap fix, I would like to replace it with a proper solution.
It could also be done by adding this option always and not mention Delta here, as @HyukjinKwon suggested. It would still be a "strange" piece of code to attach this kind of option just for this particular case, and kind of "leaky". Some people suggested to me that maybe having Spark attach some options that always point to the API that the command originated from could be useful also in other cases, but that's also a much bigger change to design and make.
Another option could be to, since the semantics of saveAsTable in DFWV1 can be interpreted differently that createOrReplace / replace of DFWV2, maybe it could have a new plan node SaveAsV2TableCommand, just like it has it's own node for SaveAsV1TableCommand? But again, this is a lot of changes.
Or, the existing CreateTableAsSelect / ReplaceTableAsSelect should have a flag parameter indicating that it's actually a SaveAsTable command? But again, this is not really clean...
I am researching options.
| // To retain compatibility of the Delta datasource with Spark 4.1 in connect mode, Spark | ||
| // provides this explicit storage option to indicate to Delta datasource that this call | ||
| // is coming from DataFrameWriter V1. | ||
| // |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per:
// FIXME: Since the details of the documented semantics of Spark's DataFrameWriter V1
// saveAsTable API differs from that of CREATE/REPLACE TABLE AS SELECT, Spark should
// not be reusing the exact same logical plan for these APIs.
// Existing Datasources which have been implemented following Spark's documentation of
// these APIs should have a way to differentiate between these APIs.
Why don't we just always append the option? The downstream datasources who care about this behaviour will make the change accordingly.
I have been testing it with Delta master building with Spark master.
It would make me very happy to find a way to fix it without changing Spark code. It seems to me however that currently I lose all the ability to distinguish DFWV1
I could have a test with a Dummy data source that verifies that this option is added for saveAsTable, but the e2e behavior change is Delta specific.
See my other comment #53173 (comment) for some of the options I am exploring. |
@dongjoon-hyun I investigated it more, raised a PR delta-io/delta#5569 with a sketch of trying to move it to a Delta extension rule. is directly returning the ReplaceTableAsSelect plan created by DataFrameWriter, without running any resolution or analysis on it. At the point or returning it we already returned from DataFrameWriter call. at which point the plan is identical for |
|
According to @juliuszsompolski 's analysis, do you think we need revise or revert the original patch (SPARK-52451) from |
|
@dongjoon-hyun The change is essentially two lines: Adding the option. If source is delta. Or, to not mention Delta by name, maybe a new interface |
|
To be clear, @juliuszsompolski , Apache Spark 4.1 release process is not considering for additional development. If there is no reasonable solution, the release process is simply supposed to find any new problems like the above and to revert it to unblock the release process, @juliuszsompolski . It's a little sad because we missed this regression at Apache Spark 4.1.0-preview2 (which is the first release including the root causes, SPARK-52451 and SPARK-53097). |
|
Also, cc @aokolnychyi too. |
|
I think the root cause is that Delta Lake claims to be a v2 source, which means I understand the intention to avoid behavior changes, and Delta used to rely on a hack to identify Instead of reverting that reasonable change from Spark, I'm more in favor of adding a way to identify the v2 commands from v1 |
|
Thank you for all your feedback. AFAIK, we have 5 ways to move forward so far.
For 1 ~ 4, Delta community needs to change their code in any way. Is it okay for them? |
|
Could you make alternative working PRs (which passes all CIs) for further decision, @cloud-fan or @juliuszsompolski ? |
|
2 and 3 are basically the same but just different API flavor. I think a new bool flag is simpler. @juliuszsompolski what do you think? |
|
I have slight preference for 2, which is a bit more verbose, but:
For 4, Delta community is actively working in developing a proper V2 datasource. See already closed PRs in https://github.com/delta-io/delta/pulls?q=is%3Apr+dsv2+is%3Aclosed. But it won't be ready now, for 4.1... For 5, I agree with @cloud-fan that we shouldn't be reverting reasonable changes to Spark. Delta, and the fact that we let that horrible hack be there for 6 years is at fault here. Because of the timeline of it being detected this late in 4.1 process, and the behaviour change being very unfriendly (causing silent overwrite of metadata where it was not overwritten before) to users of open source Spark with open source Delta I'd much prefer to be allowed to move forward with a narrowly scoped fix. I will prepare a PR for option 2. |
|
I raised #53215 with option 2. |
|
Thank you, @cloud-fan and @juliuszsompolski . |
What changes were proposed in this pull request?
Make DataFrameWriter saveAsTable add a writeOption
isDataFrameWriterV1 = truewhen using Overwrite mode with adeltaData source.This is an emergency fix to prevent a breaking change resulting in data corruption with Delta data sources in Spark 4.1.
Why are the changes needed?
Spark's SaveMode.Overwrite is documented as:
It does not define the behaviour of overwriting the table metadata (schema, etc). Delta datasource interpretation of this API documentation of DataFrameWriter V1 is to not replace table schema, unless Delta-specific option "overwriteSchema" is set to true.
However, DataFrameWriter V1 creates a ReplaceTableAsSelect plan, which is the same as the plan of DataFrameWriterV2 createOrReplace API, which is documented as:
Therefore, for calls via DataFrameWriter V2 createOrReplace, the metadata always needs to be replaced, and Delta datasource doesn't use the overwriteSchema option.
Since the created plan is exactly the same, Delta had used a very ugly hack to detect where the API call is coming from based on the stack trace of the call.
In Spark 4.1 in connect mode, this stopped working because planning and execution of the commands go decoupled, and the stack trace no longer contains this point where the plan got created.
To retain compatibility of the Delta datasource with Spark 4.1 in connect mode, Spark provides this explicit storage option to indicate to Delta datasource that this call is coming from DataFrameWriter V1.
Followup: Since the details of the documented semantics of Spark's DataFrameWriter V1 saveAsTable API differs from that of CREATE/REPLACE TABLE AS SELECT, Spark should not be reusing the exact same logical plan for these APIs.
Existing Datasources which have been implemented following Spark's documentation of these APIs should have a way to differentiate between these APIs.
However, at this point this is an emergency fix, as releasing Spark 4.1 as is would cause data corruption issues with Delta in DataFrameWriter saveAsTable in overwrite mode, as it would not be correctly interpreting it's overwriteSchema mode.
Does this PR introduce any user-facing change?
No
How was this patch tested?
It has been tested with tests that are not part of the PR. To properly test in connect mode, changes are needed on both Spark and Delta side and integrating it will be done as followup work.
Was this patch authored or co-authored using generative AI tooling?
Assisted by Claude Code.
Generated-by: claude code, model sonnet 4.5