New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-26915][SQL] DataFrameWriter.save() should write without schema validation #23836
Conversation
@gengliangwang: The community agreed to remove the v2 write paths using SaveMode before the next release. The problem is that SaveMode is ambiguous and doesn't have reliable behavior. That's why we are introducing the new logical plans: to set expectations for behavior. Part of the challenge while standardizing behavior across sources is to support what Spark already does in v2. In this case, we need to define how a table can opt out of schema validation, or at least have relaxed validation rules that allow things like adding new columns by writing a DF with a new column. I don't think that the right way to do that is to use a write path that has no validation (the WriteToDataSourceV2 plan), especially when that write path is set to be removed. As I've said before, the right way to do this is to:
I think number 1 is the most important. What you have here removes for all v2 writes, but I think the behavior you are trying to mimic from v1 is applied when writing to path-based tables. That's a big unintended consequence, and why it is important to state what you're trying to accomplish and have a design for how you're going to do it. Please consider this a -1. |
Test build #102493 has finished for PR 23836 at commit
|
I would argue that putting back SaveMode should be on the table, if that's the best way to ensure that queries which work today will work with a v2 file source implementation. I agree that this PR isn't the right way to go about it. If the v2 ORC implementation doesn't work because of problems in the API, we need to go back to the drawing board and fix the API, not make ad-hoc changes to work around the problem. |
@jose-torres, SaveMode is used by v1. That is the best way to ensure that queries don't break. SaveMode should be translated to concrete plans for v2. Otherwise, v2 is just as unpredictable as v1 and we don't gain anything. |
We eventually have to ensure that queries don't break even with v2, unless your proposal is to have format("orc") and such invoke v1 forever. |
@jose-torres, I'm not saying that the default should be v1 forever. The right way to move over is to develop them in parallel and switch over when we can validate that the behavior is the same. Right now, v2 can't run CTAS plans so we clearly can't switch. But when v2 has all of the necessary logical plans, then we can start running the existing behavior tests on v2 to see what changes remain, like changing validation for path-based tables. Continuing to use SaveMode actually inhibits the move to v2. If write paths use SaveMode, then they can pass behavior tests and appear to work when they actually don't. Also, let me clarify my comment on using v1. I think we need to keep v1 around until the process of moving to v2 is complete because there are code paths that we know can't be changed to v2 without altering behavior. For example, we've agreed to standardize behavior on what file sources do. Users will have to choose between existing behavior and using v2 for other sources. I'm not confident that all v1 behaviors will be available in v2. In v1, a CTAS plan can be validated against an existing table. In some cases, that CTAS should fail because the table exists (SQL) and in some cases, the plan that is created should be AppendData instead of CTAS (DataFrameWriter). Does the validation for AppendData work exactly the same way as validating a CTAS that is actually and append? My guess is that it doesn't, and that we might not want it to. I think the final solution is to introduce a new write API that always uses v2 and makes it obvious what plan will be used. I've proposed such an API in the logical plans SPIP. Moving users to that API and eventually deprecating the DataFrameWriter API will take care of migrating the last few cases (which should be minor) from v1 to v2. |
I definitely agree with the direction: translate However I think the current translation is not precise: append mode doesn't mean append, it's actually "create table if not exist or append table". At least this is the case for file source and JDBC source. I believe it's true for most of the v1 sources. The next problem is, how to implement "create table if not exist or append table" with ds v2 APIs. I have 2 proposals:
For proposal 1, file source doesn't work because it can't create an empty table(it doesn't have metastore). I guess other data source will face the same issue. And it requires the catalog API, which is not done yet. I think proposal 2 is better. It's useful even after we have the catalog API, to implement atomic CTAS. |
@rdblue Sorry if I cause some misunderstanding.
The above solution of @cloud-fan is good directions to go. We can discuss the solution in tomorrow's meetup. |
Agreed. This is why we need to get CTAS finished.
My understanding is that the plan is to do both. If a catalog supports staged tables, then Spark uses them to perform an atomic operation. If it doesn't, then Spark uses the create/append/drop-on-error strategy. I agree that option 2 is "better" in that the operation is atomic. But sources are not required to support atomic CTAS. We need both options, so they are not mutually exclusive. |
Close this one. |
…t path before delete it ## What changes were proposed in this pull request? This is a followup PR to resolve comment: apache#23601 (review) When Spark writes DataFrame with "overwrite" mode, it deletes the output path before actual writes. To safely handle the case that the output path doesn't exist, it is suggested to follow the V1 code by checking the existence. ## How was this patch tested? Apply apache#23836 and run unit tests Closes apache#23889 from gengliangwang/checkFileBeforeOverwrite. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>
What changes were proposed in this pull request?
Spark supports writing to file data sources without getting and validation with the table schema.
For example,
newDF
can be different with the original table schema.However, the behavior is changed since #23606 . Currently, data source V2 always validates the output query with the table schema. Even after the catalog support of DS V2 is implemented, I think it is hard to support both behaviors with the current API/framework.
To me,
DataFrameWriter.save
is more like a simple IO API(e.g. file IO). It doesn't have to be involved with thetable
concept, otherwise:Drop table + CTAS
, can beInsert overwrite
, can beCTAS
...Insert
, can beAlter table add column + Insert
, can beCTAS
..Things can be too complex if we decide to allow both with/without schema validation in one API
DataFrameWriter.save
. That is to say, let's remove the expressionAppendData
andOverwriteByExpression
inDataFrameWriter.save
, since their behaviors are different from the API's. The expressions are still useful. We can useAppendData
andOverwriteByExpression
inDataFrameWriter.saveAsTable
, which is more appropriate.This PR proposes to remove the new expressions in
DataFrameWriter.save
, and reenable ORC V2. I am aware that the interfaceSupportsSaveMode
might be removed in the future. But in the current stage, we should prevent the regression, and make sure the behavior is unchanged and predictable in future development.How was this patch tested?
Unit test