Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23348][SQL] append data using saveAsTable should adjust the data types #20527

Closed
wants to merge 1 commit into from

Conversation

cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

For inserting/appending data to an existing table, Spark should adjust the data types of the input query according to the table schema, or fail fast if it's uncastable.

There are several ways to insert/append data: SQL API, DataFrameWriter.insertInto, DataFrameWriter.saveAsTable. The first 2 ways create InsertIntoTable plan, and the last way creates CreateTable plan. However, we only adjust input query data types for InsertIntoTable, and users may hit weird errors when appending data using saveAsTable. See the JIRA for the error case.

This PR fixes this bug by adjusting data types for CreateTable too.

How was this patch tested?

new test.

@cloud-fan
Copy link
Contributor Author

cc @gatorsmile @dongjoon-hyun

@SparkQA
Copy link

SparkQA commented Feb 7, 2018

Test build #87151 has finished for PR 20527 at commit ad19125.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class PreprocessTableInsertion(conf: SQLConf) extends Rule[LogicalPlan]


Seq("c" -> 3).toDF("i", "j").write.mode("append").saveAsTable("t")
checkAnswer(spark.table("t"), Row(1, "a") :: Row(2, "b") :: Row(3, "c")
:: Row(null, "3") :: Nil)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pining me, @cloud-fan . +1 for the patch, LGTM.

@@ -346,37 +349,11 @@ case class PreprocessTableInsertion(conf: SQLConf) extends Rule[LogicalPlan] wit
""".stripMargin)
}

castAndRenameChildOutput(insert.copy(partition = normalizedPartSpec), expectedColumns)
insert.copy(query = newQuery, partition = normalizedPartSpec)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: don't need to copy the newQuery if it is the same as query.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's also ok to always copy it and the code is neater.

@@ -132,6 +134,32 @@ class InMemoryCatalogedDDLSuite extends DDLSuite with SharedSQLContext with Befo
checkAnswer(spark.table("t"), Row(Row("a", 1)) :: Nil)
}
}

// TODO: This test is copied from HiveDDLSuite, unify it later.
test("SPARK-23348: append data to data source table with saveAsTable") {
Copy link
Contributor

@jiangxb1987 jiangxb1987 Feb 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also want to cover the following case:

2) Target tables have column metadata

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can add this when unifying the test cases?

@jiangxb1987
Copy link
Contributor

LGTM only some nits.

@gatorsmile
Copy link
Member

LGTM

Thanks! Merged to master/2.3

asfgit pushed a commit that referenced this pull request Feb 8, 2018
…ta types

## What changes were proposed in this pull request?

For inserting/appending data to an existing table, Spark should adjust the data types of the input query according to the table schema, or fail fast if it's uncastable.

There are several ways to insert/append data: SQL API, `DataFrameWriter.insertInto`, `DataFrameWriter.saveAsTable`. The first 2 ways create `InsertIntoTable` plan, and the last way creates `CreateTable` plan. However, we only adjust input query data types for `InsertIntoTable`, and users may hit weird errors when appending data using `saveAsTable`. See the JIRA for the error case.

This PR fixes this bug by adjusting data types for `CreateTable` too.

## How was this patch tested?

new test.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #20527 from cloud-fan/saveAsTable.

(cherry picked from commit 7f5f5fb)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
@asfgit asfgit closed this in 7f5f5fb Feb 8, 2018
robert3005 pushed a commit to palantir/spark that referenced this pull request Feb 12, 2018
…ta types

## What changes were proposed in this pull request?

For inserting/appending data to an existing table, Spark should adjust the data types of the input query according to the table schema, or fail fast if it's uncastable.

There are several ways to insert/append data: SQL API, `DataFrameWriter.insertInto`, `DataFrameWriter.saveAsTable`. The first 2 ways create `InsertIntoTable` plan, and the last way creates `CreateTable` plan. However, we only adjust input query data types for `InsertIntoTable`, and users may hit weird errors when appending data using `saveAsTable`. See the JIRA for the error case.

This PR fixes this bug by adjusting data types for `CreateTable` too.

## How was this patch tested?

new test.

Author: Wenchen Fan <wenchen@databricks.com>

Closes apache#20527 from cloud-fan/saveAsTable.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants