[SPARK-22356][SQL] data source table should support overlapped columns between data and partition schema #19579

cloud-fan · 2017-10-26T03:10:06Z

What changes were proposed in this pull request?

This is a regression introduced by #14207. After Spark 2.1, we store the inferred schema when creating the table, to avoid inferring schema again at read path. However, there is one special case: overlapped columns between data and partition. For this case, it breaks the assumption of table schema that there is on ovelap between data and partition schema, and partition columns should be at the end. The result is, for Spark 2.1, the table scan has incorrect schema that puts partition columns at the end. For Spark 2.2, we add a check in CatalogTable to validate table schema, which fails at this case.

To fix this issue, a simple and safe approach is to fallback to old behavior when overlapeed columns detected, i.e. store empty schema in metastore.

How was this patch tested?

new regression test

…ables

cloud-fan · 2017-10-26T03:10:17Z

cc @gatorsmile

SparkQA · 2017-10-26T06:07:25Z

Test build #83073 has finished for PR 19579 at commit 18907cb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-10-26T20:56:21Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala

+      // empty schema in metastore and infer it at runtime. Note that this also means the new
+      // scalable partitioning handling feature(introduced at Spark 2.1) is disabled in this case.
+      case r: HadoopFsRelation if r.overlappedPartCols.nonEmpty =>
+        table.copy(schema = new StructType(), partitionColumnNames = Nil)


Log a warning message here? When data columns and partition columns have the same names, the values could be inconsistent. Thus, we do not suggest users to create such a table and it might perform well because we infer the schema at runtime.

SparkQA · 2017-10-27T00:13:43Z

Test build #83094 has finished for PR 19579 at commit 2fe7ec4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-10-27T00:43:13Z

LGTM

gatorsmile · 2017-10-27T00:50:26Z

Thanks! Merged to master/2.2

…s between data and partition schema This is a regression introduced by #14207. After Spark 2.1, we store the inferred schema when creating the table, to avoid inferring schema again at read path. However, there is one special case: overlapped columns between data and partition. For this case, it breaks the assumption of table schema that there is on ovelap between data and partition schema, and partition columns should be at the end. The result is, for Spark 2.1, the table scan has incorrect schema that puts partition columns at the end. For Spark 2.2, we add a check in CatalogTable to validate table schema, which fails at this case. To fix this issue, a simple and safe approach is to fallback to old behavior when overlapeed columns detected, i.e. store empty schema in metastore. new regression test Author: Wenchen Fan <wenchen@databricks.com> Closes #19579 from cloud-fan/bug2.

## What changes were proposed in this pull request? #19579 introduces a behavior change. We need to document it in the migration guide. ## How was this patch tested? Also update the HiveExternalCatalogVersionsSuite to verify it. Author: gatorsmile <gatorsmile@gmail.com> Closes #20606 from gatorsmile/addMigrationGuide. (cherry picked from commit a77ebb0) Signed-off-by: gatorsmile <gatorsmile@gmail.com>

## What changes were proposed in this pull request? #19579 introduces a behavior change. We need to document it in the migration guide. ## How was this patch tested? Also update the HiveExternalCatalogVersionsSuite to verify it. Author: gatorsmile <gatorsmile@gmail.com> Closes #20606 from gatorsmile/addMigrationGuide.

…s between data and partition schema This is a regression introduced by apache#14207. After Spark 2.1, we store the inferred schema when creating the table, to avoid inferring schema again at read path. However, there is one special case: overlapped columns between data and partition. For this case, it breaks the assumption of table schema that there is on ovelap between data and partition schema, and partition columns should be at the end. The result is, for Spark 2.1, the table scan has incorrect schema that puts partition columns at the end. For Spark 2.2, we add a check in CatalogTable to validate table schema, which fails at this case. To fix this issue, a simple and safe approach is to fallback to old behavior when overlapeed columns detected, i.e. store empty schema in metastore. new regression test Author: Wenchen Fan <wenchen@databricks.com> Closes apache#19579 from cloud-fan/bug2.

…using a hard-coded location to store localized Spark binaries ### What changes were proposed in this pull request? This PR changes `HiveExternalCatalogVersionsSuite` to, by default, use a standard temporary directory to store the Spark binaries that it localizes. It additionally adds a new System property, `spark.test.cache-dir`, which can be used to define a static location into which the Spark binary will be localized to allow for sharing between test executions. If the System property is used, the downloaded binaries won't be deleted after the test runs. ### Why are the changes needed? In SPARK-22356 (PR #19579), the `sparkTestingDir` used by `HiveExternalCatalogVersionsSuite` became hard-coded to enable re-use of the downloaded Spark tarball between test executions: ``` // For local test, you can set `sparkTestingDir` to a static value like `/tmp/test-spark`, to // avoid downloading Spark of different versions in each run. private val sparkTestingDir = new File("/tmp/test-spark") ``` However this doesn't work, since it gets deleted every time: ``` override def afterAll(): Unit = { try { Utils.deleteRecursively(wareHousePath) Utils.deleteRecursively(tmpDataDir) Utils.deleteRecursively(sparkTestingDir) } finally { super.afterAll() } } ``` It's bad that we're hard-coding to a `/tmp` directory, as in some cases this is not the proper place to store temporary files. We're not currently making any good use of it. ### Does this PR introduce _any_ user-facing change? Developer-facing changes only, as this is in a test. ### How was this patch tested? The test continues to execute as expected. Closes #30122 from xkrogen/xkrogen-SPARK-33214-hiveexternalversioncatalogsuite-fix. Authored-by: Erik Krogen <xkrogen@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

overlapped columns between data and partition schema in data source t…

18907cb

…ables

cloud-fan mentioned this pull request Oct 26, 2017

[SPARK-22245][SQL] partitioned data set should always put partition columns at the end #19471

Closed

gatorsmile reviewed Oct 26, 2017

View reviewed changes

address comments

2fe7ec4

asfgit closed this in 9b262f6 Oct 27, 2017

This was referenced Feb 13, 2018

[MINOR][TEST] Update from 2.2.0 to 2.2.1 in HiveExternalCatalogVersionsSuite #20597

Closed

[SPARK-23421] [SQL] Document the behavior change in SPARK-22356 #20606

Closed

xkrogen mentioned this pull request Oct 21, 2020

[SPARK-33214][TEST][HIVE] Stop HiveExternalCatalogVersionsSuite from using a hard-coded location to store localized Spark binaries. #30122

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-22356][SQL] data source table should support overlapped columns between data and partition schema #19579

[SPARK-22356][SQL] data source table should support overlapped columns between data and partition schema #19579

cloud-fan commented Oct 26, 2017

cloud-fan commented Oct 26, 2017

SparkQA commented Oct 26, 2017

gatorsmile Oct 26, 2017

SparkQA commented Oct 27, 2017

gatorsmile commented Oct 27, 2017

gatorsmile commented Oct 27, 2017 •

edited

[SPARK-22356][SQL] data source table should support overlapped columns between data and partition schema #19579

[SPARK-22356][SQL] data source table should support overlapped columns between data and partition schema #19579

Conversation

cloud-fan commented Oct 26, 2017

What changes were proposed in this pull request?

How was this patch tested?

cloud-fan commented Oct 26, 2017

SparkQA commented Oct 26, 2017

gatorsmile Oct 26, 2017

Choose a reason for hiding this comment

SparkQA commented Oct 27, 2017

gatorsmile commented Oct 27, 2017

gatorsmile commented Oct 27, 2017 • edited

gatorsmile commented Oct 27, 2017 •

edited