Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-22356][SQL] data source table should support overlapped columns between data and partition schema #19579

Closed
wants to merge 2 commits into from

Conversation

cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

This is a regression introduced by #14207. After Spark 2.1, we store the inferred schema when creating the table, to avoid inferring schema again at read path. However, there is one special case: overlapped columns between data and partition. For this case, it breaks the assumption of table schema that there is on ovelap between data and partition schema, and partition columns should be at the end. The result is, for Spark 2.1, the table scan has incorrect schema that puts partition columns at the end. For Spark 2.2, we add a check in CatalogTable to validate table schema, which fails at this case.

To fix this issue, a simple and safe approach is to fallback to old behavior when overlapeed columns detected, i.e. store empty schema in metastore.

How was this patch tested?

new regression test

@cloud-fan
Copy link
Contributor Author

cc @gatorsmile

@SparkQA
Copy link

SparkQA commented Oct 26, 2017

Test build #83073 has finished for PR 19579 at commit 18907cb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

// empty schema in metastore and infer it at runtime. Note that this also means the new
// scalable partitioning handling feature(introduced at Spark 2.1) is disabled in this case.
case r: HadoopFsRelation if r.overlappedPartCols.nonEmpty =>
table.copy(schema = new StructType(), partitionColumnNames = Nil)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Log a warning message here? When data columns and partition columns have the same names, the values could be inconsistent. Thus, we do not suggest users to create such a table and it might perform well because we infer the schema at runtime.

@SparkQA
Copy link

SparkQA commented Oct 27, 2017

Test build #83094 has finished for PR 19579 at commit 2fe7ec4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

LGTM

@gatorsmile
Copy link
Member

gatorsmile commented Oct 27, 2017

Thanks! Merged to master/2.2

@asfgit asfgit closed this in 9b262f6 Oct 27, 2017
asfgit pushed a commit that referenced this pull request Oct 27, 2017
…s between data and partition schema

This is a regression introduced by #14207. After Spark 2.1, we store the inferred schema when creating the table, to avoid inferring schema again at read path. However, there is one special case: overlapped columns between data and partition. For this case, it breaks the assumption of table schema that there is on ovelap between data and partition schema, and partition columns should be at the end. The result is, for Spark 2.1, the table scan has incorrect schema that puts partition columns at the end. For Spark 2.2, we add a check in CatalogTable to validate table schema, which fails at this case.

To fix this issue, a simple and safe approach is to fallback to old behavior when overlapeed columns detected, i.e. store empty schema in metastore.

new regression test

Author: Wenchen Fan <wenchen@databricks.com>

Closes #19579 from cloud-fan/bug2.
asfgit pushed a commit that referenced this pull request Feb 15, 2018
## What changes were proposed in this pull request?
#19579 introduces a behavior change. We need to document it in the migration guide.

## How was this patch tested?
Also update the HiveExternalCatalogVersionsSuite to verify it.

Author: gatorsmile <gatorsmile@gmail.com>

Closes #20606 from gatorsmile/addMigrationGuide.

(cherry picked from commit a77ebb0)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
asfgit pushed a commit that referenced this pull request Feb 15, 2018
## What changes were proposed in this pull request?
#19579 introduces a behavior change. We need to document it in the migration guide.

## How was this patch tested?
Also update the HiveExternalCatalogVersionsSuite to verify it.

Author: gatorsmile <gatorsmile@gmail.com>

Closes #20606 from gatorsmile/addMigrationGuide.
MatthewRBruce pushed a commit to Shopify/spark that referenced this pull request Jul 31, 2018
…s between data and partition schema

This is a regression introduced by apache#14207. After Spark 2.1, we store the inferred schema when creating the table, to avoid inferring schema again at read path. However, there is one special case: overlapped columns between data and partition. For this case, it breaks the assumption of table schema that there is on ovelap between data and partition schema, and partition columns should be at the end. The result is, for Spark 2.1, the table scan has incorrect schema that puts partition columns at the end. For Spark 2.2, we add a check in CatalogTable to validate table schema, which fails at this case.

To fix this issue, a simple and safe approach is to fallback to old behavior when overlapeed columns detected, i.e. store empty schema in metastore.

new regression test

Author: Wenchen Fan <wenchen@databricks.com>

Closes apache#19579 from cloud-fan/bug2.
cloud-fan pushed a commit that referenced this pull request Nov 4, 2020
…using a hard-coded location to store localized Spark binaries

### What changes were proposed in this pull request?
This PR changes `HiveExternalCatalogVersionsSuite` to, by default, use a standard temporary directory to store the Spark binaries that it localizes. It additionally adds a new System property, `spark.test.cache-dir`, which can be used to define a static location into which the Spark binary will be localized to allow for sharing between test executions. If the System property is used, the downloaded binaries won't be deleted after the test runs.

### Why are the changes needed?
In SPARK-22356 (PR #19579), the `sparkTestingDir` used by `HiveExternalCatalogVersionsSuite` became hard-coded to enable re-use of the downloaded Spark tarball between test executions:
```
  // For local test, you can set `sparkTestingDir` to a static value like `/tmp/test-spark`, to
  // avoid downloading Spark of different versions in each run.
  private val sparkTestingDir = new File("/tmp/test-spark")
```
However this doesn't work, since it gets deleted every time:
```
  override def afterAll(): Unit = {
    try {
      Utils.deleteRecursively(wareHousePath)
      Utils.deleteRecursively(tmpDataDir)
      Utils.deleteRecursively(sparkTestingDir)
    } finally {
      super.afterAll()
    }
  }
```

It's bad that we're hard-coding to a `/tmp` directory, as in some cases this is not the proper place to store temporary files. We're not currently making any good use of it.

### Does this PR introduce _any_ user-facing change?
Developer-facing changes only, as this is in a test.

### How was this patch tested?
The test continues to execute as expected.

Closes #30122 from xkrogen/xkrogen-SPARK-33214-hiveexternalversioncatalogsuite-fix.

Authored-by: Erik Krogen <xkrogen@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants