Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-11301][SQL] Fix case sensitivity for filter on partitioned col… #14970

Closed
wants to merge 1 commit into from
Closed

Conversation

dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Sep 6, 2016

What changes were proposed in this pull request?

DataSourceStrategy does not consider SQLConf in Context and always match column names. For instance, HiveContext uses case insensitive configuration, but it's ignored in DataSourceStrategy. This issue was originally registered at SPARK-11301 against 1.6.0 and seemed to be fixed at that time, but Apache Spark 1.6.2 still handles partitioned column name in a case-sensitive way always. This is incorrect like the following.

scala> sql("CREATE TABLE t(a int) PARTITIONED BY (b string) STORED AS PARQUET")
scala> sql("INSERT INTO TABLE t PARTITION(b='P') SELECT * FROM (SELECT 1) t")
scala> sql("INSERT INTO TABLE t PARTITION(b='Q') SELECT * FROM (SELECT 2) t")
scala> sql("SELECT * FROM T WHERE B='P'").show
+---+---+
|  a|  b|
+---+---+
|  1|  P|
|  2|  Q|
+---+---+

The result is the same with set spark.sql.caseSensitive=false. Here is the result in Databricks CE .

This PR reads the configuration and handle the column name comparison accordingly.

How was this patch tested?

Pass the Jenkins test with a modified test.

@dongjoon-hyun
Copy link
Member Author

Hi, @cloud-fan .
Could you review this PR?

@dongjoon-hyun
Copy link
Member Author

Also, cc @yhuai .

@cloud-fan
Copy link
Contributor

thanks, can you also put the reason of this bug in PR description(not just symptom)? Is it a bug for 1.6 only?

@dongjoon-hyun
Copy link
Member Author

Thank you for review, @cloud-fan . Yes. This is a bug for 1.6 branch only. I'll update the PR description.

@SparkQA
Copy link

SparkQA commented Sep 6, 2016

Test build #64977 has finished for PR 14970 at commit a86de9b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

@cloud-fan . I updated the PR description with the more detail reason.

asfgit pushed a commit that referenced this pull request Sep 6, 2016
## What changes were proposed in this pull request?

`DataSourceStrategy` does not consider `SQLConf` in `Context` and always match column names. For instance, `HiveContext` uses case insensitive configuration, but it's ignored in `DataSourceStrategy`. This issue was originally registered at SPARK-11301 against 1.6.0 and seemed to be fixed at that time, but Apache Spark 1.6.2 still handles **partitioned column name** in a case-sensitive way always. This is incorrect like the following.

```scala
scala> sql("CREATE TABLE t(a int) PARTITIONED BY (b string) STORED AS PARQUET")
scala> sql("INSERT INTO TABLE t PARTITION(b='P') SELECT * FROM (SELECT 1) t")
scala> sql("INSERT INTO TABLE t PARTITION(b='Q') SELECT * FROM (SELECT 2) t")
scala> sql("SELECT * FROM T WHERE B='P'").show
+---+---+
|  a|  b|
+---+---+
|  1|  P|
|  2|  Q|
+---+---+
```

The result is the same with `set spark.sql.caseSensitive=false`. Here is the result in [Databricks CE](https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6660119172909095/3421754458488607/5162191866050912/latest.html) .

This PR reads the configuration and handle the column name comparison accordingly.

## How was this patch tested?

Pass the Jenkins test with a modified test.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #14970 from dongjoon-hyun/SPARK-11301.
@cloud-fan
Copy link
Contributor

thanks, merging to 1.6!

@dongjoon-hyun
Copy link
Member Author

Thank you, @cloud-fan !

@cloud-fan
Copy link
Contributor

can you close it? It's not merged to master and will not be closed automatically

@dongjoon-hyun
Copy link
Member Author

Sure!

zzcclp pushed a commit to zzcclp/spark that referenced this pull request Sep 7, 2016
## What changes were proposed in this pull request?

`DataSourceStrategy` does not consider `SQLConf` in `Context` and always match column names. For instance, `HiveContext` uses case insensitive configuration, but it's ignored in `DataSourceStrategy`. This issue was originally registered at SPARK-11301 against 1.6.0 and seemed to be fixed at that time, but Apache Spark 1.6.2 still handles **partitioned column name** in a case-sensitive way always. This is incorrect like the following.

```scala
scala> sql("CREATE TABLE t(a int) PARTITIONED BY (b string) STORED AS PARQUET")
scala> sql("INSERT INTO TABLE t PARTITION(b='P') SELECT * FROM (SELECT 1) t")
scala> sql("INSERT INTO TABLE t PARTITION(b='Q') SELECT * FROM (SELECT 2) t")
scala> sql("SELECT * FROM T WHERE B='P'").show
+---+---+
|  a|  b|
+---+---+
|  1|  P|
|  2|  Q|
+---+---+
```

The result is the same with `set spark.sql.caseSensitive=false`. Here is the result in [Databricks CE](https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6660119172909095/3421754458488607/5162191866050912/latest.html) .

This PR reads the configuration and handle the column name comparison accordingly.

## How was this patch tested?

Pass the Jenkins test with a modified test.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes apache#14970 from dongjoon-hyun/SPARK-11301.

(cherry picked from commit 958039a)
@dongjoon-hyun dongjoon-hyun deleted the SPARK-11301 branch January 7, 2019 07:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants