[SPARK-11301][SQL] Fix case sensitivity for filter on partitioned col… #14970

dongjoon-hyun · 2016-09-06T07:08:31Z

What changes were proposed in this pull request?

DataSourceStrategy does not consider SQLConf in Context and always match column names. For instance, HiveContext uses case insensitive configuration, but it's ignored in DataSourceStrategy. This issue was originally registered at SPARK-11301 against 1.6.0 and seemed to be fixed at that time, but Apache Spark 1.6.2 still handles partitioned column name in a case-sensitive way always. This is incorrect like the following.

scala> sql("CREATE TABLE t(a int) PARTITIONED BY (b string) STORED AS PARQUET")
scala> sql("INSERT INTO TABLE t PARTITION(b='P') SELECT * FROM (SELECT 1) t")
scala> sql("INSERT INTO TABLE t PARTITION(b='Q') SELECT * FROM (SELECT 2) t")
scala> sql("SELECT * FROM T WHERE B='P'").show
+---+---+
|  a|  b|
+---+---+
|  1|  P|
|  2|  Q|
+---+---+

The result is the same with set spark.sql.caseSensitive=false. Here is the result in Databricks CE .

This PR reads the configuration and handle the column name comparison accordingly.

How was this patch tested?

Pass the Jenkins test with a modified test.

…umns

dongjoon-hyun · 2016-09-06T07:11:28Z

Hi, @cloud-fan .
Could you review this PR?

dongjoon-hyun · 2016-09-06T07:22:42Z

Also, cc @yhuai .

cloud-fan · 2016-09-06T08:12:36Z

thanks, can you also put the reason of this bug in PR description(not just symptom)? Is it a bug for 1.6 only?

dongjoon-hyun · 2016-09-06T08:44:34Z

Thank you for review, @cloud-fan . Yes. This is a bug for 1.6 branch only. I'll update the PR description.

SparkQA · 2016-09-06T08:48:45Z

Test build #64977 has finished for PR 14970 at commit a86de9b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2016-09-06T08:56:39Z

@cloud-fan . I updated the PR description with the more detail reason.

## What changes were proposed in this pull request? `DataSourceStrategy` does not consider `SQLConf` in `Context` and always match column names. For instance, `HiveContext` uses case insensitive configuration, but it's ignored in `DataSourceStrategy`. This issue was originally registered at SPARK-11301 against 1.6.0 and seemed to be fixed at that time, but Apache Spark 1.6.2 still handles **partitioned column name** in a case-sensitive way always. This is incorrect like the following. ```scala scala> sql("CREATE TABLE t(a int) PARTITIONED BY (b string) STORED AS PARQUET") scala> sql("INSERT INTO TABLE t PARTITION(b='P') SELECT * FROM (SELECT 1) t") scala> sql("INSERT INTO TABLE t PARTITION(b='Q') SELECT * FROM (SELECT 2) t") scala> sql("SELECT * FROM T WHERE B='P'").show +---+---+ | a| b| +---+---+ | 1| P| | 2| Q| +---+---+ ``` The result is the same with `set spark.sql.caseSensitive=false`. Here is the result in [Databricks CE](https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6660119172909095/3421754458488607/5162191866050912/latest.html) . This PR reads the configuration and handle the column name comparison accordingly. ## How was this patch tested? Pass the Jenkins test with a modified test. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #14970 from dongjoon-hyun/SPARK-11301.

cloud-fan · 2016-09-06T11:36:49Z

thanks, merging to 1.6!

dongjoon-hyun · 2016-09-06T13:32:22Z

Thank you, @cloud-fan !

cloud-fan · 2016-09-06T13:34:50Z

can you close it? It's not merged to master and will not be closed automatically

dongjoon-hyun · 2016-09-06T13:35:14Z

Sure!

## What changes were proposed in this pull request? `DataSourceStrategy` does not consider `SQLConf` in `Context` and always match column names. For instance, `HiveContext` uses case insensitive configuration, but it's ignored in `DataSourceStrategy`. This issue was originally registered at SPARK-11301 against 1.6.0 and seemed to be fixed at that time, but Apache Spark 1.6.2 still handles **partitioned column name** in a case-sensitive way always. This is incorrect like the following. ```scala scala> sql("CREATE TABLE t(a int) PARTITIONED BY (b string) STORED AS PARQUET") scala> sql("INSERT INTO TABLE t PARTITION(b='P') SELECT * FROM (SELECT 1) t") scala> sql("INSERT INTO TABLE t PARTITION(b='Q') SELECT * FROM (SELECT 2) t") scala> sql("SELECT * FROM T WHERE B='P'").show +---+---+ | a| b| +---+---+ | 1| P| | 2| Q| +---+---+ ``` The result is the same with `set spark.sql.caseSensitive=false`. Here is the result in [Databricks CE](https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6660119172909095/3421754458488607/5162191866050912/latest.html) . This PR reads the configuration and handle the column name comparison accordingly. ## How was this patch tested? Pass the Jenkins test with a modified test. Author: Dongjoon Hyun <dongjoon@apache.org> Closes apache#14970 from dongjoon-hyun/SPARK-11301. (cherry picked from commit 958039a)

[SPARK-11301][SQL] Fix case sensitivity for filter on partitioned col…

a86de9b

…umns

dongjoon-hyun closed this Sep 6, 2016

dongjoon-hyun deleted the SPARK-11301 branch January 7, 2019 07:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-11301][SQL] Fix case sensitivity for filter on partitioned col… #14970

[SPARK-11301][SQL] Fix case sensitivity for filter on partitioned col… #14970

dongjoon-hyun commented Sep 6, 2016 •

edited

dongjoon-hyun commented Sep 6, 2016

dongjoon-hyun commented Sep 6, 2016

cloud-fan commented Sep 6, 2016

dongjoon-hyun commented Sep 6, 2016

SparkQA commented Sep 6, 2016

dongjoon-hyun commented Sep 6, 2016

cloud-fan commented Sep 6, 2016

dongjoon-hyun commented Sep 6, 2016

cloud-fan commented Sep 6, 2016

dongjoon-hyun commented Sep 6, 2016

[SPARK-11301][SQL] Fix case sensitivity for filter on partitioned col… #14970

[SPARK-11301][SQL] Fix case sensitivity for filter on partitioned col… #14970

Conversation

dongjoon-hyun commented Sep 6, 2016 • edited

What changes were proposed in this pull request?

How was this patch tested?

dongjoon-hyun commented Sep 6, 2016

dongjoon-hyun commented Sep 6, 2016

cloud-fan commented Sep 6, 2016

dongjoon-hyun commented Sep 6, 2016

SparkQA commented Sep 6, 2016

dongjoon-hyun commented Sep 6, 2016

cloud-fan commented Sep 6, 2016

dongjoon-hyun commented Sep 6, 2016

cloud-fan commented Sep 6, 2016

dongjoon-hyun commented Sep 6, 2016

dongjoon-hyun commented Sep 6, 2016 •

edited