New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-11301][SQL] Fix case sensitivity for filter on partitioned col… #14970
Conversation
Hi, @cloud-fan . |
Also, cc @yhuai . |
thanks, can you also put the reason of this bug in PR description(not just symptom)? Is it a bug for 1.6 only? |
Thank you for review, @cloud-fan . Yes. This is a bug for 1.6 branch only. I'll update the PR description. |
Test build #64977 has finished for PR 14970 at commit
|
@cloud-fan . I updated the PR description with the more detail reason. |
## What changes were proposed in this pull request? `DataSourceStrategy` does not consider `SQLConf` in `Context` and always match column names. For instance, `HiveContext` uses case insensitive configuration, but it's ignored in `DataSourceStrategy`. This issue was originally registered at SPARK-11301 against 1.6.0 and seemed to be fixed at that time, but Apache Spark 1.6.2 still handles **partitioned column name** in a case-sensitive way always. This is incorrect like the following. ```scala scala> sql("CREATE TABLE t(a int) PARTITIONED BY (b string) STORED AS PARQUET") scala> sql("INSERT INTO TABLE t PARTITION(b='P') SELECT * FROM (SELECT 1) t") scala> sql("INSERT INTO TABLE t PARTITION(b='Q') SELECT * FROM (SELECT 2) t") scala> sql("SELECT * FROM T WHERE B='P'").show +---+---+ | a| b| +---+---+ | 1| P| | 2| Q| +---+---+ ``` The result is the same with `set spark.sql.caseSensitive=false`. Here is the result in [Databricks CE](https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6660119172909095/3421754458488607/5162191866050912/latest.html) . This PR reads the configuration and handle the column name comparison accordingly. ## How was this patch tested? Pass the Jenkins test with a modified test. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #14970 from dongjoon-hyun/SPARK-11301.
thanks, merging to 1.6! |
Thank you, @cloud-fan ! |
can you close it? It's not merged to master and will not be closed automatically |
Sure! |
## What changes were proposed in this pull request? `DataSourceStrategy` does not consider `SQLConf` in `Context` and always match column names. For instance, `HiveContext` uses case insensitive configuration, but it's ignored in `DataSourceStrategy`. This issue was originally registered at SPARK-11301 against 1.6.0 and seemed to be fixed at that time, but Apache Spark 1.6.2 still handles **partitioned column name** in a case-sensitive way always. This is incorrect like the following. ```scala scala> sql("CREATE TABLE t(a int) PARTITIONED BY (b string) STORED AS PARQUET") scala> sql("INSERT INTO TABLE t PARTITION(b='P') SELECT * FROM (SELECT 1) t") scala> sql("INSERT INTO TABLE t PARTITION(b='Q') SELECT * FROM (SELECT 2) t") scala> sql("SELECT * FROM T WHERE B='P'").show +---+---+ | a| b| +---+---+ | 1| P| | 2| Q| +---+---+ ``` The result is the same with `set spark.sql.caseSensitive=false`. Here is the result in [Databricks CE](https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6660119172909095/3421754458488607/5162191866050912/latest.html) . This PR reads the configuration and handle the column name comparison accordingly. ## How was this patch tested? Pass the Jenkins test with a modified test. Author: Dongjoon Hyun <dongjoon@apache.org> Closes apache#14970 from dongjoon-hyun/SPARK-11301. (cherry picked from commit 958039a)
What changes were proposed in this pull request?
DataSourceStrategy
does not considerSQLConf
inContext
and always match column names. For instance,HiveContext
uses case insensitive configuration, but it's ignored inDataSourceStrategy
. This issue was originally registered at SPARK-11301 against 1.6.0 and seemed to be fixed at that time, but Apache Spark 1.6.2 still handles partitioned column name in a case-sensitive way always. This is incorrect like the following.The result is the same with
set spark.sql.caseSensitive=false
. Here is the result in Databricks CE .This PR reads the configuration and handle the column name comparison accordingly.
How was this patch tested?
Pass the Jenkins test with a modified test.