- 
                Notifications
    
You must be signed in to change notification settings  - Fork 28.9k
 
[SPARK-9407] [SQL] Relaxes Parquet ValidTypeMap to allow ENUM predicates to be pushed down #8107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-9407] [SQL] Relaxes Parquet ValidTypeMap to allow ENUM predicates to be pushed down #8107
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I added 3 hack alerts in 1.5 including this PR, just trying to make the alerts a little bit more consistent :)
ed5b26c    to
    dedb3b6      
    Compare
  
    | 
           retest this please  | 
    
| 
           I'm a little bit confused, not sure whether the Jenkins build failed or passed since there are no links to the builds...  | 
    
| 
           retest this please  | 
    
| 
           Test build #40615 has finished for   PR 8107 at commit  
  | 
    
| 
           Test build #1490 has finished for   PR 8107 at commit  
  | 
    
| 
           Merging to master and branch-1.5.  | 
    
…tes to be pushed down This PR adds a hacky workaround for PARQUET-201, and should be removed once we upgrade to parquet-mr 1.8.1 or higher versions. In Parquet, not all types of columns can be used for filter push-down optimization. The set of valid column types is controlled by `ValidTypeMap`. Unfortunately, in parquet-mr 1.7.0 and prior versions, this limitation is too strict, and doesn't allow `BINARY (ENUM)` columns to be pushed down. On the other hand, `BINARY (ENUM)` is commonly seen in Parquet files written by libraries like `parquet-avro`. This restriction is problematic for Spark SQL, because Spark SQL doesn't have a type that maps to Parquet `BINARY (ENUM)` directly, and always converts `BINARY (ENUM)` to Catalyst `StringType`. Thus, a predicate involving a `BINARY (ENUM)` is recognized as one involving a string field instead and can be pushed down by the query optimizer. Such predicates are actually perfectly legal except that it fails the `ValidTypeMap` check. The workaround added here is relaxing `ValidTypeMap` to include `BINARY (ENUM)`. I also took the chance to simplify `ParquetCompatibilityTest` a little bit when adding regression test. Author: Cheng Lian <lian@databricks.com> Closes #8107 from liancheng/spark-9407/parquet-enum-filter-push-down. (cherry picked from commit 3ecb379) Signed-off-by: Cheng Lian <lian@databricks.com>
…tes to be pushed down This PR adds a hacky workaround for PARQUET-201, and should be removed once we upgrade to parquet-mr 1.8.1 or higher versions. In Parquet, not all types of columns can be used for filter push-down optimization. The set of valid column types is controlled by `ValidTypeMap`. Unfortunately, in parquet-mr 1.7.0 and prior versions, this limitation is too strict, and doesn't allow `BINARY (ENUM)` columns to be pushed down. On the other hand, `BINARY (ENUM)` is commonly seen in Parquet files written by libraries like `parquet-avro`. This restriction is problematic for Spark SQL, because Spark SQL doesn't have a type that maps to Parquet `BINARY (ENUM)` directly, and always converts `BINARY (ENUM)` to Catalyst `StringType`. Thus, a predicate involving a `BINARY (ENUM)` is recognized as one involving a string field instead and can be pushed down by the query optimizer. Such predicates are actually perfectly legal except that it fails the `ValidTypeMap` check. The workaround added here is relaxing `ValidTypeMap` to include `BINARY (ENUM)`. I also took the chance to simplify `ParquetCompatibilityTest` a little bit when adding regression test. Author: Cheng Lian <lian@databricks.com> Closes apache#8107 from liancheng/spark-9407/parquet-enum-filter-push-down.
This PR adds a hacky workaround for PARQUET-201, and should be removed once we upgrade to parquet-mr 1.8.1 or higher versions.
In Parquet, not all types of columns can be used for filter push-down optimization. The set of valid column types is controlled by
ValidTypeMap. Unfortunately, in parquet-mr 1.7.0 and prior versions, this limitation is too strict, and doesn't allowBINARY (ENUM)columns to be pushed down. On the other hand,BINARY (ENUM)is commonly seen in Parquet files written by libraries likeparquet-avro.This restriction is problematic for Spark SQL, because Spark SQL doesn't have a type that maps to Parquet
BINARY (ENUM)directly, and always convertsBINARY (ENUM)to CatalystStringType. Thus, a predicate involving aBINARY (ENUM)is recognized as one involving a string field instead and can be pushed down by the query optimizer. Such predicates are actually perfectly legal except that it fails theValidTypeMapcheck.The workaround added here is relaxing
ValidTypeMapto includeBINARY (ENUM). I also took the chance to simplifyParquetCompatibilityTesta little bit when adding regression test.