[SPARK-9407] [SQL] Relaxes Parquet ValidTypeMap to allow ENUM predicates to be pushed down #8107

liancheng · 2015-08-11T15:55:49Z

This PR adds a hacky workaround for PARQUET-201, and should be removed once we upgrade to parquet-mr 1.8.1 or higher versions.

In Parquet, not all types of columns can be used for filter push-down optimization. The set of valid column types is controlled by ValidTypeMap. Unfortunately, in parquet-mr 1.7.0 and prior versions, this limitation is too strict, and doesn't allow BINARY (ENUM) columns to be pushed down. On the other hand, BINARY (ENUM) is commonly seen in Parquet files written by libraries like parquet-avro.

This restriction is problematic for Spark SQL, because Spark SQL doesn't have a type that maps to Parquet BINARY (ENUM) directly, and always converts BINARY (ENUM) to Catalyst StringType. Thus, a predicate involving a BINARY (ENUM) is recognized as one involving a string field instead and can be pushed down by the query optimizer. Such predicates are actually perfectly legal except that it fails the ValidTypeMap check.

The workaround added here is relaxing ValidTypeMap to include BINARY (ENUM). I also took the chance to simplify ParquetCompatibilityTest a little bit when adding regression test.

liancheng · 2015-08-11T15:57:58Z

sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetRelation.scala

Well, I added 3 hack alerts in 1.5 including this PR, just trying to make the alerts a little bit more consistent :)

liancheng · 2015-08-12T01:50:45Z

retest this please

liancheng · 2015-08-12T09:18:56Z

I'm a little bit confused, not sure whether the Jenkins build failed or passed since there are no links to the builds...

liancheng · 2015-08-12T09:19:04Z

retest this please

SparkQA · 2015-08-12T11:57:33Z

Test build #40615 has finished for PR 8107 at commit dedb3b6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-08-12T11:59:09Z

Test build #1490 has finished for PR 8107 at commit dedb3b6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2015-08-12T12:00:28Z

Merging to master and branch-1.5.

…tes to be pushed down This PR adds a hacky workaround for PARQUET-201, and should be removed once we upgrade to parquet-mr 1.8.1 or higher versions. In Parquet, not all types of columns can be used for filter push-down optimization. The set of valid column types is controlled by `ValidTypeMap`. Unfortunately, in parquet-mr 1.7.0 and prior versions, this limitation is too strict, and doesn't allow `BINARY (ENUM)` columns to be pushed down. On the other hand, `BINARY (ENUM)` is commonly seen in Parquet files written by libraries like `parquet-avro`. This restriction is problematic for Spark SQL, because Spark SQL doesn't have a type that maps to Parquet `BINARY (ENUM)` directly, and always converts `BINARY (ENUM)` to Catalyst `StringType`. Thus, a predicate involving a `BINARY (ENUM)` is recognized as one involving a string field instead and can be pushed down by the query optimizer. Such predicates are actually perfectly legal except that it fails the `ValidTypeMap` check. The workaround added here is relaxing `ValidTypeMap` to include `BINARY (ENUM)`. I also took the chance to simplify `ParquetCompatibilityTest` a little bit when adding regression test. Author: Cheng Lian <lian@databricks.com> Closes #8107 from liancheng/spark-9407/parquet-enum-filter-push-down. (cherry picked from commit 3ecb379) Signed-off-by: Cheng Lian <lian@databricks.com>

…tes to be pushed down This PR adds a hacky workaround for PARQUET-201, and should be removed once we upgrade to parquet-mr 1.8.1 or higher versions. In Parquet, not all types of columns can be used for filter push-down optimization. The set of valid column types is controlled by `ValidTypeMap`. Unfortunately, in parquet-mr 1.7.0 and prior versions, this limitation is too strict, and doesn't allow `BINARY (ENUM)` columns to be pushed down. On the other hand, `BINARY (ENUM)` is commonly seen in Parquet files written by libraries like `parquet-avro`. This restriction is problematic for Spark SQL, because Spark SQL doesn't have a type that maps to Parquet `BINARY (ENUM)` directly, and always converts `BINARY (ENUM)` to Catalyst `StringType`. Thus, a predicate involving a `BINARY (ENUM)` is recognized as one involving a string field instead and can be pushed down by the query optimizer. Such predicates are actually perfectly legal except that it fails the `ValidTypeMap` check. The workaround added here is relaxing `ValidTypeMap` to include `BINARY (ENUM)`. I also took the chance to simplify `ParquetCompatibilityTest` a little bit when adding regression test. Author: Cheng Lian <lian@databricks.com> Closes apache#8107 from liancheng/spark-9407/parquet-enum-filter-push-down.

liancheng reviewed Aug 11, 2015
View reviewed changes

Relaxes Parquet ValidTypeMap to allow ENUM predicates to be pushed down

dedb3b6

liancheng force-pushed the spark-9407/parquet-enum-filter-push-down branch from ed5b26c to dedb3b6 Compare August 12, 2015 00:21

asfgit closed this in 3ecb379 Aug 12, 2015

liancheng deleted the spark-9407/parquet-enum-filter-push-down branch August 12, 2015 12:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-9407] [SQL] Relaxes Parquet ValidTypeMap to allow ENUM predicates to be pushed down #8107

[SPARK-9407] [SQL] Relaxes Parquet ValidTypeMap to allow ENUM predicates to be pushed down #8107

Uh oh!

liancheng commented Aug 11, 2015

Uh oh!

liancheng Aug 11, 2015

Uh oh!

liancheng commented Aug 12, 2015

Uh oh!

liancheng commented Aug 12, 2015

Uh oh!

liancheng commented Aug 12, 2015

Uh oh!

SparkQA commented Aug 12, 2015

Uh oh!

SparkQA commented Aug 12, 2015

Uh oh!

liancheng commented Aug 12, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-9407] [SQL] Relaxes Parquet ValidTypeMap to allow ENUM predicates to be pushed down #8107

[SPARK-9407] [SQL] Relaxes Parquet ValidTypeMap to allow ENUM predicates to be pushed down #8107

Uh oh!

Conversation

liancheng commented Aug 11, 2015

Uh oh!

liancheng Aug 11, 2015

Choose a reason for hiding this comment

Uh oh!

liancheng commented Aug 12, 2015

Uh oh!

liancheng commented Aug 12, 2015

Uh oh!

liancheng commented Aug 12, 2015

Uh oh!

SparkQA commented Aug 12, 2015

Uh oh!

SparkQA commented Aug 12, 2015

Uh oh!

liancheng commented Aug 12, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants