Skip to content

Conversation

@liancheng
Copy link
Contributor

This PR adds a hacky workaround for PARQUET-201, and should be removed once we upgrade to parquet-mr 1.8.1 or higher versions.

In Parquet, not all types of columns can be used for filter push-down optimization. The set of valid column types is controlled by ValidTypeMap. Unfortunately, in parquet-mr 1.7.0 and prior versions, this limitation is too strict, and doesn't allow BINARY (ENUM) columns to be pushed down. On the other hand, BINARY (ENUM) is commonly seen in Parquet files written by libraries like parquet-avro.

This restriction is problematic for Spark SQL, because Spark SQL doesn't have a type that maps to Parquet BINARY (ENUM) directly, and always converts BINARY (ENUM) to Catalyst StringType. Thus, a predicate involving a BINARY (ENUM) is recognized as one involving a string field instead and can be pushed down by the query optimizer. Such predicates are actually perfectly legal except that it fails the ValidTypeMap check.

The workaround added here is relaxing ValidTypeMap to include BINARY (ENUM). I also took the chance to simplify ParquetCompatibilityTest a little bit when adding regression test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I added 3 hack alerts in 1.5 including this PR, just trying to make the alerts a little bit more consistent :)

@liancheng liancheng force-pushed the spark-9407/parquet-enum-filter-push-down branch from ed5b26c to dedb3b6 Compare August 12, 2015 00:21
@liancheng
Copy link
Contributor Author

retest this please

@liancheng
Copy link
Contributor Author

I'm a little bit confused, not sure whether the Jenkins build failed or passed since there are no links to the builds...

@liancheng
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Aug 12, 2015

Test build #40615 has finished for PR 8107 at commit dedb3b6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 12, 2015

Test build #1490 has finished for PR 8107 at commit dedb3b6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor Author

Merging to master and branch-1.5.

@asfgit asfgit closed this in 3ecb379 Aug 12, 2015
asfgit pushed a commit that referenced this pull request Aug 12, 2015
…tes to be pushed down

This PR adds a hacky workaround for PARQUET-201, and should be removed once we upgrade to parquet-mr 1.8.1 or higher versions.

In Parquet, not all types of columns can be used for filter push-down optimization.  The set of valid column types is controlled by `ValidTypeMap`.  Unfortunately, in parquet-mr 1.7.0 and prior versions, this limitation is too strict, and doesn't allow `BINARY (ENUM)` columns to be pushed down.  On the other hand, `BINARY (ENUM)` is commonly seen in Parquet files written by libraries like `parquet-avro`.

This restriction is problematic for Spark SQL, because Spark SQL doesn't have a type that maps to Parquet `BINARY (ENUM)` directly, and always converts `BINARY (ENUM)` to Catalyst `StringType`.  Thus, a predicate involving a `BINARY (ENUM)` is recognized as one involving a string field instead and can be pushed down by the query optimizer.  Such predicates are actually perfectly legal except that it fails the `ValidTypeMap` check.

The workaround added here is relaxing `ValidTypeMap` to include `BINARY (ENUM)`.  I also took the chance to simplify `ParquetCompatibilityTest` a little bit when adding regression test.

Author: Cheng Lian <lian@databricks.com>

Closes #8107 from liancheng/spark-9407/parquet-enum-filter-push-down.

(cherry picked from commit 3ecb379)
Signed-off-by: Cheng Lian <lian@databricks.com>
@liancheng liancheng deleted the spark-9407/parquet-enum-filter-push-down branch August 12, 2015 12:07
CodingCat pushed a commit to CodingCat/spark that referenced this pull request Aug 17, 2015
…tes to be pushed down

This PR adds a hacky workaround for PARQUET-201, and should be removed once we upgrade to parquet-mr 1.8.1 or higher versions.

In Parquet, not all types of columns can be used for filter push-down optimization.  The set of valid column types is controlled by `ValidTypeMap`.  Unfortunately, in parquet-mr 1.7.0 and prior versions, this limitation is too strict, and doesn't allow `BINARY (ENUM)` columns to be pushed down.  On the other hand, `BINARY (ENUM)` is commonly seen in Parquet files written by libraries like `parquet-avro`.

This restriction is problematic for Spark SQL, because Spark SQL doesn't have a type that maps to Parquet `BINARY (ENUM)` directly, and always converts `BINARY (ENUM)` to Catalyst `StringType`.  Thus, a predicate involving a `BINARY (ENUM)` is recognized as one involving a string field instead and can be pushed down by the query optimizer.  Such predicates are actually perfectly legal except that it fails the `ValidTypeMap` check.

The workaround added here is relaxing `ValidTypeMap` to include `BINARY (ENUM)`.  I also took the chance to simplify `ParquetCompatibilityTest` a little bit when adding regression test.

Author: Cheng Lian <lian@databricks.com>

Closes apache#8107 from liancheng/spark-9407/parquet-enum-filter-push-down.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants