Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-27698][SQL] Add new method convertibleFilters for getting pushed down filters in Parquet file reader #24597

Closed

Conversation

gengliangwang
Copy link
Member

@gengliangwang gengliangwang commented May 14, 2019

What changes were proposed in this pull request?

To return accurate pushed filters in Parquet file scan(#24327 (review)), we can process the original data source filters in the following way:

  1. For "And" operators, split the conjunctive predicates and try converting each of them. After that
    1.1 if partially predicate pushed down is allowed, return convertible results;
    1.2 otherwise, return the whole predicate if convertible, or empty result if not convertible.

  2. For "Or" operators, if both children can be pushed down, it is partially or totally convertible; otherwise, return empty result

  3. For other operators, they are not able to be partially pushed down.
    2.1 if the entire predicate is convertible, return itself
    2.2 otherwise, return an empty result.

This PR also contains code refactoring. Currently ParquetFilters. createFilter accepts parameter schema: MessageType and create field mapping for every input filter. We can make it a class member and avoid creating the nameToParquetField mapping for every input filter.

How was this patch tested?

Unit test

@gengliangwang
Copy link
Member Author

@gengliangwang
Copy link
Member Author

mark this as WIP before #24598 is merged.

@gengliangwang gengliangwang changed the title [SPARK-27698] Add new method for getting pushed down filters in Parquet file reader [WIP][SPARK-27698] Add new method for getting pushed down filters in Parquet file reader May 14, 2019
@SparkQA
Copy link

SparkQA commented May 14, 2019

Test build #105376 has finished for PR 24597 at commit 75fe737.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang gengliangwang changed the title [WIP][SPARK-27698] Add new method for getting pushed down filters in Parquet file reader [SPARK-27698] Add new method for getting pushed down filters in Parquet file reader May 19, 2019
@SparkQA
Copy link

SparkQA commented May 19, 2019

Test build #105527 has finished for PR 24597 at commit fa8f48c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 19, 2019

Test build #105529 has finished for PR 24597 at commit b22ea80.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang
Copy link
Member Author

This is ready for review @dongjoon-hyun @wangyum @rdblue @cloud-fan

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-27698] Add new method for getting pushed down filters in Parquet file reader [SPARK-27698][SQL] Add new method for getting pushed down filters in Parquet file reader May 20, 2019
@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented May 20, 2019

Thank you for pinging me, @gengliangwang . Shall we wait for one day? Currently, after SPARK-27699, HiveOrcFilterSuite failure is reported in Hadoop 3.2 profile. The fix is under testing and will be merged tomorrow. @wangyum and @HyukjinKwon is actively working on that.

Copy link
Member

@wangyum wangyum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine to me.

@gengliangwang
Copy link
Member Author

retest this please.

@gengliangwang
Copy link
Member Author

@dongjoon-hyun @cloud-fan Please review this, so that we can continue the migration of Parquet V2.

@SparkQA
Copy link

SparkQA commented May 21, 2019

Test build #105613 has finished for PR 24597 at commit b22ea80.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang gengliangwang changed the title [SPARK-27698][SQL] Add new method for getting pushed down filters in Parquet file reader [SPARK-27698][SQL] Add new method convertibleFilters for getting pushed down filters in Parquet file reader May 22, 2019
@cloud-fan cloud-fan closed this in c3c443c May 22, 2019
@cloud-fan
Copy link
Contributor

thanks, merging to master!

/**
* Returns a map, which contains parquet field name and data type, if predicate push down applies.
*/
private def getFieldMap(dataType: MessageType): Map[String, ParquetField] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gengliangwang, don't have to move codes around to make it easier to track ...

@HyukjinKwon
Copy link
Member

Looks fine but can you clarify the relation between convertibleFilters and createFilters?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants