Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-30323][SQL] Support filters pushdown in CSV datasource #26973

Closed
wants to merge 50 commits into from

Conversation

MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Dec 20, 2019

What changes were proposed in this pull request?

In the PR, I propose to support pushed down filters in CSV datasource. The reason of pushing a filter up to UnivocityParser is to apply the filter as soon as all its attributes become available i.e. converted from CSV fields to desired values according to the schema. This allows to skip conversions of other values if the filter returns false. This can improve performance when pushed filters are highly selective and conversion of CSV string fields to desired values are comparably expensive ( for example, conversion to TIMESTAMP values).

Here are details of the implementation:

  • UnivocityParser.convert() converts parsed CSV tokens one-by-one sequentially starting from index 0 up to parsedSchema.length - 1. At current index i, it applies filters that refer to attributes at row fields indexes 0..i. If any filter returns false, it skips conversions of other input tokens.
  • Pushed filters are converted to expressions. The expressions are bound to row positions according to requiredSchema. The expressions are compiled to predicates via generating Java code.
  • To be able to apply predicates to partially initialized rows, the predicates are grouped, and combined via the And expression. Final predicate at index N can refer to row fields at the positions 0..N, and can be applied to a row even if other fields at the positions N+1..requiredSchema.lenght-1 are not set.

Why are the changes needed?

The changes improve performance on synthetic benchmarks more than 9 times (on JDK 8 & 11):

OpenJDK 64-Bit Server VM 11.0.5+10 on Mac OS X 10.15.2
Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz
Filters pushdown:                         Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
w/o filters                                       11889          11945          52          0.0      118893.1       1.0X
pushdown disabled                                 11790          11860         115          0.0      117902.3       1.0X
w/ filters                                         1240           1278          33          0.1       12400.8       9.6X

Does this PR introduce any user-facing change?

No

How was this patch tested?

  • Added new test suite CSVFiltersSuite
  • Added tests to CSVSuite and UnivocityParserSuite

@SparkQA
Copy link

SparkQA commented Dec 20, 2019

Test build #115633 has finished for PR 26973 at commit f0cc83c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 21, 2019

Test build #115636 has finished for PR 26973 at commit f24e873.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 13, 2020

Test build #116621 has finished for PR 26973 at commit 4a25815.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 13, 2020

Test build #116624 has finished for PR 26973 at commit c03ae06.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 13, 2020

Test build #116647 has finished for PR 26973 at commit e302fa4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 13, 2020

Test build #116649 has finished for PR 26973 at commit 96e9554.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Looks pretty good but I will take a final look after the comments were addressed.

@SparkQA
Copy link

SparkQA commented Jan 14, 2020

Test build #116694 has finished for PR 26973 at commit 9217536.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 14, 2020

Test build #116722 has finished for PR 26973 at commit 15c9648.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class CSVFilters(filters: Seq[sources.Filter], requiredSchema: StructType)

@SparkQA
Copy link

SparkQA commented Jan 15, 2020

Test build #116755 has finished for PR 26973 at commit 06be013.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Jan 15, 2020

jenkins, retest this, please

@SparkQA
Copy link

SparkQA commented Jan 15, 2020

Test build #116764 has finished for PR 26973 at commit 06be013.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master.

@MaxGekk MaxGekk deleted the csv-filters-pushdown branch June 5, 2020 19:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants