[SPARK-30323][SQL] Support filters pushdown in CSV datasource #26973

MaxGekk · 2019-12-20T19:03:37Z

What changes were proposed in this pull request?

In the PR, I propose to support pushed down filters in CSV datasource. The reason of pushing a filter up to UnivocityParser is to apply the filter as soon as all its attributes become available i.e. converted from CSV fields to desired values according to the schema. This allows to skip conversions of other values if the filter returns false. This can improve performance when pushed filters are highly selective and conversion of CSV string fields to desired values are comparably expensive ( for example, conversion to TIMESTAMP values).

Here are details of the implementation:

UnivocityParser.convert() converts parsed CSV tokens one-by-one sequentially starting from index 0 up to parsedSchema.length - 1. At current index i, it applies filters that refer to attributes at row fields indexes 0..i. If any filter returns false, it skips conversions of other input tokens.
Pushed filters are converted to expressions. The expressions are bound to row positions according to requiredSchema. The expressions are compiled to predicates via generating Java code.
To be able to apply predicates to partially initialized rows, the predicates are grouped, and combined via the And expression. Final predicate at index N can refer to row fields at the positions 0..N, and can be applied to a row even if other fields at the positions N+1..requiredSchema.lenght-1 are not set.

Why are the changes needed?

The changes improve performance on synthetic benchmarks more than 9 times (on JDK 8 & 11):

OpenJDK 64-Bit Server VM 11.0.5+10 on Mac OS X 10.15.2
Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz
Filters pushdown:                         Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
w/o filters                                       11889          11945          52          0.0      118893.1       1.0X
pushdown disabled                                 11790          11860         115          0.0      117902.3       1.0X
w/ filters                                         1240           1278          33          0.1       12400.8       9.6X

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added new test suite CSVFiltersSuite
Added tests to CSVSuite and UnivocityParserSuite

This reverts commit 11bcbc6.

SparkQA · 2019-12-20T21:05:32Z

Test build #115633 has finished for PR 26973 at commit f0cc83c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

…-pushdown

SparkQA · 2019-12-21T02:41:31Z

Test build #115636 has finished for PR 26973 at commit f24e873.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-01-13T13:22:57Z

Test build #116621 has finished for PR 26973 at commit 4a25815.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-01-13T13:53:48Z

Test build #116624 has finished for PR 26973 at commit c03ae06.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…d option

SparkQA · 2020-01-13T18:45:07Z

Test build #116647 has finished for PR 26973 at commit e302fa4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-01-13T19:45:35Z

Test build #116649 has finished for PR 26973 at commit 96e9554.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVFilters.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala

HyukjinKwon · 2020-01-14T05:06:33Z

Looks pretty good but I will take a final look after the comments were addressed.

SparkQA · 2020-01-14T12:32:12Z

Test build #116694 has finished for PR 26973 at commit 9217536.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-01-14T22:51:41Z

Test build #116722 has finished for PR 26973 at commit 15c9648.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class CSVFilters(filters: Seq[sources.Filter], requiredSchema: StructType)

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/UnivocityParserSuite.scala

…-pushdown

SparkQA · 2020-01-15T08:05:02Z

Test build #116755 has finished for PR 26973 at commit 06be013.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2020-01-15T08:09:51Z

jenkins, retest this, please

SparkQA · 2020-01-15T12:27:42Z

Test build #116764 has finished for PR 26973 at commit 06be013.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-01-16T04:09:37Z

Merged to master.

MaxGekk added 26 commits December 16, 2019 20:31

Return Seq[InternalRow] from convert()

895638f

Pass filters to CSV datasource v1

4bc8d9b

Add CSVFilters

0124199

Add filterToExpression

fb8912e

Initial impl of CSVFilters

c2515b6

Support filters push down in CSV v2

9ced607

Add a test to CSVSuite

20dbef0

Keep only one predicate per field

becfe1e

Add a benchmark

77e7d54

SQL config spark.sql.csv.filterPushdown.enabled

415e4ce

Use SQL config in CSVBenchmark

3db517f

Refactoring

98963bc

Add comments for skipRow

05111a5

Apply filters only on CSV level

899cf17

Add a comment for predicates

d08fe58

Add a comment for CSVFilters

b0a34b3

Add a comment for unsupportedFilters

5fe5600

Add comments

a7f3006

Add tests to UnivocityParserSuite

c989bee

Support AlwaysTrue and AlwaysFalse filters

124c45d

Add tests for filterToExpression()

d7932c2

Add tests for readSchema

bb0abf4

Add tests for skipRow()

1c707e5

Benchmarks at the commit 67b644c

11bcbc6

Revert "Benchmarks at the commit 67b644c"

a5088bd

This reverts commit 11bcbc6.

Update benchmarks results

f0cc83c

MaxGekk added 2 commits December 21, 2019 00:14

Merge remote-tracking branch 'remotes/origin/master' into csv-filters…

e7b3304

…-pushdown

Add equals(), hashCode() and description() to CSVScan

f24e873

Test more options/modes in the end-to-end test

18389b0

MaxGekk added 2 commits January 13, 2020 17:30

Bug fix: malformed input + permissive mode + columnNameOfCorruptRecor…

e302fa4

…d option

Remove unnecessary setNullAt

96e9554

HyukjinKwon reviewed Jan 14, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVFilters.scala Outdated Show resolved Hide resolved

HyukjinKwon reviewed Jan 14, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala Outdated Show resolved Hide resolved

HyukjinKwon reviewed Jan 14, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala Show resolved Hide resolved

MaxGekk added 2 commits January 14, 2020 11:15

Remove checkFilters()

1be5534

Remove private[sql] for parsedSchema

9217536

Simplify code assuming that requireSchema contains all filter refs

15c9648

HyukjinKwon reviewed Jan 15, 2020

View reviewed changes

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/UnivocityParserSuite.scala Outdated Show resolved Hide resolved

HyukjinKwon reviewed Jan 15, 2020

View reviewed changes

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/UnivocityParserSuite.scala Show resolved Hide resolved

HyukjinKwon approved these changes Jan 15, 2020

View reviewed changes

MaxGekk added 3 commits January 15, 2020 09:12

Merge remote-tracking branch 'remotes/origin/master' into csv-filters…

f2c3b3e

…-pushdown

Use intercept in UnivocityParserSuite

df30439

Remove nested getSchema() in UnivocityParserSuite

06be013

HyukjinKwon closed this in 4e50f02 Jan 16, 2020

MaxGekk deleted the csv-filters-pushdown branch June 5, 2020 19:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-30323][SQL] Support filters pushdown in CSV datasource #26973

[SPARK-30323][SQL] Support filters pushdown in CSV datasource #26973

MaxGekk commented Dec 20, 2019 •

edited

Loading

SparkQA commented Dec 20, 2019

SparkQA commented Dec 21, 2019

SparkQA commented Jan 13, 2020

SparkQA commented Jan 13, 2020

SparkQA commented Jan 13, 2020

SparkQA commented Jan 13, 2020

HyukjinKwon commented Jan 14, 2020

SparkQA commented Jan 14, 2020

SparkQA commented Jan 14, 2020

SparkQA commented Jan 15, 2020

MaxGekk commented Jan 15, 2020

SparkQA commented Jan 15, 2020

HyukjinKwon commented Jan 16, 2020

[SPARK-30323][SQL] Support filters pushdown in CSV datasource #26973

[SPARK-30323][SQL] Support filters pushdown in CSV datasource #26973

Conversation

MaxGekk commented Dec 20, 2019 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Dec 20, 2019

SparkQA commented Dec 21, 2019

SparkQA commented Jan 13, 2020

SparkQA commented Jan 13, 2020

SparkQA commented Jan 13, 2020

SparkQA commented Jan 13, 2020

HyukjinKwon commented Jan 14, 2020

SparkQA commented Jan 14, 2020

SparkQA commented Jan 14, 2020

SparkQA commented Jan 15, 2020

MaxGekk commented Jan 15, 2020

SparkQA commented Jan 15, 2020

HyukjinKwon commented Jan 16, 2020

MaxGekk commented Dec 20, 2019 •

edited

Loading