Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-32792][SQL] Improve Parquet In filter pushdown #29642

Closed
wants to merge 21 commits into from
Closed

[SPARK-32792][SQL] Improve Parquet In filter pushdown #29642

wants to merge 21 commits into from

Conversation

wangyum
Copy link
Member

@wangyum wangyum commented Sep 3, 2020

What changes were proposed in this pull request?

Support push down GreaterThanOrEqual minimum value and LessThanOrEqual maximum value for Parquet when sources.In's values exceeds spark.sql.optimizer.inSetRewriteMinMaxThreshold. For example:

SELECT * FROM t WHERE id IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 15)

We will push down id >= 1 and id <= 15.

Impala also has this improvement: https://issues.apache.org/jira/browse/IMPALA-3654

Why are the changes needed?

Improve query performance.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test, manual test and benchmark test.

Before this PR:

================================================================================================
Pushdown benchmark for InSet -> InFilters
================================================================================================

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 10, distribution: 10):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
---------------------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized                                                5995           6026          53          2.6         381.2       1.0X
Parquet Vectorized (Pushdown)                                      423            440          11         37.2          26.9      14.2X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 10, distribution: 50):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
---------------------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized                                                5767           5887         154          2.7         366.7       1.0X
Parquet Vectorized (Pushdown)                                      419            428           6         37.6          26.6      13.8X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 10, distribution: 90):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
---------------------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized                                                5764           5857          96          2.7         366.4       1.0X
Parquet Vectorized (Pushdown)                                      408            419           9         38.6          25.9      14.1X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 100, distribution: 10):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
----------------------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized                                                 5895           5949          41          2.7         374.8       1.0X
Parquet Vectorized (Pushdown)                                      5908           5986         114          2.7         375.6       1.0X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 100, distribution: 50):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
----------------------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized                                                 5893           5988         106          2.7         374.7       1.0X
Parquet Vectorized (Pushdown)                                      5875           5939          57          2.7         373.5       1.0X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 100, distribution: 90):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
----------------------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized                                                 5891           5954          42          2.7         374.5       1.0X
Parquet Vectorized (Pushdown)                                      5901           5976          99          2.7         375.2       1.0X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 2000, distribution: 10):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
-----------------------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized                                                  6128           6158          40          2.6         389.6       1.0X
Parquet Vectorized (Pushdown)                                       6145           6190          37          2.6         390.7       1.0X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 2000, distribution: 50):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
-----------------------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized                                                  6142           6217          64          2.6         390.5       1.0X
Parquet Vectorized (Pushdown)                                       6149           6235          90          2.6         391.0       1.0X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 2000, distribution: 90):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
-----------------------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized                                                  6148           6218          64          2.6         390.9       1.0X
Parquet Vectorized (Pushdown)                                       6145           6177          30          2.6         390.7       1.0X

After this PR:

================================================================================================
Pushdown benchmark for InSet -> InFilters
================================================================================================

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 10, distribution: 10):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
---------------------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized                                                5745           5768          28          2.7         365.2       1.0X
Parquet Vectorized (Pushdown)                                      401            412          12         39.2          25.5      14.3X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 10, distribution: 50):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
---------------------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized                                                5796           5861          61          2.7         368.5       1.0X
Parquet Vectorized (Pushdown)                                      417            482          37         37.7          26.5      13.9X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 10, distribution: 90):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
---------------------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized                                                5754           5777          20          2.7         365.8       1.0X
Parquet Vectorized (Pushdown)                                      408            418           9         38.6          25.9      14.1X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 100, distribution: 10):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
----------------------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized                                                 5878           5915          40          2.7         373.7       1.0X
Parquet Vectorized (Pushdown)                                       929            940          10         16.9          59.1       6.3X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 100, distribution: 50):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
----------------------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized                                                 5886           5917          29          2.7         374.2       1.0X
Parquet Vectorized (Pushdown)                                      3091           3114          20          5.1         196.5       1.9X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 100, distribution: 90):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
----------------------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized                                                 5913           5948          48          2.7         375.9       1.0X
Parquet Vectorized (Pushdown)                                      5330           5427          98          3.0         338.9       1.1X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 2000, distribution: 10):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
-----------------------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized                                                  6147           6228          72          2.6         390.8       1.0X
Parquet Vectorized (Pushdown)                                       1023           1029           4         15.4          65.1       6.0X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 2000, distribution: 50):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
-----------------------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized                                                  6164           6224          47          2.6         391.9       1.0X
Parquet Vectorized (Pushdown)                                       3332           3360          45          4.7         211.9       1.8X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 2000, distribution: 90):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
-----------------------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized                                                  6154           6192          38          2.6         391.3       1.0X
Parquet Vectorized (Pushdown)                                       5588           5679          92          2.8         355.3       1.1X

@SparkQA
Copy link

SparkQA commented Sep 3, 2020

Test build #128266 has finished for PR 29642 at commit 648e8e5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang
Copy link
Member

@wangyum Do you have any further comments? If not, shall we close this one?

@SparkQA
Copy link

SparkQA commented Oct 25, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34844/

@SparkQA
Copy link

SparkQA commented Oct 25, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34844/

@SparkQA
Copy link

SparkQA commented Oct 25, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34845/

@SparkQA
Copy link

SparkQA commented Oct 25, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34845/

@SparkQA
Copy link

SparkQA commented Oct 25, 2020

Test build #130244 has finished for PR 29642 at commit c5ab656.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 25, 2020

Test build #130245 has finished for PR 29642 at commit 0169114.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum
Copy link
Member Author

wangyum commented Oct 26, 2020

@SparkQA
Copy link

SparkQA commented Nov 18, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35840/

@SparkQA
Copy link

SparkQA commented Nov 18, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35840/

@SparkQA
Copy link

SparkQA commented Nov 18, 2020

Test build #131236 has finished for PR 29642 at commit ebb13cc.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Comment on lines 613 to 618
case Some(dataType) =>
val sortedValues = values.sorted(TypeUtils.getInterpretedOrdering(dataType))
createFilterHelper(
sources.And(sources.GreaterThanOrEqual(name, sortedValues.head),
sources.LessThanOrEqual(name, sortedValues.last)),
canPartialPushDownConjuncts)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic is same to HiveShim.scala#L746-L750.

case InSet(child, values) if useAdvanced && values.size > inSetThreshold =>
val dataType = child.dataType
val sortedValues = values.toSeq.sorted(TypeUtils.getInterpretedOrdering(dataType))
convert(And(GreaterThanOrEqual(child, Literal(sortedValues.head, dataType)),
LessThanOrEqual(child, Literal(sortedValues.last, dataType))))

@cloud-fan @dongjoon-hyun @HyukjinKwon It can be improved by 6.6X in InSet -> InFilters (values count: 100, distribution: 10):

Parquet Vectorized (Pushdown)                      9520           9560          27          1.7         605.3       1.0X
Parquet Vectorized (Pushdown)                      873             885           11        18.0          55.5       6.6X

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, then can we turn it into a util method and use it in all the filter pushdown place?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, Added a new function to TypeUtils.

@SparkQA
Copy link

SparkQA commented Nov 18, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35893/

@SparkQA
Copy link

SparkQA commented Nov 18, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35893/

@SparkQA
Copy link

SparkQA commented Nov 18, 2020

Test build #131289 has finished for PR 29642 at commit b8cb1f4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

shall we implement the logic in FileSourceStrategy? Then it's not parquet only.

@wangyum
Copy link
Member Author

wangyum commented Nov 20, 2020

It seems only Parquet not well supported In predicate pushdown. @MaxGekk What do you think?

This is the benchmark of CSV:

val rowsNum = 100 * 1000
val numIters = 3
val colsNum = 100
val fields = Seq.tabulate(colsNum)(i => StructField(s"col$i", TimestampType))
val schema = StructType(StructField("key", IntegerType) +: fields)
def columns(): Seq[Column] = {
  val ts = Seq.tabulate(colsNum) { i =>
    lit(Instant.ofEpochSecond(i * 12345678)).as(s"col$i")
  }
  ($"id" % 1000).as("key") +: ts
}
withTempPath { path =>
  spark.range(rowsNum).select(columns(): _*)
    .write.option("header", true)
    .csv(path.getAbsolutePath)
  def readback = {
    spark.read
      .option("header", true)
      .schema(schema)
      .csv(path.getAbsolutePath)
  }

  def withFilter(filer: String, configEnabled: Boolean): Unit = {
    withSQLConf(SQLConf.CSV_FILTER_PUSHDOWN_ENABLED.key -> configEnabled.toString()) {
      readback.filter(filer).noop()
    }
  }

  Seq(5, 10, 50, 100, 500).foreach { count =>
    Seq(10, 50).foreach { distribution =>
      val title = s"InSet -> InFilters (values count: $count, distribution: $distribution)"
      val benchmark = new Benchmark(title, rowsNum, output = output)
      Seq(false, true).foreach { pushDownEnabled =>
        val name = s"Native CSV Vectorized ${if (pushDownEnabled) s"(Pushdown)" else ""}"
        benchmark.addCase(name, numIters) { _ =>
          val filter =
            Range(0, count).map(_ => scala.util.Random.nextInt(rowsNum * distribution / 100))
          val whereExpr = s"key in(${filter.mkString(",")})"
          withFilter(whereExpr, configEnabled = pushDownEnabled)
        }
      }
      benchmark.run()
    }
  }
}

Result:

================================================================================================
Benchmark to measure CSV read performance
================================================================================================

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 5, distribution: 10):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
--------------------------------------------------------------------------------------------------------------------------------------
Native CSV Vectorized                                           13082          17077        1674          0.0      130815.6       1.0X
Native CSV Vectorized (Pushdown)                                 1172           1192          35          0.1       11719.5      11.2X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 5, distribution: 50):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
--------------------------------------------------------------------------------------------------------------------------------------
Native CSV Vectorized                                           11858          12028         237          0.0      118576.9       1.0X
Native CSV Vectorized (Pushdown)                                 1165           1172           6          0.1       11652.4      10.2X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 10, distribution: 10):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
---------------------------------------------------------------------------------------------------------------------------------------
Native CSV Vectorized                                            11883          12180         494          0.0      118834.3       1.0X
Native CSV Vectorized (Pushdown)                                  1142           1156          21          0.1       11418.6      10.4X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 10, distribution: 50):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
---------------------------------------------------------------------------------------------------------------------------------------
Native CSV Vectorized                                            11857          11878          19          0.0      118570.4       1.0X
Native CSV Vectorized (Pushdown)                                  1169           1174           7          0.1       11692.9      10.1X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 50, distribution: 10):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
---------------------------------------------------------------------------------------------------------------------------------------
Native CSV Vectorized                                            11923          11962          66          0.0      119228.0       1.0X
Native CSV Vectorized (Pushdown)                                  1196           1225          26          0.1       11960.7      10.0X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 50, distribution: 50):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
---------------------------------------------------------------------------------------------------------------------------------------
Native CSV Vectorized                                            11910          11917           7          0.0      119095.3       1.0X
Native CSV Vectorized (Pushdown)                                  1191           1194           5          0.1       11908.0      10.0X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 100, distribution: 10):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
----------------------------------------------------------------------------------------------------------------------------------------
Native CSV Vectorized                                             11948          12097         201          0.0      119484.5       1.0X
Native CSV Vectorized (Pushdown)                                   1250           1284          32          0.1       12501.4       9.6X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 100, distribution: 50):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
----------------------------------------------------------------------------------------------------------------------------------------
Native CSV Vectorized                                             11938          11978          39          0.0      119378.8       1.0X
Native CSV Vectorized (Pushdown)                                   1176           1188          11          0.1       11756.0      10.2X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 500, distribution: 10):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
----------------------------------------------------------------------------------------------------------------------------------------
Native CSV Vectorized                                             11954          12051         124          0.0      119542.9       1.0X
Native CSV Vectorized (Pushdown)                                   1762           1833         104          0.1       17620.6       6.8X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
InSet -> InFilters (values count: 500, distribution: 50):  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
----------------------------------------------------------------------------------------------------------------------------------------
Native CSV Vectorized                                             11860          12166         484          0.0      118597.8       1.0X
Native CSV Vectorized (Pushdown)                                   1417           1434          15          0.1       14171.7       8.4X

makeEq.lift(nameToParquetField(name).fieldType)
.map(_(nameToParquetField(name).fieldNames, v))
}.reduceLeftOption(FilterApi.or)
case sources.In(name, values) if pushDownInFilterThreshold > 0 &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wangyum, the impala reference sounds good. Can we make it general and push the range filter to other data sources as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is supposed to be beneficial in other sources as well, I think it makes more sense to push it to other sources as well anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems only Parquet is not well supported In predicate pushdown.
Parquet vs ORC:

OpenJDK 64-Bit Server VM 1.8.0_232-8u232-b09-0ubuntu1~18.04.1-b09 on Linux 4.15.0-1044-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
InSet -> InFilters (values count: 50, distribution: 50): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized 9281 9298 12 1.7 590.1 1.0X
Parquet Vectorized (Pushdown) 9546 9561 17 1.6 606.9 1.0X
Native ORC Vectorized 6877 6897 18 2.3 437.2 1.3X
Native ORC Vectorized (Pushdown) 661 668 15 23.8 42.0 14.0X
OpenJDK 64-Bit Server VM 1.8.0_232-8u232-b09-0ubuntu1~18.04.1-b09 on Linux 4.15.0-1044-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
InSet -> InFilters (values count: 50, distribution: 90): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized 9322 9335 22 1.7 592.7 1.0X
Parquet Vectorized (Pushdown) 9551 9573 18 1.6 607.2 1.0X
Native ORC Vectorized 6902 6915 13 2.3 438.8 1.4X
Native ORC Vectorized (Pushdown) 659 680 25 23.9 41.9 14.1X
OpenJDK 64-Bit Server VM 1.8.0_232-8u232-b09-0ubuntu1~18.04.1-b09 on Linux 4.15.0-1044-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
InSet -> InFilters (values count: 100, distribution: 10): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized 9278 9294 18 1.7 589.9 1.0X
Parquet Vectorized (Pushdown) 9520 9560 27 1.7 605.3 1.0X
Native ORC Vectorized 6855 6870 16 2.3 435.9 1.4X
Native ORC Vectorized (Pushdown) 795 808 16 19.8 50.5 11.7X
OpenJDK 64-Bit Server VM 1.8.0_232-8u232-b09-0ubuntu1~18.04.1-b09 on Linux 4.15.0-1044-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
InSet -> InFilters (values count: 100, distribution: 50): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized 9306 9311 4 1.7 591.6 1.0X
Parquet Vectorized (Pushdown) 9529 9551 16 1.7 605.8 1.0X
Native ORC Vectorized 6875 6882 7 2.3 437.1 1.4X
Native ORC Vectorized (Pushdown) 853 865 15 18.4 54.2 10.9X
OpenJDK 64-Bit Server VM 1.8.0_232-8u232-b09-0ubuntu1~18.04.1-b09 on Linux 4.15.0-1044-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
InSet -> InFilters (values count: 100, distribution: 90): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized 9256 9271 9 1.7 588.5 1.0X
Parquet Vectorized (Pushdown) 9500 9520 13 1.7 604.0 1.0X
Native ORC Vectorized 6843 6857 9 2.3 435.1 1.4X
Native ORC Vectorized (Pushdown) 858 870 14 18.3 54.6 10.8X

CSV:
#29642 (comment)

@@ -704,8 +704,8 @@ object SQLConf {
val PARQUET_FILTER_PUSHDOWN_INFILTERTHRESHOLD =
buildConf("spark.sql.parquet.pushdown.inFilterThreshold")
.doc("The maximum number of values to filter push-down optimization for IN predicate. " +
"Large threshold won't necessarily provide much better performance. " +
"The experiment argued that 300 is the limit threshold. " +
"Spark will push-down a value greater than or equal to its minimum value and " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the default value 10 is small here. What is the default threshold in IMPLA?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impala only optimize it to >= minimum value and <= maximum value: apache/impala@aa05c64

Parquet Vectorized 10287 10449 144 1.5 654.0 1.0X
Parquet Vectorized (Pushdown) 467 494 20 33.7 29.7 22.0X
Native ORC Vectorized 6781 6848 58 2.3 431.1 1.5X
Native ORC Vectorized (Pushdown) 428 440 10 36.8 27.2 24.1X
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto. 17 vs 17 -> 22 vs 24.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Github action runs on different machines, there is a performance difference between them.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the benchmark result, I'm a little confused if this PR is improvement or not. Could you add some explanation about the improvement part, @wangyum ? Maybe, is it affected by the master branch instead of this PR?

Also, cc @huaxingao since this is Parquet filter pushdown.

@SparkQA
Copy link

SparkQA commented May 9, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42829/

@SparkQA
Copy link

SparkQA commented May 9, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42829/

@SparkQA
Copy link

SparkQA commented May 9, 2021

Test build #138307 has finished for PR 29642 at commit f0bfb06.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum
Copy link
Member Author

wangyum commented May 10, 2021

@dongjoon-hyun This pr only improve the In predicate. I have added the improvement part to PR description.

@wangyum
Copy link
Member Author

wangyum commented May 13, 2021

@dongjoon-hyun Do you have more comments?

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented May 13, 2021

No. Github action runs on different machines, there is a performance difference between them.

No, @wangyum . I'm meaning the ratio between ORC and Parquet on the same machine run. Previously, ORC and Parquet shows the similar performance but now Parquet looks like more slower than ORC after this PR by increasing the gap. For example, the following.

- Parquet Vectorized                                10512          10572          58          1.5         668.4       1.0X
- Parquet Vectorized (Pushdown)                       596            621          19         26.4          37.9      17.6X
- Native ORC Vectorized                              8555           8723          97          1.8         543.9       1.2X
- Native ORC Vectorized (Pushdown)                    592            609          11         26.6          37.7      17.8X
+ Parquet Vectorized                                 9788          10231         259          1.6         622.3       1.0X
+ Parquet Vectorized (Pushdown)                       493            536          29         31.9          31.3      19.9X
+ Native ORC Vectorized                              6487           6575         137          2.4         412.4       1.5X
+ Native ORC Vectorized (Pushdown)                    436            447          14         36.1          27.7      22.4X

Although the values are too small, this generate result shows a slowdown of Parquet compared with ORC. That was my question.

@wangyum
Copy link
Member Author

wangyum commented May 14, 2021

@dongjoon-hyun I think this performance issue is not caused by this change. This PR only changes the In predicate. It is also slow without this change:

OpenJDK 64-Bit Server VM 11.0.11+9-LTS on Linux 5.4.0-1047-azure
Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz
Select 0 string row (value IS NULL):      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Parquet Vectorized                                10623          10994         272          1.5         675.4       1.0X
Parquet Vectorized (Pushdown)                       627            657          24         25.1          39.9      16.9X
Native ORC Vectorized                              7490           7653         203          2.1         476.2       1.4X
Native ORC Vectorized (Pushdown)                    553            606          34         28.4          35.2      19.2X

https://github.com/wangyum/spark/runs/2580852093

@cloud-fan
Copy link
Contributor

Yea these benchmark results are not updated in time. Let's post the benchmark result before and after this PR in the PR description.

@wangyum
Copy link
Member Author

wangyum commented May 14, 2021

@dongjoon-hyun @cloud-fan Please see the latest benchmark result: 27a2bf6

@SparkQA
Copy link

SparkQA commented May 14, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43078/

@SparkQA
Copy link

SparkQA commented May 14, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43078/

@SparkQA
Copy link

SparkQA commented May 14, 2021

Test build #138557 has finished for PR 29642 at commit 27a2bf6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented May 15, 2021

Thank you for updating, @wangyum . At the last commit, yes, I agree that it looks like there is no regression by this PR.

One last question: could you spot what is the improvement in the the last commit by this PR? It's not clear to me in the last commit. Do we need to add some specific additional benchmark case for your contribution?

@SparkQA
Copy link

SparkQA commented May 16, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43103/

@SparkQA
Copy link

SparkQA commented May 16, 2021

Test build #138582 has finished for PR 29642 at commit 2545c1e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum
Copy link
Member Author

wangyum commented May 16, 2021

@dongjoon-hyun I think current benchmark is enough. I have updated the benchmark to PR description.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you so much, @wangyum and all!
Merged to master.

cc @aokolnychyi

@HyukjinKwon
Copy link
Member

uhoh .. seems like there's a logical conflict with #31776:

[error] /home/runner/work/spark/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala:489:27: wrong number of arguments for pattern ParquetFilters.this.ParquetSchemaType(logicalTypeAnnotation: org.apache.parquet.schema.LogicalTypeAnnotation, primitiveTypeName: org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName, length: Int)
[error]     case ParquetSchemaType(DECIMAL, INT32, _, _) if pushDownDecimal =>
[error]                           ^
[error] /home/runner/work/spark/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala:496:27: wrong number of arguments for pattern ParquetFilters.this.ParquetSchemaType(logicalTypeAnnotation: org.apache.parquet.schema.LogicalTypeAnnotation, primitiveTypeName: org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName, length: Int)
[error]     case ParquetSchemaType(DECIMAL, INT64, _, _) if pushDownDecimal =>
[error]                           ^
[error] /home/runner/work/spark/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala:503:27: wrong number of arguments for pattern ParquetFilters.this.ParquetSchemaType(logicalTypeAnnotation: org.apache.parquet.schema.LogicalTypeAnnotation, primitiveTypeName: org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName, length: Int)
[error]     case ParquetSchemaType(DECIMAL, FIXED_LEN_BYTE_ARRAY, length, _) if pushDownDecimal =>
[error]                           ^
[warn] /home/runner/work/spark/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala:127:39: [deprecation @ org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.prepareWrite | origin=org.apache.parquet.hadoop.ParquetOutputFormat.ENABLE_JOB_SUMMARY | version=] value ENABLE_JOB_SUMMARY in class ParquetOutputFormat is deprecated
[warn]       && conf.get(ParquetOutputFormat.ENABLE_JOB_SUMMARY) == null) {
[warn]                                       ^
[warn] /home/runner/work/spark/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetWrite.scala:89:39: [deprecation @ org.apache.spark.sql.execution.datasources.v2.parquet.ParquetWrite.prepareWrite | origin=org.apache.parquet.hadoop.ParquetOutputFormat.ENABLE_JOB_SUMMARY | version=] value ENABLE_JOB_SUMMARY in class ParquetOutputFormat is deprecated
[warn]       && conf.get(ParquetOutputFormat.ENABLE_JOB_SUMMARY) == null) {

@HyukjinKwon
Copy link
Member

@wangyum are you online? can you take a quick look and fix or revert?

@dongjoon-hyun
Copy link
Member

Oops.

dongjoon-hyun pushed a commit that referenced this pull request May 17, 2021
### What changes were proposed in this pull request?

This fixes the compilation error due to the logical conflicts between #31776 and #29642 .

### Why are the changes needed?

To recover compilation.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Closes #32568 from wangyum/HOT-FIX.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
@wangyum wangyum deleted the SPARK-32792 branch May 17, 2021 05:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
6 participants