[SPARK-29065][SQL][TEST] Extend `EXTRACT` benchmark #25772

MaxGekk · 2019-09-12T07:06:53Z

What changes were proposed in this pull request?

In the PR, I propose to extend ExtractBenchmark and add new ones for:

EXTRACT and DATE as input column
the DATE_PART function and DATE/TIMESTAMP input column

Why are the changes needed?

The EXTRACT expression is rebased on the DATE_PART expression by the PR #25410 where some of sub-expressions take DATE column as the input (Millennium, Year and etc.) but others require TIMESTAMP column (Hour, Minute). Separate benchmarks for DATE should exclude overhead of implicit conversions DATE <-> TIMESTAMP.

Does this PR introduce any user-facing change?

No, it doesn't.

How was this patch tested?

Regenerated results of ExtractBenchmark

MaxGekk · 2019-09-12T07:08:29Z

@dongjoon-hyun @cloud-fan @maropu Could you take a look at the PR. Should I extract cast-related code to a separate PR? WDYT?

maropu · 2019-09-12T07:23:36Z

Ur, I see. I didn't notice we've already accepted explicit casts from long to timestamp though, any reason to support this explicit cast? It seems pgSQL doesn't accept that ;

postgres=# select cast(0::bigint as timestamp);
ERROR:  cannot cast type bigint to timestamp without time zone
LINE 1: select cast(0::bigint as timestamp);
               ^
postgres=# select cast(0::int as date);
ERROR:  cannot cast type integer to date
LINE 1: select cast(0::int as date);
               ^

cc: @gengliangwang

dongjoon-hyun

Sorry, @MaxGekk . I'm a little negative with this PR.

First of all, this is not a TEST PR because it touch SQL behavior. You need to split this PR into two.
In addition, the new behavior is not consistent with PostgreSQL, too.

postgres=# select cast(1 as date);
ERROR:  cannot cast type integer to date
LINE 1: select cast(1 as date);

Could you compare this new behavior with the other DBMSs, please?

cc @maropu

dongjoon-hyun · 2019-09-12T07:29:25Z

Oops. I'm late. :)

maropu · 2019-09-12T07:30:02Z

hahaha, I was faster

cloud-fan · 2019-09-12T07:43:46Z

I think long to timestamp is a legacy behavior and we shouldn't go further in this direction. If we do want to convert integrals to date in the benchmark, we can create a UDF/expression to do this in the test package, and use it in the benchmark.

This reverts commit 6b9ad66.

This reverts commit cb1e6c0.

MaxGekk · 2019-09-12T09:06:03Z

@maropu @dongjoon-hyun @cloud-fan Thank you for your quick feedback. I reverted changes in Cast, and did conversion to DATE via TIMESTAMP in the benchmark. Please, take a look at this when you have time.

cloud-fan · 2019-09-12T09:14:54Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/ExtractBenchmark.scala

+      case other => throw new IllegalArgumentException(
+        s"Unsupported function '$other'. Valid functions are 'extract' and 'date_part'.")
+    }
+    codegenBenchmark(s"$field of $from", cardinality) {


do we really need to test with whole-stage-codegen on and off? The benchmark is just a SELECT query and I don't think whole-stage-codegen can help here. We can also see from the benchmark result that there is no difference.

The code is the same, so, we could benchmark it with default settings only - spark.sql.codegen.wholeStage = true.

Yea I know the code is same, but since we are refining this benchmark, we can fix this issue as well.

I leaved only codegen benchmarks. From my point of view, it became better when all cases in one table.

cloud-fan

LGTM except one comment

SparkQA · 2019-09-12T10:46:07Z

Test build #110502 has finished for PR 25772 at commit 6b9ad66.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-09-12T12:32:55Z

Thanks! Merged to master.

SparkQA · 2019-09-12T12:55:01Z

Test build #110509 has finished for PR 25772 at commit e54c41a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-09-12T13:52:55Z

Test build #110511 has finished for PR 25772 at commit 766110a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-09-12T18:38:54Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/ExtractBenchmark.scala

+  private val spark: SparkSession = SparkSession.builder()
+    .master("local[1]")
+    .appName(this.getClass.getCanonicalName)
+    .getOrCreate()


Ur, is there other reason not to use SqlBasedBenchmark? You can do override def getSparkSession from SqlBasedBenchmark like DatasetBenchmark does.

Having unused dependency is better, from my point of view. If you think it makes sense, I could change that in a separate PR, and override getSparkSession in other benchmarks like FilterPushdownBenchmark, OrcReadBenchmark, PrimitiveArrayBenchmark.

Maybe later. For now, never mind. We have more important things to do~ 3.0.0-preview is coming! 😄

We have more important things to do~ 3.0.0-preview is coming!

Please, ping me if you think I could help.

Here is the PR #25828 , just in case.

### What changes were proposed in this pull request? In the PR, I propose to extend `ExtractBenchmark` and add new ones for: - `EXTRACT` and `DATE` as input column - the `DATE_PART` function and `DATE`/`TIMESTAMP` input column ### Why are the changes needed? The `EXTRACT` expression is rebased on the `DATE_PART` expression by the PR apache#25410 where some of sub-expressions take `DATE` column as the input (`Millennium`, `Year` and etc.) but others require `TIMESTAMP` column (`Hour`, `Minute`). Separate benchmarks for `DATE` should exclude overhead of implicit conversions `DATE` <-> `TIMESTAMP`. ### Does this PR introduce any user-facing change? No, it doesn't. ### How was this patch tested? - Regenerated results of `ExtractBenchmark` Closes apache#25772 from MaxGekk/date_part-benchmark. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

MaxGekk added 5 commits September 11, 2019 22:52

Benchmark results before changes

d65d834

Extend the extract benchmark

20c9fe8

Regenerate results

0844c76

Support casting long (int, short, byte) to date

cb1e6c0

Add tests for casting to date

6b9ad66

dongjoon-hyun requested changes Sep 12, 2019

View reviewed changes

dongjoon-hyun added SQL TESTS labels Sep 12, 2019

MaxGekk added 4 commits September 12, 2019 13:15

Revert "Add tests for casting to date"

5d20e20

This reverts commit 6b9ad66.

Revert "Support casting long (int, short, byte) to date"

6a67b99

This reverts commit cb1e6c0.

Cast to DATE via TIMESTAMP

c9241d8

Re-gen benchmark results

e54c41a

cloud-fan reviewed Sep 12, 2019

View reviewed changes

cloud-fan approved these changes Sep 12, 2019

View reviewed changes

MaxGekk added 2 commits September 12, 2019 15:09

Run benchmarks for codegen only

add2a4a

Re-gen results

766110a

maropu approved these changes Sep 12, 2019

View reviewed changes

maropu closed this in 8e9fafb Sep 12, 2019

dongjoon-hyun reviewed Sep 12, 2019

View reviewed changes

MaxGekk deleted the date_part-benchmark branch October 5, 2019 19:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-29065][SQL][TEST] Extend `EXTRACT` benchmark #25772

[SPARK-29065][SQL][TEST] Extend `EXTRACT` benchmark #25772

MaxGekk commented Sep 12, 2019 •

edited

Loading

MaxGekk commented Sep 12, 2019

maropu commented Sep 12, 2019

dongjoon-hyun left a comment

dongjoon-hyun commented Sep 12, 2019

maropu commented Sep 12, 2019

cloud-fan commented Sep 12, 2019 •

edited

Loading

MaxGekk commented Sep 12, 2019

cloud-fan Sep 12, 2019

MaxGekk Sep 12, 2019

cloud-fan Sep 12, 2019

MaxGekk Sep 12, 2019

cloud-fan left a comment

SparkQA commented Sep 12, 2019

maropu commented Sep 12, 2019

SparkQA commented Sep 12, 2019

SparkQA commented Sep 12, 2019

dongjoon-hyun Sep 12, 2019

MaxGekk Sep 13, 2019

dongjoon-hyun Sep 13, 2019

MaxGekk Sep 13, 2019

MaxGekk Sep 18, 2019

[SPARK-29065][SQL][TEST] Extend EXTRACT benchmark #25772

[SPARK-29065][SQL][TEST] Extend EXTRACT benchmark #25772

Conversation

MaxGekk commented Sep 12, 2019 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

MaxGekk commented Sep 12, 2019

maropu commented Sep 12, 2019

dongjoon-hyun left a comment

Choose a reason for hiding this comment

dongjoon-hyun commented Sep 12, 2019

maropu commented Sep 12, 2019

cloud-fan commented Sep 12, 2019 • edited Loading

MaxGekk commented Sep 12, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan left a comment

Choose a reason for hiding this comment

SparkQA commented Sep 12, 2019

maropu commented Sep 12, 2019

SparkQA commented Sep 12, 2019

SparkQA commented Sep 12, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

[SPARK-29065][SQL][TEST] Extend `EXTRACT` benchmark #25772

[SPARK-29065][SQL][TEST] Extend `EXTRACT` benchmark #25772

MaxGekk commented Sep 12, 2019 •

edited

Loading

cloud-fan commented Sep 12, 2019 •

edited

Loading