[SPARK-48028][TESTS] Regenerate benchmark results after turning ANSI on #46266

yaooqinn · 2024-04-28T13:28:13Z

What changes were proposed in this pull request?

This PR aims to fix benchmark errors and regenerate benchmark results for Apache Spark 4.0.0 after turning ANSI on.

The latest baseline has been updated by SPARK-47513.

Benchmarks fixed:

AvroReadBenchmark
DataSourceReadBenchmark
DateTimeBenchmark (Fixed before this PR)
InExpressionBenchmark
MetadataStructBenchmark
OrcReadBenchmark
TPCDSQueryBenchmark <ANSI OFF

Why are the changes needed?

SPARK-44444 turns ANSI on by default, there could be performance related issues

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Manual review. /

Was this patch authored or co-authored using generative AI tooling?

No.

…ANSI on

yaooqinn · 2024-04-28T13:30:29Z

Waiting for CI results

https://github.com/yaooqinn/spark/actions/runs/8867970355
https://github.com/yaooqinn/spark/actions/runs/8867971118

dongjoon-hyun · 2024-04-28T23:14:54Z

Thank you!

dongjoon-hyun · 2024-04-28T23:15:22Z

Looking forward to seeing the result.

yaooqinn · 2024-04-29T01:48:26Z

Thank you @dongjoon-hyun.

Unfortunately, they timed out.

yaooqinn · 2024-04-29T03:02:29Z

Pending CI results

yaooqinn · 2024-04-29T07:42:18Z

Witnessed an extremely slow benchmark #45453 (comment)

…DataSourceReadBenchmark.scala

…InExpressionBenchmark.scala

…hmark.scala

yaooqinn · 2024-04-29T23:32:17Z

There is a failure in TPCDS query 90:

Running benchmark: TPCDS
  Running case: q90
24/04/29 22:09:37 ERROR Utils: Aborting task
org.apache.spark.SparkArithmeticException: [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. SQLSTATE: 22012
== SQL (line 1, position 8) ==
SELECT cast(amc AS DECIMAL(15, 4)) / cast(pmc AS DECIMAL(15, 4)) am_pm_ratio
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

	at org.apache.spark.sql.errors.QueryExecutionErrors$.divideByZeroError(QueryExecutionErrors.scala:203)
	at

In this PR, I tend to disable ansi for tpcds benchmark temporarily.

yaooqinn · 2024-04-30T09:43:18Z

sql/core/benchmarks/AggregateBenchmark-results.txt

-agg w/o group wholestage off                      30915          32941        2865         67.8          14.7       1.0X
-agg w/o group wholestage on                         717            720           2       2924.3           0.3      43.1X
+agg w/o group wholestage off                      38161          38820         933         55.0          18.2       1.0X
+agg w/o group wholestage on                        2472           2488          10        848.5           1.2      15.4X


yaooqinn · 2024-04-30T09:57:05Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/TPCDSQueryBenchmark.scala

@@ -58,6 +58,7 @@ object TPCDSQueryBenchmark extends SqlBasedBenchmark with Logging {
      .set("spark.sql.crossJoin.enabled", "true")
      .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
      .set("spark.kryo.registrationRequired", "true")
+      .set("spark.sql.ansi.enabled", "false")


cc @dongjoon-hyun @cloud-fan Now we seem to lose TPCDS compliance by default.

Oh, could you file a new JIRA issue to track that, @yaooqinn ?

Just for my understanding, TPCDS data have some overflows?

DIVIDE_BY_ZERO in q90 so far

https://issues.apache.org/jira/browse/SPARK-48066 has been filed

dongjoon-hyun · 2024-04-30T21:10:43Z

connector/avro/src/test/scala/org/apache/spark/sql/execution/benchmark/AvroReadBenchmark.scala

@@ -87,7 +87,7 @@ object AvroReadBenchmark extends SqlBasedBenchmark {

        prepareTable(
          dir,
-          spark.sql("SELECT CAST(value AS INT) AS c1, CAST(value as STRING) AS c2 FROM t1"))
+          spark.sql(s"SELECT value % ${Int.MaxValue} AS c1, CAST(value as STRING) AS c2 FROM t1"))


~~This looks like a different benchmark, why do we use a static value here instead of column of t1, @yaooqinn ?~~

dongjoon-hyun · 2024-04-30T21:17:52Z

sql/core/benchmarks/DataSourceReadBenchmark-jdk21-results.txt

-OpenJDK 64-Bit Server VM 21.0.2+13-LTS on Linux 6.5.0-1016-azure
+SQL CSV                                           11095          11134          54          1.4         705.4       1.0X
+SQL Json                                           9688           9701          18          1.6         616.0       1.1X
+SQL Parquet Vectorized: DataPageV1                  293            297           4         53.7          18.6      37.9X


DataPageV1 seems to show a regression in BIGINT column. It's slower than DataPageV2 of Parquet.

dongjoon-hyun · 2024-04-30T21:19:07Z

sql/core/benchmarks/DataSourceReadBenchmark-jdk21-results.txt

-OpenJDK 64-Bit Server VM 21.0.2+13-LTS on Linux 6.5.0-1016-azure
+SQL CSV                                           11485          11545          86          1.4         730.2       1.0X
+SQL Json                                          11591          11597           8          1.4         737.0       1.0X
+SQL Parquet Vectorized: DataPageV1                  269            288          18         58.5          17.1      42.7X


This also has slightly different ratio. DataPageV1 vs DataPageV2.

dongjoon-hyun · 2024-04-30T21:21:49Z

sql/core/benchmarks/DataSourceReadBenchmark-jdk21-results.txt

+SQL ORC Vectorized (Nested Column Enabled)                             285            305          38         55.1          18.1       7.7X
+SQL Parquet MR: DataPageV1                                            2807           2812           8          5.6         178.4       0.8X
+SQL Parquet Vectorized: DataPageV1 (Nested Column Disabled)           3405           3407           3          4.6         216.5       0.6X
+SQL Parquet Vectorized: DataPageV1 (Nested Column Enabled)             291            321          20         54.0          18.5       7.6X


This also shows a regression on DataPageV1 (according to the ratio change of DataPageV1 vs DataPageV2. DataPageV1 becomes slower than DataPageV2.

dongjoon-hyun · 2024-04-30T21:24:59Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala

+    case ByteType => "cast(value % 128 as byte)"
+    case ShortType => "cast(value % 32768 as short)"
+    case _ => s"cast(value % ${Int.MaxValue} as ${dataType.sql})"
+  }


dongjoon-hyun

Thank you, @yaooqinn . This is a nice data point.

dongjoon-hyun

+1, LGTM.

dongjoon-hyun · 2024-04-30T23:37:25Z

Merged to master.

yaooqinn · 2024-05-01T01:06:40Z

Thank you @dongjoon-hyun

### What changes were proposed in this pull request? This PR aims to fix benchmark errors and regenerate benchmark results for Apache Spark 4.0.0 after turning ANSI on. The latest baseline has been updated by SPARK-47513. Benchmarks fixed: - AvroReadBenchmark - DataSourceReadBenchmark - DateTimeBenchmark (Fixed before this PR) - InExpressionBenchmark - MetadataStructBenchmark - OrcReadBenchmark - TPCDSQueryBenchmark <b><ANSI OFF</b> ### Why are the changes needed? SPARK-44444 turns ANSI on by default, there could be performance related issues ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review. / [![Run benchmarks](https://github.com/yaooqinn/spark/actions/workflows/benchmark.yml/badge.svg?branch=SPARK-48028)](https://github.com/yaooqinn/spark/actions/workflows/benchmark.yml) ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#46266 from yaooqinn/SPARK-48028. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

[WIP][SPARK-48028][TESTS] Regenerate benchmark results after turning …

9c948c7

…ANSI on

github-actions bot added SQL AVRO labels Apr 28, 2024

yaooqinn added 4 commits April 29, 2024 15:44

fix sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/…

39744f2

…DataSourceReadBenchmark.scala

fix MetadataStructBenchmark

4544d4d

fix sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/…

8568faa

…InExpressionBenchmark.scala

fix sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenc…

0470092

…hmark.scala

dongjoon-hyun mentioned this pull request Apr 29, 2024

[SPARK-47693][TESTS][FOLLOWUP] Reduce CollationBenchmarks time #46283

Closed

yaooqinn added 2 commits April 30, 2024 00:18

Merge branch 'master' into SPARK-48028

9464775

disable ansi for tpcds benchmark

c99d310

add result files

3cefa1b

github-actions bot added MLLIB CORE labels Apr 30, 2024

yaooqinn commented Apr 30, 2024

View reviewed changes

yaooqinn changed the title ~~[WIP][SPARK-48028][TESTS] Regenerate benchmark results after turning ANSI on~~ [SPARK-48028][TESTS] Regenerate benchmark results after turning ANSI on Apr 30, 2024

dongjoon-hyun reviewed Apr 30, 2024

View reviewed changes

dongjoon-hyun approved these changes Apr 30, 2024

View reviewed changes

dongjoon-hyun closed this in c71d02a Apr 30, 2024

yaooqinn deleted the SPARK-48028 branch May 1, 2024 01:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-48028][TESTS] Regenerate benchmark results after turning ANSI on #46266

[SPARK-48028][TESTS] Regenerate benchmark results after turning ANSI on #46266

yaooqinn commented Apr 28, 2024 •

edited

yaooqinn commented Apr 28, 2024

dongjoon-hyun commented Apr 28, 2024

dongjoon-hyun commented Apr 28, 2024

yaooqinn commented Apr 29, 2024

yaooqinn commented Apr 29, 2024 •

edited

yaooqinn commented Apr 29, 2024

yaooqinn commented Apr 29, 2024

yaooqinn Apr 30, 2024

yaooqinn Apr 30, 2024

dongjoon-hyun Apr 30, 2024

dongjoon-hyun Apr 30, 2024

yaooqinn May 1, 2024

yaooqinn May 1, 2024

dongjoon-hyun May 1, 2024

dongjoon-hyun Apr 30, 2024 •

edited

dongjoon-hyun Apr 30, 2024

dongjoon-hyun Apr 30, 2024

dongjoon-hyun Apr 30, 2024 •

edited

dongjoon-hyun Apr 30, 2024

dongjoon-hyun Apr 30, 2024

dongjoon-hyun left a comment •

edited

dongjoon-hyun left a comment

dongjoon-hyun commented Apr 30, 2024

yaooqinn commented May 1, 2024

[SPARK-48028][TESTS] Regenerate benchmark results after turning ANSI on #46266

[SPARK-48028][TESTS] Regenerate benchmark results after turning ANSI on #46266

Conversation

yaooqinn commented Apr 28, 2024 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

yaooqinn commented Apr 28, 2024

dongjoon-hyun commented Apr 28, 2024

dongjoon-hyun commented Apr 28, 2024

yaooqinn commented Apr 29, 2024

yaooqinn commented Apr 29, 2024 • edited

yaooqinn commented Apr 29, 2024

yaooqinn commented Apr 29, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun Apr 30, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun Apr 30, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun left a comment • edited

Choose a reason for hiding this comment

dongjoon-hyun left a comment

Choose a reason for hiding this comment

dongjoon-hyun commented Apr 30, 2024

yaooqinn commented May 1, 2024

yaooqinn commented Apr 28, 2024 •

edited

yaooqinn commented Apr 29, 2024 •

edited

dongjoon-hyun Apr 30, 2024 •

edited

dongjoon-hyun Apr 30, 2024 •

edited

dongjoon-hyun left a comment •

edited