Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-48028][TESTS] Regenerate benchmark results after turning ANSI on #46266

Closed
wants to merge 8 commits into from

Conversation

yaooqinn
Copy link
Member

@yaooqinn yaooqinn commented Apr 28, 2024

What changes were proposed in this pull request?

This PR aims to fix benchmark errors and regenerate benchmark results for Apache Spark 4.0.0 after turning ANSI on.

The latest baseline has been updated by SPARK-47513.

Benchmarks fixed:

  • AvroReadBenchmark
  • DataSourceReadBenchmark
  • DateTimeBenchmark (Fixed before this PR)
  • InExpressionBenchmark
  • MetadataStructBenchmark
  • OrcReadBenchmark
  • TPCDSQueryBenchmark <ANSI OFF

Why are the changes needed?

SPARK-44444 turns ANSI on by default, there could be performance related issues

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Manual review. / Run benchmarks

Was this patch authored or co-authored using generative AI tooling?

No.

@yaooqinn
Copy link
Member Author

@dongjoon-hyun
Copy link
Member

Thank you!

@dongjoon-hyun
Copy link
Member

Looking forward to seeing the result.

@yaooqinn
Copy link
Member Author

Thank you @dongjoon-hyun.

Unfortunately, they timed out.

@yaooqinn
Copy link
Member Author

yaooqinn commented Apr 29, 2024

Pending CI results Run benchmarks

@yaooqinn
Copy link
Member Author

Witnessed an extremely slow benchmark #45453 (comment)

@yaooqinn
Copy link
Member Author

There is a failure in TPCDS query 90:

Running benchmark: TPCDS
  Running case: q90
24/04/29 22:09:37 ERROR Utils: Aborting task
org.apache.spark.SparkArithmeticException: [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. SQLSTATE: 22012
== SQL (line 1, position 8) ==
SELECT cast(amc AS DECIMAL(15, 4)) / cast(pmc AS DECIMAL(15, 4)) am_pm_ratio
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

	at org.apache.spark.sql.errors.QueryExecutionErrors$.divideByZeroError(QueryExecutionErrors.scala:203)
	at

In this PR, I tend to disable ansi for tpcds benchmark temporarily.

Comment on lines -9 to +10
agg w/o group wholestage off 30915 32941 2865 67.8 14.7 1.0X
agg w/o group wholestage on 717 720 2 2924.3 0.3 43.1X
agg w/o group wholestage off 38161 38820 933 55.0 18.2 1.0X
agg w/o group wholestage on 2472 2488 10 848.5 1.2 15.4X
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mark

@@ -58,6 +58,7 @@ object TPCDSQueryBenchmark extends SqlBasedBenchmark with Logging {
.set("spark.sql.crossJoin.enabled", "true")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.kryo.registrationRequired", "true")
.set("spark.sql.ansi.enabled", "false")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @dongjoon-hyun @cloud-fan Now we seem to lose TPCDS compliance by default.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, could you file a new JIRA issue to track that, @yaooqinn ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for my understanding, TPCDS data have some overflows?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DIVIDE_BY_ZERO in q90 so far

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@yaooqinn yaooqinn changed the title [WIP][SPARK-48028][TESTS] Regenerate benchmark results after turning ANSI on [SPARK-48028][TESTS] Regenerate benchmark results after turning ANSI on Apr 30, 2024
@@ -87,7 +87,7 @@ object AvroReadBenchmark extends SqlBasedBenchmark {

prepareTable(
dir,
spark.sql("SELECT CAST(value AS INT) AS c1, CAST(value as STRING) AS c2 FROM t1"))
spark.sql(s"SELECT value % ${Int.MaxValue} AS c1, CAST(value as STRING) AS c2 FROM t1"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a different benchmark, why do we use a static value here instead of column of t1, @yaooqinn ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, nvm.

OpenJDK 64-Bit Server VM 21.0.2+13-LTS on Linux 6.5.0-1016-azure
SQL CSV 11095 11134 54 1.4 705.4 1.0X
SQL Json 9688 9701 18 1.6 616.0 1.1X
SQL Parquet Vectorized: DataPageV1 293 297 4 53.7 18.6 37.9X
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataPageV1 seems to show a regression in BIGINT column. It's slower than DataPageV2 of Parquet.

OpenJDK 64-Bit Server VM 21.0.2+13-LTS on Linux 6.5.0-1016-azure
SQL CSV 11485 11545 86 1.4 730.2 1.0X
SQL Json 11591 11597 8 1.4 737.0 1.0X
SQL Parquet Vectorized: DataPageV1 269 288 18 58.5 17.1 42.7X
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also has slightly different ratio. DataPageV1 vs DataPageV2.

SQL ORC Vectorized (Nested Column Enabled) 285 305 38 55.1 18.1 7.7X
SQL Parquet MR: DataPageV1 2807 2812 8 5.6 178.4 0.8X
SQL Parquet Vectorized: DataPageV1 (Nested Column Disabled) 3405 3407 3 4.6 216.5 0.6X
SQL Parquet Vectorized: DataPageV1 (Nested Column Enabled) 291 321 20 54.0 18.5 7.6X
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also shows a regression on DataPageV1 (according to the ratio change of DataPageV1 vs DataPageV2. DataPageV1 becomes slower than DataPageV2.

case ByteType => "cast(value % 128 as byte)"
case ShortType => "cast(value % 32768 as short)"
case _ => s"cast(value % ${Int.MaxValue} as ${dataType.sql})"
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @yaooqinn . This is a nice data point.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

@dongjoon-hyun
Copy link
Member

Merged to master.

@yaooqinn
Copy link
Member Author

yaooqinn commented May 1, 2024

Thank you @dongjoon-hyun

@yaooqinn yaooqinn deleted the SPARK-48028 branch May 1, 2024 01:06
JacobZheng0927 pushed a commit to JacobZheng0927/spark that referenced this pull request May 11, 2024
### What changes were proposed in this pull request?

This PR aims to fix benchmark errors and regenerate benchmark results for Apache Spark 4.0.0 after turning ANSI on.

The latest baseline has been updated by SPARK-47513.

Benchmarks fixed:
- AvroReadBenchmark
- DataSourceReadBenchmark
- DateTimeBenchmark (Fixed before this PR)
- InExpressionBenchmark
- MetadataStructBenchmark
- OrcReadBenchmark
- TPCDSQueryBenchmark <b><ANSI OFF</b>

### Why are the changes needed?

SPARK-44444 turns ANSI on by default, there could be performance related issues

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual review. / [![Run benchmarks](https://github.com/yaooqinn/spark/actions/workflows/benchmark.yml/badge.svg?branch=SPARK-48028)](https://github.com/yaooqinn/spark/actions/workflows/benchmark.yml)

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#46266 from yaooqinn/SPARK-48028.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants