-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-48028][TESTS] Regenerate benchmark results after turning ANSI on #46266
Conversation
Thank you! |
Looking forward to seeing the result. |
Thank you @dongjoon-hyun. Unfortunately, they timed out. |
Witnessed an extremely slow benchmark #45453 (comment) |
…DataSourceReadBenchmark.scala
…InExpressionBenchmark.scala
There is a failure in TPCDS query 90:
In this PR, I tend to disable ansi for tpcds benchmark temporarily. |
agg w/o group wholestage off 30915 32941 2865 67.8 14.7 1.0X | ||
agg w/o group wholestage on 717 720 2 2924.3 0.3 43.1X | ||
agg w/o group wholestage off 38161 38820 933 55.0 18.2 1.0X | ||
agg w/o group wholestage on 2472 2488 10 848.5 1.2 15.4X |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mark
@@ -58,6 +58,7 @@ object TPCDSQueryBenchmark extends SqlBasedBenchmark with Logging { | |||
.set("spark.sql.crossJoin.enabled", "true") | |||
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") | |||
.set("spark.kryo.registrationRequired", "true") | |||
.set("spark.sql.ansi.enabled", "false") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @dongjoon-hyun @cloud-fan Now we seem to lose TPCDS compliance by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, could you file a new JIRA issue to track that, @yaooqinn ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for my understanding, TPCDS data have some overflows?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DIVIDE_BY_ZERO in q90 so far
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://issues.apache.org/jira/browse/SPARK-48066 has been filed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
@@ -87,7 +87,7 @@ object AvroReadBenchmark extends SqlBasedBenchmark { | |||
|
|||
prepareTable( | |||
dir, | |||
spark.sql("SELECT CAST(value AS INT) AS c1, CAST(value as STRING) AS c2 FROM t1")) | |||
spark.sql(s"SELECT value % ${Int.MaxValue} AS c1, CAST(value as STRING) AS c2 FROM t1")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a different benchmark, why do we use a static value here instead of column
of t1
, @yaooqinn ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, nvm.
OpenJDK 64-Bit Server VM 21.0.2+13-LTS on Linux 6.5.0-1016-azure | ||
SQL CSV 11095 11134 54 1.4 705.4 1.0X | ||
SQL Json 9688 9701 18 1.6 616.0 1.1X | ||
SQL Parquet Vectorized: DataPageV1 293 297 4 53.7 18.6 37.9X |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataPageV1
seems to show a regression in BIGINT
column. It's slower than DataPageV2
of Parquet.
OpenJDK 64-Bit Server VM 21.0.2+13-LTS on Linux 6.5.0-1016-azure | ||
SQL CSV 11485 11545 86 1.4 730.2 1.0X | ||
SQL Json 11591 11597 8 1.4 737.0 1.0X | ||
SQL Parquet Vectorized: DataPageV1 269 288 18 58.5 17.1 42.7X |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also has slightly different ratio. DataPageV1
vs DataPageV2
.
SQL ORC Vectorized (Nested Column Enabled) 285 305 38 55.1 18.1 7.7X | ||
SQL Parquet MR: DataPageV1 2807 2812 8 5.6 178.4 0.8X | ||
SQL Parquet Vectorized: DataPageV1 (Nested Column Disabled) 3405 3407 3 4.6 216.5 0.6X | ||
SQL Parquet Vectorized: DataPageV1 (Nested Column Enabled) 291 321 20 54.0 18.5 7.6X |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also shows a regression on DataPageV1
(according to the ratio change of DataPageV1
vs DataPageV2
. DataPageV1
becomes slower than DataPageV2
.
case ByteType => "cast(value % 128 as byte)" | ||
case ShortType => "cast(value % 32768 as short)" | ||
case _ => s"cast(value % ${Int.MaxValue} as ${dataType.sql})" | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @yaooqinn . This is a nice data point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM.
Merged to master. |
Thank you @dongjoon-hyun |
### What changes were proposed in this pull request? This PR aims to fix benchmark errors and regenerate benchmark results for Apache Spark 4.0.0 after turning ANSI on. The latest baseline has been updated by SPARK-47513. Benchmarks fixed: - AvroReadBenchmark - DataSourceReadBenchmark - DateTimeBenchmark (Fixed before this PR) - InExpressionBenchmark - MetadataStructBenchmark - OrcReadBenchmark - TPCDSQueryBenchmark <b><ANSI OFF</b> ### Why are the changes needed? SPARK-44444 turns ANSI on by default, there could be performance related issues ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review. / [![Run benchmarks](https://github.com/yaooqinn/spark/actions/workflows/benchmark.yml/badge.svg?branch=SPARK-48028)](https://github.com/yaooqinn/spark/actions/workflows/benchmark.yml) ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#46266 from yaooqinn/SPARK-48028. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
What changes were proposed in this pull request?
This PR aims to fix benchmark errors and regenerate benchmark results for Apache Spark 4.0.0 after turning ANSI on.
The latest baseline has been updated by SPARK-47513.
Benchmarks fixed:
Why are the changes needed?
SPARK-44444 turns ANSI on by default, there could be performance related issues
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Manual review. /
Was this patch authored or co-authored using generative AI tooling?
No.