[#1607] docs: Performance report with partial TPC-DS(SF=40000) queries #1650

rickyma · 2024-04-16T08:52:46Z

What changes were proposed in this pull request?

Add a performance report using TPC-DS.

Why are the changes needed?

For #1607.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

No need to be tested.

rickyma · 2024-04-16T08:57:50Z

We can see from here: https://github.com/rickyma/incubator-uniffle/blob/issue-1607/docs/benchmark_netty_case_report.md

rickyma · 2024-04-16T09:02:35Z

@zuston @jerqi PTAL.

github-actions · 2024-04-16T09:51:44Z

Test Results

2 363 files ±0 2 363 suites ±0 4h 31m 1s ⏱️ -8s
912 tests ±0 911 ✅ ±0 1 💤 ±0 0 ❌ ±0
10 585 runs ±0 10 571 ✅ ±0 14 💤 ±0 0 ❌ ±0

Results for commit c492db1. ± Comparison against base commit b4c92b8.

♻️ This comment has been updated with latest results.

zuston · 2024-04-16T10:07:19Z

docs/benchmark_netty.md

+
+We can draw the following conclusions:
+
+1. At 1400 concurrency, Spark Native is already unable to successfully complete tasks, and at 5600 concurrency, Spark


Spark Native -> vanilla spark

docs/benchmark_netty.md

advancedxy · 2024-04-16T12:57:50Z

docs/benchmark_netty.md

+2. The calculation formula for `Netty(SSD) Performance Improvement` is as follows:
+
+````
+Netty(SSD) Performance Improvement = (Tasks Total Time - Tasks Total Time( Netty(SSD) )) / Tasks Total Time * 100%


Hmmm, I would call it Total task time reduction.
BTW, for performance improvement, we usually use speedup to indicate that. The definition of speedup is s = Time of old / Time of new, you can refer https://en.wikipedia.org/wiki/Speedup#Using_execution_times for how this is defined.

I add a new column Netty(SSD) Speedup. Also I keep the original column and rename it to Netty(SSD) Total Task Time Reduction.

advancedxy · 2024-04-16T13:00:18Z

docs/benchmark_netty.md

+
+We can draw the following conclusions:
+
+1. At 1400 concurrency, Vanilla Spark is already unable to complete tasks successfully, and at 5600 concurrency, Spark


Nit: Vanilla Spark is already incapable of successfully completing tasks,

advancedxy · 2024-04-16T13:06:10Z

BTW, thank you for the case report. I think it's quite compelling for users that Uniffle is capable of handling ten TBs of shuffle data and improves job stability and performance overall.

zuston · 2024-04-16T13:39:43Z

Great work! I have a question that the peek write speed is too slow from my side under the so high concurrency. Because I do the similar test about rust based server.

So is client backpressed?

rickyma · 2024-04-16T14:13:39Z

Great work! I have a question that the peek write speed is too slow from my side under the so high concurrency. Because I do the similar test about rust based server.

So is client backpressed?

You mean your peak write speed is too slow when you test your rust based servers? I think the number of concurrent tasks is not enough in your case maybe?

advancedxy

Generally LGTM from my side. @zuston Please take another look.

docs/benchmark_netty_case_report.md

zuston · 2024-04-17T01:58:52Z

Great work! I have a question that the peek write speed is too slow from my side under the so high concurrency. Because I do the similar test about rust based server.
So is client backpressed?

You mean your peak write speed is too slow when you test your rust based servers? I think the number of concurrent tasks is not enough in your case maybe?

Misunderstand me, I think write speed in your test report is slow.

rickyma · 2024-04-17T03:05:51Z

Misunderstand me, I think write speed in your test report is slow.

There could be many reasons, such as differences in data scale, SQL, hardware, shuffle server's configuration, etc. For example, if the memory is larger, the block size will be larger when flushing, resulting in less random IO, which may also have an impact even on SSDs. You can try testing again following the same method described in my document, as this comparison might help clarify the issue. Currently, I cannot tell the reason either. However, the current result is indeed as such.

zuston · 2024-04-17T03:10:54Z

Misunderstand me, I think write speed in your test report is slow.

There could be many reasons, such as differences in data scale, SQL, hardware, shuffle server's configuration, etc. For example, if the memory is larger, the block size will be larger when flushing, resulting in less random IO, which may also have an impact even on SSDs. You can try testing again following the same method described in my document, as this comparison might help clarify the issue. Currently, I cannot tell the reason either. However, the current result is indeed as such.

Yes. Just my a little question. Anyway, this report is good for community!

### What changes were proposed in this pull request? Update `README.md`. ### Why are the changes needed? A follow-up PR for: #1650. Easier for users to find out the performance report. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unnecessary.

rickyma force-pushed the issue-1607 branch from 65e9f85 to 97924cd Compare April 16, 2024 09:10

rickyma mentioned this pull request Apr 16, 2024

[Improvement] Set Netty as the default server type #1651

Open

3 tasks

zuston reviewed Apr 16, 2024

View reviewed changes

rickyma force-pushed the issue-1607 branch 3 times, most recently from 7df03fa to 7dacc18 Compare April 16, 2024 11:28

[apache#1607] docs: Performance Benchmark Report using TPC-DS

fe0fd42

rickyma force-pushed the issue-1607 branch from 7dacc18 to fe0fd42 Compare April 16, 2024 11:32

rickyma requested a review from zuston April 16, 2024 11:40

advancedxy reviewed Apr 16, 2024

View reviewed changes

Fix reviews

23e2267

Fix reviews

d519d31

rickyma force-pushed the issue-1607 branch from 2e1ad93 to d519d31 Compare April 16, 2024 13:42

rickyma changed the title ~~[#1607] docs: Performance Benchmark Report using TPC-DS~~ [#1607] docs: Performance report with partial TPC-DS(SF=40000) queries Apr 16, 2024

rickyma added 2 commits April 16, 2024 22:15

Fix reviews

b77f19f

Add E2E time

713b507

rickyma requested a review from advancedxy April 16, 2024 14:37

rickyma added 2 commits April 16, 2024 22:41

Fix notes

5d871f5

Fix notes

c015f71

advancedxy reviewed Apr 16, 2024

View reviewed changes

docs/benchmark_netty_case_report.md Outdated Show resolved Hide resolved

rickyma added 2 commits April 16, 2024 22:51

Fix conclusion

448d531

Fix reviews

c492db1

rickyma requested a review from advancedxy April 16, 2024 14:53

advancedxy approved these changes Apr 16, 2024

View reviewed changes

zuston merged commit 5ab625b into apache:master Apr 17, 2024
41 checks passed

rickyma mentioned this pull request Apr 19, 2024

[#1607][FOLLOWUP] docs(benchmark): Performance report with partial TPC-DS(SF=40000) queries #1661

Merged

rickyma deleted the issue-1607 branch May 5, 2024 08:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[#1607] docs: Performance report with partial TPC-DS(SF=40000) queries #1650

[#1607] docs: Performance report with partial TPC-DS(SF=40000) queries #1650

rickyma commented Apr 16, 2024

rickyma commented Apr 16, 2024 •

edited

Loading

rickyma commented Apr 16, 2024

github-actions bot commented Apr 16, 2024 •

edited

Loading

zuston Apr 16, 2024

rickyma Apr 16, 2024

advancedxy Apr 16, 2024

rickyma Apr 16, 2024

advancedxy Apr 16, 2024

rickyma Apr 16, 2024

advancedxy commented Apr 16, 2024

zuston commented Apr 16, 2024

rickyma commented Apr 16, 2024

advancedxy left a comment

zuston commented Apr 17, 2024

rickyma commented Apr 17, 2024 •

edited

Loading

zuston commented Apr 17, 2024


		We can draw the following conclusions:

		1. At 1400 concurrency, Spark Native is already unable to successfully complete tasks, and at 5600 concurrency, Spark


		We can draw the following conclusions:

		1. At 1400 concurrency, Vanilla Spark is already unable to complete tasks successfully, and at 5600 concurrency, Spark

[#1607] docs: Performance report with partial TPC-DS(SF=40000) queries #1650

[#1607] docs: Performance report with partial TPC-DS(SF=40000) queries #1650

Conversation

rickyma commented Apr 16, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

rickyma commented Apr 16, 2024 • edited Loading

rickyma commented Apr 16, 2024

github-actions bot commented Apr 16, 2024 • edited Loading

Test Results

zuston Apr 16, 2024

Choose a reason for hiding this comment

rickyma Apr 16, 2024

Choose a reason for hiding this comment

advancedxy Apr 16, 2024

Choose a reason for hiding this comment

rickyma Apr 16, 2024

Choose a reason for hiding this comment

advancedxy Apr 16, 2024

Choose a reason for hiding this comment

rickyma Apr 16, 2024

Choose a reason for hiding this comment

advancedxy commented Apr 16, 2024

zuston commented Apr 16, 2024

rickyma commented Apr 16, 2024

advancedxy left a comment

Choose a reason for hiding this comment

zuston commented Apr 17, 2024

rickyma commented Apr 17, 2024 • edited Loading

zuston commented Apr 17, 2024

rickyma commented Apr 16, 2024 •

edited

Loading

github-actions bot commented Apr 16, 2024 •

edited

Loading

rickyma commented Apr 17, 2024 •

edited

Loading