Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#1607] docs: Performance report with partial TPC-DS(SF=40000) queries #1650

Merged
merged 9 commits into from
Apr 17, 2024

Conversation

rickyma
Copy link
Contributor

@rickyma rickyma commented Apr 16, 2024

What changes were proposed in this pull request?

Add a performance report using TPC-DS.

Why are the changes needed?

For #1607.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

No need to be tested.

@rickyma
Copy link
Contributor Author

rickyma commented Apr 16, 2024

@rickyma
Copy link
Contributor Author

rickyma commented Apr 16, 2024

@zuston @jerqi PTAL.

Copy link

github-actions bot commented Apr 16, 2024

Test Results

 2 363 files  ±0   2 363 suites  ±0   4h 31m 1s ⏱️ -8s
   912 tests ±0     911 ✅ ±0   1 💤 ±0  0 ❌ ±0 
10 585 runs  ±0  10 571 ✅ ±0  14 💤 ±0  0 ❌ ±0 

Results for commit c492db1. ± Comparison against base commit b4c92b8.

♻️ This comment has been updated with latest results.


We can draw the following conclusions:

1. At 1400 concurrency, Spark Native is already unable to successfully complete tasks, and at 5600 concurrency, Spark
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark Native -> vanilla spark

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@rickyma rickyma force-pushed the issue-1607 branch 3 times, most recently from 7df03fa to 7dacc18 Compare April 16, 2024 11:28
docs/benchmark_netty.md Outdated Show resolved Hide resolved
docs/benchmark_netty.md Outdated Show resolved Hide resolved
docs/benchmark_netty.md Outdated Show resolved Hide resolved
docs/benchmark_netty.md Outdated Show resolved Hide resolved
2. The calculation formula for `Netty(SSD) Performance Improvement` is as follows:

````
Netty(SSD) Performance Improvement = (Tasks Total Time - Tasks Total Time( Netty(SSD) )) / Tasks Total Time * 100%
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, I would call it Total task time reduction.
BTW, for performance improvement, we usually use speedup to indicate that. The definition of speedup is s = Time of old / Time of new, you can refer https://en.wikipedia.org/wiki/Speedup#Using_execution_times for how this is defined.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I add a new column Netty(SSD) Speedup. Also I keep the original column and rename it to Netty(SSD) Total Task Time Reduction.


We can draw the following conclusions:

1. At 1400 concurrency, Vanilla Spark is already unable to complete tasks successfully, and at 5600 concurrency, Spark
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Vanilla Spark is already incapable of successfully completing tasks,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@advancedxy
Copy link
Contributor

BTW, thank you for the case report. I think it's quite compelling for users that Uniffle is capable of handling ten TBs of shuffle data and improves job stability and performance overall.

@zuston
Copy link
Member

zuston commented Apr 16, 2024

Great work! I have a question that the peek write speed is too slow from my side under the so high concurrency. Because I do the similar test about rust based server.

So is client backpressed?

@rickyma rickyma changed the title [#1607] docs: Performance Benchmark Report using TPC-DS [#1607] docs: Performance report with partial TPC-DS(SF=40000) queries Apr 16, 2024
@rickyma
Copy link
Contributor Author

rickyma commented Apr 16, 2024

Great work! I have a question that the peek write speed is too slow from my side under the so high concurrency. Because I do the similar test about rust based server.

So is client backpressed?

You mean your peak write speed is too slow when you test your rust based servers? I think the number of concurrent tasks is not enough in your case maybe?

@rickyma rickyma requested a review from advancedxy April 16, 2024 14:37
Copy link
Contributor

@advancedxy advancedxy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM from my side. @zuston Please take another look.

docs/benchmark_netty_case_report.md Outdated Show resolved Hide resolved
@rickyma rickyma requested a review from advancedxy April 16, 2024 14:53
@zuston
Copy link
Member

zuston commented Apr 17, 2024

Great work! I have a question that the peek write speed is too slow from my side under the so high concurrency. Because I do the similar test about rust based server.
So is client backpressed?

You mean your peak write speed is too slow when you test your rust based servers? I think the number of concurrent tasks is not enough in your case maybe?

Misunderstand me, I think write speed in your test report is slow.

@rickyma
Copy link
Contributor Author

rickyma commented Apr 17, 2024

Misunderstand me, I think write speed in your test report is slow.

There could be many reasons, such as differences in data scale, SQL, hardware, shuffle server's configuration, etc. For example, if the memory is larger, the block size will be larger when flushing, resulting in less random IO, which may also have an impact even on SSDs. You can try testing again following the same method described in my document, as this comparison might help clarify the issue. Currently, I cannot tell the reason either. However, the current result is indeed as such.

@zuston
Copy link
Member

zuston commented Apr 17, 2024

Misunderstand me, I think write speed in your test report is slow.

There could be many reasons, such as differences in data scale, SQL, hardware, shuffle server's configuration, etc. For example, if the memory is larger, the block size will be larger when flushing, resulting in less random IO, which may also have an impact even on SSDs. You can try testing again following the same method described in my document, as this comparison might help clarify the issue. Currently, I cannot tell the reason either. However, the current result is indeed as such.

Yes. Just my a little question. Anyway, this report is good for community!

@zuston zuston merged commit 5ab625b into apache:master Apr 17, 2024
41 checks passed
zuston pushed a commit that referenced this pull request Apr 19, 2024
### What changes were proposed in this pull request?

Update `README.md`.

### Why are the changes needed?

A follow-up PR for: #1650.
Easier for users to find out the performance report.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unnecessary.
@rickyma rickyma deleted the issue-1607 branch May 5, 2024 08:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants