Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance degrades when symbols count is big #8

Closed
arseny30 opened this issue Mar 31, 2019 · 8 comments
Closed

Performance degrades when symbols count is big #8

arseny30 opened this issue Mar 31, 2019 · 8 comments

Comments

@arseny30
Copy link

Hi, I've got a c++ implementation of RaptorQ. It's a part of another project and private now, but it will be released eventually.

First, great work. You have achieved a really nice performance on 10K bytes.

But I have concerns about performance with bigger symbol_count.
For 10K data and symbol_size=512, symbol_count should be 20.

I've tried to change elements to 512 * 50000 in you encode 10KB benchmark, and it doesn't seem to finish in a reasonable amount of time.

Am I doing something wrong?
I'm running it with cargo bench --features benchmarking.
Have you tested that it did make the codec to scale linearly to larger blocks?

As I recall rfc is quite cryptic about details critical for performance.

@cberner
Copy link
Owner

cberner commented Mar 31, 2019

Thanks for the report! Yes, you're totally right about this. I've tested it up to ~200 symbol count and it works well there. I should have noted that it's only linear for another order of magnitude or so. The reason it doesn't keep scaling past that, at the moment, is that I haven't implemented the sparse matrix support that you need at that size. Right now everything is using dense matrices

@enricobacis
Copy link

Do you have any plan to implement the sparse matrix version anytime soon?

@cberner
Copy link
Owner

cberner commented Apr 6, 2019

Ya, I started on a sparse matrix implementation. Hopefully, I'll have something working this weekend

@cberner
Copy link
Owner

cberner commented Apr 8, 2019

I merged a bunch of optimizations. You can reproduce these numbers with the encode_benchmark binary.

Before optimizations:

symbol count = 10, encoded 100 MB in 0.789secs, throughput: 1013.9Mbit/s
symbol count = 100, encoded 100 MB in 0.883secs, throughput: 906.0Mbit/s
symbol count = 250, encoded 99 MB in 1.598secs, throughput: 500.5Mbit/s
symbol count = 500, encoded 99 MB in 2.242secs, throughput: 356.3Mbit/s
symbol count = 1000, encoded 99 MB in 4.372secs, throughput: 182.3Mbit/s
symbol count = 2000, encoded 99 MB in 11.666secs, throughput: 68.3Mbit/s
symbol count = 4000, encoded 99 MB in 34.964secs, throughput: 22.8Mbit/s
symbol count = 10000, encoded 97 MB in 120.642secs, throughput: 6.5Mbit/s
symbol count = 20000, encoded 97 MB in 401.891secs, throughput: 1.9Mbit/s
symbol count = 40000, encoded 97 MB in 1248.080secs, throughput: 0.6Mbit/s
symbol count = 56403, encoded 82 MB in 1720.578secs, throughput: 0.4Mbit/s

After optimizations

symbol count = 10, encoded 100 MB in 0.842secs, throughput: 950.1Mbit/s
symbol count = 100, encoded 100 MB in 0.938secs, throughput: 852.9Mbit/s
symbol count = 250, encoded 99 MB in 1.733secs, throughput: 461.5Mbit/s
symbol count = 500, encoded 99 MB in 2.420secs, throughput: 330.1Mbit/s
symbol count = 1000, encoded 99 MB in 3.470secs, throughput: 229.6Mbit/s
symbol count = 2000, encoded 99 MB in 4.609secs, throughput: 172.9Mbit/s
symbol count = 4000, encoded 99 MB in 5.591secs, throughput: 142.5Mbit/s
symbol count = 10000, encoded 97 MB in 9.043secs, throughput: 86.4Mbit/s
symbol count = 20000, encoded 97 MB in 15.487secs, throughput: 50.4Mbit/s
symbol count = 40000, encoded 97 MB in 24.356secs, throughput: 32.1Mbit/s
symbol count = 56403, encoded 82 MB in 28.757secs, throughput: 23.0Mbit/s

There's still a bit of work to be done finishing the optimizations, but it scales much better now

@arseny30
Copy link
Author

arseny30 commented Apr 8, 2019

Tried to reproduce your benchmark with my implementation.

symbol count = 10, encoded 100 MB in 0.751secs, throughput: 1064.6Mbit/s
symbol count = 100, encoded 100 MB in 0.567secs, throughput: 1409.9Mbit/s
symbol count = 250, encoded 99 MB in 0.616secs, throughput: 1297.9Mbit/s
symbol count = 500, encoded 99 MB in 0.600secs, throughput: 1332.5Mbit/s
symbol count = 1000, encoded 99 MB in 0.585secs, throughput: 1361.8Mbit/s
symbol count = 2000, encoded 99 MB in 0.724secs, throughput: 1100.0Mbit/s
symbol count = 4000, encoded 99 MB in 0.828secs, throughput: 962.0Mbit/s
symbol count = 10000, encoded 97 MB in 0.899secs, throughput: 868.6Mbit/s
symbol count = 20000, encoded 97 MB in 1.034secs, throughput: 755.7Mbit/s
symbol count = 40000, encoded 97 MB in 1.237secs, throughput: 631.4Mbit/s
symbol count = 56403, encoded 82 MB in 1.157secs, throughput: 571.4Mbit/s

There is still room for improvement.

@cberner
Copy link
Owner

cberner commented Apr 8, 2019

Cool, good to know! edit forgot yours is private. Any tips on what you did to scale to high symbol counts?

@cberner
Copy link
Owner

cberner commented Apr 12, 2019

Just pushed some further optimizations. This seems like it's about as far as it's easy to push the implementation, so going to close this. Happy to merge PRs to further improve it though:

symbol count = 10, encoded 100 MB in 0.797secs, throughput: 1003.8Mbit/s
symbol count = 100, encoded 100 MB in 0.932secs, throughput: 858.4Mbit/s
symbol count = 250, encoded 99 MB in 1.589secs, throughput: 503.3Mbit/s
symbol count = 500, encoded 99 MB in 1.572secs, throughput: 508.2Mbit/s
symbol count = 1000, encoded 99 MB in 1.779secs, throughput: 447.9Mbit/s
symbol count = 2000, encoded 99 MB in 2.120secs, throughput: 375.9Mbit/s
symbol count = 4000, encoded 99 MB in 2.413secs, throughput: 330.2Mbit/s
symbol count = 10000, encoded 97 MB in 3.132secs, throughput: 249.4Mbit/s
symbol count = 20000, encoded 97 MB in 5.191secs, throughput: 150.5Mbit/s
symbol count = 40000, encoded 97 MB in 7.197secs, throughput: 108.6Mbit/s
symbol count = 56403, encoded 82 MB in 7.612secs, throughput: 86.8Mbit/s

@cberner cberner closed this as completed Apr 12, 2019
@ycopy
Copy link

ycopy commented May 5, 2020

Hi, I've got a c++ implementation of RaptorQ. It's a part of another project and private now, but it will be released eventually.

First, great work. You have achieved a really nice performance on 10K bytes.

But I have concerns about performance with bigger symbol_count.
For 10K data and symbol_size=512, symbol_count should be 20.

I've tried to change elements to 512 * 50000 in you encode 10KB benchmark, and it doesn't seem to finish in a reasonable amount of time.

Am I doing something wrong?
I'm running it with cargo bench --features benchmarking.
Have you tested that it did make the codec to scale linearly to larger blocks?

As I recall rfc is quite cryptic about details critical for performance.
Hi arseny30, I'm looking for a cpp implementation of RQ, as you said your implementation would be released eventually, do you have a date on this event ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants