Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using openmp to speed up further #9

Closed
htot opened this issue Feb 18, 2016 · 27 comments
Closed

Using openmp to speed up further #9

htot opened this issue Feb 18, 2016 · 27 comments

Comments

@htot
Copy link
Contributor

htot commented Feb 18, 2016

I have made some small modifications to the code to take advantage of multi-threading using openmp and achieve a further 3x improvement with codec plain and2x with codec SSE3. With 8 threads on a quad-core with hyperthreading both codec now run at approx the same throughput, which seems to memory bandwidth limited (see attached under the heading ferry-quad)

TRP-Base64-00.pdf

I uploaded the code to my repository https://github.com/htot/base64 branch openmp, but I'm not really familiar with working with github so don't know how to create a pull request. Also, as right now, the code FTB without the option -fopenmp.

Are you interested to pull? How to proceed?

@aklomp
Copy link
Owner

aklomp commented Feb 18, 2016

Hi Ferry, noticed you guys are from my old neighbourhood in Delft. Greetings from The Hague :)

I must admit that I don't have any experience with OpenMP, but I get from this that it's a library that can split the work over multiple cores? Looks good, and is in line with the aim of this project, which is to be fast most of all. ("Correctness" is also important, but sort of meaningless when a base64 string is either correct or not.)

I'd agree with you that the results look bounded by memory latency rather than raw throughput, but to me that's a good thing, it means the code is getting the theoretical maximal performance on the target hardware. If you wanted to get more accurate benchmarks, you could try to work on data only in the L1 or L2 caches.

I'd certainly be interested to pull, but there is one showstopper with the current code and that's that the library would require OpenMP. That's not worth it, I'd like to remain cross-platform and portable. You could fix this by using an environment variable to Make such as ENABLE_OPENMP to conditionally include omp.h and link the library. I also noticed some spacing issues which I will comment on separately. If you fix these (or I can fix them, but I'd become the author of your commit), I'll pull it.

Not being familiar with pull requests is no problem, as long as I have your commits and can merge them manually.

@htot
Copy link
Contributor Author

htot commented Feb 19, 2016

For me this use of openmp is new as well, so I needed to learn a few thing to get it right. It works mostly by compiler pragmas, so is ignored when the compiler doesn't support it. Yesterday I already put a conditional around the only openmp library function that is called from the code, so now it builds also without -fopenmp. But you are right, it should then not include omp.h. I'll fix that up, along with the spaces formatting.

@htot
Copy link
Contributor Author

htot commented Feb 20, 2016

I've split up the code, fixed formatting and permissions and added some benchmarks.

@htot
Copy link
Contributor Author

htot commented Feb 20, 2016

BTW I checked my memory configuration and found it to be DDR1333, while I have DDR1600, fixed that and got some more performance.

@aklomp
Copy link
Owner

aklomp commented Feb 20, 2016

Thanks for the contribution! I'm amazed, on my Core i5 the SSSE3 and AVX2 decoders are now equally fast and bounded only by memory latency. This is a very cool technology to have in my toolbox, thanks.

I just pushed an 'openmp' branch which contains your work, plus a merge commit where I make some small cleanups. I fixed some typos, I tried to keep the line length at 80 characters, and I made some minor whitespace changes. If you agree on the changes, I'll push it into master.

@aklomp
Copy link
Owner

aklomp commented Feb 20, 2016

Also, I think you deserve to be in the Acknowledgements section of the readme, so I'll add a mention if you agree.

@aklomp
Copy link
Owner

aklomp commented Feb 20, 2016

Added a shout-out to the Readme.

@htot
Copy link
Contributor Author

htot commented Feb 20, 2016

Thanks for this.

Today I modified the benchmark program, to benchmark with different buffer sizes. I found that when the buffer is 100kB - 1MB that seems to be the best case, so my previous benchmarks @10Mb were not really comparable. with yours. Fixes that up, added a graph at the bottom of Readme.md .

You can see > 10M the cache is not very effective so you become memory constrained, while below 10k its not really worth to create threads. In the sweat spot you can achieve 27Gb/sec.

@htot
Copy link
Contributor Author

htot commented Feb 20, 2016

Ehr, that would be previously 100MB. Now I repeat 10MB 10x, to measure the timing accurately, and repeat proportionally more times when the block is smaller.

@htot
Copy link
Contributor Author

htot commented Feb 20, 2016

With this out of the way, I can get to doing something useful with the code: use it to encode binary data over a 2Mb/s serial connection. Using base64 I can add control/synchronization characters (/00 error, /FF preamble, STX, ETX) with almost no performance loss. With the Edison doing 150MB/s even without threading encoding will increase communication time ~1% (+33% due to base64 of course).

Thanks!

@aklomp
Copy link
Owner

aklomp commented Feb 22, 2016

Ah, I saw that you merged your branch on top of my master. Could you maybe try to rebase it onto master instead? Rebasing will replay your commits on top of my master branch and "fake history" so that it looks like you were working off the current HEAD all the time. That gives a nice linear history without merge commits.

With rebase you can also squash commits so that it looks like they never happened. You could for instance suppress the plaintext ASCII table that appeared and disappeared in the README. The "recorded history" becomes a clean set of additive patches. Faking history in git is an awesome perfectionistic superpower :)

It's OK if you don't want to rebase, because I'm happy to merge your changes. Let me know if you're OK with the mention I put in the README and with the small tweaks I made in my merge commit.

@aklomp
Copy link
Owner

aklomp commented Feb 22, 2016

The question of which benchmark size to choose is tricky. If you choose 100MB, the tests take very long to run on a Raspberry Pi or other slow hardware. If you choose 10MB, you'll be constrained by the speed of memory on desktop systems. (Which I think is an awesome achievement btw :) If you choose a lower size, you get fairer benchmarks for very fast hardware, but the timing variation is larger. I guess I'm not the first to note that meaningful benchmarking is both very difficult and quite subjective.

@aklomp
Copy link
Owner

aklomp commented Feb 22, 2016

I think it's cool that you're using this library for serial comms. It's nice that you stretch out 3x8 bits over 4x6 so that you can fit in control characters in the top two bits. But if that's the approach you're taking, I have to note that base64-encoded data is not 6-bit clean, because after reshuffling, it gets translated to the base64 alphabet. (0x00 becomes 'A', and so on.) So the top two bits are not free for use. If you want to patch that out, get rid of the "translate" sections during encoding/decoding.

@htot
Copy link
Contributor Author

htot commented Feb 22, 2016

:-) I think this was my first C code in 10 years, and now I'm attempting openmp, git, github and markdown at the same time. As soon as I figure out how to rebase I'll attempt to make the history lineair.
What I modified in the benchmark is to use the same buffer multiple times in such a way that the total time is constant regardless the buffer size. This way we should get the same accuracy for small and large buffers. It take a bit of time on edison and probably a bit more on raspberry.
The correct way would be of course to select a constant time and do as many buffer conversions as possible within that time. But I think for now this is good enough.
I measured on Edison today that the cache has no effect (or doesn't exist), so no peak for small buffers. And thread creation costs 9us for one and 12us for 2 threads. So in my application (2k buffer at 2Mb/sec = 0.2MB/sec with SSE3 codec running at 150MB/sec) not using openmp for encoding/decoding will probably be optimal. Irony.
Actually base64 should be just fine I think, in ascii control characters are put from 0x00 to 0x37 and 0xFF is also reserved. I think I saw in your code 0xFF is used for error signalling so that would be just fine for me. And all I actually need is 0x00 (the serial port should be able to generate that in case of parity errors), 0xFF for preambles and STX (0x02) and ETX (0x03) to enclose the base64 text.

@aklomp aklomp mentioned this issue Feb 23, 2016
@htot
Copy link
Contributor Author

htot commented Feb 28, 2016

With respect to rebasing and rewriting history: please tell me how to do that and I'll get it done. I'm a little afraid to mess things up.

Otherwise, just go ahead and merge as you see fit. Thanks for the mention in the Readme!

@BurningEnlightenment
Copy link
Contributor

Take a look at the Pro Git Book, especially at the sections about Rebasing and Rewriting History.

Cheers

@htot
Copy link
Contributor Author

htot commented Feb 28, 2016

So, how would I continue then? Create a new branch starting from my initial branch point (beaf4e3) and then reorder and squash commits into 1) add openmp stuff + 2) benchmarks?

@htot
Copy link
Contributor Author

htot commented Feb 28, 2016

Alright, I think I sorted things out locally. Now how to fix on github? Delete the openmp branch and push up again? Or create a new branch?

@BurningEnlightenment
Copy link
Contributor

You can overwrite a remote branch's content doing a force push, i.e. git push -f. You shouldn't delete a branch with an open pull request, because the pull request will irreversebly be closed and you'd have to open a new one.

EDIT: whoops, just noticed that this is just an issue and not a pull request 😧, in this case it doesn't really matter whether you force push, recreate or branch again.

@htot
Copy link
Contributor Author

htot commented Mar 1, 2016

I forced pushed. Hope that didn't screw up anything.

@aklomp
Copy link
Owner

aklomp commented Mar 2, 2016

Hey, I just merged your work into my openmp branch. I'll probably merge that into master tomorrow after I sleep on it.

I added two things: a merge commit and a new commit on top. The merge commit basically changes nothing except some spacing/style/syntax issues. The new commit shuffles things a little bit so that the plain functions are back in lib/lib.c, and lib/lib_plain.c is no longer needed. I think that makes the code cleaner, because lib/lib_openmp.c was basically copying the bodies of the plain codecs for the case where the buffer size was below the OMP threshold.

@htot
Copy link
Contributor Author

htot commented Mar 2, 2016

That's great news. Thanks for this great project and for letting me contribute something back!

@aklomp
Copy link
Owner

aklomp commented Mar 3, 2016

Merged into master branch, thanks a lot for a great contribution!

@aklomp aklomp closed this as completed Mar 3, 2016
@aklomp aklomp reopened this Mar 3, 2016
@aklomp
Copy link
Owner

aklomp commented Mar 3, 2016

Actually, I think I have one last question about the OpenMP code, in regards to this line. base64_stream_decode returns -1 if the requested codec is not implemented, 0 if decoding failed because of invalid input, and 1 if everything is OK. What happens if we spawn four OpenMP threads, and one thread encounters a decode error? All of the return values are added, so the accumulated result will be 3, which is truncated to 1, which means "success". Shouldn't a value for result of anything other than num_threads indicate an error?

@htot
Copy link
Contributor Author

htot commented Mar 4, 2016

You're right, I didn't realize there is no test case for this.
We could replace '+=' by '&=' (assuming -1 can not occur) and initialize 'result' to 1?

@aklomp
Copy link
Owner

aklomp commented Mar 4, 2016

Well, if there are no errors, the sum must be equal to num_threads. The -1 case will come out as -num_threads since all threads encounter the same missing codec. Any result other than those two indicates that one or more threads hit an error.

@aklomp
Copy link
Owner

aklomp commented Mar 6, 2016

Pushed fix to master.

@aklomp aklomp closed this as completed Mar 6, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants