-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add templated memcpy/memcmp #2655
Conversation
Wow, incredible that you've made that kind of improvement even after the initial algorithm change! Maybe it's worth re-posting the blog with a short update/new benchmark at the top? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Looks good. Two questions:
//! but only when you are calling memcpy with a const size in a loop. | ||
//! For instance `while (<cond>) { memcpy(<dest>, <src>, const_size); ... }` | ||
static inline void FastMemcpy(void *dest, const void *src, const size_t size) { | ||
D_ASSERT(size % 8 == 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If size%8 has to be 0, do we need to add the cases for size=1, size=2, etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be removed but I forgot it was there, my bad. We may want to use this code elsewhere where the assertion does not hold.
case 256: | ||
return MemcpyFixed<256>(dest, src); | ||
default: | ||
MemcpyFixed<256>(dest, src); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we just call memcpy with a variable size at this point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found a slight performance improvement by calling the fixed memcpy like this beyond 256 bytes, that's why I kept it there. It's not much though. Would you prefer I change it?
@Alex-Monahan That is a good idea. Maybe I should hold off on that until @Mytherin finishes the local storage rework that will allow for parallel creation of ordered tables? These two things together will give a great performance boost, especially for strings. |
Even more goodies coming than I knew about!! :-) |
Thanks! |
This PR adds templated versions of
memcpy
andmemcmp
, wrapped in methods calledFastMemcpy
andFastMemcmp
, which select the specific templated version with a large switch statement.The allows the compiler to pre-compile these functions with the specified number of bytes, and emit more optimized machine code. This yields a speedup specifically when these functions are used in a loop with a constant size, e.g.:
The speedup can be up to 5-6x, depending on
size
.Of course, these functions are heavily used in the sort code, and adding these functions yields a significant performance boost.
I sorted 100M random integers:
And here are the results:
As we can see, the performance is mostly felt when there are more threads, because it affects merge sort more than it affects radix sort.
The fun part is that the single threaded performance is now as fast as the 4 threaded performance on this specific benchmark was when the sorting blog came out.