Add templated memcpy/memcmp #2655

lnkuiper · 2021-11-23T10:01:00Z

This PR adds templated versions of memcpy and memcmp, wrapped in methods called FastMemcpy and FastMemcmp, which select the specific templated version with a large switch statement.

The allows the compiler to pre-compile these functions with the specified number of bytes, and emit more optimized machine code. This yields a speedup specifically when these functions are used in a loop with a constant size, e.g.:

const size_t size = 8;
for (idx_t i = 0; i < n; i++) {
    // ...
    FastMemcpy(dest, source, size);
    // or
    FastMemcmp(str1, str2, size);
    // ...
}

The speedup can be up to 5-6x, depending on size.

Of course, these functions are heavily used in the sort code, and adding these functions yields a significant performance boost.
I sorted 100M random integers:

.timer on
.mode trash
create table test as select cast(random() * 99999999 as int) i from range(100000000);
-- pragma threads=1; OR pragma threads=4;
select * from test order by i;

And here are the results:

Threads	Old	New
1	4.285s	3.966s
4	2.968s	2.277s

As we can see, the performance is mostly felt when there are more threads, because it affects merge sort more than it affects radix sort.

The fun part is that the single threaded performance is now as fast as the 4 threaded performance on this specific benchmark was when the sorting blog came out.

Alex-Monahan · 2021-11-23T13:22:03Z

Wow, incredible that you've made that kind of improvement even after the initial algorithm change! Maybe it's worth re-posting the blog with a short update/new benchmark at the top?

Mytherin

Thanks for the PR! Looks good. Two questions:

Mytherin · 2021-11-23T13:44:38Z

src/include/duckdb/common/fast_mem.hpp

+//! but only when you are calling memcpy with a const size in a loop.
+//! For instance `while (<cond>) { memcpy(<dest>, <src>, const_size); ... }`
+static inline void FastMemcpy(void *dest, const void *src, const size_t size) {
+	D_ASSERT(size % 8 == 0);


If size%8 has to be 0, do we need to add the cases for size=1, size=2, etc?

This can be removed but I forgot it was there, my bad. We may want to use this code elsewhere where the assertion does not hold.

Mytherin · 2021-11-23T13:46:07Z

src/include/duckdb/common/fast_mem.hpp

+	case 256:
+		return MemcpyFixed<256>(dest, src);
+	default:
+		MemcpyFixed<256>(dest, src);


Can't we just call memcpy with a variable size at this point?

I found a slight performance improvement by calling the fixed memcpy like this beyond 256 bytes, that's why I kept it there. It's not much though. Would you prefer I change it?

lnkuiper · 2021-11-23T14:49:54Z

@Alex-Monahan That is a good idea. Maybe I should hold off on that until @Mytherin finishes the local storage rework that will allow for parallel creation of ordered tables?

These two things together will give a great performance boost, especially for strings.

Alex-Monahan · 2021-11-23T14:52:30Z

Even more goodies coming than I knew about!! :-)

Mytherin · 2021-11-24T17:34:44Z

Thanks!

lnkuiper added 8 commits November 18, 2021 15:44

add fast_mem

c91e3b5

add import for windows builds

ac80f70

more imports for windows

a7f271c

Merge branch 'master' into fast_mem

d123576

add fast memcpy to merge sorter too

616d664

expand fast_mem switches and change pointer arithmetic in merge_sorter

96d4ff4

Merge branch 'master' into fast_mem

760b0dc

Merge branch 'master' into fast_mem

fdd062f

Mytherin reviewed Nov 23, 2021

View reviewed changes

lnkuiper added 4 commits November 24, 2021 10:27

PR feedback

c90af97

Merge branch 'master' into fast_mem

b635290

exclude codecov for the big switch

2b2e98b

Merge branch 'master' into fast_mem

b14a290

Mytherin merged commit a73d33c into duckdb:master Nov 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add templated memcpy/memcmp #2655

Add templated memcpy/memcmp #2655

lnkuiper commented Nov 23, 2021 •

edited

Alex-Monahan commented Nov 23, 2021

Mytherin left a comment

Mytherin Nov 23, 2021

lnkuiper Nov 23, 2021

Mytherin Nov 23, 2021

lnkuiper Nov 23, 2021

lnkuiper commented Nov 23, 2021

Alex-Monahan commented Nov 23, 2021

Mytherin commented Nov 24, 2021

Add templated memcpy/memcmp #2655

Add templated memcpy/memcmp #2655

Conversation

lnkuiper commented Nov 23, 2021 • edited

Alex-Monahan commented Nov 23, 2021

Mytherin left a comment

Choose a reason for hiding this comment

Mytherin Nov 23, 2021

Choose a reason for hiding this comment

lnkuiper Nov 23, 2021

Choose a reason for hiding this comment

Mytherin Nov 23, 2021

Choose a reason for hiding this comment

lnkuiper Nov 23, 2021

Choose a reason for hiding this comment

lnkuiper commented Nov 23, 2021

Alex-Monahan commented Nov 23, 2021

Mytherin commented Nov 24, 2021

lnkuiper commented Nov 23, 2021 •

edited