-
Notifications
You must be signed in to change notification settings - Fork 132
Bump downwards #37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bump downwards #37
Conversation
This changes `bumpalo`'s implementation from
* initializing the bump pointer at the start of the chunk, and
* incrementing the bump pointer to allocate an object
to
* initializing the bump pointer at the end of the chunk, and
* decrementing the bump pointer to allocate an object
This means that we are now rounding down to align the pointer, which is just
masking the bottom bits. Rounding up, what we used to have to do, required an
addition which could overflow, which meant that we had an extra conditional
branch in the generated code.
Furthermore, once the bump pointer is decremented, it is now pointing directly
at the allocated space. Previously, we had to save a copy of the original
pointer in a temporary, update the bump pointer, and then return the
temporary. That requires the use of an extra register, so the new approach
should help lower register pressure at call sites, producing slightly better
code.
The decrement also requirers fewer instructions to implement, which is better
for code size, and all else being equal should also imply a speed up in its own
right as well.
Put all this together and it looks like allocation speeds up 3-19% depending on
the work load! See the benchmark results below.
Note that there is a ~4% regression in `realloc` performance. This is because
the new, decrementing-the-bump-pointer implementation cannot grow the last
allocation in place by only updating the bump pointer. It has to do a copy since
the beginning of the allocation moves, even when we get to reuse the original
allocation's space. I think this is worth the trade off for the speed up to
allocation, however.
--------------------------------------------------------------------------------
alloc/small time: [26.129 us 26.168 us 26.208 us]
thrpt: [381.56 Melem/s 382.15 Melem/s 382.71 Melem/s]
change:
time: [-9.2069% -8.7900% -8.3936%] (p = 0.00 < 0.05)
thrpt: [+9.1627% +9.6372% +10.141%]
Performance has improved.
Found 123 outliers among 1000 measurements (12.30%)
51 (5.10%) high mild
72 (7.20%) high severe
alloc/big time: [348.03 us 348.21 us 348.41 us]
thrpt: [28.702 Melem/s 28.718 Melem/s 28.733 Melem/s]
change:
time: [-3.1144% -3.0057% -2.8915%] (p = 0.00 < 0.05)
thrpt: [+2.9776% +3.0989% +3.2145%]
Performance has improved.
Found 150 outliers among 1000 measurements (15.00%)
58 (5.80%) low mild
46 (4.60%) high mild
46 (4.60%) high severe
alloc-with/small time: [26.446 us 26.477 us 26.508 us]
thrpt: [377.25 Melem/s 377.69 Melem/s 378.12 Melem/s]
change:
time: [-16.499% -16.191% -15.898%] (p = 0.00 < 0.05)
thrpt: [+18.904% +19.318% +19.759%]
Performance has improved.
Found 57 outliers among 1000 measurements (5.70%)
43 (4.30%) high mild
14 (1.40%) high severe
alloc-with/big time: [313.26 us 313.75 us 314.35 us]
thrpt: [31.811 Melem/s 31.872 Melem/s 31.922 Melem/s]
change:
time: [-6.5853% -6.2957% -6.0163%] (p = 0.00 < 0.05)
thrpt: [+6.4014% +6.7187% +7.0495%]
Performance has improved.
Found 166 outliers among 1000 measurements (16.60%)
70 (7.00%) low mild
44 (4.40%) high mild
52 (5.20%) high severe
format-realloc/format-realloc/10
time: [84.850 ns 85.002 ns 85.162 ns]
thrpt: [117.42 Melem/s 117.64 Melem/s 117.86 Melem/s]
change:
time: [+4.8825% +5.4527% +6.2553%] (p = 0.00 < 0.05)
thrpt: [-5.8870% -5.1707% -4.6552%]
Performance has regressed.
Found 299 outliers among 1000 measurements (29.90%)
1 (0.10%) low severe
78 (7.80%) low mild
22 (2.20%) high mild
198 (19.80%) high severe
format-realloc/format-realloc/80
time: [85.144 ns 85.353 ns 85.571 ns]
thrpt: [934.89 Melem/s 937.29 Melem/s 939.58 Melem/s]
change:
time: [+4.6040% +5.5085% +6.1615%] (p = 0.00 < 0.05)
thrpt: [-5.8039% -5.2209% -4.4014%]
Performance has regressed.
Found 168 outliers among 1000 measurements (16.80%)
40 (4.00%) high mild
128 (12.80%) high severe
format-realloc/format-realloc/270
time: [84.940 ns 85.080 ns 85.225 ns]
thrpt: [3.1681 Gelem/s 3.1735 Gelem/s 3.1787 Gelem/s]
change:
time: [+3.7967% +4.2268% +4.6452%] (p = 0.00 < 0.05)
thrpt: [-4.4390% -4.0554% -3.6579%]
Performance has regressed.
Found 229 outliers among 1000 measurements (22.90%)
8 (0.80%) low severe
2 (0.20%) low mild
11 (1.10%) high mild
208 (20.80%) high severe
format-realloc/format-realloc/640
time: [85.917 ns 86.199 ns 86.497 ns]
thrpt: [7.3991 Gelem/s 7.4247 Gelem/s 7.4490 Gelem/s]
change:
time: [+2.2676% +3.1780% +3.8626%] (p = 0.00 < 0.05)
thrpt: [-3.7190% -3.0801% -2.2173%]
Performance has regressed.
Found 169 outliers among 1000 measurements (16.90%)
62 (6.20%) high mild
107 (10.70%) high severe
|
@fitzgen Wow, those are some nice numbers! It might also mean that I can use this for flatbuffers! |
|
Great! :) |
| let new_ptr = footer.ptr.get(); | ||
| // NB: we know it is non-overlapping because of the size check | ||
| // in the `if` condition. | ||
| ptr::copy_nonoverlapping(ptr.as_ptr(), new_ptr.as_ptr(), new_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tested how much we lose by using ptr::copy instead and now having the new_size <= old_size / 2 check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. If we are shrinking but not by a lot, we can just use the same pointer. We could reclaim it, but it is a lot of trouble for very few bytes saved. I agree with this implementation! 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly, and since we already are doing this calculus, we might was well choose the threshold where we get to do a faster copy as well. That said, if you want to experiment with other implementations and benchmark them, I'm happy to accept results-driven PRs! :)
|
So I have tried bumping downwards in a linear allocator and even tough it saves a couple of instructions it was always slower on Jaguar CPUs. |
This changes
bumpalo's implementation fromto
This means that we are now rounding down to align the pointer, which is just masking the bottom bits. Rounding up, what we used to have to do, required an addition which could overflow, which meant that we had an extra conditional branch in the generated code.
Furthermore, once the bump pointer is decremented, it is now pointing directly at the allocated space. Previously, we had to save a copy of the original pointer in a temporary, update the bump pointer, and then return the temporary. That requires the use of an extra register, so the new approach should help lower register pressure at call sites, producing slightly better code.
The decrement also requirers fewer instructions to implement, which is better for code size, and all else being equal should also imply a speed up in its own right as well.
Put all this together and it looks like allocation speeds up 3-19% depending on the work load! See the benchmark results below.
Note that there is a ~4% regression in
reallocperformance. This is because the new, decrementing-the-bump-pointer implementation cannot grow the last allocation in place by only updating the bump pointer. It has to do a copy since the beginning of the allocation moves, even when we get to reuse the original allocation's space. I think this is worth the trade off for the speed up to allocation, however.Benchmark Results