Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bump downwards #37

Merged
merged 2 commits into from Nov 1, 2019

Conversation

@fitzgen
Copy link
Owner

fitzgen commented Nov 1, 2019

This changes bumpalo's implementation from

  • initializing the bump pointer at the start of the chunk, and
  • incrementing the bump pointer to allocate an object

to

  • initializing the bump pointer at the end of the chunk, and
  • decrementing the bump pointer to allocate an object

This means that we are now rounding down to align the pointer, which is just masking the bottom bits. Rounding up, what we used to have to do, required an addition which could overflow, which meant that we had an extra conditional branch in the generated code.

Furthermore, once the bump pointer is decremented, it is now pointing directly at the allocated space. Previously, we had to save a copy of the original pointer in a temporary, update the bump pointer, and then return the temporary. That requires the use of an extra register, so the new approach should help lower register pressure at call sites, producing slightly better code.

The decrement also requirers fewer instructions to implement, which is better for code size, and all else being equal should also imply a speed up in its own right as well.

Put all this together and it looks like allocation speeds up 3-19% depending on the work load! See the benchmark results below.

Note that there is a ~4% regression in realloc performance. This is because the new, decrementing-the-bump-pointer implementation cannot grow the last allocation in place by only updating the bump pointer. It has to do a copy since the beginning of the allocation moves, even when we get to reuse the original allocation's space. I think this is worth the trade off for the speed up to allocation, however.

Benchmark Results
alloc/small             time:   [26.129 us 26.168 us 26.208 us]
                        thrpt:  [381.56 Melem/s 382.15 Melem/s 382.71 Melem/s]
                 change:
                        time:   [-9.2069% -8.7900% -8.3936%] (p = 0.00 < 0.05)
                        thrpt:  [+9.1627% +9.6372% +10.141%]
                        Performance has improved.
Found 123 outliers among 1000 measurements (12.30%)
  51 (5.10%) high mild
  72 (7.20%) high severe

alloc/big               time:   [348.03 us 348.21 us 348.41 us]
                        thrpt:  [28.702 Melem/s 28.718 Melem/s 28.733 Melem/s]
                 change:
                        time:   [-3.1144% -3.0057% -2.8915%] (p = 0.00 < 0.05)
                        thrpt:  [+2.9776% +3.0989% +3.2145%]
                        Performance has improved.
Found 150 outliers among 1000 measurements (15.00%)
  58 (5.80%) low mild
  46 (4.60%) high mild
  46 (4.60%) high severe

alloc-with/small        time:   [26.446 us 26.477 us 26.508 us]
                        thrpt:  [377.25 Melem/s 377.69 Melem/s 378.12 Melem/s]
                 change:
                        time:   [-16.499% -16.191% -15.898%] (p = 0.00 < 0.05)
                        thrpt:  [+18.904% +19.318% +19.759%]
                        Performance has improved.
Found 57 outliers among 1000 measurements (5.70%)
  43 (4.30%) high mild
  14 (1.40%) high severe

alloc-with/big          time:   [313.26 us 313.75 us 314.35 us]
                        thrpt:  [31.811 Melem/s 31.872 Melem/s 31.922 Melem/s]
                 change:
                        time:   [-6.5853% -6.2957% -6.0163%] (p = 0.00 < 0.05)
                        thrpt:  [+6.4014% +6.7187% +7.0495%]
                        Performance has improved.
Found 166 outliers among 1000 measurements (16.60%)
  70 (7.00%) low mild
  44 (4.40%) high mild
  52 (5.20%) high severe

format-realloc/format-realloc/10
                        time:   [84.850 ns 85.002 ns 85.162 ns]
                        thrpt:  [117.42 Melem/s 117.64 Melem/s 117.86 Melem/s]
                 change:
                        time:   [+4.8825% +5.4527% +6.2553%] (p = 0.00 < 0.05)
                        thrpt:  [-5.8870% -5.1707% -4.6552%]
                        Performance has regressed.
Found 299 outliers among 1000 measurements (29.90%)
  1 (0.10%) low severe
  78 (7.80%) low mild
  22 (2.20%) high mild
  198 (19.80%) high severe

format-realloc/format-realloc/80
                        time:   [85.144 ns 85.353 ns 85.571 ns]
                        thrpt:  [934.89 Melem/s 937.29 Melem/s 939.58 Melem/s]
                 change:
                        time:   [+4.6040% +5.5085% +6.1615%] (p = 0.00 < 0.05)
                        thrpt:  [-5.8039% -5.2209% -4.4014%]
                        Performance has regressed.
Found 168 outliers among 1000 measurements (16.80%)
  40 (4.00%) high mild
  128 (12.80%) high severe

format-realloc/format-realloc/270
                        time:   [84.940 ns 85.080 ns 85.225 ns]
                        thrpt:  [3.1681 Gelem/s 3.1735 Gelem/s 3.1787 Gelem/s]
                 change:
                        time:   [+3.7967% +4.2268% +4.6452%] (p = 0.00 < 0.05)
                        thrpt:  [-4.4390% -4.0554% -3.6579%]
                        Performance has regressed.
Found 229 outliers among 1000 measurements (22.90%)
  8 (0.80%) low severe
  2 (0.20%) low mild
  11 (1.10%) high mild
  208 (20.80%) high severe

format-realloc/format-realloc/640
                        time:   [85.917 ns 86.199 ns 86.497 ns]
                        thrpt:  [7.3991 Gelem/s 7.4247 Gelem/s 7.4490 Gelem/s]
                 change:
                        time:   [+2.2676% +3.1780% +3.8626%] (p = 0.00 < 0.05)
                        thrpt:  [-3.7190% -3.0801% -2.2173%]
                        Performance has regressed.
Found 169 outliers among 1000 measurements (16.90%)
  62 (6.20%) high mild
  107 (10.70%) high severe
fitzgen added 2 commits Nov 1, 2019
This changes `bumpalo`'s implementation from

* initializing the bump pointer at the start of the chunk, and
* incrementing the bump pointer to allocate an object

to

* initializing the bump pointer at the end of the chunk, and
* decrementing the bump pointer to allocate an object

This means that we are now rounding down to align the pointer, which is just
masking the bottom bits. Rounding up, what we used to have to do, required an
addition which could overflow, which meant that we had an extra conditional
branch in the generated code.

Furthermore, once the bump pointer is decremented, it is now pointing directly
at the allocated space. Previously, we had to save a copy of the original
pointer in a temporary, update the bump pointer, and then return the
temporary. That requires the use of an extra register, so the new approach
should help lower register pressure at call sites, producing slightly better
code.

The decrement also requirers fewer instructions to implement, which is better
for code size, and all else being equal should also imply a speed up in its own
right as well.

Put all this together and it looks like allocation speeds up 3-19% depending on
the work load! See the benchmark results below.

Note that there is a ~4% regression in `realloc` performance. This is because
the new, decrementing-the-bump-pointer implementation cannot grow the last
allocation in place by only updating the bump pointer. It has to do a copy since
the beginning of the allocation moves, even when we get to reuse the original
allocation's space. I think this is worth the trade off for the speed up to
allocation, however.

--------------------------------------------------------------------------------

alloc/small             time:   [26.129 us 26.168 us 26.208 us]
                        thrpt:  [381.56 Melem/s 382.15 Melem/s 382.71 Melem/s]
                 change:
                        time:   [-9.2069% -8.7900% -8.3936%] (p = 0.00 < 0.05)
                        thrpt:  [+9.1627% +9.6372% +10.141%]
                        Performance has improved.
Found 123 outliers among 1000 measurements (12.30%)
  51 (5.10%) high mild
  72 (7.20%) high severe

alloc/big               time:   [348.03 us 348.21 us 348.41 us]
                        thrpt:  [28.702 Melem/s 28.718 Melem/s 28.733 Melem/s]
                 change:
                        time:   [-3.1144% -3.0057% -2.8915%] (p = 0.00 < 0.05)
                        thrpt:  [+2.9776% +3.0989% +3.2145%]
                        Performance has improved.
Found 150 outliers among 1000 measurements (15.00%)
  58 (5.80%) low mild
  46 (4.60%) high mild
  46 (4.60%) high severe

alloc-with/small        time:   [26.446 us 26.477 us 26.508 us]
                        thrpt:  [377.25 Melem/s 377.69 Melem/s 378.12 Melem/s]
                 change:
                        time:   [-16.499% -16.191% -15.898%] (p = 0.00 < 0.05)
                        thrpt:  [+18.904% +19.318% +19.759%]
                        Performance has improved.
Found 57 outliers among 1000 measurements (5.70%)
  43 (4.30%) high mild
  14 (1.40%) high severe

alloc-with/big          time:   [313.26 us 313.75 us 314.35 us]
                        thrpt:  [31.811 Melem/s 31.872 Melem/s 31.922 Melem/s]
                 change:
                        time:   [-6.5853% -6.2957% -6.0163%] (p = 0.00 < 0.05)
                        thrpt:  [+6.4014% +6.7187% +7.0495%]
                        Performance has improved.
Found 166 outliers among 1000 measurements (16.60%)
  70 (7.00%) low mild
  44 (4.40%) high mild
  52 (5.20%) high severe

format-realloc/format-realloc/10
                        time:   [84.850 ns 85.002 ns 85.162 ns]
                        thrpt:  [117.42 Melem/s 117.64 Melem/s 117.86 Melem/s]
                 change:
                        time:   [+4.8825% +5.4527% +6.2553%] (p = 0.00 < 0.05)
                        thrpt:  [-5.8870% -5.1707% -4.6552%]
                        Performance has regressed.
Found 299 outliers among 1000 measurements (29.90%)
  1 (0.10%) low severe
  78 (7.80%) low mild
  22 (2.20%) high mild
  198 (19.80%) high severe

format-realloc/format-realloc/80
                        time:   [85.144 ns 85.353 ns 85.571 ns]
                        thrpt:  [934.89 Melem/s 937.29 Melem/s 939.58 Melem/s]
                 change:
                        time:   [+4.6040% +5.5085% +6.1615%] (p = 0.00 < 0.05)
                        thrpt:  [-5.8039% -5.2209% -4.4014%]
                        Performance has regressed.
Found 168 outliers among 1000 measurements (16.80%)
  40 (4.00%) high mild
  128 (12.80%) high severe

format-realloc/format-realloc/270
                        time:   [84.940 ns 85.080 ns 85.225 ns]
                        thrpt:  [3.1681 Gelem/s 3.1735 Gelem/s 3.1787 Gelem/s]
                 change:
                        time:   [+3.7967% +4.2268% +4.6452%] (p = 0.00 < 0.05)
                        thrpt:  [-4.4390% -4.0554% -3.6579%]
                        Performance has regressed.
Found 229 outliers among 1000 measurements (22.90%)
  8 (0.80%) low severe
  2 (0.20%) low mild
  11 (1.10%) high mild
  208 (20.80%) high severe

format-realloc/format-realloc/640
                        time:   [85.917 ns 86.199 ns 86.497 ns]
                        thrpt:  [7.3991 Gelem/s 7.4247 Gelem/s 7.4490 Gelem/s]
                 change:
                        time:   [+2.2676% +3.1780% +3.8626%] (p = 0.00 < 0.05)
                        thrpt:  [-3.7190% -3.0801% -2.2173%]
                        Performance has regressed.
Found 169 outliers among 1000 measurements (16.90%)
  62 (6.20%) high mild
  107 (10.70%) high severe
@fitzgen

This comment has been minimized.

Copy link
Owner Author

fitzgen commented Nov 1, 2019

@fitzgen fitzgen merged commit 38054c7 into master Nov 1, 2019
3 checks passed
3 checks passed
fitzgen.bumpalo Build #20191101.2 succeeded
Details
fitzgen.bumpalo (Check that benches build) Check that benches build succeeded
Details
fitzgen.bumpalo (Tests) Tests succeeded
Details
@fitzgen fitzgen deleted the bump-downwards branch Nov 1, 2019
@TethysSvensson

This comment has been minimized.

Copy link
Contributor

TethysSvensson commented Nov 1, 2019

@fitzgen Wow, those are some nice numbers! It might also mean that I can use this for flatbuffers!

@fitzgen

This comment has been minimized.

Copy link
Owner Author

fitzgen commented Nov 1, 2019

Great! :)

let new_ptr = footer.ptr.get();
// NB: we know it is non-overlapping because of the size check
// in the `if` condition.
ptr::copy_nonoverlapping(ptr.as_ptr(), new_ptr.as_ptr(), new_size);

This comment has been minimized.

Copy link
@TethysSvensson

TethysSvensson Nov 1, 2019

Contributor

Have you tested how much we lose by using ptr::copy instead and now having the new_size <= old_size / 2 check?

This comment has been minimized.

Copy link
@TethysSvensson

TethysSvensson Nov 1, 2019

Contributor

Ah, I see. If we are shrinking but not by a lot, we can just use the same pointer. We could reclaim it, but it is a lot of trouble for very few bytes saved. I agree with this implementation! 👍

This comment has been minimized.

Copy link
@fitzgen

fitzgen Nov 4, 2019

Author Owner

Exactly, and since we already are doing this calculus, we might was well choose the threshold where we get to do a faster copy as well. That said, if you want to experiment with other implementations and benchmark them, I'm happy to accept results-driven PRs! :)

src/lib.rs Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.