Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a custom FixedBitSet + optimize Page::iter_fixed_len #1160

Merged
merged 1 commit into from
Apr 30, 2024

Conversation

Centril
Copy link
Contributor

@Centril Centril commented Apr 26, 2024

Description of Changes

  • Shrinks present_rows to only take up 10 bytes. Fixes Reduce footprint of present_rows to 10 bytes #1152.
  • Drops the bit-vec dependency in favor of FixedBitSet so we can achieve the above.
  • Changes the algorithm of Page::iter_fixed_len to use a trailing_zeros approach.
    This improves performance of the iter_read_fixed_len for all cases significantly (see results below).
  • Removes some dead code and addresses lints to make the nightly compiler happier (to pacify miri).

Benchmarks as compared to HEAD~1 (branched master) on i7-7700K, 64GB RAM.

iter_read_fixed_len 8_byte/iter_read_fixed_len 8_byte/0.1
                        time:   [1.2453 µs 1.2476 µs 1.2504 µs]
                        thrpt:  [4.9933 GiB/s 5.0044 GiB/s 5.0138 GiB/s]
                 change:
                        time:   [-90.640% -90.594% -90.557%] (p = 0.00 < 0.05)
                        thrpt:  [+958.94% +963.11% +968.42%]
                        Performance has improved.

iter_read_fixed_len 8_byte/iter_read_fixed_len 8_byte/0.25
                        time:   [3.1358 µs 3.1572 µs 3.1847 µs]
                        thrpt:  [4.8825 GiB/s 4.9251 GiB/s 4.9586 GiB/s]
                 change:
                        time:   [-84.648% -84.560% -84.426%] (p = 0.00 < 0.05)
                        thrpt:  [+542.11% +547.68% +551.37%]
                        Performance has improved.

iter_read_fixed_len 8_byte/iter_read_fixed_len 8_byte/0.5
                        time:   [5.4504 µs 5.4618 µs 5.4775 µs]
                        thrpt:  [5.6068 GiB/s 5.6230 GiB/s 5.6347 GiB/s]
                 change:
                        time:   [-84.835% -84.755% -84.652%] (p = 0.00 < 0.05)
                        thrpt:  [+551.56% +555.96% +559.43%]
                        Performance has improved.

iter_read_fixed_len 8_byte/iter_read_fixed_len 8_byte/0.75
                        time:   [7.6559 µs 7.6630 µs 7.6772 µs]
                        thrpt:  [5.9500 GiB/s 5.9610 GiB/s 5.9666 GiB/s]
                 change:
                        time:   [-71.593% -71.571% -71.540%] (p = 0.00 < 0.05)
                        thrpt:  [+251.37% +251.75% +252.03%]
                        Performance has improved.

iter_read_fixed_len 8_byte/iter_read_fixed_len 8_byte/0.9
                        time:   [9.0964 µs 9.1057 µs 9.1182 µs]
                        thrpt:  [6.0172 GiB/s 6.0255 GiB/s 6.0316 GiB/s]
                 change:
                        time:   [-61.702% -61.475% -61.299%] (p = 0.00 < 0.05)
                        thrpt:  [+158.39% +159.57% +161.11%]
                        Performance has improved.

iter_read_fixed_len 8_byte/iter_read_fixed_len 8_byte/1
                        time:   [10.002 µs 10.007 µs 10.012 µs]
                        thrpt:  [6.0903 GiB/s 6.0932 GiB/s 6.0960 GiB/s]
                 change:
                        time:   [-54.012% -53.966% -53.916%] (p = 0.00 < 0.05)
                        thrpt:  [+116.99% +117.23% +117.45%]
                        Performance has improved.

iter_read_fixed_len 32_byte/iter_read_fixed_len 32_byte/0.1
                        time:   [261.06 ns 261.42 ns 261.80 ns]
                        thrpt:  [24.816 GiB/s 24.853 GiB/s 24.887 GiB/s]
                 change:
                        time:   [-90.917% -90.870% -90.838%] (p = 0.00 < 0.05)
                        thrpt:  [+991.51% +995.34% +1001.0%]
                        Performance has improved.

iter_read_fixed_len 32_byte/iter_read_fixed_len 32_byte/0.25
                        time:   [757.95 ns 758.98 ns 760.35 ns]
                        thrpt:  [20.813 GiB/s 20.850 GiB/s 20.879 GiB/s]
                 change:
                        time:   [-74.311% -74.261% -74.208%] (p = 0.00 < 0.05)
                        thrpt:  [+287.72% +288.51% +289.27%]
                        Performance has improved.

iter_read_fixed_len 32_byte/iter_read_fixed_len 32_byte/0.5
                        time:   [1.3360 µs 1.3364 µs 1.3368 µs]
                        thrpt:  [22.628 GiB/s 22.636 GiB/s 22.642 GiB/s]
                 change:
                        time:   [-61.588% -61.510% -61.409%] (p = 0.00 < 0.05)
                        thrpt:  [+159.13% +159.81% +160.33%]
                        Performance has improved.
                        
iter_read_fixed_len 32_byte/iter_read_fixed_len 32_byte/0.75
                        time:   [1.9166 µs 1.9174 µs 1.9183 µs]
                        thrpt:  [23.800 GiB/s 23.812 GiB/s 23.822 GiB/s]
                 change:
                        time:   [-50.428% -50.281% -50.131%] (p = 0.00 < 0.05)
                        thrpt:  [+100.53% +101.13% +101.73%]
                        Performance has improved.
                        
iter_read_fixed_len 32_byte/iter_read_fixed_len 32_byte/0.9
                        time:   [2.2676 µs 2.2698 µs 2.2735 µs]
                        thrpt:  [24.133 GiB/s 24.173 GiB/s 24.196 GiB/s]
                 change:
                        time:   [-46.345% -46.272% -46.199%] (p = 0.00 < 0.05)
                        thrpt:  [+85.870% +86.123% +86.375%]
                        Performance has improved.
                        
iter_read_fixed_len 32_byte/iter_read_fixed_len 32_byte/1
                        time:   [2.4979 µs 2.4998 µs 2.5028 µs]
                        thrpt:  [24.363 GiB/s 24.392 GiB/s 24.410 GiB/s]
                 change:
                        time:   [-39.172% -39.109% -39.052%] (p = 0.00 < 0.05)
                        thrpt:  [+64.074% +64.227% +64.397%]
                        Performance has improved.
                        

iter_read_fixed_len 256_byte/iter_read_fixed_len 256_byte/0.1
                        time:   [31.214 ns 31.237 ns 31.264 ns]
                        thrpt:  [175.40 GiB/s 175.55 GiB/s 175.68 GiB/s]
                 change:
                        time:   [-89.765% -89.756% -89.747%] (p = 0.00 < 0.05)
                        thrpt:  [+875.28% +876.16% +877.00%]
                        Performance has improved.
                        
iter_read_fixed_len 256_byte/iter_read_fixed_len 256_byte/0.25
                        time:   [67.296 ns 67.340 ns 67.424 ns]
                        thrpt:  [194.49 GiB/s 194.73 GiB/s 194.85 GiB/s]
                 change:
                        time:   [-79.814% -79.798% -79.777%] (p = 0.00 < 0.05)
                        thrpt:  [+394.50% +395.00% +395.38%]
                        Performance has improved.
                        
iter_read_fixed_len 256_byte/iter_read_fixed_len 256_byte/0.5
                        time:   [163.99 ns 164.26 ns 164.56 ns]
                        thrpt:  [188.35 GiB/s 188.69 GiB/s 189.00 GiB/s]
                 change:
                        time:   [-58.921% -58.858% -58.791%] (p = 0.00 < 0.05)
                        thrpt:  [+142.66% +143.06% +143.44%]
                        Performance has improved.
iter_read_fixed_len 256_byte/iter_read_fixed_len 256_byte/0.75
                        time:   [247.31 ns 247.47 ns 247.78 ns]
                        thrpt:  [189.56 GiB/s 189.80 GiB/s 189.92 GiB/s]
                 change:
                        time:   [-46.170% -46.134% -46.080%] (p = 0.00 < 0.05)
                        thrpt:  [+85.460% +85.645% +85.770%]
                        Performance has improved.
                        
iter_read_fixed_len 256_byte/iter_read_fixed_len 256_byte/0.9
                        time:   [293.82 ns 294.04 ns 294.30 ns]
                        thrpt:  [187.95 GiB/s 188.11 GiB/s 188.25 GiB/s]
                 change:
                        time:   [-40.039% -39.889% -39.644%] (p = 0.00 < 0.05)
                        thrpt:  [+65.685% +66.358% +66.776%]
                        Performance has improved.
                        
iter_read_fixed_len 256_byte/iter_read_fixed_len 256_byte/1
                        time:   [318.79 ns 319.17 ns 319.92 ns]
                        thrpt:  [190.04 GiB/s 190.49 GiB/s 190.71 GiB/s]
                 change:
                        time:   [-38.455% -38.353% -38.268%] (p = 0.00 < 0.05)
                        thrpt:  [+61.991% +62.213% +62.482%]
                        Performance has improved.

API and ABI breaking changes

None

Expected complexity level and risk

2, contained changes, but some unsafe included.

Testing

Proptests are included and they have been run in miri

@Centril Centril requested review from kazimuth and gefjon April 26, 2024 00:13
@Centril Centril changed the title Use a custom FixedBitSet + optimize Page::iter_fixed_len Use a custom FixedBitSet + optimize Page::iter_fixed_len Apr 26, 2024
@bfops bfops added the release-any To be landed in any release window label Apr 29, 2024
Copy link
Contributor

@gefjon gefjon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat! Minor comment-ey suggestions.

Why did you choose to make FixedBitSet generic over B: BitBlock, rather than just using u64 directly? Did you run any experiments with different block sizes? Do you foresee us wanting to choose a different block size?

crates/table/src/fixed_bit_set.rs Show resolved Hide resolved
crates/table/src/fixed_bit_set.rs Show resolved Hide resolved
crates/table/src/fixed_bit_set.rs Show resolved Hide resolved
crates/table/src/fixed_bit_set.rs Show resolved Hide resolved
@Centril
Copy link
Contributor Author

Centril commented Apr 30, 2024

Why did you choose to make FixedBitSet generic over B: BitBlock, rather than just using u64 directly?

Initially, I started off with a type alias type BitBlock = u32;, but eventually I felt that B being generic helped enforce "unit correctness" and I also heard that James might want a custom bitset for other places.

Did you run any experiments with different block sizes?

Initially, I tried u32 just because that was the previous block size for BitVec. However, changing to u64 did give some slight improvements in benchmarks. They were not huge, but they were notable. I don't recall how significant they were however.

Do you foresee us wanting to choose a different block size?

Might be that James's use case wants u32 to be more dense, but u64 might also be all we want.

@Centril Centril force-pushed the centril/faster-smaller-present_rows branch from e6a7719 to 2482c62 Compare April 30, 2024 21:40
@Centril Centril enabled auto-merge April 30, 2024 21:43
@Centril Centril added this pull request to the merge queue Apr 30, 2024
Merged via the queue into master with commit b55121c Apr 30, 2024
6 checks passed
@Centril Centril deleted the centril/faster-smaller-present_rows branch April 30, 2024 22:15
@Shubham8287
Copy link
Contributor

Shubham8287 commented May 2, 2024

Just went through code, It was fun to read FixedBitSet code. I really appreciate how well the code is written. Thanks! 🙌

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-any To be landed in any release window
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reduce footprint of present_rows to 10 bytes
4 participants