Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bit reverse #8

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
Open

Bit reverse #8

wants to merge 8 commits into from

Conversation

ayushkarnawat
Copy link
Owner

@ayushkarnawat ayushkarnawat commented Sep 17, 2020

Description

Move every bit twice without using auxiliary memory. Again treating the string to be rotated as ab, observe the identity (a^R b^R)^R = ba, where R is the operation that reverses a string. The “reverse” operation can be accomplished using only constant storage. Thus, with 3 reversals of bit strings, the string can be rotated.

Checklist

  • Bit-by-bit reversal
  • Byte reversal (for large bitarray rotations)
    • Lookup table (8 bits) -> 256 total
    • Lookup table (4 bits) -> 16 total
    • Vectorized lookup
  • Combination of byte and bit reversal (for the remaining bits that are less than a byte in length)
  • Check
    • Memory management (i.e. leaks)
    • Cache misses
    • Cycles performed (performance)
  • Driver code to compare different approaches

@ayushkarnawat ayushkarnawat self-assigned this Sep 17, 2020
@ayushkarnawat
Copy link
Owner Author

ayushkarnawat commented Sep 18, 2020

Byte reversal methods

Some interesting real-world tests using different approaches:

Lookup table

A standard lookup table representing all combinations of 8 bits numbers is represented by 2^8=256 values. Since the table is relatively large (256 bytes), storing it in L1 cache line might not be feasible. For example, my Intel Core 2 Duo has a 64 byte cache line. It means it would take more than one query from the L2/L3 cache just to read the byte table for every byte reversal (resulting in a cache miss). For a large bitarray, this process can happen multiple times and thus can result in many cache misses.

As such, it more practical to use the more memory-efficient 4 bit flip table since we can easily fit into the given L1 cache line. This will come with a bit of a (hopefully negligible) performance hit since we have to compute the reverse for the 4-bit nibble on-the-fly.

Is there a cache-oblivious implementation? According to Silvesteri, there does not exist an optimal cache-oblivious algorithm, but there might be a cache-aware one.

Vectorized

While a vectorized lookup, in practice, takes less time than a equivalent serial implementation, it requires AVX2 instructions. Recent Intel-based CPUs (after Haswell) have AVX2 instructions, so most devices should be supported. Despite this, it might not be as fast as the 4-bit lookup table.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant