Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] introduce Roaring bitmaps to Git #1357

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

Abhra303
Copy link

Git currently uses ewah bitmaps ( which are based on run-length encoding) to compress bitmaps. Ewah bitmaps stores bitmaps in the form of run-length words i.e. instead of storing each and every bit, it tries to find consecutive bits (having same value) and replace them with the value bit and the range upto which the bit is present. It is simple and efficient. But one downside of this approach is that we have to decompress the whole bitmap in order to find the bit of a certain position.

For small (or medium sized) bitmaps, this is not an issue. But it can be an issue for large (or extra large) bitmaps. In that case roaring bitmaps are generally more efficient[1] than ewah itself. Some benchmarks suggests that roaring bitmaps give more performance benefits than ewah or any other similar compression technique.

This patch series is currently in RFC state and it aims to let Git use roaring bitmaps. As this is an RFC patch series (for now), the code are not fully accurate (i.e. some tests are failing). But it is backward-compatible (tests related to ewah bitmaps are passing). Some commit messages might need more explanation and some commits may need a split (specially the one that implement writing roaring bitmaps). Overall, the structure and code are near to ready to make the series a formal patch series.

I am submitting it as an RFC (after discussions with mentors) because the GSoC coding period is about to end. I will continue to work on the patch series.

cc: Taylor Blau me@ttaylorr.com
cc: Kaartic Sivaram kaartic.sivaraam@gmail.com
cc: Junio C Hamano gitster@pobox.com
cc: Derrick Stolee derrickstolee@github.com

According to Roaring bitmap's paper[1], it gives better performance
(in most cases) than EWAH bitmaps. Its compression ratio is also good.
Moreover, unlike EWAH, it doesn't have to parse the whole bitmap to
know about one object. So, It may be good for Git to use Roaring
bitmaps for its work.

CRoaring is a well tested library for Roaring bitmaps and is mainly
written by the author of the mentioned paper.

Add CRoaring library to use roaring bitmap related functions.

[1] https://arxiv.org/pdf/1603.06549.pdf

Mentored-by: Taylor Blau <me@ttaylorr.com>
Co-Mentored-by: Kaartic Sivaraam <kaartic.sivaraam@gmail.com>
Signed-off-by: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
Though the Roaring library is introduced in previous commit, the library
cannot be used as is. One reason is that the library doesn't support Big
endian machines. Besides, Git specific file related functions does use
`hashwrite()` (or similar). So there is a need to modify the library.

Implement and modify new functions so that Git can actually use the
library.

Mentored-by: Taylor Blau <me@ttaylorr.com>
Mentored-by: Kaartic Sivaraam <kaartic.sivaraam@gmail.com>
Signed-off-by: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
Roaring bitmaps are said to be more efficient (most of the time) than
ewah bitmaps. So Git might gain some optimization if it support roaring
bitmaps. As Roaring library has all the changes it needed to implement
roaring bitmaps in Git, Git can learn to write roaring bitmaps. However,
all the changes are backward-compatible.

Teach Git to write roaring bitmaps.

Mentored-by: Taylor Blau <me@ttaylorr.com>
Mentored-by: Kaartic Sivaraam <kaartic.sivaraam@gmail.com>
Signed-off-by: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
Though Git can write roaring bitmaps, there is still no way (e.g.
configurations) to control the writing of roaring bitmaps.

Introduce `pack.useroaringbitmap` option to control the writing of
roaring bitmaps.

Mentored-by: Taylor Blau <me@ttaylorr.com>
Mentored-by: Kaartic Sivaraam <kaartic.sivaraam@gmail.com>
Signed-off-by: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
Git knows how to write roaring bitmaps but it still doesn't know
how to read roaring bitmaps. The changes are backward-compatible.

Teach Git to read roaring bitmaps.

Mentored-by: Taylor Blau <me@ttaylorr.com>
Mentored-by: Kaartic Sivaraam <kaartic.sivaraam@gmail.com>
Signed-off-by: Abhradeep Chakraborty <chakrabortyabhradeep79@gmail.com>
@Abhra303
Copy link
Author

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Sep 19, 2022

Submitted as pull.1357.git.1663609659.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git/ pr-1357/Abhra303/roaring-bitmap-exp-v1

To fetch this version to local tag pr-1357/Abhra303/roaring-bitmap-exp-v1:

git fetch --no-tags https://github.com/gitgitgadget/git/ tag pr-1357/Abhra303/roaring-bitmap-exp-v1

@gitgitgadget
Copy link

gitgitgadget bot commented Sep 19, 2022

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 9/19/2022 1:47 PM, Abhradeep Chakraborty via GitGitGadget wrote:
> Git currently uses ewah bitmaps ( which are based on run-length encoding) to
> compress bitmaps. Ewah bitmaps stores bitmaps in the form of run-length
> words i.e. instead of storing each and every bit, it tries to find
> consecutive bits (having same value) and replace them with the value bit and
> the range upto which the bit is present. It is simple and efficient. But one
> downside of this approach is that we have to decompress the whole bitmap in
> order to find the bit of a certain position.
> 
> For small (or medium sized) bitmaps, this is not an issue. But it can be an
> issue for large (or extra large) bitmaps. In that case roaring bitmaps are
> generally more efficient[1] than ewah itself. Some benchmarks suggests that
> roaring bitmaps give more performance benefits than ewah or any other
> similar compression technique.
> 
> This patch series is currently in RFC state and it aims to let Git use
> roaring bitmaps. As this is an RFC patch series (for now), the code are not
> fully accurate (i.e. some tests are failing). But it is backward-compatible
> (tests related to ewah bitmaps are passing). Some commit messages might need
> more explanation and some commits may need a split (specially the one that
> implement writing roaring bitmaps). Overall, the structure and code are near
> to ready to make the series a formal patch series.
> 
> I am submitting it as an RFC (after discussions with mentors) because the
> GSoC coding period is about to end. I will continue to work on the patch
> series.

I look forward to your next version. I hope to see some information about
the performance characteristics across the two versions. Specifically:

1. How do various test in t/perf/ change between the two formats?
2. For certain test repos (git/git, torvalds/linux, etc.) how much does
   the .bitmap file change in size across the formats?
 
>  Makefile                   |     3 +
>  bitmap.c                   |   225 +
>  bitmap.h                   |    33 +
...
>  ewah/bitmap.c              |    61 +-
>  ewah/ewok.h                |    37 +-
...
>  roaring/roaring.c          | 20047 +++++++++++++++++++++++++++++++++++
>  roaring/roaring.h          |  1028 ++

I wonder if there is value in modifying the structure of these files
into a bitmap/ directory and then perhaps ewah/ and roaring/ within
each? Just a thought.

Thanks,
-Stolee

@gitgitgadget
Copy link

gitgitgadget bot commented Sep 20, 2022

On the Git mailing list, Abhradeep Chakraborty wrote (reply to this):

Hi Derrick,

On Mon, Sep 19, 2022 at 11:48 PM Derrick Stolee
<derrickstolee@github.com> wrote:
> I look forward to your next version. I hope to see some information about
> the performance characteristics across the two versions. Specifically:
>
> 1. How do various test in t/perf/ change between the two formats?
> 2. For certain test repos (git/git, torvalds/linux, etc.) how much does
>    the .bitmap file change in size across the formats?

Yeah, sure. I will be including the performance test result in the
next version :)

> >  Makefile                   |     3 +
> >  bitmap.c                   |   225 +
> >  bitmap.h                   |    33 +
> ...
> >  ewah/bitmap.c              |    61 +-
> >  ewah/ewok.h                |    37 +-
> ...
> >  roaring/roaring.c          | 20047 +++++++++++++++++++++++++++++++++++
> >  roaring/roaring.h          |  1028 ++
>
> I wonder if there is value in modifying the structure of these files
> into a bitmap/ directory and then perhaps ewah/ and roaring/ within
> each? Just a thought.

Great idea! Thanks! Will change it in the next version..

Thanks :)

@gitgitgadget
Copy link

gitgitgadget bot commented Sep 20, 2022

On the Git mailing list, Taylor Blau wrote (reply to this):

On Mon, Sep 19, 2022 at 05:47:34PM +0000, Abhradeep Chakraborty via GitGitGadget wrote:
> This patch series is currently in RFC state and it aims to let Git use
> roaring bitmaps. As this is an RFC patch series (for now), the code are not
> fully accurate (i.e. some tests are failing). But it is backward-compatible
> (tests related to ewah bitmaps are passing). Some commit messages might need
> more explanation and some commits may need a split (specially the one that
> implement writing roaring bitmaps). Overall, the structure and code are near
> to ready to make the series a formal patch series.

Extremely exciting. Congratulations on all of your work so far. I'm
hopeful that you'll continue working on this after GSoC is over (for
those playing along at home, Abhradeep's coding period was extended by a
couple of weeks).

But even if you don't, this is a great artifact to leave around on the
list for somebody else who is interested in this area to pick up in the
future, and benefit from all of the work that you've done so far.

I am still working through my post-Git Merge backlog, but I'm looking
forward to reading these patches soon. I'm glad that other reviewers
have already started to dive in :-).

Well done!


Thanks,
Taylor

@gitgitgadget
Copy link

gitgitgadget bot commented Sep 21, 2022

On the Git mailing list, Abhradeep Chakraborty wrote (reply to this):

On Wed, Sep 21, 2022 at 3:29 AM Taylor Blau <me@ttaylorr.com> wrote:
>
> On Mon, Sep 19, 2022 at 05:47:34PM +0000, Abhradeep Chakraborty via GitGitGadget wrote:
> > This patch series is currently in RFC state and it aims to let Git use
> > roaring bitmaps. As this is an RFC patch series (for now), the code are not
> > fully accurate (i.e. some tests are failing). But it is backward-compatible
> > (tests related to ewah bitmaps are passing). Some commit messages might need
> > more explanation and some commits may need a split (specially the one that
> > implement writing roaring bitmaps). Overall, the structure and code are near
> > to ready to make the series a formal patch series.
>
> Extremely exciting. Congratulations on all of your work so far. I'm
> hopeful that you'll continue working on this after GSoC is over (for
> those playing along at home, Abhradeep's coding period was extended by a
> couple of weeks).

Yeah, I will continue (or better to say I am continuing) my work. I
hope that I can submit the next version in the upcoming few days.

Thanks for supporting and guiding me throughout the GSoC period. I
have learned a lot of new things during this period.

> I am still working through my post-Git Merge backlog, but I'm looking
> forward to reading these patches soon. I'm glad that other reviewers
> have already started to dive in :-).

No problem, I am doing some improvements by this time.
By the way, I am very excited to see the Youtube Git-Merge recordings ;)

Thanks :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant