Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changed Paths Bloom Filters #497

Closed

Conversation

garimasi514
Copy link

@garimasi514 garimasi514 commented Dec 19, 2019

Hey!

The commit graph feature brought in a lot of performance improvements across
multiple commands. However, file based history continues to be a performance
pain point, especially in large repositories.

Adopting changed path Bloom filters has been discussed on the list before, and
a prototype version was worked on by SZEDER Gábor, Jonathan Tan and
Dr. Derrick Stolee [1]. This series is based on Dr. Stolee's proof of concept
in [2]

With the changes in this series, git users will be able to choose to write Bloom
filters to the commit-graph using the following command:

'git commit-graph write --changed-paths'

Subsequent 'git log -- path' commands will use these computed Bloom filters to
decided which commits are worth exploring further to produce the history of the
provided path.

Cost of computing and writing Bloom filters

Computing and writing Bloom filters to the commit graph for the first
time implies computing the diffs and the resulting Bloom filters
for all the commits in the repository. This adds a non trivial amount of time
to run time. Every subsequent run is incremental i.e. we reuse the previously
computed Bloom filters. So this is a one time cost.

Time taken by 'git commit-graph write' with and w/o --changed-paths,
speed up in 'git log -- path' with computed Bloom filters (see a):-

-------------------------------------------------------------------------
| Repo        | w/o --changed-paths | with --changed-paths | Speed up   |
-------------------------------------------------------------------------
| git [3]     | 0.9 seconds         | 7 seconds            | 2x to 6x   |
| linux [4]   | 16 seconds          | 1 minute 8 seconds   | 2x to 6x   | 
| android [5] | 9 seconds           | 48 seconds           | 2x to 6x   |
| AzDo(see b) | 1 minute            | 5 minutes 2 seconds  | 10x to 30x |
-------------------------------------------------------------------------

a) We tested the performance of git log -- path with randomly chosen paths of
varying depths in each repo. The speed up depends on how deep the files are
in the hierarchy and how often a file has been touched in the commit history.

b) This internal repository has about 420k commits, 183k files distributed
across 34k folders, the size on disk is about 17 GiB. The most massive gains
on this repository were for files 6-12 levels deep in the tree.

c) These numbers were collected on a Windows machine, except for the linux repo
which was tested on a Linux machine.

Future Work (not included in the scope of this series)

  1. Supporting multiple path based revision walk
  2. Adopting it in git blame logic.
  3. Interactions with line log git log -L

Cheers!
Garima Singh

[1] https://lore.kernel.org/git/20181009193445.21908-1-szeder.dev@gmail.com/

[2] https://lore.kernel.org/git/61559c5b-546e-d61b-d2e1-68de692f5972@gmail.com/

[3] https://github.com/git/git

[4] https://github.com/torvalds/linux

[5] https://android.googlesource.com/platform/frameworks/base/

Cc: stolee@gmail.com, szeder.dev@gmail.com, jonathantanmy@google.com,
jeffhost@microsoft.com, me@ttaylorr.com, peff@peff.net, garimasigit@gmail.com,
jnareb@gmail.com, christian.couder@gmail.com, emilyshaffer@gmail.com,
gitster@pobox.com

commit-graph.c Outdated Show resolved Hide resolved
commit-graph.c Outdated Show resolved Hide resolved
bloom.c Outdated Show resolved Hide resolved
bloom.c Outdated Show resolved Hide resolved
revision.c Outdated Show resolved Hide resolved
revision.c Outdated Show resolved Hide resolved
bloom.c Outdated Show resolved Hide resolved
bloom.c Outdated Show resolved Hide resolved
bloom.c Show resolved Hide resolved
bloom.c Outdated Show resolved Hide resolved
bloom.c Show resolved Hide resolved
bloom.c Outdated Show resolved Hide resolved
bloom.c Show resolved Hide resolved
commit-graph.c Outdated Show resolved Hide resolved
@garimasi514 garimasi514 force-pushed the coreGit-bloomFilters branch 3 times, most recently from 55a84e0 to a522843 Compare December 20, 2019 18:24
builtin/commit-graph.c Outdated Show resolved Hide resolved
Makefile Show resolved Hide resolved
commit-graph.c Outdated Show resolved Hide resolved
Documentation/technical/commit-graph-format.txt Outdated Show resolved Hide resolved
Documentation/technical/commit-graph-format.txt Outdated Show resolved Hide resolved
Documentation/technical/commit-graph-format.txt Outdated Show resolved Hide resolved
commit-graph.c Show resolved Hide resolved
t/helper/test-read-graph.c Show resolved Hide resolved
bloom.c Outdated Show resolved Hide resolved
builtin/commit-graph.c Outdated Show resolved Hide resolved
Copy link

@jeffhostetler jeffhostetler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GH is being weird and routing my commets thru the review mechanism.

commit-graph.c Outdated Show resolved Hide resolved
commit-graph.c Outdated Show resolved Hide resolved
bloom.c Outdated
prev_index = 0;

filter->len = next_index - prev_index;
filter->data = (uint64_t *)(g->chunk_bloom_data + 8 * prev_index + 12);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mark this as a bug/todo for later.
This is going to cause problems for platforms (CPUs) that don't allow non-aligned access.
If we assume that g->chunk_bloom_data is an allocated pointer, malloc will give us a quad-word aligned pointer. Then adding a multiple of 8 for the prev_index, still keeps us quad-aligned. But the +12 won't be.

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 10, 2020

This patch series was integrated into pu via git@3902d5a.

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 11, 2020

This patch series was integrated into pu via git@4571811.

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 12, 2020

On the Git mailing list, Taylor Blau wrote (reply to this):

Hi Stolee,

On Wed, Apr 08, 2020 at 11:51:14AM -0400, Derrick Stolee wrote:
> On 4/6/2020 12:59 PM, Garima Singh via GitGitGadget wrote:
> > Hey!
> >
> > The commit graph feature brought in a lot of performance improvements across
> > multiple commands. However, file based history continues to be a performance
> > pain point, especially in large repositories.
> >
> > Adopting changed path Bloom filters has been discussed on the list before,
> > and a prototype version was worked on by SZEDER Gábor, Jonathan Tan and Dr.
> > Derrick Stolee [1]. This series is based on Dr. Stolee's proof of concept in
> > [2]
> >
> > With the changes in this series, git users will be able to choose to write
> > Bloom filters to the commit-graph using the following command:
> >
> > 'git commit-graph write --changed-paths'
> >
> > Subsequent 'git log -- path' commands will use these computed Bloom filters
> > to decided which commits are worth exploring further to produce the history
> > of the provided path.
>
> I noticed Jakub was not CC'd on this email. Jakub: do you plan to re-review
> the new version? Or are you satisfied with the resolutions to your comments?
>
> Is anyone else planning to review this series?

I feel horribly that I've had this patch series sitting in my review
backlog for months and haven't gotten to it yet, especially because I
have such an interest in these patches and know that much care was taken
to prepare them.

I read through these patches over some coffee today at a cursory level.
The high-level approach makes sense to me, and the implementation looks
solid. I think that anything that does come up (see below) can be
addressed in 'next' rather than waiting longer on this series.

For what it's worth, I'm planning on starting to test this series in
some of our testing repositories at GitHub, and I'll report back on our
experience with some notes (and patches) should anything come up.

> I'm just wondering when we should take this series to cook in 'next' and
> start building things on top of it, such as "git blame" or "git log -L"
> improvements. While it cooks, any bugs or issues could be resolved with
> patches on top of this version. That would be my preference, anyway.

That would be my preference, too.

I noticed a few small things (mostly a couple of typos and other very
minor details). But, I'd much rather build on top of this series once it
has landed in 'next' than go to a fifth re-roll since there are many
patches involved.

I also noticed that you have already sent some patches in a separate
series that are based on this one, which would apply cleanly if this
series is merged into next.

I figure that this will also be helpful as I send some patches about
extra 'commit-graph write' options out of GitHub's fork, since they will
inevitably create merge conflicts if we both are targeting 'next'. So,
I figure that this approach will ease some maintainer burden ;-).

>
> What do you think, Junio?
>
> Thanks,
> -Stolee

Thanks,
Taylor

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 13, 2020

This patch series was integrated into pu via git@f569002.

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 14, 2020

This patch series was integrated into pu via git@883792f.

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 15, 2020

This patch series was integrated into pu via git@7ff5102.

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 15, 2020

This patch series was integrated into pu via git@3734f87.

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 20, 2020

This patch series was integrated into pu via git@474e17a.

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 21, 2020

This patch series was integrated into pu via git@b0e6895.

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 24, 2020

This patch series was integrated into pu via git@3cbecac.

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 24, 2020

This patch series was integrated into next via git@34b35f4.

@gitgitgadget gitgitgadget bot added the next label Apr 24, 2020
@gitgitgadget
Copy link

gitgitgadget bot commented Apr 28, 2020

This patch series was integrated into pu via git@f9942b1.

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 28, 2020

This patch series was integrated into pu via git@2f0149c.

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 29, 2020

This patch series was integrated into pu via git@1a90f7b.

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 30, 2020

This patch series was integrated into pu via git@f386d57.

@gitgitgadget
Copy link

gitgitgadget bot commented May 1, 2020

This patch series was integrated into pu via git@9b6606f.

@gitgitgadget
Copy link

gitgitgadget bot commented May 1, 2020

This patch series was integrated into next via git@9b6606f.

@gitgitgadget
Copy link

gitgitgadget bot commented May 1, 2020

This patch series was integrated into master via git@9b6606f.

@gitgitgadget gitgitgadget bot added the master label May 1, 2020
@gitgitgadget gitgitgadget bot closed this May 1, 2020
@gitgitgadget
Copy link

gitgitgadget bot commented May 1, 2020

Closed via 9b6606f.

@@ -17,6 +17,9 @@ metadata, including:
- The parents of the commit, stored using positional references within
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, SZEDER Gábor wrote (reply to this):

On Mon, Apr 06, 2020 at 04:59:49PM +0000, Garima Singh via GitGitGadget wrote:
> From: Garima Singh <garima.singh@microsoft.com>
> 
> Update the technical documentation for commit-graph-format with
> the formats for the Bloom filter index (BIDX) and Bloom filter
> data (BDAT) chunks. Write the computed Bloom filters information
> to the commit graph file using this format.
> 
> Helped-by: Derrick Stolee <dstolee@microsoft.com>
> Signed-off-by: Garima Singh <garima.singh@microsoft.com>
> ---
>  .../technical/commit-graph-format.txt         |  30 +++++
>  commit-graph.c                                | 113 +++++++++++++++++-
>  commit-graph.h                                |   5 +
>  3 files changed, 147 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
> index a4f17441aed..de56f9f1efd 100644
> --- a/Documentation/technical/commit-graph-format.txt
> +++ b/Documentation/technical/commit-graph-format.txt
> @@ -17,6 +17,9 @@ metadata, including:
>  - The parents of the commit, stored using positional references within
>    the graph file.
>  
> +- The Bloom filter of the commit carrying the paths that were changed between
> +  the commit and its first parent, if requested.
> +
>  These positional references are stored as unsigned 32-bit integers
>  corresponding to the array position within the list of commit OIDs. Due
>  to some special constants we use to track parents, we can store at most
> @@ -93,6 +96,33 @@ CHUNK DATA:
>        positions for the parents until reaching a value with the most-significant
>        bit on. The other bits correspond to the position of the last parent.
>  
> +  Bloom Filter Index (ID: {'B', 'I', 'D', 'X'}) (N * 4 bytes) [Optional]
> +    * The ith entry, BIDX[i], stores the number of 8-byte word blocks in all

This is inconsistent with the implementation: according to the code in
one of the previous patches these entries are simple byte offsets, not
8-byte word offsets, i.e. the combined size of all modified path
Bloom filters can be at most 2^32 bytes.

The commit-graph file can contain information about at most 2^31-1
commits.  This means that with that many commits each commit can have
a merely 2 byte Bloom filter on average.  When using 7 hashes we'd
need 10 bits per path, so in two bytes we could store only a single
path.

Clearly, using 4 byte index entries significantly lowers the max
number of commits that can be stored with modified path Bloom filters.
IMO every new chunk must support at least 2^31-1 commits.

> +      Bloom filters from commit 0 to commit i (inclusive) in lexicographic
> +      order. The Bloom filter for the i-th commit spans from BIDX[i-1] to
> +      BIDX[i] (plus header length), where BIDX[-1] is 0.
> +    * The BIDX chunk is ignored if the BDAT chunk is not present.
> +
> +  Bloom Filter Data (ID: {'B', 'D', 'A', 'T'}) [Optional]
> +    * It starts with header consisting of three unsigned 32-bit integers:
> +      - Version of the hash algorithm being used. We currently only support
> +	value 1 which corresponds to the 32-bit version of the murmur3 hash
> +	implemented exactly as described in
> +	https://en.wikipedia.org/wiki/MurmurHash#Algorithm and the double
> +	hashing technique using seed values 0x293ae76f and 0x7e646e2 as
> +	described in https://doi.org/10.1007/978-3-540-30494-4_26 "Bloom Filters
> +	in Probabilistic Verification"

How should double hashing compute the k hashes, i.e. using 64 bit or
32 bit unsigned integer arithmetic?

I'm puzzled that you link to this paper and still use double hashing.

Two of the contributions of that paper are that it points out some
shortcomings of the double hashing scheme and provides a better
alternative in the form of enhanced double hashing, which can cut the
false positive rate in half.

However, that paper considers the hashing scheme only in the context
of one big Bloom filter.  I've found that when it comes to many small
Bloom filters then the k hashes produced by any double hashing variant
are not independent enough, and "standard" double hashing fares the
worst among them.  There are real repositories out there where double
hashing has over an order of magnitude higher average false positive
rate than enhanced double hashing.  Though that's not to say that
enhanced double hashing is good...

For details on these issues see

  https://public-inbox.org/git/20200529085038.26008-16-szeder.dev@gmail.com

> +      - The number of times a path is hashed and hence the number of bit positions
> +	      that cumulatively determine whether a file is present in the commit.
> +      - The minimum number of bits 'b' per entry in the Bloom filter. If the filter
> +	      contains 'n' entries, then the filter size is the minimum number of 64-bit
> +	      words that contain n*b bits.

Since the ideal number of bits per element depends only on the number
of hashes per path (k / ln(2) ≈ k * 10 / 7), why is this value stored
in the commit-graph?

> +    * The rest of the chunk is the concatenation of all the computed Bloom
> +      filters for the commits in lexicographic order.
> +    * Note: Commits with no changes or more than 512 changes have Bloom filters
> +      of length zero.

What does this "Note:" prefix mean in the file format specification?

Can an implementation use a one byte Bloom filter with no bits set for
a commit with no changes?  Can an implementation still store a Bloom
filter for commits that modify more than 512 paths?

> +    * The BDAT chunk is present if and only if BIDX is present.
> +
>    Base Graphs List (ID: {'B', 'A', 'S', 'E'}) [Optional]
>        This list of H-byte hashes describe a set of B commit-graph files that
>        form a commit-graph chain. The graph position for the ith commit in this
> diff --git a/commit-graph.c b/commit-graph.c
> index 732c81fa1b2..a8b6b5cca5d 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c

> @@ -1034,6 +1071,59 @@ static void write_graph_chunk_extra_edges(struct hashfile *f,
>  	}
>  }
>  
> +static void write_graph_chunk_bloom_indexes(struct hashfile *f,
> +					    struct write_commit_graph_context *ctx)
> +{
> +	struct commit **list = ctx->commits.list;
> +	struct commit **last = ctx->commits.list + ctx->commits.nr;
> +	uint32_t cur_pos = 0;
> +	struct progress *progress = NULL;
> +	int i = 0;
> +
> +	if (ctx->report_progress)
> +		progress = start_delayed_progress(
> +			_("Writing changed paths Bloom filters index"),
> +			ctx->commits.nr);
> +
> +	while (list < last) {
> +		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
> +		cur_pos += filter->len;

Given a sufficiently large number of commits with large enough Bloom
filters this will silently overflow.

> +		display_progress(progress, ++i);
> +		hashwrite_be32(f, cur_pos);
> +		list++;
> +	}
> +
> +	stop_progress(&progress);
> +}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 5/29/2020 4:57 AM, SZEDER Gábor wrote:
> On Mon, Apr 06, 2020 at 04:59:49PM +0000, Garima Singh via GitGitGadget wrote:
>> From: Garima Singh <garima.singh@microsoft.com>
>>
>> Update the technical documentation for commit-graph-format with
>> the formats for the Bloom filter index (BIDX) and Bloom filter
>> data (BDAT) chunks. Write the computed Bloom filters information
>> to the commit graph file using this format.
>>
>> Helped-by: Derrick Stolee <dstolee@microsoft.com>
>> Signed-off-by: Garima Singh <garima.singh@microsoft.com>
>> ---
>>  .../technical/commit-graph-format.txt         |  30 +++++
>>  commit-graph.c                                | 113 +++++++++++++++++-
>>  commit-graph.h                                |   5 +
>>  3 files changed, 147 insertions(+), 1 deletion(-)
>>
>> diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
>> index a4f17441aed..de56f9f1efd 100644
>> --- a/Documentation/technical/commit-graph-format.txt
>> +++ b/Documentation/technical/commit-graph-format.txt
>> @@ -17,6 +17,9 @@ metadata, including:
>>  - The parents of the commit, stored using positional references within
>>    the graph file.
>>  
>> +- The Bloom filter of the commit carrying the paths that were changed between
>> +  the commit and its first parent, if requested.
>> +
>>  These positional references are stored as unsigned 32-bit integers
>>  corresponding to the array position within the list of commit OIDs. Due
>>  to some special constants we use to track parents, we can store at most
>> @@ -93,6 +96,33 @@ CHUNK DATA:
>>        positions for the parents until reaching a value with the most-significant
>>        bit on. The other bits correspond to the position of the last parent.
>>  
>> +  Bloom Filter Index (ID: {'B', 'I', 'D', 'X'}) (N * 4 bytes) [Optional]
>> +    * The ith entry, BIDX[i], stores the number of 8-byte word blocks in all
> 
> This is inconsistent with the implementation: according to the code in
> one of the previous patches these entries are simple byte offsets, not
> 8-byte word offsets, i.e. the combined size of all modified path
> Bloom filters can be at most 2^32 bytes.

The documentation was fixed in 88093289cdc (Documentation: changed-path Bloom
filters use byte words, 2020-05-11).

> The commit-graph file can contain information about at most 2^31-1
> commits.  This means that with that many commits each commit can have
> a merely 2 byte Bloom filter on average.  When using 7 hashes we'd
> need 10 bits per path, so in two bytes we could store only a single
> path.
> 
> Clearly, using 4 byte index entries significantly lowers the max
> number of commits that can be stored with modified path Bloom filters.

This is a good point, and certainly the reason for 8-byte multiples.

> IMO every new chunk must support at least 2^31-1 commits.

I'm not sure this is a valid requirement. Even extremely large repositories
(that are created by actual use, not synthetic) are on the scale of 2^24
commits.

You are right that we should make the commit-graph write process more robust
to reaching these limits. You point out that we have a new limit when these
filters are enabled.

For reference, the Windows OS repo has ~4.25 million commits and the
commit-graph file with changed-path Bloom filters is around 520mb. That's
the whole file size, and without the filters it's around 240mb, so the
filters are taking <300mb ~ 2^29 and we would need to grow the repo by 8x
to hit this limit. That's not an unreasonable amount of growth, but is
also far enough away that we can handle it in time.

The incremental commit-graph can actually save us here (and is similar to
how we solved a scale issue in Azure Repos around the multi-pack-index):
we can refuse to merge layers of an incremental commit-graph if the
changed-path filters would exceed the size limit. Of course, the _first_
write of such a commit-graph would need to be aware of this limit and
plan for it in advance, but that's also a theoretical issue.

I'm tracking some follow-up work [1] for the changed-path filters,
including a way to limit the number of filters computed in one
"git commit-graph write" process. I'll make note of your concerns here,
too.

[1] https://github.com/microsoft/git/issues/272

>> +      Bloom filters from commit 0 to commit i (inclusive) in lexicographic
>> +      order. The Bloom filter for the i-th commit spans from BIDX[i-1] to
>> +      BIDX[i] (plus header length), where BIDX[-1] is 0.
>> +    * The BIDX chunk is ignored if the BDAT chunk is not present.
>> +
>> +  Bloom Filter Data (ID: {'B', 'D', 'A', 'T'}) [Optional]
>> +    * It starts with header consisting of three unsigned 32-bit integers:
>> +      - Version of the hash algorithm being used. We currently only support
>> +	value 1 which corresponds to the 32-bit version of the murmur3 hash
>> +	implemented exactly as described in
>> +	https://en.wikipedia.org/wiki/MurmurHash#Algorithm and the double
>> +	hashing technique using seed values 0x293ae76f and 0x7e646e2 as
>> +	described in https://doi.org/10.1007/978-3-540-30494-4_26 "Bloom Filters
>> +	in Probabilistic Verification"
> 
> How should double hashing compute the k hashes, i.e. using 64 bit or
> 32 bit unsigned integer arithmetic?
> 
> I'm puzzled that you link to this paper and still use double hashing.
> 
> Two of the contributions of that paper are that it points out some
> shortcomings of the double hashing scheme and provides a better
> alternative in the form of enhanced double hashing, which can cut the
> false positive rate in half.
> 
> However, that paper considers the hashing scheme only in the context
> of one big Bloom filter.  I've found that when it comes to many small
> Bloom filters then the k hashes produced by any double hashing variant
> are not independent enough, and "standard" double hashing fares the
> worst among them.  There are real repositories out there where double
> hashing has over an order of magnitude higher average false positive
> rate than enhanced double hashing.  Though that's not to say that
> enhanced double hashing is good...
> 
> For details on these issues see
> 
>   https://public-inbox.org/git/20200529085038.26008-16-szeder.dev@gmail.com

That message includes very detailed experimental analysis, which is nice.
We will need to do some concrete side-by-side comparisons to see if there
actually is a meaningful difference. (You may have already done this.)

>> +      - The number of times a path is hashed and hence the number of bit positions
>> +	      that cumulatively determine whether a file is present in the commit.
>> +      - The minimum number of bits 'b' per entry in the Bloom filter. If the filter
>> +	      contains 'n' entries, then the filter size is the minimum number of 64-bit
>> +	      words that contain n*b bits.
> 
> Since the ideal number of bits per element depends only on the number
> of hashes per path (k / ln(2) ≈ k * 10 / 7), why is this value stored
> in the commit-graph?

The ideal number depends also on what false-positive rate you want. In a
hypothetical future where we want to allow customization here, we want
the filters to be consistently sized across all filters.

>> +    * The rest of the chunk is the concatenation of all the computed Bloom
>> +      filters for the commits in lexicographic order.
>> +    * Note: Commits with no changes or more than 512 changes have Bloom filters
>> +      of length zero.
> 
> What does this "Note:" prefix mean in the file format specification?
> 
> Can an implementation use a one byte Bloom filter with no bits set for
> a commit with no changes?  Can an implementation still store a Bloom
> filter for commits that modify more than 512 paths?

This is currently due to a hard-coded value in the implementation. It's not a
requirement of the file format.

>> +    * The BDAT chunk is present if and only if BIDX is present.
>> +
>>    Base Graphs List (ID: {'B', 'A', 'S', 'E'}) [Optional]
>>        This list of H-byte hashes describe a set of B commit-graph files that
>>        form a commit-graph chain. The graph position for the ith commit in this
>> diff --git a/commit-graph.c b/commit-graph.c
>> index 732c81fa1b2..a8b6b5cca5d 100644
>> --- a/commit-graph.c
>> +++ b/commit-graph.c
> 
>> @@ -1034,6 +1071,59 @@ static void write_graph_chunk_extra_edges(struct hashfile *f,
>>  	}
>>  }
>>  
>> +static void write_graph_chunk_bloom_indexes(struct hashfile *f,
>> +					    struct write_commit_graph_context *ctx)
>> +{
>> +	struct commit **list = ctx->commits.list;
>> +	struct commit **last = ctx->commits.list + ctx->commits.nr;
>> +	uint32_t cur_pos = 0;
>> +	struct progress *progress = NULL;
>> +	int i = 0;
>> +
>> +	if (ctx->report_progress)
>> +		progress = start_delayed_progress(
>> +			_("Writing changed paths Bloom filters index"),
>> +			ctx->commits.nr);
>> +
>> +	while (list < last) {
>> +		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
>> +		cur_pos += filter->len;
> 
> Given a sufficiently large number of commits with large enough Bloom
> filters this will silently overflow.

Worth fixing, but we are not in a rush. I noted it in my GitHub issue.

Thanks,
-Stolee

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, SZEDER Gábor wrote (reply to this):

On Fri, May 29, 2020 at 09:35:17AM -0400, Derrick Stolee wrote:
> >> +  Bloom Filter Index (ID: {'B', 'I', 'D', 'X'}) (N * 4 bytes) [Optional]
> >> +    * The ith entry, BIDX[i], stores the number of 8-byte word blocks in all
> > 
> > This is inconsistent with the implementation: according to the code in
> > one of the previous patches these entries are simple byte offsets, not
> > 8-byte word offsets, i.e. the combined size of all modified path
> > Bloom filters can be at most 2^32 bytes.
> 
> The documentation was fixed in 88093289cdc (Documentation: changed-path Bloom
> filters use byte words, 2020-05-11).

Oh, good.  I'm waaay behind the curve and haven't seen this fix.  Even
better, now I also noticed that two bugs I was about to report have
been fixed already (though both fixes have minor flaws).

Ok, so at least the specs are consistent with the implementation.  I'm
not sure this was done in the right direction, though, because too
small Bloom filters do hurt performance.

> > Clearly, using 4 byte index entries significantly lowers the max
> > number of commits that can be stored with modified path Bloom filters.
> 
> This is a good point, and certainly the reason for 8-byte multiples.

Note that Bloom filters with power-of-two number of bits have higher
false positive probabilities when using some form of double hashing.
When going for 8 byte blocks all commits modifying <= 12 paths
(assuming 7 hashes per path) will have power-of-2 sized Bloom filters
(64 or 128 bits), and that is a lot of commits.

> The incremental commit-graph can actually save us here

Oh, I haven't thought of that

> >> +      Bloom filters from commit 0 to commit i (inclusive) in lexicographic
> >> +      order. The Bloom filter for the i-th commit spans from BIDX[i-1] to
> >> +      BIDX[i] (plus header length), where BIDX[-1] is 0.
> >> +    * The BIDX chunk is ignored if the BDAT chunk is not present.
> >> +
> >> +  Bloom Filter Data (ID: {'B', 'D', 'A', 'T'}) [Optional]
> >> +    * It starts with header consisting of three unsigned 32-bit integers:
> >> +      - Version of the hash algorithm being used. We currently only support
> >> +	value 1 which corresponds to the 32-bit version of the murmur3 hash
> >> +	implemented exactly as described in
> >> +	https://en.wikipedia.org/wiki/MurmurHash#Algorithm and the double
> >> +	hashing technique using seed values 0x293ae76f and 0x7e646e2 as
> >> +	described in https://doi.org/10.1007/978-3-540-30494-4_26 "Bloom Filters
> >> +	in Probabilistic Verification"
> > 
> > How should double hashing compute the k hashes, i.e. using 64 bit or
> > 32 bit unsigned integer arithmetic?

Note that this should be clarified in the specs.

> >> +      - The number of times a path is hashed and hence the number of bit positions
> >> +	      that cumulatively determine whether a file is present in the commit.
> >> +      - The minimum number of bits 'b' per entry in the Bloom filter. If the filter
> >> +	      contains 'n' entries, then the filter size is the minimum number of 64-bit
> >> +	      words that contain n*b bits.
> > 
> > Since the ideal number of bits per element depends only on the number
> > of hashes per path (k / ln(2) ≈ k * 10 / 7), why is this value stored
> > in the commit-graph?
> 
> The ideal number depends also on what false-positive rate you want.

Well, yes, but indirectly:  according to Wikipedia :) the optimal
number of hashes per element depends only on the desired false
probability, and the optimal number of bits per element depends only
on the number of hashes per element.

So storing the min number of bits per entry seems to be redundant.

> In a
> hypothetical future where we want to allow customization here, we want
> the filters to be consistently sized across all filters.

Wouldn't customizing through the number of hashes be sufficient?

> >> +    * Note: Commits with no changes or more than 512 changes have Bloom filters
> >> +      of length zero.
> > 
> > What does this "Note:" prefix mean in the file format specification?
> > 
> > Can an implementation use a one byte Bloom filter with no bits set for
> > a commit with no changes?  Can an implementation still store a Bloom
> > filter for commits that modify more than 512 paths?
> 
> This is currently due to a hard-coded value in the implementation. It's not a
> requirement of the file format.

Should an implementation detail like that be part of the specs?  It
sure caused a bit of confusion here.

@@ -57,6 +57,11 @@ or `--stdin-packs`.)
With the `--append` option, include all commits that are present in the
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, SZEDER Gábor wrote (reply to this):

On Mon, Apr 06, 2020 at 04:59:51PM +0000, Garima Singh via GitGitGadget wrote:
> From: Garima Singh <garima.singh@microsoft.com>
> 
> Add --changed-paths option to git commit-graph write. This option will
> allow users to compute information about the paths that have changed
> between a commit and its first parent, and write it into the commit graph
> file. If the option is passed to the write subcommand we set the
> COMMIT_GRAPH_WRITE_BLOOM_FILTERS flag and pass it down to the
> commit-graph logic.
> 
> Helped-by: Derrick Stolee <dstolee@microsoft.com>
> Signed-off-by: Garima Singh <garima.singh@microsoft.com>
> ---
>  Documentation/git-commit-graph.txt | 5 +++++
>  builtin/commit-graph.c             | 9 +++++++--
>  2 files changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
> index 28d1fee5053..f4b13c005b8 100644
> --- a/Documentation/git-commit-graph.txt
> +++ b/Documentation/git-commit-graph.txt
> @@ -57,6 +57,11 @@ or `--stdin-packs`.)
>  With the `--append` option, include all commits that are present in the
>  existing commit-graph file.
>  +
> +With the `--changed-paths` option, compute and write information about the
> +paths changed between a commit and it's first parent. This operation can
> +take a while on large repositories. It provides significant performance gains
> +for getting history of a directory or a file with `git log -- <path>`.

So 'git commit-graph write' only computes and writes changed path
Bloom filters if this option is specified.  Though not mentioned in
the documentation or in the commit message, the negated
'--no-changed-paths' is supported as well, and it removes Bloom
filters from the commit-graph file.  All this is quite reasonable.

However, the most important question is what happens when the
commit-graph file already contains Bloom filters and neither of these
options are specified on the command line.  This isn't mentioned in
the docs or in the commit message, either, but as it is implemented in
this patch (i.e. COMMIT_GRAPH_WRITE_BLOOM_FILTERS is not passed from
the builtin to the commit-graph logic) all those existing Bloom
filters are removed from the commit-graph.  Considering how expensive
it was to compute those Bloom filters this might not be the most
desirable behaviour.

This is important, because 'git commit-graph write' is not the only
command that writes the commit-graph file.  'git gc' does that by
default, too, and will wipe out any modified path Bloom filters while
doing so.  Worse, the user doesn't even have to invoke 'git gc'
manually, because a lot of git commands invoke 'git gc --auto'.

  $ git commit-graph write --reachable --changed-paths
  $ ~/src/git/t/helper/test-tool read-graph |grep ^chunks
  chunks: oid_fanout oid_lookup commit_metadata bloom_indexes bloom_data
  $ git gc --quiet 
  $ ~/src/git/t/helper/test-tool read-graph |grep ^chunks
  chunks: oid_fanout oid_lookup commit_metadata

Consequently, if users want to use modified path Bloom filters, then
they should avoid gc, both manual and auto, or they'll have to
re-generate the Bloom filters every once in a while.  That is
definitely not the desired behaviour.


Now compare this e.g. to the behaviour of 'git update-index
--split-index' and '--untracked-cache': both of these options turn on
features that improve performance and write extra stuff to the index,
and after they did so all subsequent git commands updating the index
will keep writing that extra stuff, including 'git update-index'
itself even without those options, until it's finally invoked with the
corresponding '--no-...' option.  I particularly like how
'--[no-]untracked-cache' and 'core.untrackedCache' work together and
warn when the given command line option goes against the configured
value, and I think the command line options and configuration
variables controlling modified path Bloom filters should behave
similarly.

>  With the `--split` option, write the commit-graph as a chain of multiple
>  commit-graph files stored in `<dir>/info/commit-graphs`. The new commits
>  not already in the commit-graph are added in a new "tip" file. This file
> diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
> index d1ab6625f63..cacb5d04a80 100644
> --- a/builtin/commit-graph.c
> +++ b/builtin/commit-graph.c
> @@ -9,7 +9,7 @@
>  
>  static char const * const builtin_commit_graph_usage[] = {
>  	N_("git commit-graph verify [--object-dir <objdir>] [--shallow] [--[no-]progress]"),
> -	N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--[no-]progress] <split options>"),
> +	N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--changed-paths] [--[no-]progress] <split options>"),
>  	NULL
>  };
>  
> @@ -19,7 +19,7 @@ static const char * const builtin_commit_graph_verify_usage[] = {
>  };
>  
>  static const char * const builtin_commit_graph_write_usage[] = {
> -	N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--[no-]progress] <split options>"),
> +	N_("git commit-graph write [--object-dir <objdir>] [--append|--split] [--reachable|--stdin-packs|--stdin-commits] [--changed-paths] [--[no-]progress] <split options>"),
>  	NULL
>  };
>  
> @@ -32,6 +32,7 @@ static struct opts_commit_graph {
>  	int split;
>  	int shallow;
>  	int progress;
> +	int enable_changed_paths;
>  } opts;
>  
>  static struct object_directory *find_odb(struct repository *r,
> @@ -135,6 +136,8 @@ static int graph_write(int argc, const char **argv)
>  			N_("start walk at commits listed by stdin")),
>  		OPT_BOOL(0, "append", &opts.append,
>  			N_("include all commits already in the commit-graph file")),
> +		OPT_BOOL(0, "changed-paths", &opts.enable_changed_paths,
> +			N_("enable computation for changed paths")),
>  		OPT_BOOL(0, "progress", &opts.progress, N_("force progress reporting")),
>  		OPT_BOOL(0, "split", &opts.split,
>  			N_("allow writing an incremental commit-graph file")),
> @@ -168,6 +171,8 @@ static int graph_write(int argc, const char **argv)
>  		flags |= COMMIT_GRAPH_WRITE_SPLIT;
>  	if (opts.progress)
>  		flags |= COMMIT_GRAPH_WRITE_PROGRESS;
> +	if (opts.enable_changed_paths)
> +		flags |= COMMIT_GRAPH_WRITE_BLOOM_FILTERS;
>  
>  	read_replace_refs = 0;
>  	odb = find_odb(the_repository, opts.obj_dir);
> -- 
> gitgitgadget
> 

@@ -0,0 +1,255 @@
#include "git-compat-util.h"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, SZEDER Gábor wrote (reply to this):

On Mon, Apr 06, 2020 at 04:59:50PM +0000, Garima Singh via GitGitGadget wrote:
> From: Garima Singh <garima.singh@microsoft.com>
> 
> Add logic to
> a) parse Bloom filter information from the commit graph file and,
> b) re-use existing Bloom filters.
> 
> See Documentation/technical/commit-graph-format for the format in which
> the Bloom filter information is written to the commit graph file.
> 
> To read Bloom filter for a given commit with lexicographic position
> 'i' we need to:
> 1. Read BIDX[i] which essentially gives us the starting index in BDAT for
>    filter of commit i+1. It is essentially the index past the end
>    of the filter of commit i. It is called end_index in the code.
> 
> 2. For i>0, read BIDX[i-1] which will give us the starting index in BDAT
>    for filter of commit i. It is called the start_index in the code.
>    For the first commit, where i = 0, Bloom filter data starts at the
>    beginning, just past the header in the BDAT chunk. Hence, start_index
>    will be 0.
> 
> 3. The length of the filter will be end_index - start_index, because
>    BIDX[i] gives the cumulative 8-byte words including the ith
>    commit's filter.
> 
> We toggle whether Bloom filters should be recomputed based on the
> compute_if_not_present flag.

A very important question is not discussed here: when should we
recompute Bloom filters?

> Helped-by: Derrick Stolee <dstolee@microsoft.com>
> Signed-off-by: Garima Singh <garima.singh@microsoft.com>
> ---
>  bloom.c               | 49 ++++++++++++++++++++++++++++++++++++++++++-
>  bloom.h               |  4 +++-
>  commit-graph.c        |  6 +++---
>  t/helper/test-bloom.c |  2 +-
>  4 files changed, 55 insertions(+), 6 deletions(-)
> 
> diff --git a/bloom.c b/bloom.c
> index a16eee92331..0f714dd76ae 100644
> --- a/bloom.c
> +++ b/bloom.c
> @@ -4,6 +4,8 @@
>  #include "diffcore.h"
>  #include "revision.h"
>  #include "hashmap.h"
> +#include "commit-graph.h"
> +#include "commit.h"
>  
>  define_commit_slab(bloom_filter_slab, struct bloom_filter);
>  
> @@ -26,6 +28,36 @@ static inline unsigned char get_bitmask(uint32_t pos)
>  	return ((unsigned char)1) << (pos & (BITS_PER_WORD - 1));
>  }
>  
> +static int load_bloom_filter_from_graph(struct commit_graph *g,
> +				   struct bloom_filter *filter,
> +				   struct commit *c)
> +{
> +	uint32_t lex_pos, start_index, end_index;
> +
> +	while (c->graph_pos < g->num_commits_in_base)
> +		g = g->base_graph;
> +
> +	/* The commit graph commit 'c' lives in doesn't carry bloom filters. */
> +	if (!g->chunk_bloom_indexes)
> +		return 0;
> +
> +	lex_pos = c->graph_pos - g->num_commits_in_base;
> +
> +	end_index = get_be32(g->chunk_bloom_indexes + 4 * lex_pos);
> +
> +	if (lex_pos > 0)
> +		start_index = get_be32(g->chunk_bloom_indexes + 4 * (lex_pos - 1));
> +	else
> +		start_index = 0;
> +
> +	filter->len = end_index - start_index;
> +	filter->data = (unsigned char *)(g->chunk_bloom_data +
> +					sizeof(unsigned char) * start_index +
> +					BLOOMDATA_CHUNK_HEADER_SIZE);
> +
> +	return 1;
> +}
> +
>  /*
>   * Calculate the murmur3 32-bit hash value for the given data
>   * using the given seed.
> @@ -127,7 +159,8 @@ void init_bloom_filters(void)
>  }
>  
>  struct bloom_filter *get_bloom_filter(struct repository *r,
> -				      struct commit *c)
> +				      struct commit *c,
> +					  int compute_if_not_present)
>  {
>  	struct bloom_filter *filter;
>  	struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;

This line in the hunk context sets the default parameters with which
this process will compute any new changed path Bloom filters.

Note that this is not the settings instance that eventually gets
written to the header of the Bloom filters chunk:
write_commit_graph_file() has its own 'struct bloom_filter_settings'
instance, and that's the one that goes into the chunk header.

> @@ -140,6 +173,20 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
>  
>  	filter = bloom_filter_slab_at(&bloom_filters, c);
>  
> +	if (!filter->data) {
> +		load_commit_graph_info(r, c);
> +		if (c->graph_pos != COMMIT_NOT_FROM_GRAPH &&
> +			r->objects->commit_graph->chunk_bloom_indexes) {
> +			if (load_bloom_filter_from_graph(r->objects->commit_graph, filter, c))
> +				return filter;
> +			else
> +				return NULL;
> +		}
> +	}

And in the above conditions we try to load the existing Bloom filter
for the given commit and return it as-is for reuse if it already
exists, or go on to compute a new Bloom filter with the parameters set
at the beginning of the function.

Unfortunately, the parameters used to compute the now reused Bloom
filters are not checked anywhere.  In fact this writing process
entirely ignores all parameters in the header of the existing Bloom
filters chunk, and simply replaces them with the default parameters
hard-coded in write_commit_graph_file().  Consequently, we can end up
with Bloom filters computed with different parameters in the same
commit-graph file, which, in turn, can result in commits omitted from
the output of pathspec-limited revision walks.

The makeshift (there is no way to override those hard-coded defaults)
tests below demonstrate this issue.

This issue raises a good couple of questions:

  - What should we do when updating a commit-graph that was written
    with different Bloom filter parameters than our hardcoded
    defaults?

    Reusing the exising Bloom filters is clearly wrong.  Throwing away
    all existing Bloom filters and recomputing them with our defaults
    parameters doesn't seem to be good option, because that's a
    considerable amount of work, and the user might have a reason to
    chose those parameters.

  - What should we do when updating a commit-graph that was written
    with different Bloom filter parameters than specified by the user
    on the command line or in the config?

    Wipe out the old Bloom filters and recompute with new parameters,
    spending considerable time in bigger repositories?  Or stop with a
    warning about the different parameters (maybe it's just a typo),
    and require '--force'?
    
    Dunno, and we don't have such options and configuration yet
    anyway.

  - What about split commit-graphs?

    When split commit-graphs were introduced there was not a single
    chunk that had its own header.  Now the Bloom filters chunk does
    have a header, which leads to other questions:

    - Should that Bloom filters header be included in every split
      commit-graph?

      Not sure, but I suppose that having a header in each split
      commit-graph file would make loading and parsing that chunk a
      bit simpler, because all of them should be parsed the same way.
      Anyway, I think the specs should be explicit about it.  But...:

    - Should we allow different parameters in the Bloom filter chunks
      in each split commit-graph?

      The point of split commit-graphs is to avoid the overhead of
      re-writing the whole commit-graph file every time new commits
      are added, and it's crucial that both writing and merging split
      commit-graph files are cheap.  However, split commit-graph files
      using different Bloom filter parameters can't be merged without
      recomputing those Bloom filters, making merging quite expensive.

      So I don't think that it's a good idea to allow different Bloom
      filter parameters in split commit-graphs.  But then perhaps it
      would be better not to have a Bloom filter chunk header in all
      split commit-graph files after all.

    In any case, the last test below shows that the Bloom filter
    parameters are only read from the header of the most recent split
    commit-graph file.


  ---  >8  ---

#!/bin/sh

test_description='test'

. ./test-lib.sh

test_expect_success 'yuckiest setup ever!' '
	(
		cd "$GIT_BUILD_DIR" &&

		# The number of hashes per path cannot be configured
		# at runtime, so build a dedicated git binary that
		# writes Bloom filters using only 6 hashes per path.
		sed -i -e "/DEFAULT_BLOOM_FILTER_SETTINGS/ s/7/6/" bloom.h &&
		make -j4 git &&
		cp git git6 &&

		# Revert, rebuild.
		sed -i -e "/DEFAULT_BLOOM_FILTER_SETTINGS/ s/6/7/" bloom.h &&
		make -j4 git
	) &&
	git6="$GIT_BUILD_DIR"/git6
'

test_expect_success 'setup' '
	# We need a filename whose 7th hash maps to a different bit
	# position than any of its first 6 hashes in a 2-byte Bloom
	# filter.
	file=File &&

	test_tick &&
	git commit --allow-empty -m initial &&
	echo 1 >$file &&
	git add $file &&
	git commit -m one $file &&
	echo 2 >$file &&
	git commit -m two $file &&

	git log --oneline -- $file >expect
'

test_expect_success 'can read Bloom filters with different parameters' '
	test_when_finished "rm -rfv .git/objects/info/commit-graph*" &&

	# Write a commit-graph with Bloom filters using only 6 hashes
	# per path.
	"$git6" commit-graph write --reachable --changed-paths &&

	# Try pathspec-limited revision walk with the git binary writing
	# Bloom filters using 7 hashes: it still works, because no matter
	# how many hashes it would use when writing the commit-graph, the
	# reader part respects the nr of hashes stored in the
	# commit-graph file.  So far so good.
	git log --oneline $file >actual &&
	test_cmp expect actual
'

test_expect_failure 'commit-graph write does not reuse Bloom filters with different parameters' '
	test_when_finished "rm -rfv .git/objects/info/commit-graph*" &&

	# Write a commit-graph with Bloom filters using only 6 hashes
	# per path for a subset of commits.
	git rev-parse HEAD^ |
	"$git6" commit-graph write --stdin-commits --changed-paths &&

	# Add the rest of the commits to the commit-graph containing Bloom
	# filters using 6 hashes with a git version that writes Bloom
	# filters using 7 hashes.
	# Does it reuse the existing Bloom filters with 6 hashes?
	git commit-graph write --reachable --changed-paths &&

	# Yes, it does, because these report different filter data,
	# even though both commits modified the same file.
	test-tool bloom get_filter_for_commit $(git rev-parse HEAD^) &&
	test-tool bloom get_filter_for_commit $(git rev-parse HEAD) &&

	# Furthermore, it updated the Bloom filter chunk header as well,
	# which now stores that all Bloom filters use 7 hashes.
	# Consequently, the first commit whose Bloom filter was written
	# with only 6 hashes falls victim of a false negative, and is
	# omitted from the output.
	git log --oneline $file >actual &&
	test_cmp expect actual
'

test_expect_failure 'split commit-graphs and Bloom filters with different parameters' '
	test_when_finished "rm -rfv .git/objects/info/commit-graph*" &&

	git rev-parse HEAD^ |
	"$git6" commit-graph write --stdin-commits --changed-paths --split &&

	git commit-graph write --reachable --changed-paths --split=no-merge &&

	# To make sure that I test what I want, i.e. two commit-graphs
	# with one commit in each.  (Though "test-tool read-graph" is
	# utterly oblivious to split commit graphs...)
	test_line_count = 2 .git/objects/info/commit-graphs/commit-graph-chain &&
	verbose test "$(test-tool read-graph |sed -n -e "s/^num_commits: //p")" = 1 &&

	test-tool bloom get_filter_for_commit $(git rev-parse HEAD^) &&
	test-tool bloom get_filter_for_commit $(git rev-parse HEAD) &&

	git log --oneline $file >actual &&
	test_cmp expect actual
'

test_done

  ---  8<  ---

> +	if (filter->data || !compute_if_not_present)
> +		return filter;
> +
>  	repo_diff_setup(r, &diffopt);
>  	diffopt.flags.recursive = 1;
>  	diffopt.max_changes = max_changes;
> diff --git a/bloom.h b/bloom.h
> index 85ab8e9423d..760d7122374 100644
> --- a/bloom.h
> +++ b/bloom.h
> @@ -32,6 +32,7 @@ struct bloom_filter_settings {
>  
>  #define DEFAULT_BLOOM_FILTER_SETTINGS { 1, 7, 10 }
>  #define BITS_PER_WORD 8
> +#define BLOOMDATA_CHUNK_HEADER_SIZE 3 * sizeof(uint32_t)
>  
>  /*
>   * A bloom_filter struct represents a data segment to
> @@ -79,6 +80,7 @@ void add_key_to_filter(const struct bloom_key *key,
>  void init_bloom_filters(void);
>  
>  struct bloom_filter *get_bloom_filter(struct repository *r,
> -				      struct commit *c);
> +				      struct commit *c,
> +				      int compute_if_not_present);
>  
>  #endif
> \ No newline at end of file
> diff --git a/commit-graph.c b/commit-graph.c
> index a8b6b5cca5d..77668629e27 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -1086,7 +1086,7 @@ static void write_graph_chunk_bloom_indexes(struct hashfile *f,
>  			ctx->commits.nr);
>  
>  	while (list < last) {
> -		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
> +		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
>  		cur_pos += filter->len;
>  		display_progress(progress, ++i);
>  		hashwrite_be32(f, cur_pos);
> @@ -1115,7 +1115,7 @@ static void write_graph_chunk_bloom_data(struct hashfile *f,
>  	hashwrite_be32(f, settings->bits_per_entry);
>  
>  	while (list < last) {
> -		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
> +		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
>  		display_progress(progress, ++i);
>  		hashwrite(f, filter->data, filter->len * sizeof(unsigned char));
>  		list++;
> @@ -1296,7 +1296,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
>  
>  	for (i = 0; i < ctx->commits.nr; i++) {
>  		struct commit *c = sorted_commits[i];
> -		struct bloom_filter *filter = get_bloom_filter(ctx->r, c);
> +		struct bloom_filter *filter = get_bloom_filter(ctx->r, c, 1);
>  		ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
>  		display_progress(progress, i + 1);
>  	}
> diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
> index f18d1b722e1..ce412664ba9 100644
> --- a/t/helper/test-bloom.c
> +++ b/t/helper/test-bloom.c
> @@ -39,7 +39,7 @@ static void get_bloom_filter_for_commit(const struct object_id *commit_oid)
>  	struct bloom_filter *filter;
>  	setup_git_directory();
>  	c = lookup_commit(the_repository, commit_oid);
> -	filter = get_bloom_filter(the_repository, c);
> +	filter = get_bloom_filter(the_repository, c, 1);
>  	print_bloom_filter(filter);
>  }
>  
> -- 
> gitgitgadget

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

SZEDER Gábor <szeder.dev@gmail.com> writes:

> Note that this is not the settings instance that eventually gets
> written to the header of the Bloom filters chunk:
> write_commit_graph_file() has its own 'struct bloom_filter_settings'
> instance, and that's the one that goes into the chunk header.
> ...
> Unfortunately, the parameters used to compute the now reused Bloom
> filters are not checked anywhere.  In fact this writing process
> entirely ignores all parameters in the header of the existing Bloom
> filters chunk, and simply replaces them with the default parameters
> hard-coded in write_commit_graph_file().  Consequently, we can end up
> with Bloom filters computed with different parameters in the same
> commit-graph file, which, in turn, can result in commits omitted from
> the output of pathspec-limited revision walks.

Yeah, the whole design seems quite broken and as you said later,
mixing other ingredients like split file would only make things
worse X-<.

> The makeshift (there is no way to override those hard-coded defaults)
> tests below demonstrate this issue.
>
> This issue raises a good couple of questions:
>
>   - What should we do when updating a commit-graph that was written
>     with different Bloom filter parameters than our hardcoded
>     defaults?
>
>     Reusing the exising Bloom filters is clearly wrong.  Throwing away
>     all existing Bloom filters and recomputing them with our defaults
>     parameters doesn't seem to be good option, because that's a
>     considerable amount of work, and the user might have a reason to
>     chose those parameters.
>
>   - What should we do when updating a commit-graph that was written
>     with different Bloom filter parameters than specified by the user
>     on the command line or in the config?
>
>     Wipe out the old Bloom filters and recompute with new parameters,
>     spending considerable time in bigger repositories?  Or stop with a
>     warning about the different parameters (maybe it's just a typo),
>     and require '--force'?
>     
>     Dunno, and we don't have such options and configuration yet
>     anyway.
>
>   - What about split commit-graphs?
>
>     When split commit-graphs were introduced there was not a single
>     chunk that had its own header.  Now the Bloom filters chunk does
>     have a header, which leads to other questions:
>
>     - Should that Bloom filters header be included in every split
>       commit-graph?
>
>       Not sure, but I suppose that having a header in each split
>       commit-graph file would make loading and parsing that chunk a
>       bit simpler, because all of them should be parsed the same way.
>       Anyway, I think the specs should be explicit about it.  But...:
>
>     - Should we allow different parameters in the Bloom filter chunks
>       in each split commit-graph?
>
>       The point of split commit-graphs is to avoid the overhead of
>       re-writing the whole commit-graph file every time new commits
>       are added, and it's crucial that both writing and merging split
>       commit-graph files are cheap.  However, split commit-graph files
>       using different Bloom filter parameters can't be merged without
>       recomputing those Bloom filters, making merging quite expensive.
>
>       So I don't think that it's a good idea to allow different Bloom
>       filter parameters in split commit-graphs.  But then perhaps it
>       would be better not to have a Bloom filter chunk header in all
>       split commit-graph files after all.
>
>     In any case, the last test below shows that the Bloom filter
>     parameters are only read from the header of the most recent split
>     commit-graph file.
>
>
>   ---  >8  ---
>
> #!/bin/sh
>
> test_description='test'
>
> . ./test-lib.sh
>
> test_expect_success 'yuckiest setup ever!' '
> 	(
> 		cd "$GIT_BUILD_DIR" &&
>
> 		# The number of hashes per path cannot be configured
> 		# at runtime, so build a dedicated git binary that
> 		# writes Bloom filters using only 6 hashes per path.
> 		sed -i -e "/DEFAULT_BLOOM_FILTER_SETTINGS/ s/7/6/" bloom.h &&
> 		make -j4 git &&
> 		cp git git6 &&
>
> 		# Revert, rebuild.
> 		sed -i -e "/DEFAULT_BLOOM_FILTER_SETTINGS/ s/6/7/" bloom.h &&
> 		make -j4 git
> 	) &&
> 	git6="$GIT_BUILD_DIR"/git6
> '
>
> test_expect_success 'setup' '
> 	# We need a filename whose 7th hash maps to a different bit
> 	# position than any of its first 6 hashes in a 2-byte Bloom
> 	# filter.
> 	file=File &&
>
> 	test_tick &&
> 	git commit --allow-empty -m initial &&
> 	echo 1 >$file &&
> 	git add $file &&
> 	git commit -m one $file &&
> 	echo 2 >$file &&
> 	git commit -m two $file &&
>
> 	git log --oneline -- $file >expect
> '
>
> test_expect_success 'can read Bloom filters with different parameters' '
> 	test_when_finished "rm -rfv .git/objects/info/commit-graph*" &&
>
> 	# Write a commit-graph with Bloom filters using only 6 hashes
> 	# per path.
> 	"$git6" commit-graph write --reachable --changed-paths &&
>
> 	# Try pathspec-limited revision walk with the git binary writing
> 	# Bloom filters using 7 hashes: it still works, because no matter
> 	# how many hashes it would use when writing the commit-graph, the
> 	# reader part respects the nr of hashes stored in the
> 	# commit-graph file.  So far so good.
> 	git log --oneline $file >actual &&
> 	test_cmp expect actual
> '
>
> test_expect_failure 'commit-graph write does not reuse Bloom filters with different parameters' '
> 	test_when_finished "rm -rfv .git/objects/info/commit-graph*" &&
>
> 	# Write a commit-graph with Bloom filters using only 6 hashes
> 	# per path for a subset of commits.
> 	git rev-parse HEAD^ |
> 	"$git6" commit-graph write --stdin-commits --changed-paths &&
>
> 	# Add the rest of the commits to the commit-graph containing Bloom
> 	# filters using 6 hashes with a git version that writes Bloom
> 	# filters using 7 hashes.
> 	# Does it reuse the existing Bloom filters with 6 hashes?
> 	git commit-graph write --reachable --changed-paths &&
>
> 	# Yes, it does, because these report different filter data,
> 	# even though both commits modified the same file.
> 	test-tool bloom get_filter_for_commit $(git rev-parse HEAD^) &&
> 	test-tool bloom get_filter_for_commit $(git rev-parse HEAD) &&
>
> 	# Furthermore, it updated the Bloom filter chunk header as well,
> 	# which now stores that all Bloom filters use 7 hashes.
> 	# Consequently, the first commit whose Bloom filter was written
> 	# with only 6 hashes falls victim of a false negative, and is
> 	# omitted from the output.
> 	git log --oneline $file >actual &&
> 	test_cmp expect actual
> '
>
> test_expect_failure 'split commit-graphs and Bloom filters with different parameters' '
> 	test_when_finished "rm -rfv .git/objects/info/commit-graph*" &&
>
> 	git rev-parse HEAD^ |
> 	"$git6" commit-graph write --stdin-commits --changed-paths --split &&
>
> 	git commit-graph write --reachable --changed-paths --split=no-merge &&
>
> 	# To make sure that I test what I want, i.e. two commit-graphs
> 	# with one commit in each.  (Though "test-tool read-graph" is
> 	# utterly oblivious to split commit graphs...)
> 	test_line_count = 2 .git/objects/info/commit-graphs/commit-graph-chain &&
> 	verbose test "$(test-tool read-graph |sed -n -e "s/^num_commits: //p")" = 1 &&
>
> 	test-tool bloom get_filter_for_commit $(git rev-parse HEAD^) &&
> 	test-tool bloom get_filter_for_commit $(git rev-parse HEAD) &&
>
> 	git log --oneline $file >actual &&
> 	test_cmp expect actual
> '
>
> test_done
>
>   ---  8<  ---
>
>> +	if (filter->data || !compute_if_not_present)
>> +		return filter;
>> +
>>  	repo_diff_setup(r, &diffopt);
>>  	diffopt.flags.recursive = 1;
>>  	diffopt.max_changes = max_changes;
>> diff --git a/bloom.h b/bloom.h
>> index 85ab8e9423d..760d7122374 100644
>> --- a/bloom.h
>> +++ b/bloom.h
>> @@ -32,6 +32,7 @@ struct bloom_filter_settings {
>>  
>>  #define DEFAULT_BLOOM_FILTER_SETTINGS { 1, 7, 10 }
>>  #define BITS_PER_WORD 8
>> +#define BLOOMDATA_CHUNK_HEADER_SIZE 3 * sizeof(uint32_t)
>>  
>>  /*
>>   * A bloom_filter struct represents a data segment to
>> @@ -79,6 +80,7 @@ void add_key_to_filter(const struct bloom_key *key,
>>  void init_bloom_filters(void);
>>  
>>  struct bloom_filter *get_bloom_filter(struct repository *r,
>> -				      struct commit *c);
>> +				      struct commit *c,
>> +				      int compute_if_not_present);
>>  
>>  #endif
>> \ No newline at end of file
>> diff --git a/commit-graph.c b/commit-graph.c
>> index a8b6b5cca5d..77668629e27 100644
>> --- a/commit-graph.c
>> +++ b/commit-graph.c
>> @@ -1086,7 +1086,7 @@ static void write_graph_chunk_bloom_indexes(struct hashfile *f,
>>  			ctx->commits.nr);
>>  
>>  	while (list < last) {
>> -		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
>> +		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
>>  		cur_pos += filter->len;
>>  		display_progress(progress, ++i);
>>  		hashwrite_be32(f, cur_pos);
>> @@ -1115,7 +1115,7 @@ static void write_graph_chunk_bloom_data(struct hashfile *f,
>>  	hashwrite_be32(f, settings->bits_per_entry);
>>  
>>  	while (list < last) {
>> -		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
>> +		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
>>  		display_progress(progress, ++i);
>>  		hashwrite(f, filter->data, filter->len * sizeof(unsigned char));
>>  		list++;
>> @@ -1296,7 +1296,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
>>  
>>  	for (i = 0; i < ctx->commits.nr; i++) {
>>  		struct commit *c = sorted_commits[i];
>> -		struct bloom_filter *filter = get_bloom_filter(ctx->r, c);
>> +		struct bloom_filter *filter = get_bloom_filter(ctx->r, c, 1);
>>  		ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
>>  		display_progress(progress, i + 1);
>>  	}
>> diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
>> index f18d1b722e1..ce412664ba9 100644
>> --- a/t/helper/test-bloom.c
>> +++ b/t/helper/test-bloom.c
>> @@ -39,7 +39,7 @@ static void get_bloom_filter_for_commit(const struct object_id *commit_oid)
>>  	struct bloom_filter *filter;
>>  	setup_git_directory();
>>  	c = lookup_commit(the_repository, commit_oid);
>> -	filter = get_bloom_filter(the_repository, c);
>> +	filter = get_bloom_filter(the_repository, c, 1);
>>  	print_bloom_filter(filter);
>>  }
>>  
>> -- 
>> gitgitgadget

@@ -0,0 +1,275 @@
#include "git-compat-util.h"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, SZEDER Gábor wrote (reply to this):

On Mon, Apr 06, 2020 at 04:59:52PM +0000, Garima Singh via GitGitGadget wrote:
> +static void prepare_to_use_bloom_filter(struct rev_info *revs)
> +{
> +	struct pathspec_item *pi;
> +	char *path_alloc = NULL;
> +	const char *path;
> +	int last_index;
> +	int len;
> +
> +	if (!revs->commits)
> +	    return;
> +
> +	repo_parse_commit(revs->repo, revs->commits->item);
> +
> +	if (!revs->repo->objects->commit_graph)
> +		return;
> +
> +	revs->bloom_filter_settings = revs->repo->objects->commit_graph->bloom_filter_settings;
> +	if (!revs->bloom_filter_settings)
> +		return;
> +
> +	pi = &revs->pruning.pathspec.items[0];
> +	last_index = pi->len - 1;
> +
> +	/* remove single trailing slash from path, if needed */
> +	if (pi->match[last_index] == '/') {
> +	    path_alloc = xstrdup(pi->match);
> +	    path_alloc[last_index] = '\0';
> +	    path = path_alloc;

fill_bloom_key() takes a length parameter, so there is no need to
duplicate the path to be able to shorten it by one character to remove
that trailing '/'.

> +	} else
> +	    path = pi->match;
> +
> +	len = strlen(path);

'struct pathspec_item's 'len' field already contains the length of the
path, so there is no need for this strlen().

> +
> +	revs->bloom_key = xmalloc(sizeof(struct bloom_key));
> +	fill_bloom_key(path, len, revs->bloom_key, revs->bloom_filter_settings);
> +
> +	free(path_alloc);
> +}

> @@ -3362,6 +3440,8 @@ int prepare_revision_walk(struct rev_info *revs)
>  				       FOR_EACH_OBJECT_PROMISOR_ONLY);
>  	}
>  
> +	if (revs->pruning.pathspec.nr == 1 && !revs->reflog_info)
> +		prepare_to_use_bloom_filter(revs);
>  	if (revs->no_walk != REVISION_WALK_NO_WALK_UNSORTED)
>  		commit_list_sort_by_date(&revs->commits);
>  	if (revs->no_walk)
                return 0;
        if (revs->limited) {
                if (limit_list(revs) < 0)
                        return -1;

I extended the hunk context a bit to show that
prepare_to_use_bloom_filter() is called before limit_list().  This is
important, because specifying exclude revs and pathspecs, i.e.  'git
log ^v1.2.3 -- dir/file' does perform a lot of diffs in limit_list(),
and this way we can take advantage of Bloom filters even in this case.

> @@ -3379,6 +3459,7 @@ int prepare_revision_walk(struct rev_info *revs)
>  		simplify_merges(revs);
>  	if (revs->children.name)
>  		set_children(revs);
> +
>  	return 0;
>  }
>  
> diff --git a/revision.h b/revision.h
> index 475f048fb61..7c026fe41fc 100644
> --- a/revision.h
> +++ b/revision.h
> @@ -56,6 +56,8 @@ struct repository;
>  struct rev_info;
>  struct string_list;
>  struct saved_parents;
> +struct bloom_key;
> +struct bloom_filter_settings;
>  define_shared_commit_slab(revision_sources, char *);
>  
>  struct rev_cmdline_info {
> @@ -291,6 +293,15 @@ struct rev_info {
>  	struct revision_sources *sources;
>  
>  	struct topo_walk_info *topo_walk_info;
> +
> +	/* Commit graph bloom filter fields */
> +	/* The bloom filter key for the pathspec */
> +	struct bloom_key *bloom_key;
> +	/*
> +	 * The bloom filter settings used to generate the key.
> +	 * This is loaded from the commit-graph being used.
> +	 */
> +	struct bloom_filter_settings *bloom_filter_settings;
>  };
>  
>  int ref_excluded(struct string_list *, const char *path);
> -- 
> gitgitgadget
> 

@@ -0,0 +1,206 @@
#include "git-compat-util.h"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, SZEDER Gábor wrote (reply to this):

On Mon, Apr 06, 2020 at 04:59:44PM +0000, Garima Singh via GitGitGadget wrote:
> From: Garima Singh <garima.singh@microsoft.com>
> 
> Add the core implementation for computing Bloom filters for
> the paths changed between a commit and it's first parent.
> 
> We fill the Bloom filters as (const char *data, int len) pairs
> as `struct bloom_filters" within a commit slab.
> 
> Filters for commits with no changes and more than 512 changes,
> is represented with a filter of length zero. There is no gain
> in distinguishing between a computed filter of length zero for
> a commit with no changes, and an uncomputed filter for new commits
> or for commits with more than 512 changes. The effect on
> `git log -- path` is the same in both cases. We will fall back to
> the normal diffing algorithm when we can't benefit from the
> existence of Bloom filters.
> 
> Helped-by: Jeff King <peff@peff.net>
> Helped-by: Derrick Stolee <dstolee@microsoft.com>
> Reviewed-by: Jakub Narębski <jnareb@gmail.com>
> Signed-off-by: Garima Singh <garima.singh@microsoft.com>
> ---
>  bloom.c               | 97 +++++++++++++++++++++++++++++++++++++++++++
>  bloom.h               |  8 ++++
>  t/helper/test-bloom.c | 20 +++++++++
>  t/t0095-bloom.sh      | 47 +++++++++++++++++++++
>  4 files changed, 172 insertions(+)
> 
> diff --git a/bloom.c b/bloom.c
> index 888b67f1ea6..881a9841ede 100644
> --- a/bloom.c
> +++ b/bloom.c
> @@ -1,5 +1,18 @@
>  #include "git-compat-util.h"
>  #include "bloom.h"
> +#include "diff.h"
> +#include "diffcore.h"
> +#include "revision.h"
> +#include "hashmap.h"
> +
> +define_commit_slab(bloom_filter_slab, struct bloom_filter);

So here we define a commit slab for modified path Bloom filters, ...

> +struct bloom_filter_slab bloom_filters;
> +
> +struct pathmap_hash_entry {
> +    struct hashmap_entry entry;
> +    const char path[FLEX_ARRAY];
> +};
>  
>  static uint32_t rotate_left(uint32_t value, int32_t count)
>  {
> @@ -107,3 +120,87 @@ void add_key_to_filter(const struct bloom_key *key,
>  		filter->data[block_pos] |= get_bitmask(hash_mod);
>  	}
>  }
> +
> +void init_bloom_filters(void)
> +{
> +	init_bloom_filter_slab(&bloom_filters);

... here initialize the slab ...

> +}
> +
> +struct bloom_filter *get_bloom_filter(struct repository *r,
> +				      struct commit *c)
> +{
> +	struct bloom_filter *filter;
> +	struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
> +	int i;
> +	struct diff_options diffopt;
> +
> +	if (bloom_filters.slab_size == 0)
> +		return NULL;
> +
> +	filter = bloom_filter_slab_at(&bloom_filters, c);

... allocate an entry in the slab ...

> +
> +	repo_diff_setup(r, &diffopt);
> +	diffopt.flags.recursive = 1;
> +	diff_setup_done(&diffopt);
> +
> +	if (c->parents)
> +		diff_tree_oid(&c->parents->item->object.oid, &c->object.oid, "", &diffopt);
> +	else
> +		diff_tree_oid(NULL, &c->object.oid, "", &diffopt);
> +	diffcore_std(&diffopt);
> +
> +	if (diff_queued_diff.nr <= 512) {
> +		struct hashmap pathmap;
> +		struct pathmap_hash_entry *e;
> +		struct hashmap_iter iter;
> +		hashmap_init(&pathmap, NULL, NULL, 0);
> +
> +		for (i = 0; i < diff_queued_diff.nr; i++) {
> +			const char *path = diff_queued_diff.queue[i]->two->path;
> +
> +			/*
> +			* Add each leading directory of the changed file, i.e. for
> +			* 'dir/subdir/file' add 'dir' and 'dir/subdir' as well, so
> +			* the Bloom filter could be used to speed up commands like
> +			* 'git log dir/subdir', too.
> +			*
> +			* Note that directories are added without the trailing '/'.
> +			*/
> +			do {
> +				char *last_slash = strrchr(path, '/');
> +
> +				FLEX_ALLOC_STR(e, path, path);
> +				hashmap_entry_init(&e->entry, strhash(path));
> +				hashmap_add(&pathmap, &e->entry);
> +
> +				if (!last_slash)
> +					last_slash = (char*)path;
> +				*last_slash = '\0';
> +
> +			} while (*path);
> +
> +			diff_free_filepair(diff_queued_diff.queue[i]);
> +		}
> +
> +		filter->len = (hashmap_get_size(&pathmap) * settings.bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;
> +		filter->data = xcalloc(filter->len, sizeof(unsigned char));

... and here we fill the slab with data, including a memory allocation
for each slab entry.

What is missing in this patch or in any of the followup patches is a
place where we clear the slab and the additional memory attached to
it.

> +
> +		hashmap_for_each_entry(&pathmap, &iter, e, entry) {
> +			struct bloom_key key;
> +			fill_bloom_key(e->path, strlen(e->path), &key, &settings);
> +			add_key_to_filter(&key, filter, &settings);
> +		}
> +
> +		hashmap_free_entries(&pathmap, struct pathmap_hash_entry, entry);
> +	} else {
> +		for (i = 0; i < diff_queued_diff.nr; i++)
> +			diff_free_filepair(diff_queued_diff.queue[i]);
> +		filter->data = NULL;
> +		filter->len = 0;
> +	}
> +
> +	free(diff_queued_diff.queue);
> +	DIFF_QUEUE_CLEAR(&diff_queued_diff);
> +
> +	return filter;
> +}
> diff --git a/bloom.h b/bloom.h
> index b9ce422ca2d..85ab8e9423d 100644
> --- a/bloom.h
> +++ b/bloom.h
> @@ -1,6 +1,9 @@
>  #ifndef BLOOM_H
>  #define BLOOM_H
>  
> +struct commit;
> +struct repository;
> +
>  struct bloom_filter_settings {
>  	/*
>  	 * The version of the hashing technique being used.
> @@ -73,4 +76,9 @@ void add_key_to_filter(const struct bloom_key *key,
>  					   struct bloom_filter *filter,
>  					   const struct bloom_filter_settings *settings);
>  
> +void init_bloom_filters(void);
> +
> +struct bloom_filter *get_bloom_filter(struct repository *r,
> +				      struct commit *c);
> +
>  #endif
> \ No newline at end of file
> diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
> index 20460cde775..f18d1b722e1 100644
> --- a/t/helper/test-bloom.c
> +++ b/t/helper/test-bloom.c
> @@ -1,6 +1,7 @@
>  #include "git-compat-util.h"
>  #include "bloom.h"
>  #include "test-tool.h"
> +#include "commit.h"
>  
>  struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
>  
> @@ -32,6 +33,16 @@ static void print_bloom_filter(struct bloom_filter *filter) {
>  	printf("\n");
>  }
>  
> +static void get_bloom_filter_for_commit(const struct object_id *commit_oid)
> +{
> +	struct commit *c;
> +	struct bloom_filter *filter;
> +	setup_git_directory();
> +	c = lookup_commit(the_repository, commit_oid);
> +	filter = get_bloom_filter(the_repository, c);
> +	print_bloom_filter(filter);
> +}
> +
>  int cmd__bloom(int argc, const char **argv)
>  {
>  	if (!strcmp(argv[1], "get_murmur3")) {
> @@ -57,5 +68,14 @@ int cmd__bloom(int argc, const char **argv)
>  		print_bloom_filter(&filter);
>  	}
>  
> +    if (!strcmp(argv[1], "get_filter_for_commit")) {
> +		struct object_id oid;
> +		const char *end;
> +		if (parse_oid_hex(argv[2], &oid, &end))
> +			die("cannot parse oid '%s'", argv[2]);
> +		init_bloom_filters();
> +		get_bloom_filter_for_commit(&oid);
> +	}
> +
>  	return 0;
>  }
> \ No newline at end of file
> diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh
> index 36a086c7c60..8f9eef116dc 100755
> --- a/t/t0095-bloom.sh
> +++ b/t/t0095-bloom.sh
> @@ -67,4 +67,51 @@ test_expect_success 'compute bloom key for test string 2' '
>  	test_cmp expect actual
>  '
>  
> +test_expect_success 'get bloom filters for commit with no changes' '
> +	git init &&
> +	git commit --allow-empty -m "c0" &&
> +	cat >expect <<-\EOF &&
> +	Filter_Length:0
> +	Filter_Data:
> +	EOF
> +	test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'get bloom filter for commit with 10 changes' '
> +	rm actual &&
> +	rm expect &&
> +	mkdir smallDir &&
> +	for i in $(test_seq 0 9)
> +	do
> +		echo $i >smallDir/$i
> +	done &&
> +	git add smallDir &&
> +	git commit -m "commit with 10 changes" &&
> +	cat >expect <<-\EOF &&
> +	Filter_Length:25
> +	Filter_Data:82|a0|65|47|0c|92|90|c0|a1|40|02|a0|e2|40|e0|04|0a|9a|66|cf|80|19|85|42|23|
> +	EOF
> +	test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_success EXPENSIVE 'get bloom filter for commit with 513 changes' '
> +	rm actual &&
> +	rm expect &&
> +	mkdir bigDir &&
> +	for i in $(test_seq 0 512)
> +	do
> +		echo $i >bigDir/$i
> +	done &&
> +	git add bigDir &&
> +	git commit -m "commit with 513 changes" &&
> +	cat >expect <<-\EOF &&
> +	Filter_Length:0
> +	Filter_Data:
> +	EOF
> +	test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
> +	test_cmp expect actual
> +'
> +
>  test_done
> \ No newline at end of file
> -- 
> gitgitgadget
> 

@@ -17,6 +17,9 @@ metadata, including:
- The parents of the commit, stored using positional references within
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, SZEDER Gábor wrote (reply to this):

76ffbca71a (commit-graph: write Bloom filters to commit graph file,
2020-04-06) added two delayed progress lines to writing the Bloom
filter index and data chunk.  This is wrong, because a single common
progress is used while writing all chunks, which is not updated while
writing these two new chunks, resulting in incomplete-looking "done"
lines:

  Expanding reachable commits in commit graph: 888679, done.
  Computing commit changed paths Bloom filters: 100% (888678/888678), done.
  Writing out commit graph in 6 passes:  66% (3554712/5332068), done.

Use the common 'struct progress' instance while writing the Bloom
filter chunks as well.

Signed-off-by: SZEDER Gábor <szeder.dev@gmail.com>
---
 commit-graph.c | 22 ++--------------------
 1 file changed, 2 insertions(+), 20 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index aaf3327ede..65cf32637c 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1086,23 +1086,14 @@ static void write_graph_chunk_bloom_indexes(struct hashfile *f,
 	struct commit **list = ctx->commits.list;
 	struct commit **last = ctx->commits.list + ctx->commits.nr;
 	uint32_t cur_pos = 0;
-	struct progress *progress = NULL;
-	int i = 0;
-
-	if (ctx->report_progress)
-		progress = start_delayed_progress(
-			_("Writing changed paths Bloom filters index"),
-			ctx->commits.nr);
 
 	while (list < last) {
 		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
 		cur_pos += filter->len;
-		display_progress(progress, ++i);
+		display_progress(ctx->progress, ++ctx->progress_cnt);
 		hashwrite_be32(f, cur_pos);
 		list++;
 	}
-
-	stop_progress(&progress);
 }
 
 static void write_graph_chunk_bloom_data(struct hashfile *f,
@@ -1111,13 +1102,6 @@ static void write_graph_chunk_bloom_data(struct hashfile *f,
 {
 	struct commit **list = ctx->commits.list;
 	struct commit **last = ctx->commits.list + ctx->commits.nr;
-	struct progress *progress = NULL;
-	int i = 0;
-
-	if (ctx->report_progress)
-		progress = start_delayed_progress(
-			_("Writing changed paths Bloom filters data"),
-			ctx->commits.nr);
 
 	hashwrite_be32(f, settings->hash_version);
 	hashwrite_be32(f, settings->num_hashes);
@@ -1125,12 +1109,10 @@ static void write_graph_chunk_bloom_data(struct hashfile *f,
 
 	while (list < last) {
 		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
-		display_progress(progress, ++i);
+		display_progress(ctx->progress, ++ctx->progress_cnt);
 		hashwrite(f, filter->data, filter->len * sizeof(unsigned char));
 		list++;
 	}
-
-	stop_progress(&progress);
 }
 
 static int oid_compare(const void *_a, const void *_b)
-- 
2.27.0.547.g4ba2d26563

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 7/9/2020 1:00 PM, SZEDER Gábor wrote:
> 76ffbca71a (commit-graph: write Bloom filters to commit graph file,
> 2020-04-06) added two delayed progress lines to writing the Bloom
> filter index and data chunk.  This is wrong, because a single common
> progress is used while writing all chunks, which is not updated while
> writing these two new chunks, resulting in incomplete-looking "done"
> lines:
> 
>   Expanding reachable commits in commit graph: 888679, done.
>   Computing commit changed paths Bloom filters: 100% (888678/888678), done.
>   Writing out commit graph in 6 passes:  66% (3554712/5332068), done.
> 
> Use the common 'struct progress' instance while writing the Bloom
> filter chunks as well.

Thanks for finding this. It's a clearly correct way to go,
and is one of the things that did not get updated properly
between the old prototype when applying it on the new code
that included this ctx->progress pattern.

Junio: head's up that this will conflict with the final patch
in ds/maintenance. I'll remove my edits to these methods in
my v2 to make that merge a bit easier.

Thanks,
-Stolee

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 7/9/2020 2:01 PM, Derrick Stolee wrote:
> Junio: head's up that this will conflict with the final patch
> in ds/maintenance. I'll remove my edits to these methods in
> my v2 to make that merge a bit easier.
Or, I'm getting confused, because I changed start_progress()
calls in midx.c, not commit-graph.c. Please ignore my scattered
brain.

Thanks,
-Stolee

@@ -0,0 +1,255 @@
#include "git-compat-util.h"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, SZEDER Gábor wrote (reply to this):

On Mon, Apr 06, 2020 at 04:59:50PM +0000, Garima Singh via GitGitGadget wrote:
> From: Garima Singh <garima.singh@microsoft.com>
> 
> Add logic to
> a) parse Bloom filter information from the commit graph file and,
> b) re-use existing Bloom filters.
> 
> See Documentation/technical/commit-graph-format for the format in which
> the Bloom filter information is written to the commit graph file.
> 
> To read Bloom filter for a given commit with lexicographic position
> 'i' we need to:
> 1. Read BIDX[i] which essentially gives us the starting index in BDAT for
>    filter of commit i+1. It is essentially the index past the end
>    of the filter of commit i. It is called end_index in the code.
> 
> 2. For i>0, read BIDX[i-1] which will give us the starting index in BDAT
>    for filter of commit i. It is called the start_index in the code.
>    For the first commit, where i = 0, Bloom filter data starts at the
>    beginning, just past the header in the BDAT chunk. Hence, start_index
>    will be 0.
> 
> 3. The length of the filter will be end_index - start_index, because
>    BIDX[i] gives the cumulative 8-byte words including the ith
>    commit's filter.
> 
> We toggle whether Bloom filters should be recomputed based on the
> compute_if_not_present flag.
> 
> Helped-by: Derrick Stolee <dstolee@microsoft.com>
> Signed-off-by: Garima Singh <garima.singh@microsoft.com>
> ---
>  bloom.c               | 49 ++++++++++++++++++++++++++++++++++++++++++-
>  bloom.h               |  4 +++-
>  commit-graph.c        |  6 +++---
>  t/helper/test-bloom.c |  2 +-
>  4 files changed, 55 insertions(+), 6 deletions(-)
> 
> diff --git a/bloom.c b/bloom.c
> index a16eee92331..0f714dd76ae 100644
> --- a/bloom.c
> +++ b/bloom.c
> @@ -4,6 +4,8 @@
>  #include "diffcore.h"
>  #include "revision.h"
>  #include "hashmap.h"
> +#include "commit-graph.h"
> +#include "commit.h"
>  
>  define_commit_slab(bloom_filter_slab, struct bloom_filter);
>  
> @@ -26,6 +28,36 @@ static inline unsigned char get_bitmask(uint32_t pos)
>  	return ((unsigned char)1) << (pos & (BITS_PER_WORD - 1));
>  }
>  
> +static int load_bloom_filter_from_graph(struct commit_graph *g,
> +				   struct bloom_filter *filter,
> +				   struct commit *c)
> +{
> +	uint32_t lex_pos, start_index, end_index;
> +
> +	while (c->graph_pos < g->num_commits_in_base)
> +		g = g->base_graph;
> +
> +	/* The commit graph commit 'c' lives in doesn't carry bloom filters. */
> +	if (!g->chunk_bloom_indexes)
> +		return 0;
> +
> +	lex_pos = c->graph_pos - g->num_commits_in_base;
> +
> +	end_index = get_be32(g->chunk_bloom_indexes + 4 * lex_pos);

Let's suppose that we encounter a bogus commit-graph file.  This would
then segfault if 'lex_pos' were to point past the end of file, i.e.
past the mmap()-ed memory region.

> +
> +	if (lex_pos > 0)
> +		start_index = get_be32(g->chunk_bloom_indexes + 4 * (lex_pos - 1));
> +	else
> +		start_index = 0;
> +
> +	filter->len = end_index - start_index;
> +	filter->data = (unsigned char *)(g->chunk_bloom_data +
> +					sizeof(unsigned char) * start_index +
> +					BLOOMDATA_CHUNK_HEADER_SIZE);

And this could lead to segfault later when accessing the Bloom filter
data if 'start_index' or 'end_index' were to point past EOF or
end_index < start_index.

IMO all indices and offsets read from the commit-graph file must be
checked to ensure that they fit in the corresponding chunk, like I did
in my modified path Bloom filters implementation.  However, I'm not
sure how it's best to handle an out-of-bounds offset...  Simply
erroring out in case of a bogus commit-graph file is the
straightforward possibility, of course, but since the commit-graph is
only an optimization, it would be better user experience to warn and
ignore it and finish the operation without the commit-graph (albeit
slower).  But is it even possible to ignore the commit-graph, say, in
the middle of a 'git rev-list --topo-order HEAD'?

> +	return 1;
> +}
> +
>  /*
>   * Calculate the murmur3 32-bit hash value for the given data
>   * using the given seed.
> @@ -127,7 +159,8 @@ void init_bloom_filters(void)
>  }
>  
>  struct bloom_filter *get_bloom_filter(struct repository *r,
> -				      struct commit *c)
> +				      struct commit *c,
> +					  int compute_if_not_present)
>  {
>  	struct bloom_filter *filter;
>  	struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
> @@ -140,6 +173,20 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
>  
>  	filter = bloom_filter_slab_at(&bloom_filters, c);
>  
> +	if (!filter->data) {
> +		load_commit_graph_info(r, c);
> +		if (c->graph_pos != COMMIT_NOT_FROM_GRAPH &&
> +			r->objects->commit_graph->chunk_bloom_indexes) {
> +			if (load_bloom_filter_from_graph(r->objects->commit_graph, filter, c))
> +				return filter;
> +			else
> +				return NULL;
> +		}
> +	}
> +
> +	if (filter->data || !compute_if_not_present)
> +		return filter;
> +
>  	repo_diff_setup(r, &diffopt);
>  	diffopt.flags.recursive = 1;
>  	diffopt.max_changes = max_changes;
> diff --git a/bloom.h b/bloom.h
> index 85ab8e9423d..760d7122374 100644
> --- a/bloom.h
> +++ b/bloom.h
> @@ -32,6 +32,7 @@ struct bloom_filter_settings {
>  
>  #define DEFAULT_BLOOM_FILTER_SETTINGS { 1, 7, 10 }
>  #define BITS_PER_WORD 8
> +#define BLOOMDATA_CHUNK_HEADER_SIZE 3 * sizeof(uint32_t)
>  
>  /*
>   * A bloom_filter struct represents a data segment to
> @@ -79,6 +80,7 @@ void add_key_to_filter(const struct bloom_key *key,
>  void init_bloom_filters(void);
>  
>  struct bloom_filter *get_bloom_filter(struct repository *r,
> -				      struct commit *c);
> +				      struct commit *c,
> +				      int compute_if_not_present);
>  
>  #endif
> \ No newline at end of file
> diff --git a/commit-graph.c b/commit-graph.c
> index a8b6b5cca5d..77668629e27 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -1086,7 +1086,7 @@ static void write_graph_chunk_bloom_indexes(struct hashfile *f,
>  			ctx->commits.nr);
>  
>  	while (list < last) {
> -		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
> +		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
>  		cur_pos += filter->len;
>  		display_progress(progress, ++i);
>  		hashwrite_be32(f, cur_pos);
> @@ -1115,7 +1115,7 @@ static void write_graph_chunk_bloom_data(struct hashfile *f,
>  	hashwrite_be32(f, settings->bits_per_entry);
>  
>  	while (list < last) {
> -		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list);
> +		struct bloom_filter *filter = get_bloom_filter(ctx->r, *list, 0);
>  		display_progress(progress, ++i);
>  		hashwrite(f, filter->data, filter->len * sizeof(unsigned char));
>  		list++;
> @@ -1296,7 +1296,7 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
>  
>  	for (i = 0; i < ctx->commits.nr; i++) {
>  		struct commit *c = sorted_commits[i];
> -		struct bloom_filter *filter = get_bloom_filter(ctx->r, c);
> +		struct bloom_filter *filter = get_bloom_filter(ctx->r, c, 1);
>  		ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
>  		display_progress(progress, i + 1);
>  	}
> diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
> index f18d1b722e1..ce412664ba9 100644
> --- a/t/helper/test-bloom.c
> +++ b/t/helper/test-bloom.c
> @@ -39,7 +39,7 @@ static void get_bloom_filter_for_commit(const struct object_id *commit_oid)
>  	struct bloom_filter *filter;
>  	setup_git_directory();
>  	c = lookup_commit(the_repository, commit_oid);
> -	filter = get_bloom_filter(the_repository, c);
> +	filter = get_bloom_filter(the_repository, c, 1);
>  	print_bloom_filter(filter);
>  }
>  
> -- 
> gitgitgadget
> 

@@ -0,0 +1,208 @@
#include "git-compat-util.h"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, SZEDER Gábor wrote (reply to this):

On Mon, Apr 06, 2020 at 04:59:45PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
> 
> When computing the changed-paths bloom filters for the commit-graph,
> we limit the size of the filter by restricting the number of paths
> in the diff. Instead of computing a large diff and then ignoring the
> result, it is better to halt the diff computation early.
> 
> Create a new "max_changes" option in struct diff_options. If non-zero,
> then halt the diff computation after discovering strictly more changed
> paths. This includes paths corresponding to trees that change.
> 
> Use this max_changes option in the bloom filter calculations. This
> reduces the time taken to compute the filters for the Linux kernel
> repo from 2m50s to 2m35s. On a large internal repository with ~500
> commits that perform tree-wide changes, the time reduced from
> 6m15s to 3m48s.
> 
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> Signed-off-by: Garima Singh <garima.singh@microsoft.com>
> ---
>  bloom.c     | 4 +++-
>  diff.h      | 5 +++++
>  tree-diff.c | 6 ++++++
>  3 files changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/bloom.c b/bloom.c
> index 881a9841ede..a16eee92331 100644
> --- a/bloom.c
> +++ b/bloom.c
> @@ -133,6 +133,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
>  	struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
>  	int i;
>  	struct diff_options diffopt;
> +	int max_changes = 512;
>  
>  	if (bloom_filters.slab_size == 0)
>  		return NULL;
> @@ -141,6 +142,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
>  
>  	repo_diff_setup(r, &diffopt);
>  	diffopt.flags.recursive = 1;
> +	diffopt.max_changes = max_changes;
>  	diff_setup_done(&diffopt);
>  
>  	if (c->parents)
> @@ -149,7 +151,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
>  		diff_tree_oid(NULL, &c->object.oid, "", &diffopt);
>  	diffcore_std(&diffopt);
>  
> -	if (diff_queued_diff.nr <= 512) {
> +	if (diff_queued_diff.nr <= max_changes) {
>  		struct hashmap pathmap;
>  		struct pathmap_hash_entry *e;
>  		struct hashmap_iter iter;
> diff --git a/diff.h b/diff.h
> index 6febe7e3656..9443dc1b003 100644
> --- a/diff.h
> +++ b/diff.h
> @@ -285,6 +285,11 @@ struct diff_options {
>  	/* Number of hexdigits to abbreviate raw format output to. */
>  	int abbrev;
>  
> +	/* If non-zero, then stop computing after this many changes. */
> +	int max_changes;
> +	/* For internal use only. */
> +	int num_changes;

"For internal use only", understood.

> +
>  	int ita_invisible_in_index;
>  /* white-space error highlighting */
>  #define WSEH_NEW (1<<12)
> diff --git a/tree-diff.c b/tree-diff.c
> index 33ded7f8b3e..f3d303c6e54 100644
> --- a/tree-diff.c
> +++ b/tree-diff.c
> @@ -434,6 +434,9 @@ static struct combine_diff_path *ll_diff_tree_paths(
>  		if (diff_can_quit_early(opt))
>  			break;
>  
> +		if (opt->max_changes && opt->num_changes > opt->max_changes)
> +			break;
> +
>  		if (opt->pathspec.nr) {
>  			skip_uninteresting(&t, base, opt);
>  			for (i = 0; i < nparent; i++)
> @@ -518,6 +521,7 @@ static struct combine_diff_path *ll_diff_tree_paths(
>  
>  			/* t↓ */
>  			update_tree_entry(&t);
> +			opt->num_changes++;
>  		}
>  
>  		/* t > p[imin] */
> @@ -535,6 +539,7 @@ static struct combine_diff_path *ll_diff_tree_paths(
>  		skip_emit_tp:
>  			/* ∀ pi=p[imin]  pi↓ */
>  			update_tp_entries(tp, nparent);
> +			opt->num_changes++;
>  		}
>  	}

This counter is basically broken, its value is wrong for over 98% of
commits, and, worse, its value remains 0 for over 85% of commits in
the repositories I usually use to test modified path Bloom filters.
Consequently, a relatively large number of commits modifying more than
512 paths get Bloom filters.

The makeshift tests in the patch below demonstrate these issues as
most of them fail, most notably those two tests that demonstrate that
modifying existing paths are not counted at all.


  ---  >8  ---

diff --git a/bloom.c b/bloom.c
index 9b86aa3f59..3db0fde734 100644
--- a/bloom.c
+++ b/bloom.c
@@ -203,7 +203,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
 	repo_diff_setup(r, &diffopt);
 	diffopt.flags.recursive = 1;
 	diffopt.detect_rename = 0;
-	diffopt.max_changes = max_changes;
+	diffopt.max_changes = 0;
 	diff_setup_done(&diffopt);
 
 	/* ensure commit is parsed so we have parent information */
@@ -214,6 +214,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
 	else
 		diff_tree_oid(NULL, &c->object.oid, "", &diffopt);
 	diffcore_std(&diffopt);
+	printf("%s  %d\n", oid_to_hex(&c->object.oid), diffopt.num_changes);
 
 	if (diffopt.num_changes <= max_changes) {
 		struct hashmap pathmap;
diff --git a/t/t9999-test.sh b/t/t9999-test.sh
new file mode 100755
index 0000000000..8d2bd9f03f
--- /dev/null
+++ b/t/t9999-test.sh
@@ -0,0 +1,142 @@
+#!/bin/sh
+
+test_description='test'
+
+. ./test-lib.sh
+
+test_expect_success 'setup' '
+	test_tick &&
+
+	echo 1 >file &&
+	mkdir -p dir/subdir &&
+	echo 1 >dir/subdir/file1 &&
+	echo 1 >dir/subdir/file2 &&
+	git add file dir &&
+	git commit -m setup &&
+
+	echo 2 >file &&
+	git commit -a -m "modify one path in root" &&
+	mod_one_path=$(git rev-parse HEAD) &&
+
+	echo 2 >dir/subdir/file1 &&
+	echo 2 >dir/subdir/file2 &&
+	git commit -a -m "modify two file two dirs deep" &&
+	mod_four_paths=$(git rev-parse HEAD) &&
+
+	>new-file &&
+	git add new-file &&
+	git commit -m "add new file in root" &&
+	new_file_in_root=$(git rev-parse HEAD) &&
+
+	git rm new-file &&
+	git commit -m "delete file in root" &&
+	delete_file_in_root=$(git rev-parse HEAD) &&
+
+	>dir/new-file &&
+	git add dir/new-file &&
+	git commit -m "add new file in dir" &&
+	new_file_in_dir=$(git rev-parse HEAD) &&
+
+	git rm dir/new-file &&
+	git commit -m "delete file in dir" &&
+	delete_file_in_dir=$(git rev-parse HEAD) &&
+
+	echo 1 >d-f &&
+	git add d-f &&
+	git commit -m foo &&
+	git rm d-f &&
+	mkdir d-f &&
+	echo 2 >d-f/file &&
+	git add d-f &&
+	git commit -m "replace file with dir" &&
+	file_to_dir=$(git rev-parse HEAD) &&
+
+	>d-f.c &&
+	git add d-f.c &&
+	git commit -m "add a file that sorts between d-f and d-f/" &&
+	git rm -r d-f &&
+	echo 3 >d-f &&
+	git add d-f &&
+	git commit -m "replace dir with file" &&
+	dir_to_file=$(git rev-parse HEAD) &&
+
+	bin_sha1=$(git rev-parse HEAD:dir/subdir | hex2oct) &&
+	# leading zero in mode: the content of the tree remains the same,
+	# but its oid does change!
+	printf "040000 subdir\0$bin_sha1" >rawtree &&
+	tree1=$(git hash-object -t tree -w rawtree) &&
+	git cat-file -p HEAD^{tree} >out &&
+	tree2=$(sed -e "s/$(git rev-parse HEAD:dir/)/$tree1/" out |git mktree) &&
+	different_but_same_tree=$(git commit-tree \
+		-m "leading zeros in mode" \
+		-p $(git rev-parse HEAD) $tree2) &&
+	git update-ref HEAD $different_but_same_tree &&
+
+	git commit-graph write --reachable --changed-paths >out &&
+	cat out  # debug
+'
+
+test_expect_success 'modify one path in root' '
+	git diff --name-status $mod_one_path^ $mod_one_path &&
+	echo "$mod_one_path  1" >expect &&
+	grep "$mod_one_path" out >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'modify two file two dirs deep' '
+	git diff --name-status $mod_four_paths^ $mod_four_paths &&
+	echo "$mod_four_paths  4" >expect &&
+	grep "$mod_four_paths" out >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'add new file in root' '
+	git diff --name-status $new_file_in_root^ $new_file_in_root &&
+	echo "$new_file_in_root  1" >expect &&
+	grep "$new_file_in_root" out >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'delete file in root' '
+	git diff --name-status $delete_file_in_root^ $delete_file_in_root &&
+	echo "$delete_file_in_root  1" >expect &&
+	grep "$delete_file_in_root" out >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'add new file in dir' '
+	git diff --name-status $new_file_in_dir^ $new_file_in_dir &&
+	echo "$new_file_in_dir  2" >expect &&
+	grep "$new_file_in_dir" out >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'delete file in dir' '
+	git diff --name-status $delete_file_in_dir^ $delete_file_in_dir &&
+	echo "$delete_file_in_dir  2" >expect &&
+	grep "$delete_file_in_dir" out >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'replace file with dir' '
+	git diff --name-status $file_to_dir^ $file_to_dir &&
+	echo "$file_to_dir  2" >expect &&
+	grep "$file_to_dir" out >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'replace dir with file' '
+	git diff --name-status $dir_to_file^ $dir_to_file &&
+	echo "$dir_to_file  2" >expect &&
+	grep "$dir_to_file" out >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'leading zeros in mode' '
+	git diff --name-status $different_but_same_tree^ $different_but_same_tree &&
+	echo "$different_but_same_tree  0" >expect &&
+	grep "$different_but_same_tree" out >actual &&
+	test_cmp expect actual
+'
+
+test_done

  ---  >8  ---


> @@ -552,6 +557,7 @@ struct combine_diff_path *diff_tree_paths(
>  	const struct object_id **parents_oid, int nparent,
>  	struct strbuf *base, struct diff_options *opt)
>  {
> +	opt->num_changes = 0;
>  	p = ll_diff_tree_paths(p, oid, parents_oid, nparent, base, opt);
>  
>  	/*
> -- 
> gitgitgadget
> 

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 8/4/2020 10:47 AM, SZEDER Gábor wrote:
> On Mon, Apr 06, 2020 at 04:59:45PM +0000, Derrick Stolee via GitGitGadget wrote:
> This counter is basically broken, its value is wrong for over 98% of
> commits, and, worse, its value remains 0 for over 85% of commits in
> the repositories I usually use to test modified path Bloom filters.
> Consequently, a relatively large number of commits modifying more than
> 512 paths get Bloom filters.

Thanks for finding this! The counter is only really tested in one
place, and that test only considers _file adds_, which is a problem.

If I understand this correctly, the bug is a performance-only bug
(since this is a performance-only feature), but it is an important
one to fix.

There is certainly some dark magic happening in this tree-diff logic,
so instead of trying to get an accurate count we should just use the
magic global diff_queued_diff to track the current list of file changes.

Note: diff_queued_diff does not track the directory changes, so it
is an under-count for the total changes to track in the Bloom filter.
This is later corrected by the block that adds these leading directory
changes.

> The makeshift tests in the patch below demonstrate these issues as
> most of them fail, most notably those two tests that demonstrate that
> modifying existing paths are not counted at all.

I adapted your diff along with ripping out 'num_changes' in favor
of diff_queued_diff.nr. This required modifying some of your expected
values in the test script (losing the leading directories in the
count).

I'll work with Taylor to create a fix, and include proper testing
of the logic here. We'll stick it in the v2 of his max-changed-paths
series [1]. He already has some helpful logging that can help create
tests that ensure this logic is performing as expected.

We plan to have that fix available by later today or early tomorrow.
Will you be available to help validate it?

[1] https://lore.kernel.org/git/cover.1596480582.git.me@ttaylorr.com/

Thanks,
-Stolee

  --- >8 ---

diff --git a/bloom.c b/bloom.c
index 1a573226e7..b8d6cb9240 100644
--- a/bloom.c
+++ b/bloom.c
@@ -218,8 +218,9 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
 	else
 		diff_tree_oid(NULL, &c->object.oid, "", &diffopt);
 	diffcore_std(&diffopt);
+	printf("%s  %d\n", oid_to_hex(&c->object.oid), diff_queued_diff.nr);
 
-	if (diffopt.num_changes <= max_changes) {
+	if (diff_queued_diff.nr <= max_changes) {
 		struct hashmap pathmap;
 		struct pathmap_hash_entry *e;
 		struct hashmap_iter iter;
diff --git a/diff.h b/diff.h
index e0c0af6286b..1d32b718857 100644
--- a/diff.h
+++ b/diff.h
@@ -287,8 +287,6 @@ struct diff_options {
 
 	/* If non-zero, then stop computing after this many changes. */
 	int max_changes;
-	/* For internal use only. */
-	int num_changes;
 
 	int ita_invisible_in_index;
 /* white-space error highlighting */
diff --git a/t/t9999-test.sh b/t/t9999-test.sh
new file mode 100755
index 00000000000..1f35aa8e2c5
--- /dev/null
+++ b/t/t9999-test.sh
@@ -0,0 +1,142 @@
+#!/bin/sh
+
+test_description='test'
+
+. ./test-lib.sh
+
+test_expect_success 'setup' '
+	test_tick &&
+
+	echo 1 >file &&
+	mkdir -p dir/subdir &&
+	echo 1 >dir/subdir/file1 &&
+	echo 1 >dir/subdir/file2 &&
+	git add file dir &&
+	git commit -m setup &&
+
+	echo 2 >file &&
+	git commit -a -m "modify one path in root" &&
+	mod_one_path=$(git rev-parse HEAD) &&
+
+	echo 2 >dir/subdir/file1 &&
+	echo 2 >dir/subdir/file2 &&
+	git commit -a -m "modify two file two dirs deep" &&
+	mod_four_paths=$(git rev-parse HEAD) &&
+
+	>new-file &&
+	git add new-file &&
+	git commit -m "add new file in root" &&
+	new_file_in_root=$(git rev-parse HEAD) &&
+
+	git rm new-file &&
+	git commit -m "delete file in root" &&
+	delete_file_in_root=$(git rev-parse HEAD) &&
+
+	>dir/new-file &&
+	git add dir/new-file &&
+	git commit -m "add new file in dir" &&
+	new_file_in_dir=$(git rev-parse HEAD) &&
+
+	git rm dir/new-file &&
+	git commit -m "delete file in dir" &&
+	delete_file_in_dir=$(git rev-parse HEAD) &&
+
+	echo 1 >d-f &&
+	git add d-f &&
+	git commit -m foo &&
+	git rm d-f &&
+	mkdir d-f &&
+	echo 2 >d-f/file &&
+	git add d-f &&
+	git commit -m "replace file with dir" &&
+	file_to_dir=$(git rev-parse HEAD) &&
+
+	>d-f.c &&
+	git add d-f.c &&
+	git commit -m "add a file that sorts between d-f and d-f/" &&
+	git rm -r d-f &&
+	echo 3 >d-f &&
+	git add d-f &&
+	git commit -m "replace dir with file" &&
+	dir_to_file=$(git rev-parse HEAD) &&
+
+	bin_sha1=$(git rev-parse HEAD:dir/subdir | hex2oct) &&
+	# leading zero in mode: the content of the tree remains the same,
+	# but its oid does change!
+	printf "040000 subdir\0$bin_sha1" >rawtree &&
+	tree1=$(git hash-object -t tree -w rawtree) &&
+	git cat-file -p HEAD^{tree} >out &&
+	tree2=$(sed -e "s/$(git rev-parse HEAD:dir/)/$tree1/" out |git mktree) &&
+	different_but_same_tree=$(git commit-tree \
+		-m "leading zeros in mode" \
+		-p $(git rev-parse HEAD) $tree2) &&
+	git update-ref HEAD $different_but_same_tree &&
+
+	git commit-graph write --reachable --changed-paths >out &&
+	cat out  # debug
+'
+
+test_expect_success 'modify one path in root' '
+	git diff --name-status $mod_one_path^ $mod_one_path &&
+	echo "$mod_one_path  1" >expect &&
+	grep "$mod_one_path" out >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'modify two file two dirs deep' '
+	git diff --name-status $mod_four_paths^ $mod_four_paths &&
+	echo "$mod_four_paths  2" >expect &&
+	grep "$mod_four_paths" out >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'add new file in root' '
+	git diff --name-status $new_file_in_root^ $new_file_in_root &&
+	echo "$new_file_in_root  1" >expect &&
+	grep "$new_file_in_root" out >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'delete file in root' '
+	git diff --name-status $delete_file_in_root^ $delete_file_in_root &&
+	echo "$delete_file_in_root  1" >expect &&
+	grep "$delete_file_in_root" out >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'add new file in dir' '
+	git diff --name-status $new_file_in_dir^ $new_file_in_dir &&
+	echo "$new_file_in_dir  1" >expect &&
+	grep "$new_file_in_dir" out >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'delete file in dir' '
+	git diff --name-status $delete_file_in_dir^ $delete_file_in_dir &&
+	echo "$delete_file_in_dir  1" >expect &&
+	grep "$delete_file_in_dir" out >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'replace file with dir' '
+	git diff --name-status $file_to_dir^ $file_to_dir &&
+	echo "$file_to_dir  2" >expect &&
+	grep "$file_to_dir" out >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'replace dir with file' '
+	git diff --name-status $dir_to_file^ $dir_to_file &&
+	echo "$dir_to_file  2" >expect &&
+	grep "$dir_to_file" out >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'leading zeros in mode' '
+	git diff --name-status $different_but_same_tree^ $different_but_same_tree &&
+	echo "$different_but_same_tree  0" >expect &&
+	grep "$different_but_same_tree" out >actual &&
+	test_cmp expect actual
+'
+
+test_done
diff --git a/tree-diff.c b/tree-diff.c
index 6ebad1a46f3..7cebbb327e2 100644
--- a/tree-diff.c
+++ b/tree-diff.c
@@ -434,7 +434,7 @@ static struct combine_diff_path *ll_diff_tree_paths(
 		if (diff_can_quit_early(opt))
 			break;
 
-		if (opt->max_changes && opt->num_changes > opt->max_changes)
+		if (opt->max_changes && diff_queued_diff.nr > opt->max_changes)
 			break;
 
 		if (opt->pathspec.nr) {
@@ -521,7 +521,6 @@ static struct combine_diff_path *ll_diff_tree_paths(
 
 			/* t↓ */
 			update_tree_entry(&t);
-			opt->num_changes++;
 		}
 
 		/* t > p[imin] */
@@ -539,7 +538,6 @@ static struct combine_diff_path *ll_diff_tree_paths(
 		skip_emit_tp:
 			/* ∀ pi=p[imin]  pi↓ */
 			update_tp_entries(tp, nparent);
-			opt->num_changes++;
 		}
 	}
 
@@ -557,7 +555,6 @@ struct combine_diff_path *diff_tree_paths(
 	const struct object_id **parents_oid, int nparent,
 	struct strbuf *base, struct diff_options *opt)
 {
-	opt->num_changes = 0;
 	p = ll_diff_tree_paths(p, oid, parents_oid, nparent, base, opt);
 
 	/*

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, SZEDER Gábor wrote (reply to this):

On Tue, Aug 04, 2020 at 12:25:45PM -0400, Derrick Stolee wrote:
> On 8/4/2020 10:47 AM, SZEDER Gábor wrote:
> > On Mon, Apr 06, 2020 at 04:59:45PM +0000, Derrick Stolee via GitGitGadget wrote:
> > This counter is basically broken, its value is wrong for over 98% of
> > commits, and, worse, its value remains 0 for over 85% of commits in
> > the repositories I usually use to test modified path Bloom filters.
> > Consequently, a relatively large number of commits modifying more than
> > 512 paths get Bloom filters.
> 
> Thanks for finding this! The counter is only really tested in one
> place, and that test only considers _file adds_, which is a problem.
> 
> If I understand this correctly, the bug is a performance-only bug
> (since this is a performance-only feature), but it is an important
> one to fix.

Or a performance-only feature in a performance-only feature, because
those additional modified path Bloom filters can improve the runtime
of pathspec-limited revision walks (assuming that the false positive
rate is low enough).

> There is certainly some dark magic happening in this tree-diff logic,
> so instead of trying to get an accurate count we should just use the
> magic global diff_queued_diff to track the current list of file changes.
> 
> Note: diff_queued_diff does not track the directory changes, so it
> is an under-count for the total changes to track in the Bloom filter.
> This is later corrected by the block that adds these leading directory
> changes.
> 
> > The makeshift tests in the patch below demonstrate these issues as
> > most of them fail, most notably those two tests that demonstrate that
> > modifying existing paths are not counted at all.
> 
> I adapted your diff along with ripping out 'num_changes' in favor
> of diff_queued_diff.nr. This required modifying some of your expected
> values in the test script (losing the leading directories in the
> count).
> 
> I'll work with Taylor to create a fix, and include proper testing
> of the logic here. We'll stick it in the v2 of his max-changed-paths
> series [1]. He already has some helpful logging that can help create
> tests that ensure this logic is performing as expected.

Don't forget to include a check of the hashmap's size, to make sure.

FWIW, the patch below does result in the correct count (read: the same
as in my implemenation) for all but 4 commits in those repositories I
use for testing, without adding any memory allocations and extra
strcmp() calls.

  ---  >8  ---

diff --git a/cache.h b/cache.h
index 0f0485ecfe..3fc7e1b427 100644
--- a/cache.h
+++ b/cache.h
@@ -1574,6 +1574,7 @@ int repo_interpret_branch_name(struct repository *r,
 int validate_headref(const char *ref);
 
 int base_name_compare(const char *name1, int len1, int mode1, const char *name2, int len2, int mode2);
+int base_name_compare_df(const char *name1, int len1, int mode1, const char *name2, int len2, int mode2, int *df);
 int df_name_compare(const char *name1, int len1, int mode1, const char *name2, int len2, int mode2);
 int name_compare(const char *name1, size_t len1, const char *name2, size_t len2);
 int cache_name_stage_compare(const char *name1, int len1, int stage1, const char *name2, int len2, int stage2);
diff --git a/read-cache.c b/read-cache.c
index aa427c5c17..041af19e60 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -460,13 +460,16 @@ int ie_modified(struct index_state *istate,
 	return 0;
 }
 
-int base_name_compare(const char *name1, int len1, int mode1,
-		      const char *name2, int len2, int mode2)
+int base_name_compare_df(const char *name1, int len1, int mode1,
+			 const char *name2, int len2, int mode2,
+			 int *df)
 {
 	unsigned char c1, c2;
 	int len = len1 < len2 ? len1 : len2;
 	int cmp;
 
+	*df = 0;
+
 	cmp = memcmp(name1, name2, len);
 	if (cmp)
 		return cmp;
@@ -476,7 +479,21 @@ int base_name_compare(const char *name1, int len1, int mode1,
 		c1 = '/';
 	if (!c2 && S_ISDIR(mode2))
 		c2 = '/';
-	return (c1 < c2) ? -1 : (c1 > c2) ? 1 : 0;
+	if (c1 == c2)
+		return 0;	/* TODO: is this even possible? */
+	if ((c1 == '/' && !c2) ||
+	    (!c1 && c2 == '/'))
+		*df = 1;
+	return (c1 < c2) ? -1 : 1;
+}
+
+int base_name_compare(const char *name1, int len1, int mode1,
+		      const char *name2, int len2, int mode2)
+{
+	int unused;
+	return base_name_compare_df(name1, len1, mode1,
+				    name2, len2, mode2,
+				    &unused);
 }
 
 /*
diff --git a/t/t9999-test.sh b/t/t9999-test.sh
index 8d2bd9f03f..4f08590b45 100755
--- a/t/t9999-test.sh
+++ b/t/t9999-test.sh
@@ -125,7 +125,7 @@ test_expect_success 'replace file with dir' '
 	test_cmp expect actual
 '
 
-test_expect_success 'replace dir with file' '
+test_expect_failure 'replace dir with file' '
 	git diff --name-status $dir_to_file^ $dir_to_file &&
 	echo "$dir_to_file  2" >expect &&
 	grep "$dir_to_file" out >actual &&
diff --git a/tree-diff.c b/tree-diff.c
index f3d303c6e5..e27f9c805e 100644
--- a/tree-diff.c
+++ b/tree-diff.c
@@ -46,11 +46,14 @@ static int ll_diff_tree_oid(const struct object_id *old_oid,
  *      Due to this convention, if trees are scanned in sorted order, all
  *      non-empty descriptors will be processed first.
  */
-static int tree_entry_pathcmp(struct tree_desc *t1, struct tree_desc *t2)
+static int tree_entry_pathcmp(struct tree_desc *t1, struct tree_desc *t2,
+			      int *df)
 {
 	struct name_entry *e1, *e2;
 	int cmp;
 
+	*df = 0;
+
 	/* empty descriptors sort after valid tree entries */
 	if (!t1->size)
 		return t2->size ? 1 : 0;
@@ -59,8 +62,9 @@ static int tree_entry_pathcmp(struct tree_desc *t1, struct tree_desc *t2)
 
 	e1 = &t1->entry;
 	e2 = &t2->entry;
-	cmp = base_name_compare(e1->path, tree_entry_len(e1), e1->mode,
-				e2->path, tree_entry_len(e2), e2->mode);
+	cmp = base_name_compare_df(e1->path, tree_entry_len(e1), e1->mode,
+				   e2->path, tree_entry_len(e2), e2->mode,
+				   df);
 	return cmp;
 }
 
@@ -410,7 +414,7 @@ static struct combine_diff_path *ll_diff_tree_paths(
 {
 	struct tree_desc t, *tp;
 	void *ttree, **tptree;
-	int i;
+	int i, df;
 
 	FAST_ARRAY_ALLOC(tp, nparent);
 	FAST_ARRAY_ALLOC(tptree, nparent);
@@ -463,7 +467,7 @@ static struct combine_diff_path *ll_diff_tree_paths(
 		tp[0].entry.mode &= ~S_IFXMIN_NEQ;
 
 		for (i = 1; i < nparent; ++i) {
-			cmp = tree_entry_pathcmp(&tp[i], &tp[imin]);
+			cmp = tree_entry_pathcmp(&tp[i], &tp[imin], &df);
 			if (cmp < 0) {
 				imin = i;
 				tp[i].entry.mode &= ~S_IFXMIN_NEQ;
@@ -483,10 +487,12 @@ static struct combine_diff_path *ll_diff_tree_paths(
 
 
 		/* compare t vs p[imin] */
-		cmp = tree_entry_pathcmp(&t, &tp[imin]);
+		cmp = tree_entry_pathcmp(&t, &tp[imin], &df);
 
 		/* t = p[imin] */
 		if (cmp == 0) {
+			int prev_num_changes = opt->num_changes;
+
 			/* are either pi > p[imin] or diff(t,pi) != ø ? */
 			if (!opt->flags.find_copies_harder) {
 				for (i = 0; i < nparent; ++i) {
@@ -506,6 +512,9 @@ static struct combine_diff_path *ll_diff_tree_paths(
 			/* D += {δ(t,pi) if pi=p[imin];  "+a" if pi > p[imin]} */
 			p = emit_path(p, base, opt, nparent,
 					&t, tp, imin);
+			if (!(opt->num_changes == prev_num_changes &&
+			      S_ISDIR(t.entry.mode)))
+				opt->num_changes++;
 
 		skip_emit_t_tp:
 			/* t↓,  ∀ pi=p[imin]  pi↓ */
@@ -518,10 +527,11 @@ static struct combine_diff_path *ll_diff_tree_paths(
 			/* D += "+t" */
 			p = emit_path(p, base, opt, nparent,
 					&t, /*tp=*/NULL, -1);
+			if (!df)
+				opt->num_changes++;
 
 			/* t↓ */
 			update_tree_entry(&t);
-			opt->num_changes++;
 		}
 
 		/* t > p[imin] */
@@ -535,11 +545,12 @@ static struct combine_diff_path *ll_diff_tree_paths(
 
 			p = emit_path(p, base, opt, nparent,
 					/*t=*/NULL, tp, imin);
+			if (!df)
+				opt->num_changes++;
 
 		skip_emit_tp:
 			/* ∀ pi=p[imin]  pi↓ */
 			update_tp_entries(tp, nparent);
-			opt->num_changes++;
 		}
 	}
 
  ---  >8  ---


Having said that, the best (i.e faster and accurate) solution to this
issue is probably:

  - Update the callchain between diff_tree_oid() and the diff callback
    functions to allow the callbacks to break diffing with a non-zero
    error code.

  - Fill Bloom filters using the approach presented in:

      https://public-inbox.org/git/20200529085038.26008-21-szeder.dev@gmail.com/

    but modify the callbacks to return non-zero when too many paths
    have been processed.

  - Drop this counter entirely, as there are no other users.

> We plan to have that fix available by later today or early tomorrow.
> Will you be available to help validate it?
> 
> [1] https://lore.kernel.org/git/cover.1596480582.git.me@ttaylorr.com/
> 
> Thanks,
> -Stolee
> 
>   --- >8 ---
> 
> diff --git a/bloom.c b/bloom.c
> index 1a573226e7..b8d6cb9240 100644
> --- a/bloom.c
> +++ b/bloom.c
> @@ -218,8 +218,9 @@ struct bloom_filter *get_bloom_filter(struct repository *r,
>  	else
>  		diff_tree_oid(NULL, &c->object.oid, "", &diffopt);
>  	diffcore_std(&diffopt);
> +	printf("%s  %d\n", oid_to_hex(&c->object.oid), diff_queued_diff.nr);
>  
> -	if (diffopt.num_changes <= max_changes) {
> +	if (diff_queued_diff.nr <= max_changes) {
>  		struct hashmap pathmap;
>  		struct pathmap_hash_entry *e;
>  		struct hashmap_iter iter;
> diff --git a/diff.h b/diff.h
> index e0c0af6286b..1d32b718857 100644
> --- a/diff.h
> +++ b/diff.h
> @@ -287,8 +287,6 @@ struct diff_options {
>  
>  	/* If non-zero, then stop computing after this many changes. */
>  	int max_changes;
> -	/* For internal use only. */
> -	int num_changes;
>  
>  	int ita_invisible_in_index;
>  /* white-space error highlighting */
> diff --git a/t/t9999-test.sh b/t/t9999-test.sh
> new file mode 100755
> index 00000000000..1f35aa8e2c5
> --- /dev/null
> +++ b/t/t9999-test.sh
> @@ -0,0 +1,142 @@
> +#!/bin/sh
> +
> +test_description='test'
> +
> +. ./test-lib.sh
> +
> +test_expect_success 'setup' '
> +	test_tick &&
> +
> +	echo 1 >file &&
> +	mkdir -p dir/subdir &&
> +	echo 1 >dir/subdir/file1 &&
> +	echo 1 >dir/subdir/file2 &&
> +	git add file dir &&
> +	git commit -m setup &&
> +
> +	echo 2 >file &&
> +	git commit -a -m "modify one path in root" &&
> +	mod_one_path=$(git rev-parse HEAD) &&
> +
> +	echo 2 >dir/subdir/file1 &&
> +	echo 2 >dir/subdir/file2 &&
> +	git commit -a -m "modify two file two dirs deep" &&
> +	mod_four_paths=$(git rev-parse HEAD) &&
> +
> +	>new-file &&
> +	git add new-file &&
> +	git commit -m "add new file in root" &&
> +	new_file_in_root=$(git rev-parse HEAD) &&
> +
> +	git rm new-file &&
> +	git commit -m "delete file in root" &&
> +	delete_file_in_root=$(git rev-parse HEAD) &&
> +
> +	>dir/new-file &&
> +	git add dir/new-file &&
> +	git commit -m "add new file in dir" &&
> +	new_file_in_dir=$(git rev-parse HEAD) &&
> +
> +	git rm dir/new-file &&
> +	git commit -m "delete file in dir" &&
> +	delete_file_in_dir=$(git rev-parse HEAD) &&
> +
> +	echo 1 >d-f &&
> +	git add d-f &&
> +	git commit -m foo &&
> +	git rm d-f &&
> +	mkdir d-f &&
> +	echo 2 >d-f/file &&
> +	git add d-f &&
> +	git commit -m "replace file with dir" &&
> +	file_to_dir=$(git rev-parse HEAD) &&
> +
> +	>d-f.c &&
> +	git add d-f.c &&
> +	git commit -m "add a file that sorts between d-f and d-f/" &&
> +	git rm -r d-f &&
> +	echo 3 >d-f &&
> +	git add d-f &&
> +	git commit -m "replace dir with file" &&
> +	dir_to_file=$(git rev-parse HEAD) &&
> +
> +	bin_sha1=$(git rev-parse HEAD:dir/subdir | hex2oct) &&
> +	# leading zero in mode: the content of the tree remains the same,
> +	# but its oid does change!
> +	printf "040000 subdir\0$bin_sha1" >rawtree &&
> +	tree1=$(git hash-object -t tree -w rawtree) &&
> +	git cat-file -p HEAD^{tree} >out &&
> +	tree2=$(sed -e "s/$(git rev-parse HEAD:dir/)/$tree1/" out |git mktree) &&
> +	different_but_same_tree=$(git commit-tree \
> +		-m "leading zeros in mode" \
> +		-p $(git rev-parse HEAD) $tree2) &&
> +	git update-ref HEAD $different_but_same_tree &&
> +
> +	git commit-graph write --reachable --changed-paths >out &&
> +	cat out  # debug
> +'
> +
> +test_expect_success 'modify one path in root' '
> +	git diff --name-status $mod_one_path^ $mod_one_path &&
> +	echo "$mod_one_path  1" >expect &&
> +	grep "$mod_one_path" out >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'modify two file two dirs deep' '
> +	git diff --name-status $mod_four_paths^ $mod_four_paths &&
> +	echo "$mod_four_paths  2" >expect &&
> +	grep "$mod_four_paths" out >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'add new file in root' '
> +	git diff --name-status $new_file_in_root^ $new_file_in_root &&
> +	echo "$new_file_in_root  1" >expect &&
> +	grep "$new_file_in_root" out >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'delete file in root' '
> +	git diff --name-status $delete_file_in_root^ $delete_file_in_root &&
> +	echo "$delete_file_in_root  1" >expect &&
> +	grep "$delete_file_in_root" out >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'add new file in dir' '
> +	git diff --name-status $new_file_in_dir^ $new_file_in_dir &&
> +	echo "$new_file_in_dir  1" >expect &&
> +	grep "$new_file_in_dir" out >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'delete file in dir' '
> +	git diff --name-status $delete_file_in_dir^ $delete_file_in_dir &&
> +	echo "$delete_file_in_dir  1" >expect &&
> +	grep "$delete_file_in_dir" out >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'replace file with dir' '
> +	git diff --name-status $file_to_dir^ $file_to_dir &&
> +	echo "$file_to_dir  2" >expect &&
> +	grep "$file_to_dir" out >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'replace dir with file' '
> +	git diff --name-status $dir_to_file^ $dir_to_file &&
> +	echo "$dir_to_file  2" >expect &&
> +	grep "$dir_to_file" out >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'leading zeros in mode' '
> +	git diff --name-status $different_but_same_tree^ $different_but_same_tree &&
> +	echo "$different_but_same_tree  0" >expect &&
> +	grep "$different_but_same_tree" out >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_done
> diff --git a/tree-diff.c b/tree-diff.c
> index 6ebad1a46f3..7cebbb327e2 100644
> --- a/tree-diff.c
> +++ b/tree-diff.c
> @@ -434,7 +434,7 @@ static struct combine_diff_path *ll_diff_tree_paths(
>  		if (diff_can_quit_early(opt))
>  			break;
>  
> -		if (opt->max_changes && opt->num_changes > opt->max_changes)
> +		if (opt->max_changes && diff_queued_diff.nr > opt->max_changes)
>  			break;
>  
>  		if (opt->pathspec.nr) {
> @@ -521,7 +521,6 @@ static struct combine_diff_path *ll_diff_tree_paths(
>  
>  			/* t↓ */
>  			update_tree_entry(&t);
> -			opt->num_changes++;
>  		}
>  
>  		/* t > p[imin] */
> @@ -539,7 +538,6 @@ static struct combine_diff_path *ll_diff_tree_paths(
>  		skip_emit_tp:
>  			/* ∀ pi=p[imin]  pi↓ */
>  			update_tp_entries(tp, nparent);
> -			opt->num_changes++;
>  		}
>  	}
>  
> @@ -557,7 +555,6 @@ struct combine_diff_path *diff_tree_paths(
>  	const struct object_id **parents_oid, int nparent,
>  	struct strbuf *base, struct diff_options *opt)
>  {
> -	opt->num_changes = 0;
>  	p = ll_diff_tree_paths(p, oid, parents_oid, nparent, base, opt);
>  
>  	/*

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 8/4/2020 1:00 PM, SZEDER Gábor wrote:
> On Tue, Aug 04, 2020 at 12:25:45PM -0400, Derrick Stolee wrote:
>> On 8/4/2020 10:47 AM, SZEDER Gábor wrote:
>>> On Mon, Apr 06, 2020 at 04:59:45PM +0000, Derrick Stolee via GitGitGadget wrote:
>>> This counter is basically broken, its value is wrong for over 98% of
>>> commits, and, worse, its value remains 0 for over 85% of commits in
>>> the repositories I usually use to test modified path Bloom filters.
>>> Consequently, a relatively large number of commits modifying more than
>>> 512 paths get Bloom filters.
>>
>> Thanks for finding this! The counter is only really tested in one
>> place, and that test only considers _file adds_, which is a problem.
>>
>> If I understand this correctly, the bug is a performance-only bug
>> (since this is a performance-only feature), but it is an important
>> one to fix.
> 
> Or a performance-only feature in a performance-only feature, because
> those additional modified path Bloom filters can improve the runtime
> of pathspec-limited revision walks (assuming that the false positive
> rate is low enough).
> 
>> There is certainly some dark magic happening in this tree-diff logic,
>> so instead of trying to get an accurate count we should just use the
>> magic global diff_queued_diff to track the current list of file changes.
>>
>> Note: diff_queued_diff does not track the directory changes, so it
>> is an under-count for the total changes to track in the Bloom filter.
>> This is later corrected by the block that adds these leading directory
>> changes.
>>
>>> The makeshift tests in the patch below demonstrate these issues as
>>> most of them fail, most notably those two tests that demonstrate that
>>> modifying existing paths are not counted at all.
>>
>> I adapted your diff along with ripping out 'num_changes' in favor
>> of diff_queued_diff.nr. This required modifying some of your expected
>> values in the test script (losing the leading directories in the
>> count).
>>
>> I'll work with Taylor to create a fix, and include proper testing
>> of the logic here. We'll stick it in the v2 of his max-changed-paths
>> series [1]. He already has some helpful logging that can help create
>> tests that ensure this logic is performing as expected.
> 
> Don't forget to include a check of the hashmap's size, to make sure.

Yes, thanks for the pointer. That check is currently not in there,
since the code assumes the hashmap's size will match num_changes.
Hopefully, the tests I intend to write around this would have caught
such an omission.
 
> FWIW, the patch below does result in the correct count (read: the same
> as in my implemenation) for all but 4 commits in those repositories I
> use for testing, without adding any memory allocations and extra
> strcmp() calls.

...

> Having said that, the best (i.e faster and accurate) solution to this
> issue is probably:
> 
>   - Update the callchain between diff_tree_oid() and the diff callback
>     functions to allow the callbacks to break diffing with a non-zero
>     error code.

It looks like this part would not be too difficult. The pathchange
callback is called by emit_path() which returns a struct combine_diff_path
pointer. This could return NULL to signal an early termination, but
we need to update all callers of the following methods to handle NULL
responses:

 * emit_path()
 * ll_diff_tree_paths()
 * diff_tree_paths()

Of some interest: diff_tree_paths() returns a struct combine_diff_path
pointer, but no callers seem to consume it.

>   - Fill Bloom filters using the approach presented in:
> 
>       https://public-inbox.org/git/20200529085038.26008-21-szeder.dev@gmail.com/
> 
>     but modify the callbacks to return non-zero when too many paths
>     have been processed.

Thanks for the pointer to that specific patch. You do a good job of
describing your thought process, including why you used the callback
approach instead of the diff queue approach. The main reason seemed to
be memory overhead from populating the entire diff queue before
checking the limit.

However, if we are using the diff queue as the short-circuit, then
perhaps that memory overhead isn't as much of a problem?

You admit yourself, that

  This patch implements a more efficient, but more complex, approach:

The logic around matching prefixes definitely seems complex and
hard to test, especially around the file/directory changes with the
sort order problems that have plagued similar prefix checks recently.
I'm not doubting your implementation, just saying that the complexity
is worth considering before jumping to that solution too quickly.

To sum up, I intend to start with a fix that uses the diff queue
count as a limit, then try the callback approach to see if there are
measurable improvements in performance.

>   - Drop this counter entirely, as there are no other users.

With the callback approach, "this counter" is both num_changes and
max_changes, since the callback would perform all of the short-circuit
logic.

Thanks,
-Stolee

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 8/4/2020 1:31 PM, Derrick Stolee wrote:
> On 8/4/2020 1:00 PM, SZEDER Gábor wrote:

>> Having said that, the best (i.e faster and accurate) solution to this
>> issue is probably:
>>
>>   - Update the callchain between diff_tree_oid() and the diff callback
>>     functions to allow the callbacks to break diffing with a non-zero
>>     error code.
> 
> It looks like this part would not be too difficult. 

Oh, my hubris! I gave this a shot for some time this morning. This
will definitely take some work to do right. Just changing the callbacks
to return 'int' is a wide-sweeping change, but the place where they are
called already has an 'int' return that means something different.

I'm not saying this is impossible. It just takes more attention and care
than I can currently devote, given my other works in progress right now.

>>   - Fill Bloom filters using the approach presented in:
>>
>>       https://public-inbox.org/git/20200529085038.26008-21-szeder.dev@gmail.com/
>>
>>     but modify the callbacks to return non-zero when too many paths
>>     have been processed.
> 
> Thanks for the pointer to that specific patch. You do a good job of
> describing your thought process, including why you used the callback
> approach instead of the diff queue approach. The main reason seemed to
> be memory overhead from populating the entire diff queue before
> checking the limit.
> 
> However, if we are using the diff queue as the short-circuit, then
> perhaps that memory overhead isn't as much of a problem?
> 
> You admit yourself, that
> 
>   This patch implements a more efficient, but more complex, approach:
> 
> The logic around matching prefixes definitely seems complex and
> hard to test, especially around the file/directory changes with the
> sort order problems that have plagued similar prefix checks recently.
> I'm not doubting your implementation, just saying that the complexity
> is worth considering before jumping to that solution too quickly.
> 
> To sum up, I intend to start with a fix that uses the diff queue
> count as a limit, then try the callback approach to see if there are
> measurable improvements in performance.

That fix is now available [1].

[1] https://lore.kernel.org/git/d1c4bbcaa9627068d5d9fbd0e4a2e8c8834a4bd3.1596646576.git.me@ttaylorr.com/

Again, the callback approach seems promising. The complexity is
stopping me from trying to apply it on top of the current
implementation, while I should be focusing on other things. I completely
believe that that approach is faster and more memory-efficient. I would
love to test and review a patch that takes that approach here.

Thanks,
-Stolee

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants