Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi-pack-index: fix verify on large repos #166

Closed

Conversation

jeffhostetler
Copy link

@jeffhostetler jeffhostetler commented Mar 15, 2019

Version 3 drops the packed-git commit I mentioned earlier, simplifies the
finish_if_sparse API as suggested by Eric, and splits the object sort commit
into 2 commits: one to add the additional progress indicators and one to
sort the objects by packfile. This makes it easier to follow.


Version 2 addresses progress-related concerns raised in the previous version
of the midx verify code.

This version also extends the existing progress.[ch] code and adds a "sparse"
mode that automatically ensures the 100% message is issued.


Teach "multi-pack-index verify" to handle cases where the number of
packfiles exceeds the open file handle limit.

The first commit fixes a problem that prevented the LRU-style close_one_pack()
mechanism from working which caused midx verify to run out of file descriptors.

The second commit teaches midx verify to sort the set of objects to verify
by packfile rather than verifying them in OID order. This eliminates the
need to have more than one packfile/idx open at the same time.

With the second commit, runtime on 3600 packfiles went from 12 minutes
to 25 seconds.

Thanks,
Jeff

Cc: dstolee@microsoft.com
Cc: avarab@gmail.com

@jeffhostetler jeffhostetler changed the title Improve performance of multi-pack-index verify on large repos multi-pack-index: fix verify on large repos Mar 18, 2019
@jeffhostetler
Copy link
Author

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 18, 2019

Submitted as pull.166.git.gitgitgadget@gmail.com

@@ -958,9 +958,29 @@ static void midx_report(const char *fmt, ...)
va_end(ap);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, =?utf-8?B?w4Z2YXIgQXJuZmrDtnLDsA==?= Bjarmason wrote (reply to this):


On Mon, Mar 18 2019, Jeff Hostetler via GitGitGadget wrote:

> +static int compare_pair_pos_vs_id(const void *_a, const void *_b)
> +{
> +	struct pair_pos_vs_id *a = (struct pair_pos_vs_id *)_a;
> +	struct pair_pos_vs_id *b = (struct pair_pos_vs_id *)_b;
> +
> +	if (a->pack_int_id < b->pack_int_id)
> +		return -1;
> +	if (a->pack_int_id > b->pack_int_id)
> +		return 1;
> +
> +	return 0;
> +}

Not a suggestion for a change, just a note that this sent me down the
rabbit hole of looking at the different idioms we use for QSORT() in
different places. Some use this form, some a ternary nest, and some the
succinct subtraction idiom of e.g. (in this case):

    return b->pack_int_id - a->pack_int_id;

> +
>  int verify_midx_file(const char *object_dir)
>  {
> -	uint32_t i;
> +	struct pair_pos_vs_id *pairs = NULL;
> +	uint32_t i, k;
>  	struct progress *progress;
>  	struct multi_pack_index *m = load_multi_pack_index(object_dir, 1);
>  	verify_midx_error = 0;
> @@ -997,15 +1017,36 @@ int verify_midx_file(const char *object_dir)
>  	}
>
>  	progress = start_progress(_("Verifying object offsets"), m->num_objects);
> +
> +	/*
> +	 * Create an array mapping each object to its packfile id.  Sort it
> +	 * to group the objects by packfile.  Use this permutation to visit
> +	 * each of the objects and only require 1 packfile to be open at a
> +	 * time.
> +	 */
> +	ALLOC_ARRAY(pairs, m->num_objects);
>  	for (i = 0; i < m->num_objects; i++) {
> +		pairs[i].pos = i;
> +		pairs[i].pack_int_id = nth_midxed_pack_int_id(m, i);
> +	}
> +	QSORT(pairs, m->num_objects, compare_pair_pos_vs_id);
> +
> +	for (k = 0; k < m->num_objects; k++) {
> [...]

I have not tested this (or midx in general), but isn't this new QSORT()
introducing the same sort of progress stalling that I fixed for
commit-graph in 890226ccb57 ("commit-graph write: add itermediate
progress", 2019-01-19)? I.e. something you can work around with a
"display_progress(progress, 0)" before the QSORT().

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jeff Hostetler wrote (reply to this):



On 3/18/2019 11:53 AM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Mon, Mar 18 2019, Jeff Hostetler via GitGitGadget wrote:
> 
>> +static int compare_pair_pos_vs_id(const void *_a, const void *_b)
>> +{
>> +	struct pair_pos_vs_id *a = (struct pair_pos_vs_id *)_a;
>> +	struct pair_pos_vs_id *b = (struct pair_pos_vs_id *)_b;
>> +
>> +	if (a->pack_int_id < b->pack_int_id)
>> +		return -1;
>> +	if (a->pack_int_id > b->pack_int_id)
>> +		return 1;
>> +
>> +	return 0;
>> +}
> 
> Not a suggestion for a change, just a note that this sent me down the
> rabbit hole of looking at the different idioms we use for QSORT() in
> different places. Some use this form, some a ternary nest, and some the
> succinct subtraction idiom of e.g. (in this case):
> 
>      return b->pack_int_id - a->pack_int_id;

Yeah, I'm not sure which way is better or worse here.
An earlier draft of this function sorted by packfile id
and then by OID (thinking we might benefit from some
locality later when we do the verify), hence the independent
if statements.  But it didn't help, so I removed the other
lines.

On 43+M objects, your version is a hair faster, so I might
as well take it instead.

> 
>> +
>>   int verify_midx_file(const char *object_dir)
>>   {
>> -	uint32_t i;
>> +	struct pair_pos_vs_id *pairs = NULL;
>> +	uint32_t i, k;
>>   	struct progress *progress;
>>   	struct multi_pack_index *m = load_multi_pack_index(object_dir, 1);
>>   	verify_midx_error = 0;
>> @@ -997,15 +1017,36 @@ int verify_midx_file(const char *object_dir)
>>   	}
>>
>>   	progress = start_progress(_("Verifying object offsets"), m->num_objects);
>> +
>> +	/*
>> +	 * Create an array mapping each object to its packfile id.  Sort it
>> +	 * to group the objects by packfile.  Use this permutation to visit
>> +	 * each of the objects and only require 1 packfile to be open at a
>> +	 * time.
>> +	 */
>> +	ALLOC_ARRAY(pairs, m->num_objects);
>>   	for (i = 0; i < m->num_objects; i++) {
>> +		pairs[i].pos = i;
>> +		pairs[i].pack_int_id = nth_midxed_pack_int_id(m, i);
>> +	}
>> +	QSORT(pairs, m->num_objects, compare_pair_pos_vs_id);
>> +
>> +	for (k = 0; k < m->num_objects; k++) {
>> [...]
> 
> I have not tested this (or midx in general), but isn't this new QSORT()
> introducing the same sort of progress stalling that I fixed for
> commit-graph in 890226ccb57 ("commit-graph write: add itermediate
> progress", 2019-01-19)? I.e. something you can work around with a
> "display_progress(progress, 0)" before the QSORT().
> 

I wasn't tracking your commit-graph changes, but yes, I think it is.

Tinkering with how to display progress, I found a couple of problems.
On my 3599 packfile, 43M object example, QSORT() takes about 5 seconds.
But there's about 2 seconds of setup before the sort starts.  The final
verify loops takes about 17 seconds.

Here I put trace2 regions around the main loops and used the
GIT_TR2_PERF stream.

> | cmd_name     |     |           |           |            | multi-pack-index (multi-pack-index)
> | cmd_mode     |     |           |           |            | verify
> | data         | r0  |  0.031295 |  0.031295 | midx       | load/num_packs:3599
> | data         | r0  |  0.031330 |  0.031330 | midx       | load/num_objects:42704807
> | region_enter | r0  |  0.031352 |           | midx       | label:verify/prepare 
> | region_leave | r0  |  0.626547 |  0.595195 | midx       | label:verify/prepare 
> | region_enter | r0  |  0.626602 |           | midx       | label:verify/oid_order 
> | region_leave | r0  |  1.570195 |  0.943593 | midx       | label:verify/oid_order 
> | region_enter | r0  |  1.570253 |           | midx       | label:verify/sort_setup 
> | region_leave | r0  |  1.809723 |  0.239470 | midx       | label:verify/sort_setup 
> | region_enter | r0  |  1.809803 |           | midx       | label:verify/sort 
> | region_leave | r0  |  6.950595 |  5.140792 | midx       | label:verify/sort 
> | region_enter | r0  |  6.950651 |           | midx       | label:verify/offsets 
> | region_leave | r0  | 24.059736 | 17.109085 | midx       | label:verify/offsets 
> | exit         |     | 24.101434 |           |            | code:0

So just adding a delay progress block by itself around the sort doesn't
help much.  It just sits there for 7 seconds before the actual progress
starts.

If I add a non-delay progress block around the "verify/prepare",
"verify/oid_order" and the "verify/offsets" loops, we get a pretty good
experience.

There is the dead time while the sort() itself is running, but at least
there is isn't a 5+ second frozen at 0% message on screen.

I'll re-roll shortly.

Thanks,
Jeff

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, =?utf-8?B?w4Z2YXIgQXJuZmrDtnLDsA==?= Bjarmason wrote (reply to this):


On Mon, Mar 18 2019, Jeff Hostetler wrote:

> On 3/18/2019 11:53 AM, =C3=86var Arnfj=C3=B6r=C3=B0 Bjarmason wrote:
>>
>> On Mon, Mar 18 2019, Jeff Hostetler via GitGitGadget wrote:
>>
>>> +static int compare_pair_pos_vs_id(const void *_a, const void *_b)
>>> +{
>>> +	struct pair_pos_vs_id *a =3D (struct pair_pos_vs_id *)_a;
>>> +	struct pair_pos_vs_id *b =3D (struct pair_pos_vs_id *)_b;
>>> +
>>> +	if (a->pack_int_id < b->pack_int_id)
>>> +		return -1;
>>> +	if (a->pack_int_id > b->pack_int_id)
>>> +		return 1;
>>> +
>>> +	return 0;
>>> +}
>>
>> Not a suggestion for a change, just a note that this sent me down the
>> rabbit hole of looking at the different idioms we use for QSORT() in
>> different places. Some use this form, some a ternary nest, and some the
>> succinct subtraction idiom of e.g. (in this case):
>>
>>      return b->pack_int_id - a->pack_int_id;
>
> Yeah, I'm not sure which way is better or worse here.
> An earlier draft of this function sorted by packfile id
> and then by OID (thinking we might benefit from some
> locality later when we do the verify), hence the independent
> if statements.  But it didn't help, so I removed the other
> lines.
>
> On 43+M objects, your version is a hair faster, so I might
> as well take it instead.

Cool!

>>
>>> +
>>>   int verify_midx_file(const char *object_dir)
>>>   {
>>> -	uint32_t i;
>>> +	struct pair_pos_vs_id *pairs =3D NULL;
>>> +	uint32_t i, k;
>>>   	struct progress *progress;
>>>   	struct multi_pack_index *m =3D load_multi_pack_index(object_dir, 1);
>>>   	verify_midx_error =3D 0;
>>> @@ -997,15 +1017,36 @@ int verify_midx_file(const char *object_dir)
>>>   	}
>>>
>>>   	progress =3D start_progress(_("Verifying object offsets"), m->num_ob=
jects);
>>> +
>>> +	/*
>>> +	 * Create an array mapping each object to its packfile id.  Sort it
>>> +	 * to group the objects by packfile.  Use this permutation to visit
>>> +	 * each of the objects and only require 1 packfile to be open at a
>>> +	 * time.
>>> +	 */
>>> +	ALLOC_ARRAY(pairs, m->num_objects);
>>>   	for (i =3D 0; i < m->num_objects; i++) {
>>> +		pairs[i].pos =3D i;
>>> +		pairs[i].pack_int_id =3D nth_midxed_pack_int_id(m, i);
>>> +	}
>>> +	QSORT(pairs, m->num_objects, compare_pair_pos_vs_id);
>>> +
>>> +	for (k =3D 0; k < m->num_objects; k++) {
>>> [...]
>>
>> I have not tested this (or midx in general), but isn't this new QSORT()
>> introducing the same sort of progress stalling that I fixed for
>> commit-graph in 890226ccb57 ("commit-graph write: add itermediate
>> progress", 2019-01-19)? I.e. something you can work around with a
>> "display_progress(progress, 0)" before the QSORT().
>>
>
> I wasn't tracking your commit-graph changes, but yes, I think it is.
>
> Tinkering with how to display progress, I found a couple of problems.
> On my 3599 packfile, 43M object example, QSORT() takes about 5 seconds.
> But there's about 2 seconds of setup before the sort starts.  The final
> verify loops takes about 17 seconds.
>
> Here I put trace2 regions around the main loops and used the
> GIT_TR2_PERF stream.
>
>> | cmd_name     |     |           |           |            | multi-pack-i=
ndex (multi-pack-index)
>> | cmd_mode     |     |           |           |            | verify
>> | data         | r0  |  0.031295 |  0.031295 | midx       | load/num_pac=
ks:3599
>> | data         | r0  |  0.031330 |  0.031330 | midx       | load/num_obj=
ects:42704807
>> | region_enter | r0  |  0.031352 |           | midx       |
>> label:verify/prepare | region_leave | r0  |  0.626547 |  0.595195 |
>> midx       | label:verify/prepare | region_enter | r0  |  0.626602 |
>> | midx       | label:verify/oid_order | region_leave | r0  |
>> 1.570195 |  0.943593 | midx       | label:verify/oid_order |
>> region_enter | r0  |  1.570253 |           | midx       |
>> label:verify/sort_setup | region_leave | r0  |  1.809723 |  0.239470
>> | midx       | label:verify/sort_setup | region_enter | r0  |
>> 1.809803 |           | midx       | label:verify/sort | region_leave
>> | r0  |  6.950595 |  5.140792 | midx       | label:verify/sort |
>> region_enter | r0  |  6.950651 |           | midx       |
>> label:verify/offsets | region_leave | r0  | 24.059736 | 17.109085 |
>> midx       | label:verify/offsets | exit         |     | 24.101434 |
>> |            | code:0
>
> So just adding a delay progress block by itself around the sort doesn't
> help much.  It just sits there for 7 seconds before the actual progress
> starts.
>
> If I add a non-delay progress block around the "verify/prepare",
> "verify/oid_order" and the "verify/offsets" loops, we get a pretty good
> experience.
>
> There is the dead time while the sort() itself is running, but at least
> there is isn't a 5+ second frozen at 0% message on screen.

Yeah, the same with the commit-graph with my hack. I.e. it'll sit there,
but at least it sits like this:

    What I was doing before 100% (X/Y)
    What I'm about to start doing 0% (0/Z) [hanging]

Instead of:

    What I was doing before 100% (X/Y)
    [hanging]

So that's an improvement, i.e. you know it's started that next phase at
least instead of just having a non-descriptive hang.

Ideally there would be some way to reach into the QSORT() and display
progress there, but that's all sorts of nasty, so as the TODO comment in
commit-graph.c notes I punted it.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

>>>> +	if (a->pack_int_id < b->pack_int_id)
>>>> +		return -1;
>>>> +	if (a->pack_int_id > b->pack_int_id)
>>>> +		return 1;
>>>> +
>>>> +	return 0;
>>>> +}
>>> ...
>>> succinct subtraction idiom of e.g. (in this case):
>>>
>>>      return b->pack_int_id - a->pack_int_id;
>> ...
>> On 43+M objects, your version is a hair faster, so I might
>> as well take it instead.
>
> Cool!

Yup, following a well-known short idiom lets readers' eyes coast
over without forcing them to make sure the two if statements have
comparison going in the right direction and return the constant with
the right sign, etc.  Even if the idiomatic version weren't faster,
it is worth using it.

Thanks.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jeff Hostetler wrote (reply to this):



On 3/18/2019 6:02 PM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Mon, Mar 18 2019, Jeff Hostetler wrote:
> 
>> On 3/18/2019 11:53 AM, Ævar Arnfjörð Bjarmason wrote:
>>>
>>> On Mon, Mar 18 2019, Jeff Hostetler via GitGitGadget wrote:
>>>
[...]
>>>> +	QSORT(pairs, m->num_objects, compare_pair_pos_vs_id);
>>>> +
>>>> +	for (k = 0; k < m->num_objects; k++) {
>>>> [...]
>>>
>>> I have not tested this (or midx in general), but isn't this new QSORT()
>>> introducing the same sort of progress stalling that I fixed for
>>> commit-graph in 890226ccb57 ("commit-graph write: add itermediate
>>> progress", 2019-01-19)? I.e. something you can work around with a
>>> "display_progress(progress, 0)" before the QSORT().
>>>
>>
>> I wasn't tracking your commit-graph changes, but yes, I think it is.
>>
[...]
>>
>> There is the dead time while the sort() itself is running, but at least
>> there is isn't a 5+ second frozen at 0% message on screen.
> 
> Yeah, the same with the commit-graph with my hack. I.e. it'll sit there,
> but at least it sits like this:
> 
>      What I was doing before 100% (X/Y)
>      What I'm about to start doing 0% (0/Z) [hanging]
> 
> Instead of:
> 
>      What I was doing before 100% (X/Y)
>      [hanging]
> 
> So that's an improvement, i.e. you know it's started that next phase at
> least instead of just having a non-descriptive hang.
> 
> Ideally there would be some way to reach into the QSORT() and display
> progress there, but that's all sorts of nasty, so as the TODO comment in
> commit-graph.c notes I punted it.

Perhaps I'm confused or this is a Windows issue, but when I do:

	progress = start_delayed_progress("sorting", n);
	display_progress(progress, 0);
	QSORT(...);
	stop_progress(&progress);

I never see the 0% message.  It always does the hang with the cursor in
column 0 on a blank line.  If I make this a regular start_progress(),
I do see the 0% message for the duration of the qsort hang.

I did some similar testing around your QSORT() in commit-graph.c
and got the same result.  It looks like start_delayed_processing()
wants to wait at least 2 seconds before displaying anything and has
an interval timer to notify it that another message should be printed,
but the display_progress(0) prior to the QSORT() arrives before the 2
seconds are up and so nothing is printed.  It's not until we get into
the loop below the QSORT that one of the display_progress(i+1) calls
could cause a message to appear.

Right?
Jeff






Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, =?utf-8?B?w4Z2YXIgQXJuZmrDtnLDsA==?= Bjarmason wrote (reply to this):


On Tue, Mar 19 2019, Jeff Hostetler wrote:

> On 3/18/2019 6:02 PM, =C3=86var Arnfj=C3=B6r=C3=B0 Bjarmason wrote:
>>
>> On Mon, Mar 18 2019, Jeff Hostetler wrote:
>>
>>> On 3/18/2019 11:53 AM, =C3=86var Arnfj=C3=B6r=C3=B0 Bjarmason wrote:
>>>>
>>>> On Mon, Mar 18 2019, Jeff Hostetler via GitGitGadget wrote:
>>>>
> [...]
>>>>> +	QSORT(pairs, m->num_objects, compare_pair_pos_vs_id);
>>>>> +
>>>>> +	for (k =3D 0; k < m->num_objects; k++) {
>>>>> [...]
>>>>
>>>> I have not tested this (or midx in general), but isn't this new QSORT()
>>>> introducing the same sort of progress stalling that I fixed for
>>>> commit-graph in 890226ccb57 ("commit-graph write: add itermediate
>>>> progress", 2019-01-19)? I.e. something you can work around with a
>>>> "display_progress(progress, 0)" before the QSORT().
>>>>
>>>
>>> I wasn't tracking your commit-graph changes, but yes, I think it is.
>>>
> [...]
>>>
>>> There is the dead time while the sort() itself is running, but at least
>>> there is isn't a 5+ second frozen at 0% message on screen.
>>
>> Yeah, the same with the commit-graph with my hack. I.e. it'll sit there,
>> but at least it sits like this:
>>
>>      What I was doing before 100% (X/Y)
>>      What I'm about to start doing 0% (0/Z) [hanging]
>>
>> Instead of:
>>
>>      What I was doing before 100% (X/Y)
>>      [hanging]
>>
>> So that's an improvement, i.e. you know it's started that next phase at
>> least instead of just having a non-descriptive hang.
>>
>> Ideally there would be some way to reach into the QSORT() and display
>> progress there, but that's all sorts of nasty, so as the TODO comment in
>> commit-graph.c notes I punted it.
>
> Perhaps I'm confused or this is a Windows issue, but when I do:
>
> 	progress =3D start_delayed_progress("sorting", n);
> 	display_progress(progress, 0);
> 	QSORT(...);
> 	stop_progress(&progress);
>
> I never see the 0% message.  It always does the hang with the cursor in
> column 0 on a blank line.  If I make this a regular start_progress(),
> I do see the 0% message for the duration of the qsort hang.
>
> I did some similar testing around your QSORT() in commit-graph.c
> and got the same result.  It looks like start_delayed_processing()
> wants to wait at least 2 seconds before displaying anything and has
> an interval timer to notify it that another message should be printed,
> but the display_progress(0) prior to the QSORT() arrives before the 2
> seconds are up and so nothing is printed.  It's not until we get into
> the loop below the QSORT that one of the display_progress(i+1) calls
> could cause a message to appear.

Correct, it'll still hang the N seconds that start_delayed_progress()
imposes.

In the commit-graph code this would sometimes take longer than that, so
I'd see the 0% earlier.

But maybe in this case even on humongous repositories it's faster than
that.

@jeffhostetler jeffhostetler force-pushed the upstream-midx-verify branch 2 times, most recently from cbe5fe6 to e2dd999 Compare March 19, 2019 11:48
@jeffhostetler
Copy link
Author

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 19, 2019

Submitted as pull.166.v2.git.gitgitgadget@gmail.com

@@ -34,6 +34,7 @@ struct progress {
uint64_t total;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Eric Sunshine wrote (reply to this):

On Tue, Mar 19, 2019 at 10:37 AM Jeff Hostetler via GitGitGadget
<gitgitgadget@gmail.com> wrote:
> Add new start_sparse_progress() and start_delayed_sparse_progress()
> constructors and "sparse" flag to struct progress.
>
> Teach stop_progress() to force a 100% complete progress message before
> printing the final "done" message when "sparse" is set.
>
> Calling display_progress() for every item in a large set can
> be expensive.  If callers try to filter this for performance
> reasons, such as emitting every k-th item, progress would
> not reach 100% unless they made a final call to display_progress()
> with the item count before calling stop_progress().
>
> Now this is automatic when "sparse" is set.
>
> Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
> ---
> diff --git a/progress.c b/progress.c
> +/*
> + * Here "sparse" means that the caller might use some sampling criteria to
> + * decide when to call display_progress() rather than calling it for every
> + * integer value in[0 .. total).  In particular, the caller might not call
> + * display_progress() for the last value in the range.
> + *
> + * When "sparse" is set, stop_progress() will automatically force the done
> + * message to show 100%.
> + */
> +static void finish_if_sparse(struct progress **p_progress)
> +{
> +       struct progress *progress = *p_progress;
> +
> +       if (progress &&
> +           progress->sparse &&
> +           progress->last_value != progress->total)
> +               display_progress(progress, progress->total);
>  }

There's no reason for this function to take a 'struct progress **'
rather than the simpler 'struct progress *', and doing so just
confuses the reader.

>  void stop_progress(struct progress **p_progress)
>  {
> +       finish_if_sparse(p_progress);
> +
>         stop_progress_msg(p_progress, _("done"));
>  }

Rather than adding a new "sparse" mode, I wonder if this entire
concept can be boiled down to a single new function:

    void finish_progress(struct progress **p_progress, const char *msg)
    {
        struct progress *progress = *p_progress;
        if (progress && progress->last_value != progress->total)
            display_progress(progress, progress->total);
        if (msg)
            stop_progress_msg(p_progress, msg);
        else
            stop_progress(p_progress);
    }

without having to make any other changes to the implementation or add
a new field to the structure. It would mean, though, that the caller
would have to remember to invoke finish_progress() rather than
stop_progress(). Which leads one to wonder why this functionality is
needed at all since it's easy enough for a caller to make the one
extra call to simulate this:

    /* client code */
    if (progress)
        display_progress(progress, progress->total);
    stop_progress(&progress);

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jeff Hostetler wrote (reply to this):



On 3/19/2019 12:42 PM, Eric Sunshine wrote:
> On Tue, Mar 19, 2019 at 10:37 AM Jeff Hostetler via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>> Add new start_sparse_progress() and start_delayed_sparse_progress()
>> constructors and "sparse" flag to struct progress.
>>
>> Teach stop_progress() to force a 100% complete progress message before
>> printing the final "done" message when "sparse" is set.
>>
>> Calling display_progress() for every item in a large set can
>> be expensive.  If callers try to filter this for performance
>> reasons, such as emitting every k-th item, progress would
>> not reach 100% unless they made a final call to display_progress()
>> with the item count before calling stop_progress().
>>
>> Now this is automatic when "sparse" is set.
>>
>> Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
>> ---
>> diff --git a/progress.c b/progress.c
>> +/*
>> + * Here "sparse" means that the caller might use some sampling criteria to
>> + * decide when to call display_progress() rather than calling it for every
>> + * integer value in[0 .. total).  In particular, the caller might not call
>> + * display_progress() for the last value in the range.
>> + *
>> + * When "sparse" is set, stop_progress() will automatically force the done
>> + * message to show 100%.
>> + */
>> +static void finish_if_sparse(struct progress **p_progress)
>> +{
>> +       struct progress *progress = *p_progress;
>> +
>> +       if (progress &&
>> +           progress->sparse &&
>> +           progress->last_value != progress->total)
>> +               display_progress(progress, progress->total);
>>   }
> 
> There's no reason for this function to take a 'struct progress **'
> rather than the simpler 'struct progress *', and doing so just
> confuses the reader.

I was just trying to match the existing API in the stop_progress()
and stop_progress_msg() routines.  But yes, I can simplify this.

> 
>>   void stop_progress(struct progress **p_progress)
>>   {
>> +       finish_if_sparse(p_progress);
>> +
>>          stop_progress_msg(p_progress, _("done"));
>>   }
> 
> Rather than adding a new "sparse" mode, I wonder if this entire
> concept can be boiled down to a single new function:
> 
>      void finish_progress(struct progress **p_progress, const char *msg)
>      {
>          struct progress *progress = *p_progress;
>          if (progress && progress->last_value != progress->total)
>              display_progress(progress, progress->total);
>          if (msg)
>              stop_progress_msg(p_progress, msg);
>          else
>              stop_progress(p_progress);
>      }
> 
> without having to make any other changes to the implementation or add
> a new field to the structure.

The existing model has start_progress() and start_delayed_progress().
I was following that model and added the new option at the start.
I'm not planning on adding any additional flags, but if we had some
we'd want them available on the startup next to this one.


>                                It would mean, though, that the caller
> would have to remember to invoke finish_progress() rather than
> stop_progress().

Right, I was trying to isolate the change to the setup, so that loop
bottoms and any in-loop return points don't all have to worry about
whether to call stop_ or finish_.


>                   Which leads one to wonder why this functionality is
> needed at all since it's easy enough for a caller to make the one
> extra call to simulate this:
> 
>      /* client code */
>      if (progress)
>          display_progress(progress, progress->total);
>      stop_progress(&progress);
> 

"struct progress" is private and I didn't want to publish it just for
this.  And again as with the finish_ remarks, that leads to more
places that would need to be updated or maintained.

And you're right, callers could always just call:
	x = ...whatever...
	progress = start_progress(..., x);
	...loop or whatever...
	display_progress(progress, x);
	stop_progress(...);

but that puts the burden on the caller to keep a copy of "x"
that matches the original value so that we get the 100% message.
I was doing that for a while, but after 3 or 4 progress loops,
I found myself wanting the progress routines to handle that
bookkeeping for me.

Jeff

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Eric Sunshine wrote (reply to this):

On Tue, Mar 19, 2019 at 2:33 PM Jeff Hostetler <git@jeffhostetler.com> wrote:
> On 3/19/2019 12:42 PM, Eric Sunshine wrote:
> > Rather than adding a new "sparse" mode, I wonder if this entire
> > concept can be boiled down to a single new function:
> >
> >      void finish_progress(struct progress **p_progress, const char *msg)
> >      {
> >          [...]
> >      }
> >
> > without having to make any other changes to the implementation or add
> > a new field to the structure.
>
> The existing model has start_progress() and start_delayed_progress().
> I was following that model and added the new option at the start.
> I'm not planning on adding any additional flags, but if we had some
> we'd want them available on the startup next to this one.

Perhaps it makes sense to just take a 'flags' argument instead of
'sparse' so we don't have to worry about this going forward. In other
words:

    #define PROGRESS_DELAYED (1 << 0)
    #define PROGRESS_SPARSE (1 << 1)

    struct progress *start_progress_x(const char *title, uint64_t total,
        unsigned flags);

which covers "delayed" start and "sparse". ("_x" is a placeholder
since I can't think of a good name).

> >                                It would mean, though, that the caller
> > would have to remember to invoke finish_progress() rather than
> > stop_progress().
>
> Right, I was trying to isolate the change to the setup, so that loop
> bottoms and any in-loop return points don't all have to worry about
> whether to call stop_ or finish_.

Yes, I understand the benefit.

Anyhow, my comments are akin to bikeshedding, thus not necessarily actionable.

@@ -8,6 +8,7 @@
#include "sha1-lookup.h"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jeff Hostetler wrote (reply to this):



On 3/19/2019 10:37 AM, Jeff Hostetler via GitGitGadget wrote:
> From: Jeff Hostetler <jeffhost@microsoft.com>
> 
> Fix "git multi-pack-index verify" to handle repos with thousands
> of packfiles.
> 
> Midx verify adds the individual "packed_git" structures to the
> multi_pack_index.packs array, but it does not add them to the
> "repository.objects.packed_git" list.  During the verification
> code, each packfile is opened and scanned.  And "pack_open_fds"
> is incremented.  If "pack_open_fds" equals the "pack_max_fds"
> open_packed_git_1() calls close_one_pack() to LRU-style close
> an already open packfile.  But because the packfiles were never
> added to the "packed_git" list, close_one_pack() does nothing.
> If there are very many packfiles, Git runs out of file descriptors
> and fails.
> 
> Note that this was observed on Windows when build with GCC and
> in a repository with more than (2048-25) packfiles.
> 
> Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
> ---
>   midx.c | 3 +++
>   1 file changed, 3 insertions(+)
> 
> diff --git a/midx.c b/midx.c
> index 24141a7c62..b2f33bcd90 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -975,6 +975,9 @@ int verify_midx_file(const char *object_dir)
>   	for (i = 0; i < m->num_packs; i++) {
>   		if (prepare_midx_pack(m, i))
>   			midx_report("failed to load pack in position %d", i);
> +
> +		if (m->packs[i])
> +			install_packed_git(the_repository, m->packs[i]);
>   	}
>   
>   	for (i = 0; i < 255; i++) {
> 

I'd like to drop this commit from this series.

I talked with Stolee offline about this.  It does
address the problem in this one instance, but it leads
us to think about other places where there may be a
similar problem.

Also, the sort by packfile in the next commit in this
series means we'll only have 1 packfile open at a time
and so this commit isn't needed in this particular case.

Thanks,
Jeff

Add new start_sparse_progress() and start_delayed_sparse_progress()
constructors and "sparse" flag to struct progress.

Teach stop_progress() to force a 100% complete progress message before
printing the final "done" message when "sparse" is set.

Calling display_progress() for every item in a large set can
be expensive.  If callers try to filter this for performance
reasons, such as emitting every k-th item, progress would
not reach 100% unless they made a final call to display_progress()
with the item count before calling stop_progress().

Now this is automatic when "sparse" is set.

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
Log multi-pack-index command mode.
Log number of objects and packfiles in the midx.

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
Add progress indicators to more parts of midx verify.
Use sparse progress indicator for object iteration.

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
Teach `multi-pack-index verify` to sort the set of object by
packfile so that only one packfile needs to be open at a time.

This is a performance improvement.  Previously, objects were
verified in OID order.  This essentially requires all packfiles
to be open at the same time.  If the number of packfiles exceeds
the open file limit, packfiles would be LRU-closed and re-opened
many times.

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
@jeffhostetler
Copy link
Author

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 21, 2019

Submitted as pull.166.v3.git.gitgitgadget@gmail.com

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 22, 2019

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Jeff Hostetler via GitGitGadget" <gitgitgadget@gmail.com> writes:

> Teach "multi-pack-index verify" to handle cases where the number of
> packfiles exceeds the open file handle limit.
>
> The first commit fixes a problem that prevented the LRU-style
> close_one_pack() mechanism from working which caused midx verify to run out
> of file descriptors.
>
> The second commit teaches midx verify to sort the set of objects to verify
> by packfile rather than verifying them in OID order. This eliminates the
> need to have more than one packfile/idx open at the same time.
>
> With the second commit, runtime on 3600 packfiles went from 12 minutes to 25
> seconds.

These reference to the first and second commit might have become
stale across interations, but logically it makes sense---the first
point is about correctness (i.e. do not die by running out of fds)
and the second one is about usable-performance.

But in this round (possibly in the previous one, too?) the "group
objects by packfile" one addresses both points?

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 24, 2019

This branch is now known as jh/midx-verify-too-many-packs.

@gitgitgadget
Copy link

gitgitgadget bot commented Mar 24, 2019

This patch series was integrated into pu via git@0b60bdc.

@gitgitgadget gitgitgadget bot added the pu label Mar 24, 2019
@gitgitgadget
Copy link

gitgitgadget bot commented Mar 25, 2019

On the Git mailing list, Jeff Hostetler wrote (reply to this):



On 3/22/2019 1:37 AM, Junio C Hamano wrote:
> "Jeff Hostetler via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> Teach "multi-pack-index verify" to handle cases where the number of
>> packfiles exceeds the open file handle limit.
>>
>> The first commit fixes a problem that prevented the LRU-style
>> close_one_pack() mechanism from working which caused midx verify to run out
>> of file descriptors.
>>
>> The second commit teaches midx verify to sort the set of objects to verify
>> by packfile rather than verifying them in OID order. This eliminates the
>> need to have more than one packfile/idx open at the same time.
>>
>> With the second commit, runtime on 3600 packfiles went from 12 minutes to 25
>> seconds.
> 
> These reference to the first and second commit might have become
> stale across interations, but logically it makes sense---the first
> point is about correctness (i.e. do not die by running out of fds)
> and the second one is about usable-performance.
> 
> But in this round (possibly in the previous one, too?) the "group
> objects by packfile" one addresses both points?
> 


Sorry, I forgot to remote the stale content in the cover letter for
the V3 version.

This version just has the sorting by packfile commit and because it only
keeps 1 packfile open at a time, it does not need the change to add 
packfiles to the packed-git list because it does not trigger the
close_one_pack() problem.

We suspect there are other places (not-yet-observed) where the design of
the all-packs and packed-git lists will lead to similar fd exhaustion
errors and want to fix it properly in the packfile and/or midx code.
We'll address this potential problem in a future patch series.

Jeff

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 1, 2019

This patch series was integrated into pu via git@69f7050.

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 2, 2019

This patch series was integrated into pu via git@6181f76.

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 3, 2019

This patch series was integrated into pu via git@490375d.

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 4, 2019

This patch series was integrated into pu via git@821b5ff.

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 10, 2019

This patch series was integrated into pu via git@48ea219.

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 10, 2019

This patch series was integrated into next via git@2ac6c93.

@gitgitgadget gitgitgadget bot added the next label Apr 10, 2019
@gitgitgadget
Copy link

gitgitgadget bot commented Apr 16, 2019

This patch series was integrated into pu via git@ec08c4f.

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 16, 2019

This patch series was integrated into next via git@ec08c4f.

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 16, 2019

This patch series was integrated into master via git@ec08c4f.

@gitgitgadget gitgitgadget bot added the master label Apr 16, 2019
@gitgitgadget gitgitgadget bot closed this Apr 16, 2019
@gitgitgadget
Copy link

gitgitgadget bot commented Apr 16, 2019

Closed via ec08c4f.

@jeffhostetler jeffhostetler deleted the upstream-midx-verify branch July 1, 2019 18:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant