multi-pack-index: fix verify on large repos #166

jeffhostetler · 2019-03-15T20:31:50Z

Version 3 drops the packed-git commit I mentioned earlier, simplifies the
finish_if_sparse API as suggested by Eric, and splits the object sort commit
into 2 commits: one to add the additional progress indicators and one to
sort the objects by packfile. This makes it easier to follow.

Version 2 addresses progress-related concerns raised in the previous version
of the midx verify code.

This version also extends the existing progress.[ch] code and adds a "sparse"
mode that automatically ensures the 100% message is issued.

Teach "multi-pack-index verify" to handle cases where the number of
packfiles exceeds the open file handle limit.

The first commit fixes a problem that prevented the LRU-style close_one_pack()
mechanism from working which caused midx verify to run out of file descriptors.

The second commit teaches midx verify to sort the set of objects to verify
by packfile rather than verifying them in OID order. This eliminates the
need to have more than one packfile/idx open at the same time.

With the second commit, runtime on 3600 packfiles went from 12 minutes
to 25 seconds.

Thanks,
Jeff

Cc: dstolee@microsoft.com
Cc: avarab@gmail.com

jeffhostetler · 2019-03-18T14:29:12Z

/submit

gitgitgadget · 2019-03-18T14:30:06Z

Submitted as pull.166.git.gitgitgadget@gmail.com

gitgitgadget · 2019-03-18T15:56:21Z

midx.c

@@ -958,9 +958,29 @@ static void midx_report(const char *fmt, ...)
 	va_end(ap);


On the Git mailing list, =?utf-8?B?w4Z2YXIgQXJuZmrDtnLDsA==?= Bjarmason wrote (reply to this):

On Mon, Mar 18 2019, Jeff Hostetler via GitGitGadget wrote: > +static int compare_pair_pos_vs_id(const void *_a, const void *_b) > +{ > + struct pair_pos_vs_id *a = (struct pair_pos_vs_id *)_a; > + struct pair_pos_vs_id *b = (struct pair_pos_vs_id *)_b; > + > + if (a->pack_int_id < b->pack_int_id) > + return -1; > + if (a->pack_int_id > b->pack_int_id) > + return 1; > + > + return 0; > +} Not a suggestion for a change, just a note that this sent me down the rabbit hole of looking at the different idioms we use for QSORT() in different places. Some use this form, some a ternary nest, and some the succinct subtraction idiom of e.g. (in this case): return b->pack_int_id - a->pack_int_id; > + > int verify_midx_file(const char *object_dir) > { > - uint32_t i; > + struct pair_pos_vs_id *pairs = NULL; > + uint32_t i, k; > struct progress *progress; > struct multi_pack_index *m = load_multi_pack_index(object_dir, 1); > verify_midx_error = 0; > @@ -997,15 +1017,36 @@ int verify_midx_file(const char *object_dir) > } > > progress = start_progress(_("Verifying object offsets"), m->num_objects); > + > + /* > + * Create an array mapping each object to its packfile id. Sort it > + * to group the objects by packfile. Use this permutation to visit > + * each of the objects and only require 1 packfile to be open at a > + * time. > + */ > + ALLOC_ARRAY(pairs, m->num_objects); > for (i = 0; i < m->num_objects; i++) { > + pairs[i].pos = i; > + pairs[i].pack_int_id = nth_midxed_pack_int_id(m, i); > + } > + QSORT(pairs, m->num_objects, compare_pair_pos_vs_id); > + > + for (k = 0; k < m->num_objects; k++) { > [...] I have not tested this (or midx in general), but isn't this new QSORT() introducing the same sort of progress stalling that I fixed for commit-graph in 890226ccb57 ("commit-graph write: add itermediate progress", 2019-01-19)? I.e. something you can work around with a "display_progress(progress, 0)" before the QSORT().

On the Git mailing list, Jeff Hostetler wrote (reply to this):

On 3/18/2019 11:53 AM, Ævar Arnfjörð Bjarmason wrote: > > On Mon, Mar 18 2019, Jeff Hostetler via GitGitGadget wrote: > >> +static int compare_pair_pos_vs_id(const void *_a, const void *_b) >> +{ >> + struct pair_pos_vs_id *a = (struct pair_pos_vs_id *)_a; >> + struct pair_pos_vs_id *b = (struct pair_pos_vs_id *)_b; >> + >> + if (a->pack_int_id < b->pack_int_id) >> + return -1; >> + if (a->pack_int_id > b->pack_int_id) >> + return 1; >> + >> + return 0; >> +} > > Not a suggestion for a change, just a note that this sent me down the > rabbit hole of looking at the different idioms we use for QSORT() in > different places. Some use this form, some a ternary nest, and some the > succinct subtraction idiom of e.g. (in this case): > > return b->pack_int_id - a->pack_int_id; Yeah, I'm not sure which way is better or worse here. An earlier draft of this function sorted by packfile id and then by OID (thinking we might benefit from some locality later when we do the verify), hence the independent if statements. But it didn't help, so I removed the other lines. On 43+M objects, your version is a hair faster, so I might as well take it instead. > >> + >> int verify_midx_file(const char *object_dir) >> { >> - uint32_t i; >> + struct pair_pos_vs_id *pairs = NULL; >> + uint32_t i, k; >> struct progress *progress; >> struct multi_pack_index *m = load_multi_pack_index(object_dir, 1); >> verify_midx_error = 0; >> @@ -997,15 +1017,36 @@ int verify_midx_file(const char *object_dir) >> } >> >> progress = start_progress(_("Verifying object offsets"), m->num_objects); >> + >> + /* >> + * Create an array mapping each object to its packfile id. Sort it >> + * to group the objects by packfile. Use this permutation to visit >> + * each of the objects and only require 1 packfile to be open at a >> + * time. >> + */ >> + ALLOC_ARRAY(pairs, m->num_objects); >> for (i = 0; i < m->num_objects; i++) { >> + pairs[i].pos = i; >> + pairs[i].pack_int_id = nth_midxed_pack_int_id(m, i); >> + } >> + QSORT(pairs, m->num_objects, compare_pair_pos_vs_id); >> + >> + for (k = 0; k < m->num_objects; k++) { >> [...] > > I have not tested this (or midx in general), but isn't this new QSORT() > introducing the same sort of progress stalling that I fixed for > commit-graph in 890226ccb57 ("commit-graph write: add itermediate > progress", 2019-01-19)? I.e. something you can work around with a > "display_progress(progress, 0)" before the QSORT(). > I wasn't tracking your commit-graph changes, but yes, I think it is. Tinkering with how to display progress, I found a couple of problems. On my 3599 packfile, 43M object example, QSORT() takes about 5 seconds. But there's about 2 seconds of setup before the sort starts. The final verify loops takes about 17 seconds. Here I put trace2 regions around the main loops and used the GIT_TR2_PERF stream. > | cmd_name | | | | | multi-pack-index (multi-pack-index) > | cmd_mode | | | | | verify > | data | r0 | 0.031295 | 0.031295 | midx | load/num_packs:3599 > | data | r0 | 0.031330 | 0.031330 | midx | load/num_objects:42704807 > | region_enter | r0 | 0.031352 | | midx | label:verify/prepare > | region_leave | r0 | 0.626547 | 0.595195 | midx | label:verify/prepare > | region_enter | r0 | 0.626602 | | midx | label:verify/oid_order > | region_leave | r0 | 1.570195 | 0.943593 | midx | label:verify/oid_order > | region_enter | r0 | 1.570253 | | midx | label:verify/sort_setup > | region_leave | r0 | 1.809723 | 0.239470 | midx | label:verify/sort_setup > | region_enter | r0 | 1.809803 | | midx | label:verify/sort > | region_leave | r0 | 6.950595 | 5.140792 | midx | label:verify/sort > | region_enter | r0 | 6.950651 | | midx | label:verify/offsets > | region_leave | r0 | 24.059736 | 17.109085 | midx | label:verify/offsets > | exit | | 24.101434 | | | code:0 So just adding a delay progress block by itself around the sort doesn't help much. It just sits there for 7 seconds before the actual progress starts. If I add a non-delay progress block around the "verify/prepare", "verify/oid_order" and the "verify/offsets" loops, we get a pretty good experience. There is the dead time while the sort() itself is running, but at least there is isn't a 5+ second frozen at 0% message on screen. I'll re-roll shortly. Thanks, Jeff

On the Git mailing list, =?utf-8?B?w4Z2YXIgQXJuZmrDtnLDsA==?= Bjarmason wrote (reply to this):

On Mon, Mar 18 2019, Jeff Hostetler wrote: > On 3/18/2019 11:53 AM, =C3=86var Arnfj=C3=B6r=C3=B0 Bjarmason wrote: >> >> On Mon, Mar 18 2019, Jeff Hostetler via GitGitGadget wrote: >> >>> +static int compare_pair_pos_vs_id(const void *_a, const void *_b) >>> +{ >>> + struct pair_pos_vs_id *a =3D (struct pair_pos_vs_id *)_a; >>> + struct pair_pos_vs_id *b =3D (struct pair_pos_vs_id *)_b; >>> + >>> + if (a->pack_int_id < b->pack_int_id) >>> + return -1; >>> + if (a->pack_int_id > b->pack_int_id) >>> + return 1; >>> + >>> + return 0; >>> +} >> >> Not a suggestion for a change, just a note that this sent me down the >> rabbit hole of looking at the different idioms we use for QSORT() in >> different places. Some use this form, some a ternary nest, and some the >> succinct subtraction idiom of e.g. (in this case): >> >> return b->pack_int_id - a->pack_int_id; > > Yeah, I'm not sure which way is better or worse here. > An earlier draft of this function sorted by packfile id > and then by OID (thinking we might benefit from some > locality later when we do the verify), hence the independent > if statements. But it didn't help, so I removed the other > lines. > > On 43+M objects, your version is a hair faster, so I might > as well take it instead. Cool! >> >>> + >>> int verify_midx_file(const char *object_dir) >>> { >>> - uint32_t i; >>> + struct pair_pos_vs_id *pairs =3D NULL; >>> + uint32_t i, k; >>> struct progress *progress; >>> struct multi_pack_index *m =3D load_multi_pack_index(object_dir, 1); >>> verify_midx_error =3D 0; >>> @@ -997,15 +1017,36 @@ int verify_midx_file(const char *object_dir) >>> } >>> >>> progress =3D start_progress(_("Verifying object offsets"), m->num_ob= jects); >>> + >>> + /* >>> + * Create an array mapping each object to its packfile id. Sort it >>> + * to group the objects by packfile. Use this permutation to visit >>> + * each of the objects and only require 1 packfile to be open at a >>> + * time. >>> + */ >>> + ALLOC_ARRAY(pairs, m->num_objects); >>> for (i =3D 0; i < m->num_objects; i++) { >>> + pairs[i].pos =3D i; >>> + pairs[i].pack_int_id =3D nth_midxed_pack_int_id(m, i); >>> + } >>> + QSORT(pairs, m->num_objects, compare_pair_pos_vs_id); >>> + >>> + for (k =3D 0; k < m->num_objects; k++) { >>> [...] >> >> I have not tested this (or midx in general), but isn't this new QSORT() >> introducing the same sort of progress stalling that I fixed for >> commit-graph in 890226ccb57 ("commit-graph write: add itermediate >> progress", 2019-01-19)? I.e. something you can work around with a >> "display_progress(progress, 0)" before the QSORT(). >> > > I wasn't tracking your commit-graph changes, but yes, I think it is. > > Tinkering with how to display progress, I found a couple of problems. > On my 3599 packfile, 43M object example, QSORT() takes about 5 seconds. > But there's about 2 seconds of setup before the sort starts. The final > verify loops takes about 17 seconds. > > Here I put trace2 regions around the main loops and used the > GIT_TR2_PERF stream. > >> | cmd_name | | | | | multi-pack-i= ndex (multi-pack-index) >> | cmd_mode | | | | | verify >> | data | r0 | 0.031295 | 0.031295 | midx | load/num_pac= ks:3599 >> | data | r0 | 0.031330 | 0.031330 | midx | load/num_obj= ects:42704807 >> | region_enter | r0 | 0.031352 | | midx | >> label:verify/prepare | region_leave | r0 | 0.626547 | 0.595195 | >> midx | label:verify/prepare | region_enter | r0 | 0.626602 | >> | midx | label:verify/oid_order | region_leave | r0 | >> 1.570195 | 0.943593 | midx | label:verify/oid_order | >> region_enter | r0 | 1.570253 | | midx | >> label:verify/sort_setup | region_leave | r0 | 1.809723 | 0.239470 >> | midx | label:verify/sort_setup | region_enter | r0 | >> 1.809803 | | midx | label:verify/sort | region_leave >> | r0 | 6.950595 | 5.140792 | midx | label:verify/sort | >> region_enter | r0 | 6.950651 | | midx | >> label:verify/offsets | region_leave | r0 | 24.059736 | 17.109085 | >> midx | label:verify/offsets | exit | | 24.101434 | >> | | code:0 > > So just adding a delay progress block by itself around the sort doesn't > help much. It just sits there for 7 seconds before the actual progress > starts. > > If I add a non-delay progress block around the "verify/prepare", > "verify/oid_order" and the "verify/offsets" loops, we get a pretty good > experience. > > There is the dead time while the sort() itself is running, but at least > there is isn't a 5+ second frozen at 0% message on screen. Yeah, the same with the commit-graph with my hack. I.e. it'll sit there, but at least it sits like this: What I was doing before 100% (X/Y) What I'm about to start doing 0% (0/Z) [hanging] Instead of: What I was doing before 100% (X/Y) [hanging] So that's an improvement, i.e. you know it's started that next phase at least instead of just having a non-descriptive hang. Ideally there would be some way to reach into the QSORT() and display progress there, but that's all sorts of nasty, so as the TODO comment in commit-graph.c notes I punted it.

On the Git mailing list, Junio C Hamano wrote (reply to this):

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes: >>>> + if (a->pack_int_id < b->pack_int_id) >>>> + return -1; >>>> + if (a->pack_int_id > b->pack_int_id) >>>> + return 1; >>>> + >>>> + return 0; >>>> +} >>> ... >>> succinct subtraction idiom of e.g. (in this case): >>> >>> return b->pack_int_id - a->pack_int_id; >> ... >> On 43+M objects, your version is a hair faster, so I might >> as well take it instead. > > Cool! Yup, following a well-known short idiom lets readers' eyes coast over without forcing them to make sure the two if statements have comparison going in the right direction and return the constant with the right sign, etc. Even if the idiomatic version weren't faster, it is worth using it. Thanks.

On the Git mailing list, Jeff Hostetler wrote (reply to this):

On 3/18/2019 6:02 PM, Ævar Arnfjörð Bjarmason wrote: > > On Mon, Mar 18 2019, Jeff Hostetler wrote: > >> On 3/18/2019 11:53 AM, Ævar Arnfjörð Bjarmason wrote: >>> >>> On Mon, Mar 18 2019, Jeff Hostetler via GitGitGadget wrote: >>> [...] >>>> + QSORT(pairs, m->num_objects, compare_pair_pos_vs_id); >>>> + >>>> + for (k = 0; k < m->num_objects; k++) { >>>> [...] >>> >>> I have not tested this (or midx in general), but isn't this new QSORT() >>> introducing the same sort of progress stalling that I fixed for >>> commit-graph in 890226ccb57 ("commit-graph write: add itermediate >>> progress", 2019-01-19)? I.e. something you can work around with a >>> "display_progress(progress, 0)" before the QSORT(). >>> >> >> I wasn't tracking your commit-graph changes, but yes, I think it is. >> [...] >> >> There is the dead time while the sort() itself is running, but at least >> there is isn't a 5+ second frozen at 0% message on screen. > > Yeah, the same with the commit-graph with my hack. I.e. it'll sit there, > but at least it sits like this: > > What I was doing before 100% (X/Y) > What I'm about to start doing 0% (0/Z) [hanging] > > Instead of: > > What I was doing before 100% (X/Y) > [hanging] > > So that's an improvement, i.e. you know it's started that next phase at > least instead of just having a non-descriptive hang. > > Ideally there would be some way to reach into the QSORT() and display > progress there, but that's all sorts of nasty, so as the TODO comment in > commit-graph.c notes I punted it. Perhaps I'm confused or this is a Windows issue, but when I do: progress = start_delayed_progress("sorting", n); display_progress(progress, 0); QSORT(...); stop_progress(&progress); I never see the 0% message. It always does the hang with the cursor in column 0 on a blank line. If I make this a regular start_progress(), I do see the 0% message for the duration of the qsort hang. I did some similar testing around your QSORT() in commit-graph.c and got the same result. It looks like start_delayed_processing() wants to wait at least 2 seconds before displaying anything and has an interval timer to notify it that another message should be printed, but the display_progress(0) prior to the QSORT() arrives before the 2 seconds are up and so nothing is printed. It's not until we get into the loop below the QSORT that one of the display_progress(i+1) calls could cause a message to appear. Right? Jeff

On the Git mailing list, =?utf-8?B?w4Z2YXIgQXJuZmrDtnLDsA==?= Bjarmason wrote (reply to this):

On Tue, Mar 19 2019, Jeff Hostetler wrote: > On 3/18/2019 6:02 PM, =C3=86var Arnfj=C3=B6r=C3=B0 Bjarmason wrote: >> >> On Mon, Mar 18 2019, Jeff Hostetler wrote: >> >>> On 3/18/2019 11:53 AM, =C3=86var Arnfj=C3=B6r=C3=B0 Bjarmason wrote: >>>> >>>> On Mon, Mar 18 2019, Jeff Hostetler via GitGitGadget wrote: >>>> > [...] >>>>> + QSORT(pairs, m->num_objects, compare_pair_pos_vs_id); >>>>> + >>>>> + for (k =3D 0; k < m->num_objects; k++) { >>>>> [...] >>>> >>>> I have not tested this (or midx in general), but isn't this new QSORT() >>>> introducing the same sort of progress stalling that I fixed for >>>> commit-graph in 890226ccb57 ("commit-graph write: add itermediate >>>> progress", 2019-01-19)? I.e. something you can work around with a >>>> "display_progress(progress, 0)" before the QSORT(). >>>> >>> >>> I wasn't tracking your commit-graph changes, but yes, I think it is. >>> > [...] >>> >>> There is the dead time while the sort() itself is running, but at least >>> there is isn't a 5+ second frozen at 0% message on screen. >> >> Yeah, the same with the commit-graph with my hack. I.e. it'll sit there, >> but at least it sits like this: >> >> What I was doing before 100% (X/Y) >> What I'm about to start doing 0% (0/Z) [hanging] >> >> Instead of: >> >> What I was doing before 100% (X/Y) >> [hanging] >> >> So that's an improvement, i.e. you know it's started that next phase at >> least instead of just having a non-descriptive hang. >> >> Ideally there would be some way to reach into the QSORT() and display >> progress there, but that's all sorts of nasty, so as the TODO comment in >> commit-graph.c notes I punted it. > > Perhaps I'm confused or this is a Windows issue, but when I do: > > progress =3D start_delayed_progress("sorting", n); > display_progress(progress, 0); > QSORT(...); > stop_progress(&progress); > > I never see the 0% message. It always does the hang with the cursor in > column 0 on a blank line. If I make this a regular start_progress(), > I do see the 0% message for the duration of the qsort hang. > > I did some similar testing around your QSORT() in commit-graph.c > and got the same result. It looks like start_delayed_processing() > wants to wait at least 2 seconds before displaying anything and has > an interval timer to notify it that another message should be printed, > but the display_progress(0) prior to the QSORT() arrives before the 2 > seconds are up and so nothing is printed. It's not until we get into > the loop below the QSORT that one of the display_progress(i+1) calls > could cause a message to appear. Correct, it'll still hang the N seconds that start_delayed_progress() imposes. In the commit-graph code this would sometimes take longer than that, so I'd see the 0% earlier. But maybe in this case even on humongous repositories it's faster than that.

jeffhostetler · 2019-03-19T14:37:07Z

/submit

gitgitgadget · 2019-03-19T14:38:04Z

Submitted as pull.166.v2.git.gitgitgadget@gmail.com

gitgitgadget · 2019-03-19T16:45:03Z

progress.c

@@ -34,6 +34,7 @@ struct progress {
 	uint64_t total;


On the Git mailing list, Eric Sunshine wrote (reply to this):

On Tue, Mar 19, 2019 at 10:37 AM Jeff Hostetler via GitGitGadget <gitgitgadget@gmail.com> wrote: > Add new start_sparse_progress() and start_delayed_sparse_progress() > constructors and "sparse" flag to struct progress. > > Teach stop_progress() to force a 100% complete progress message before > printing the final "done" message when "sparse" is set. > > Calling display_progress() for every item in a large set can > be expensive. If callers try to filter this for performance > reasons, such as emitting every k-th item, progress would > not reach 100% unless they made a final call to display_progress() > with the item count before calling stop_progress(). > > Now this is automatic when "sparse" is set. > > Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com> > --- > diff --git a/progress.c b/progress.c > +/* > + * Here "sparse" means that the caller might use some sampling criteria to > + * decide when to call display_progress() rather than calling it for every > + * integer value in[0 .. total). In particular, the caller might not call > + * display_progress() for the last value in the range. > + * > + * When "sparse" is set, stop_progress() will automatically force the done > + * message to show 100%. > + */ > +static void finish_if_sparse(struct progress **p_progress) > +{ > + struct progress *progress = *p_progress; > + > + if (progress && > + progress->sparse && > + progress->last_value != progress->total) > + display_progress(progress, progress->total); > } There's no reason for this function to take a 'struct progress **' rather than the simpler 'struct progress *', and doing so just confuses the reader. > void stop_progress(struct progress **p_progress) > { > + finish_if_sparse(p_progress); > + > stop_progress_msg(p_progress, _("done")); > } Rather than adding a new "sparse" mode, I wonder if this entire concept can be boiled down to a single new function: void finish_progress(struct progress **p_progress, const char *msg) { struct progress *progress = *p_progress; if (progress && progress->last_value != progress->total) display_progress(progress, progress->total); if (msg) stop_progress_msg(p_progress, msg); else stop_progress(p_progress); } without having to make any other changes to the implementation or add a new field to the structure. It would mean, though, that the caller would have to remember to invoke finish_progress() rather than stop_progress(). Which leads one to wonder why this functionality is needed at all since it's easy enough for a caller to make the one extra call to simulate this: /* client code */ if (progress) display_progress(progress, progress->total); stop_progress(&progress);

On the Git mailing list, Jeff Hostetler wrote (reply to this):

On 3/19/2019 12:42 PM, Eric Sunshine wrote: > On Tue, Mar 19, 2019 at 10:37 AM Jeff Hostetler via GitGitGadget > <gitgitgadget@gmail.com> wrote: >> Add new start_sparse_progress() and start_delayed_sparse_progress() >> constructors and "sparse" flag to struct progress. >> >> Teach stop_progress() to force a 100% complete progress message before >> printing the final "done" message when "sparse" is set. >> >> Calling display_progress() for every item in a large set can >> be expensive. If callers try to filter this for performance >> reasons, such as emitting every k-th item, progress would >> not reach 100% unless they made a final call to display_progress() >> with the item count before calling stop_progress(). >> >> Now this is automatic when "sparse" is set. >> >> Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com> >> --- >> diff --git a/progress.c b/progress.c >> +/* >> + * Here "sparse" means that the caller might use some sampling criteria to >> + * decide when to call display_progress() rather than calling it for every >> + * integer value in[0 .. total). In particular, the caller might not call >> + * display_progress() for the last value in the range. >> + * >> + * When "sparse" is set, stop_progress() will automatically force the done >> + * message to show 100%. >> + */ >> +static void finish_if_sparse(struct progress **p_progress) >> +{ >> + struct progress *progress = *p_progress; >> + >> + if (progress && >> + progress->sparse && >> + progress->last_value != progress->total) >> + display_progress(progress, progress->total); >> } > > There's no reason for this function to take a 'struct progress **' > rather than the simpler 'struct progress *', and doing so just > confuses the reader. I was just trying to match the existing API in the stop_progress() and stop_progress_msg() routines. But yes, I can simplify this. > >> void stop_progress(struct progress **p_progress) >> { >> + finish_if_sparse(p_progress); >> + >> stop_progress_msg(p_progress, _("done")); >> } > > Rather than adding a new "sparse" mode, I wonder if this entire > concept can be boiled down to a single new function: > > void finish_progress(struct progress **p_progress, const char *msg) > { > struct progress *progress = *p_progress; > if (progress && progress->last_value != progress->total) > display_progress(progress, progress->total); > if (msg) > stop_progress_msg(p_progress, msg); > else > stop_progress(p_progress); > } > > without having to make any other changes to the implementation or add > a new field to the structure. The existing model has start_progress() and start_delayed_progress(). I was following that model and added the new option at the start. I'm not planning on adding any additional flags, but if we had some we'd want them available on the startup next to this one. > It would mean, though, that the caller > would have to remember to invoke finish_progress() rather than > stop_progress(). Right, I was trying to isolate the change to the setup, so that loop bottoms and any in-loop return points don't all have to worry about whether to call stop_ or finish_. > Which leads one to wonder why this functionality is > needed at all since it's easy enough for a caller to make the one > extra call to simulate this: > > /* client code */ > if (progress) > display_progress(progress, progress->total); > stop_progress(&progress); > "struct progress" is private and I didn't want to publish it just for this. And again as with the finish_ remarks, that leads to more places that would need to be updated or maintained. And you're right, callers could always just call: x = ...whatever... progress = start_progress(..., x); ...loop or whatever... display_progress(progress, x); stop_progress(...); but that puts the burden on the caller to keep a copy of "x" that matches the original value so that we get the 100% message. I was doing that for a while, but after 3 or 4 progress loops, I found myself wanting the progress routines to handle that bookkeeping for me. Jeff

On the Git mailing list, Eric Sunshine wrote (reply to this):

On Tue, Mar 19, 2019 at 2:33 PM Jeff Hostetler <git@jeffhostetler.com> wrote: > On 3/19/2019 12:42 PM, Eric Sunshine wrote: > > Rather than adding a new "sparse" mode, I wonder if this entire > > concept can be boiled down to a single new function: > > > > void finish_progress(struct progress **p_progress, const char *msg) > > { > > [...] > > } > > > > without having to make any other changes to the implementation or add > > a new field to the structure. > > The existing model has start_progress() and start_delayed_progress(). > I was following that model and added the new option at the start. > I'm not planning on adding any additional flags, but if we had some > we'd want them available on the startup next to this one. Perhaps it makes sense to just take a 'flags' argument instead of 'sparse' so we don't have to worry about this going forward. In other words: #define PROGRESS_DELAYED (1 << 0) #define PROGRESS_SPARSE (1 << 1) struct progress *start_progress_x(const char *title, uint64_t total, unsigned flags); which covers "delayed" start and "sparse". ("_x" is a placeholder since I can't think of a good name). > > It would mean, though, that the caller > > would have to remember to invoke finish_progress() rather than > > stop_progress(). > > Right, I was trying to isolate the change to the setup, so that loop > bottoms and any in-loop return points don't all have to worry about > whether to call stop_ or finish_. Yes, I understand the benefit. Anyhow, my comments are akin to bikeshedding, thus not necessarily actionable.

gitgitgadget · 2019-03-19T19:57:12Z

midx.c

@@ -8,6 +8,7 @@
 #include "sha1-lookup.h"


On the Git mailing list, Jeff Hostetler wrote (reply to this):

On 3/19/2019 10:37 AM, Jeff Hostetler via GitGitGadget wrote: > From: Jeff Hostetler <jeffhost@microsoft.com> > > Fix "git multi-pack-index verify" to handle repos with thousands > of packfiles. > > Midx verify adds the individual "packed_git" structures to the > multi_pack_index.packs array, but it does not add them to the > "repository.objects.packed_git" list. During the verification > code, each packfile is opened and scanned. And "pack_open_fds" > is incremented. If "pack_open_fds" equals the "pack_max_fds" > open_packed_git_1() calls close_one_pack() to LRU-style close > an already open packfile. But because the packfiles were never > added to the "packed_git" list, close_one_pack() does nothing. > If there are very many packfiles, Git runs out of file descriptors > and fails. > > Note that this was observed on Windows when build with GCC and > in a repository with more than (2048-25) packfiles. > > Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com> > --- > midx.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/midx.c b/midx.c > index 24141a7c62..b2f33bcd90 100644 > --- a/midx.c > +++ b/midx.c > @@ -975,6 +975,9 @@ int verify_midx_file(const char *object_dir) > for (i = 0; i < m->num_packs; i++) { > if (prepare_midx_pack(m, i)) > midx_report("failed to load pack in position %d", i); > + > + if (m->packs[i]) > + install_packed_git(the_repository, m->packs[i]); > } > > for (i = 0; i < 255; i++) { > I'd like to drop this commit from this series. I talked with Stolee offline about this. It does address the problem in this one instance, but it leads us to think about other places where there may be a similar problem. Also, the sort by packfile in the next commit in this series means we'll only have 1 packfile open at a time and so this commit isn't needed in this particular case. Thanks, Jeff

Add new start_sparse_progress() and start_delayed_sparse_progress() constructors and "sparse" flag to struct progress. Teach stop_progress() to force a 100% complete progress message before printing the final "done" message when "sparse" is set. Calling display_progress() for every item in a large set can be expensive. If callers try to filter this for performance reasons, such as emitting every k-th item, progress would not reach 100% unless they made a final call to display_progress() with the item count before calling stop_progress(). Now this is automatic when "sparse" is set. Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>

Log multi-pack-index command mode. Log number of objects and packfiles in the midx. Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>

Add progress indicators to more parts of midx verify. Use sparse progress indicator for object iteration. Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>

Teach `multi-pack-index verify` to sort the set of object by packfile so that only one packfile needs to be open at a time. This is a performance improvement. Previously, objects were verified in OID order. This essentially requires all packfiles to be open at the same time. If the number of packfiles exceeds the open file limit, packfiles would be LRU-closed and re-opened many times. Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>

jeffhostetler · 2019-03-21T19:35:05Z

/submit

gitgitgadget · 2019-03-21T19:36:22Z

Submitted as pull.166.v3.git.gitgitgadget@gmail.com

gitgitgadget · 2019-03-22T05:41:24Z

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Jeff Hostetler via GitGitGadget" <gitgitgadget@gmail.com> writes:

> Teach "multi-pack-index verify" to handle cases where the number of
> packfiles exceeds the open file handle limit.
>
> The first commit fixes a problem that prevented the LRU-style
> close_one_pack() mechanism from working which caused midx verify to run out
> of file descriptors.
>
> The second commit teaches midx verify to sort the set of objects to verify
> by packfile rather than verifying them in OID order. This eliminates the
> need to have more than one packfile/idx open at the same time.
>
> With the second commit, runtime on 3600 packfiles went from 12 minutes to 25
> seconds.

These reference to the first and second commit might have become
stale across interations, but logically it makes sense---the first
point is about correctness (i.e. do not die by running out of fds)
and the second one is about usable-performance.

But in this round (possibly in the previous one, too?) the "group
objects by packfile" one addresses both points?

gitgitgadget · 2019-03-24T14:23:44Z

This branch is now known as jh/midx-verify-too-many-packs.

gitgitgadget · 2019-03-24T14:23:45Z

This patch series was integrated into pu via git@0b60bdc.

gitgitgadget · 2019-03-25T17:23:16Z

On the Git mailing list, Jeff Hostetler wrote (reply to this):

On 3/22/2019 1:37 AM, Junio C Hamano wrote:
> "Jeff Hostetler via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> Teach "multi-pack-index verify" to handle cases where the number of
>> packfiles exceeds the open file handle limit.
>>
>> The first commit fixes a problem that prevented the LRU-style
>> close_one_pack() mechanism from working which caused midx verify to run out
>> of file descriptors.
>>
>> The second commit teaches midx verify to sort the set of objects to verify
>> by packfile rather than verifying them in OID order. This eliminates the
>> need to have more than one packfile/idx open at the same time.
>>
>> With the second commit, runtime on 3600 packfiles went from 12 minutes to 25
>> seconds.
> 
> These reference to the first and second commit might have become
> stale across interations, but logically it makes sense---the first
> point is about correctness (i.e. do not die by running out of fds)
> and the second one is about usable-performance.
> 
> But in this round (possibly in the previous one, too?) the "group
> objects by packfile" one addresses both points?
> 

Sorry, I forgot to remote the stale content in the cover letter for
the V3 version.

This version just has the sorting by packfile commit and because it only
keeps 1 packfile open at a time, it does not need the change to add 
packfiles to the packed-git list because it does not trigger the
close_one_pack() problem.

We suspect there are other places (not-yet-observed) where the design of
the all-packs and packed-git lists will lead to similar fd exhaustion
errors and want to fix it properly in the packfile and/or midx code.
We'll address this potential problem in a future patch series.

Jeff

gitgitgadget · 2019-04-01T10:31:19Z

This patch series was integrated into pu via git@69f7050.

gitgitgadget · 2019-04-02T05:49:10Z

This patch series was integrated into pu via git@6181f76.

gitgitgadget · 2019-04-03T08:54:03Z

This patch series was integrated into pu via git@490375d.

gitgitgadget · 2019-04-04T09:56:16Z

This patch series was integrated into pu via git@821b5ff.

gitgitgadget · 2019-04-10T00:50:35Z

This patch series was integrated into pu via git@48ea219.

gitgitgadget · 2019-04-10T00:50:36Z

This patch series was integrated into next via git@2ac6c93.

gitgitgadget · 2019-04-16T13:14:30Z

This patch series was integrated into pu via git@ec08c4f.

gitgitgadget · 2019-04-16T13:14:31Z

This patch series was integrated into next via git@ec08c4f.

gitgitgadget · 2019-04-16T13:14:34Z

This patch series was integrated into master via git@ec08c4f.

gitgitgadget · 2019-04-16T13:14:37Z

Closed via ec08c4f.

jeffhostetler force-pushed the upstream-midx-verify branch from 74e9e37 to 2d23bc2 Compare March 18, 2019 13:37

jeffhostetler changed the title ~~Improve performance of multi-pack-index verify on large repos~~ multi-pack-index: fix verify on large repos Mar 18, 2019

gitgitgadget bot reviewed Mar 18, 2019

View reviewed changes

jeffhostetler force-pushed the upstream-midx-verify branch 2 times, most recently from cbe5fe6 to e2dd999 Compare March 19, 2019 11:48

gitgitgadget bot reviewed Mar 19, 2019

View reviewed changes

jeffhostetler mentioned this pull request Mar 19, 2019

midx: verify: add midx packfiles to the packed_git list microsoft/git#121

Merged

gitgitgadget bot reviewed Mar 19, 2019

View reviewed changes

jeffhostetler added 4 commits March 21, 2019 14:00

trace2:data: add trace2 data to midx

498258b

Log multi-pack-index command mode. Log number of objects and packfiles in the midx. Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>

midx: add progress indicators in multi-pack-index verify

8a60902

Add progress indicators to more parts of midx verify. Use sparse progress indicator for object iteration. Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>

jeffhostetler force-pushed the upstream-midx-verify branch from e2dd999 to 7e98ea0 Compare March 21, 2019 18:39

gitgitgadget bot added the pu label Mar 24, 2019

gitgitgadget bot added the next label Apr 10, 2019

gitgitgadget bot added the master label Apr 16, 2019

gitgitgadget bot closed this Apr 16, 2019

jeffhostetler deleted the upstream-midx-verify branch July 1, 2019 18:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi-pack-index: fix verify on large repos #166

multi-pack-index: fix verify on large repos #166

jeffhostetler commented Mar 15, 2019 •

edited

jeffhostetler commented Mar 18, 2019

gitgitgadget bot commented Mar 18, 2019

gitgitgadget bot Mar 18, 2019

gitgitgadget bot Mar 18, 2019

gitgitgadget bot Mar 18, 2019

gitgitgadget bot Mar 19, 2019

gitgitgadget bot Mar 19, 2019

gitgitgadget bot Mar 19, 2019

jeffhostetler commented Mar 19, 2019

gitgitgadget bot commented Mar 19, 2019

gitgitgadget bot Mar 19, 2019

gitgitgadget bot Mar 19, 2019

gitgitgadget bot Mar 19, 2019

gitgitgadget bot Mar 19, 2019

jeffhostetler commented Mar 21, 2019

gitgitgadget bot commented Mar 21, 2019

gitgitgadget bot commented Mar 22, 2019

gitgitgadget bot commented Mar 24, 2019

gitgitgadget bot commented Mar 24, 2019

gitgitgadget bot commented Mar 25, 2019

gitgitgadget bot commented Apr 1, 2019

gitgitgadget bot commented Apr 2, 2019

gitgitgadget bot commented Apr 3, 2019

gitgitgadget bot commented Apr 4, 2019

gitgitgadget bot commented Apr 10, 2019

gitgitgadget bot commented Apr 10, 2019

gitgitgadget bot commented Apr 16, 2019

gitgitgadget bot commented Apr 16, 2019

gitgitgadget bot commented Apr 16, 2019

gitgitgadget bot commented Apr 16, 2019

		@@ -958,9 +958,29 @@ static void midx_report(const char *fmt, ...)
		va_end(ap);

multi-pack-index: fix verify on large repos #166

multi-pack-index: fix verify on large repos #166

Conversation

jeffhostetler commented Mar 15, 2019 • edited

jeffhostetler commented Mar 18, 2019

gitgitgadget bot commented Mar 18, 2019

gitgitgadget bot Mar 18, 2019

Choose a reason for hiding this comment

gitgitgadget bot Mar 18, 2019

Choose a reason for hiding this comment

gitgitgadget bot Mar 18, 2019

Choose a reason for hiding this comment

gitgitgadget bot Mar 19, 2019

Choose a reason for hiding this comment

gitgitgadget bot Mar 19, 2019

Choose a reason for hiding this comment

gitgitgadget bot Mar 19, 2019

Choose a reason for hiding this comment

jeffhostetler commented Mar 19, 2019

gitgitgadget bot commented Mar 19, 2019

gitgitgadget bot Mar 19, 2019

Choose a reason for hiding this comment

gitgitgadget bot Mar 19, 2019

Choose a reason for hiding this comment

gitgitgadget bot Mar 19, 2019

Choose a reason for hiding this comment

gitgitgadget bot Mar 19, 2019

Choose a reason for hiding this comment

jeffhostetler commented Mar 21, 2019

gitgitgadget bot commented Mar 21, 2019

gitgitgadget bot commented Mar 22, 2019

gitgitgadget bot commented Mar 24, 2019

gitgitgadget bot commented Mar 24, 2019

gitgitgadget bot commented Mar 25, 2019

gitgitgadget bot commented Apr 1, 2019

gitgitgadget bot commented Apr 2, 2019

gitgitgadget bot commented Apr 3, 2019

gitgitgadget bot commented Apr 4, 2019

gitgitgadget bot commented Apr 10, 2019

gitgitgadget bot commented Apr 10, 2019

gitgitgadget bot commented Apr 16, 2019

gitgitgadget bot commented Apr 16, 2019

gitgitgadget bot commented Apr 16, 2019

gitgitgadget bot commented Apr 16, 2019

jeffhostetler commented Mar 15, 2019 •

edited