Refactor chunk-format into an API #848

derrickstolee · 2021-01-26T12:33:04Z

This is a restart on the topic previously submitted [1] but dropped because ak/corrected-commit-date was still in progress. This version is based on that branch.

[1] https://lore.kernel.org/git/pull.804.git.1607012215.gitgitgadget@gmail.com/

This version also changes the approach to use a more dynamic interaction with a struct chunkfile pointer. This idea is credited to Taylor Blau [2], but I started again from scratch. I also go further to make struct chunkfile anonymous to API consumers. It is defined only in chunk-format.c, which should hopefully deter future users from interacting with that data directly.

[2] https://lore.kernel.org/git/X8%2FI%2FRzXZksio+ri@nand.local/

This combined API is beneficial to reduce duplicated logic. Or rather, to ensure that similar file formats have similar protections against bad data. The multi-pack-index code did not have as many guards as the commit-graph code did, but now they both share a common base that checks for things like duplicate chunks or offsets outside the size of the file.

Here are some stats for the end-to-end change:

570 insertions(+), 456 deletions(-).
commit-graph.c: 107 insertions(+), 192 deletions(-)
midx.c: 164 insertions(+), 260 deletions(-)

While there is an overall increase to the code size, the consumers do get smaller. Boilerplate things like abstracting method to match chunk_write_fn and chunk_read_fn make up a lot of these insertions. The "interesting" code gets a lot smaller and cleaner.

Updates in V4

Out-of-date macros in commit-graph.c and midx.c are removed in their appropriate patches.
Documentation around the read API is improved.

Updates in V3

API methods use better types and changed their order to match internal data more closely.
Use hashfile_total() instead of internal data values.
The implementation of pair_chunk() uses read_chunk().
init_chunkfile() has an in-code doc comment warning against using the same struct chunkfile for reads and writes.
More multiplications are correctly cast in midx.c.
The chunk-format technical docs are expanded.

Updates in V2

The method pair_chunk() now automatically sets a pointer while read_chunk() uses the callback. This greatly reduces the code size.
Pointer casts are now implicit instead of explicit.
Extra care is taken to not overflow when verifying chunk sizes on write.

Thanks,
-Stolee

Cc: me@ttaylorr.com
Cc: gitster@pobox.com
Cc: l.s.r@web.de
Cc: szeder.dev@gmail.com
cc: Chris Torek chris.torek@gmail.com
cc: Derrick Stolee stolee@gmail.com

derrickstolee · 2021-01-26T16:00:09Z

/submit

gitgitgadget · 2021-01-26T16:02:34Z

Submitted as pull.848.git.1611676886.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git pr-848/derrickstolee/chunk-format/refactor-v1

To fetch this version to local tag pr-848/derrickstolee/chunk-format/refactor-v1:

git fetch --no-tags https://github.com/gitgitgadget/git tag pr-848/derrickstolee/chunk-format/refactor-v1

gitgitgadget · 2021-01-27T02:10:42Z

This branch is now known as ds/chunked-file-api.

gitgitgadget · 2021-01-27T02:10:43Z

This patch series was integrated into seen via git@f8b81ab.

gitgitgadget · 2021-01-27T04:00:11Z

commit-graph.c

@@ -1040,8 +1040,10 @@ struct write_commit_graph_context {
 };


On the Git mailing list, Taylor Blau wrote (reply to this):

On Tue, Jan 26, 2021 at 05:53:39PM -0800, Chris Torek wrote: > Note: this is purely style, and minor, but I'll ask... > > On Tue, Jan 26, 2021 at 8:08 AM Derrick Stolee via GitGitGadget > <gitgitgadget@gmail.com> wrote: > > static int write_graph_chunk_fanout(struct hashfile *f, > > - struct write_commit_graph_context *ctx) > > + void *data) > > { > > + struct write_commit_graph_context *ctx = > > + (struct write_commit_graph_context *)data; > > Why bother with the cast on the last line here? In C, > conversion from `void *` to `struct whatever *` is fine. > > (the change itself looks fine, btw) Agreed. It's not a correctness issue, but I find these unnecessary casts to detract from readability. If you do end up rerolling this series, I'd rather see struct write_commit_graph_context *ctx = data; ...but I don't think that this (non-)issue alone is worth a reroll. Thanks, Taylor

gitgitgadget · 2021-01-27T04:00:14Z

midx.c

@@ -451,7 +451,7 @@ static int pack_info_compare(const void *_a, const void *_b)
 	return strcmp(a->pack_name, b->pack_name);


On the Git mailing list, Taylor Blau wrote (reply to this):

On Tue, Jan 26, 2021 at 04:01:13PM +0000, Derrick Stolee via GitGitGadget wrote: > This change only renames "struct pack_info" to "struct > write_midx_context" and the names of instances from "packs" to "ctx". In > future changes, we will expand the data inside "struct > write_midx_context" and align our chunk-writing method with the > chunk-format API. Thanks for saying that; that makes clear what is (and isn't) going on here. > @@ -463,37 +463,37 @@ struct pack_list { > static void add_pack_to_midx(const char *full_path, size_t full_path_len, > const char *file_name, void *data) > { > - struct pack_list *packs = (struct pack_list *)data; > + struct write_midx_context *ctx = (struct write_midx_context *)data; Same comments as earlier about the unnecessary cast on the right-hand side of this (and the below) assignment. Otherwise this patch looks obviously fine to me. Thanks, Taylor

gitgitgadget · 2021-01-27T04:00:16Z

chunk-format.c

@@ -0,0 +1,155 @@
+#include "cache.h"


On the Git mailing list, Taylor Blau wrote (reply to this):

On Tue, Jan 26, 2021 at 04:01:21PM +0000, Derrick Stolee via GitGitGadget wrote: > diff --git a/chunk-format.c b/chunk-format.c > index 2ce37ecc6bb..674d31d5e58 100644 > --- a/chunk-format.c > +++ b/chunk-format.c > @@ -12,6 +12,8 @@ struct chunk_info { > uint32_t id; > uint64_t size; > chunk_write_fn write_fn; > + > + const void *start; It may be clearer to fold both of these into an anonymous union along with an enum to indicate which mode we're in. But, I could also buy that that is more error prone, so perhaps just a comment along the lines of "exactly one of these is NULL" would suffice, too. > }; > > struct chunkfile { > @@ -89,3 +91,65 @@ int write_chunkfile(struct chunkfile *cf, void *data) > > return 0; > } > + > +int read_table_of_contents(struct chunkfile *cf, > + const unsigned char *mfile, > + size_t mfile_size, Assuming that mfile and mfile_size are a pointer to a memory mapped region and its size? If so, a nit is that I'd expect "data" and "size" instead of "mfile". I think that it's probably going too far to have the chunkfile API handle mapping its own memory, so in that way I don't think it's wrong for the callers to be handling that. OTOH, it does seem a little weird to temporarily hand off ownership like this. I don't think I have a better suggestion, though. The implementation of this function looks good to me. > +int pair_chunk(struct chunkfile *cf, > + uint32_t chunk_id, > + chunk_read_fn fn, > + void *data) > +{ > + int i; > + > + for (i = 0; i < cf->chunks_nr; i++) { > + if (cf->chunks[i].id == chunk_id) > + return fn(cf->chunks[i].start, cf->chunks[i].size, data); > + } > + > + return CHUNK_NOT_FOUND; > +} > diff --git a/chunk-format.h b/chunk-format.h > index bfaed672813..250e08b8e6a 100644 > --- a/chunk-format.h > +++ b/chunk-format.h > @@ -17,4 +17,25 @@ void add_chunk(struct chunkfile *cf, > size_t size); > int write_chunkfile(struct chunkfile *cf, void *data); > > +int read_table_of_contents(struct chunkfile *cf, > + const unsigned char *mfile, > + size_t mfile_size, > + uint64_t toc_offset, > + int toc_length); > + > +/* > + * When reading a table of contents, we find the chunk with matching 'id' > + * then call its read_fn to populate the necessary 'data' based on the > + * chunk start and size. > + */ > +typedef int (*chunk_read_fn)(const unsigned char *chunk_start, > + size_t chunk_size, void *data); > + > + > +#define CHUNK_NOT_FOUND (-2) > +int pair_chunk(struct chunkfile *cf, > + uint32_t chunk_id, > + chunk_read_fn fn, > + void *data); From reading the implementation, I take it that this function calls fn with the location and size of the requested chunk, along with the user supplied data. I'm not sure that "pair" gives me that same sense. Maybe "read" or "lookup" would be better? Dunno. Thanks, Taylor

gitgitgadget · 2021-01-27T04:17:43Z

On the Git mailing list, Taylor Blau wrote (reply to this):

On Tue, Jan 26, 2021 at 04:01:09PM +0000, Derrick Stolee via GitGitGadget wrote:
> This version also changes the approach to use a more dynamic interaction
> with a struct chunkfile pointer. This idea is credited to Taylor Blau [2],
> but I started again from scratch. I also go further to make struct chunkfile
> anonymous to API consumers. It is defined only in chunk-format.c, which
> should hopefully deter future users from interacting with that data
> directly.
>
> [2] https://lore.kernel.org/git/X8%2FI%2FRzXZksio+ri@nand.local/

Great; I am very happy that you found my patch to be useful. I'm glad
that you decided to start from scratch, too, since as I recall there
were some unresolved test issues that I punted on in case you decided to
abandon the topic altogether.

> This combined API is beneficial to reduce duplicated logic. Or rather, to
> ensure that similar file formats have similar protections against bad data.
> The multi-pack-index code did not have as many guards as the commit-graph
> code did, but now they both share a common base that checks for things like
> duplicate chunks or offsets outside the size of the file.

Definitely good.

> Here are some stats for the end-to-end change:
>
>  * 638 insertions(+), 456 deletions(-).
>  * commit-graph.c: 171 insertions(+), 192 deletions(-)
>  * midx.c: 196 insertions(+), 260 deletions(-)
>
> While there is an overall increase to the code size, the consumers do get a
> bit smaller. Boilerplate things like abstracting method to match
> chunk_write_fn and chunk_read_fn make up a lot of these insertions. The
> "interesting" code gets a lot smaller and cleaner.

Like I said in [1], I don't think a net +182 line diff is reason alone
not to pursue this topic. I don't think that an chunked index v3 will
come as part of my work on the on-disk revindex format, but I do think
that it's something brian may be interested in. So, I'm feeling rather
certain that we'll eventually have new callers, at which point this will
reduce duplication overall.

[1]: https://lore.kernel.org/git/X8%2FK1dUgUmwp8ZOv@nand.local/

Thanks,
Taylor

gitgitgadget · 2021-01-27T04:17:44Z

Makefile

@@ -854,6 +854,7 @@ LIB_OBJS += bundle.o
 LIB_OBJS += cache-tree.o


On the Git mailing list, Taylor Blau wrote (reply to this):

On Tue, Jan 26, 2021 at 04:01:11PM +0000, Derrick Stolee via GitGitGadget wrote: > +/* > + * When writing a chunk-based file format, collect the chunks in > + * an array of chunk_info structs. The size stores the _expected_ > + * amount of data that will be written by write_fn. > + */ > +struct chunk_info { > + uint32_t id; > + uint64_t size; Hmm. Would we not want an off_t to indicate the size here? I wondered briefly if we even needed a size field at all, since calling write_fn would tell us the number of bytes written. But I suppose you want to know ahead of time so that you can write the file in one pass (beginning with the table of contents, which certainly needs to know the size). > + /* Trailing entry marks the end of the chunks */ > + hashwrite_be32(cf->f, 0); > + hashwrite_be64(cf->f, cur_offset); > + > + for (i = 0; i < cf->chunks_nr; i++) { > + uint64_t start_offset = cf->f->total + cf->f->offset; > + int result = cf->chunks[i].write_fn(cf->f, data); > + > + if (result) > + return result; > + > + if (cf->f->total + cf->f->offset != start_offset + cf->chunks[i].size) I don't think this is a practical concern, but a malicious caller could overflow this by passing a bogus "size" parameter. Maybe: uint64_t end_offset = ...; if (end_offset - start_offset != cf->chunks[i].size) BUG(...) ? > diff --git a/chunk-format.h b/chunk-format.h > new file mode 100644 > index 00000000000..bfaed672813 > --- /dev/null > +++ b/chunk-format.h > @@ -0,0 +1,20 @@ > +#ifndef CHUNK_FORMAT_H > +#define CHUNK_FORMAT_H > + > +#include "git-compat-util.h" > + > +struct hashfile; > +struct chunkfile; > + > +struct chunkfile *init_chunkfile(struct hashfile *f); > +void free_chunkfile(struct chunkfile *cf); > +int get_num_chunks(struct chunkfile *cf); > +typedef int (*chunk_write_fn)(struct hashfile *f, > + void *data); > +void add_chunk(struct chunkfile *cf, > + uint64_t id, > + chunk_write_fn fn, > + size_t size); > +int write_chunkfile(struct chunkfile *cf, void *data); Very clean API. Thanks, Taylor

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 1/26/2021 9:42 PM, Taylor Blau wrote: > On Tue, Jan 26, 2021 at 04:01:11PM +0000, Derrick Stolee via GitGitGadget wrote: >> +/* >> + * When writing a chunk-based file format, collect the chunks in >> + * an array of chunk_info structs. The size stores the _expected_ >> + * amount of data that will be written by write_fn. >> + */ >> +struct chunk_info { >> + uint32_t id; >> + uint64_t size; > > Hmm. Would we not want an off_t to indicate the size here? > > I wondered briefly if we even needed a size field at all, since calling > write_fn would tell us the number of bytes written. But I suppose you > want to know ahead of time so that you can write the file in one pass > (beginning with the table of contents, which certainly needs to know the > size). Is off_t 64-bits on a 32-bit machine? This is intentionally typed to be "64 bits no matter what" because it correlates with the file format's size for the chunk offsets. >> + if (cf->f->total + cf->f->offset != start_offset + cf->chunks[i].size) > > I don't think this is a practical concern, but a malicious caller could > overflow this by passing a bogus "size" parameter. Maybe: > > uint64_t end_offset = ...; > > if (end_offset - start_offset != cf->chunks[i].size) > BUG(...) Sure. Thanks, -Stolee

gitgitgadget · 2021-01-27T04:20:45Z

commit-graph.c

@@ -1040,8 +1040,10 @@ struct write_commit_graph_context {
 };


On the Git mailing list, Chris Torek wrote (reply to this):

Note: this is purely style, and minor, but I'll ask... On Tue, Jan 26, 2021 at 8:08 AM Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com> wrote: > static int write_graph_chunk_fanout(struct hashfile *f, > - struct write_commit_graph_context *ctx) > + void *data) > { > + struct write_commit_graph_context *ctx = > + (struct write_commit_graph_context *)data; Why bother with the cast on the last line here? In C, conversion from `void *` to `struct whatever *` is fine. (the change itself looks fine, btw) Chris

gitgitgadget · 2021-01-27T04:20:47Z

User Chris Torek <chris.torek@gmail.com> has been added to the cc: list.

gitgitgadget · 2021-01-27T04:20:48Z

commit-graph.c

@@ -19,6 +19,7 @@
 #include "shallow.h"


On the Git mailing list, Taylor Blau wrote (reply to this):

On Tue, Jan 26, 2021 at 04:01:12PM +0000, Derrick Stolee via GitGitGadget wrote: > From: Derrick Stolee <dstolee@microsoft.com> > > The commit-graph write logic is ready to make use of the chunk-format > write API. Each chunk write method is already in the correct prototype. > We only need to use the 'struct chunkfile' pointer and the correct API > calls. > > Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Nicely done. The majority of this patch was remarkably easy to read, which I attribute to you doing the necessary prep work to make the callbacks usable by the new API. Thank you. > @@ -1941,6 +1896,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx) > > close_commit_graph(ctx->r->objects); > finalize_hashfile(f, file_hash.hash, CSUM_HASH_IN_STREAM | CSUM_FSYNC); > + free_chunkfile(cf); Since chunkfiles are so tightly coupled to hashfiles (i.e., you can only "construct" a chunkfile given a 'struct hashfile*'), I wonder whether this should be: finalize_chunkfile(cf, ...) instead. It seems kind of weird to give up ownership of 'f' down to the chunkfile API only to reach down into it again. I could even buy that you'd always want to finalize and free a chunkfile at the same time, and so perhaps the calls could be combined, but that may be a step too far. Thanks, Taylor

gitgitgadget · 2021-01-27T06:54:14Z

midx.c

@@ -11,6 +11,7 @@
 #include "trace2.h"


On the Git mailing list, Taylor Blau wrote (reply to this):

On Tue, Jan 26, 2021 at 04:01:23PM +0000, Derrick Stolee via GitGitGadget wrote: > From: Derrick Stolee <dstolee@microsoft.com> > > Instead of parsing the table of contents directly, use the chunk-format > API methods read_table_of_contents() and pair_chunk(). In particular, we > can use the return value of pair_chunk() to generate an error when a > required chunk is missing. > > Signed-off-by: Derrick Stolee <dstolee@microsoft.com> > --- > midx.c | 103 ++++++++++++++++++++---------------- > t/t5319-multi-pack-index.sh | 6 +-- > 2 files changed, 60 insertions(+), 49 deletions(-) > > diff --git a/midx.c b/midx.c > index 0bfd2d802b6..dd019c00795 100644 > --- a/midx.c > +++ b/midx.c > @@ -54,6 +54,51 @@ static char *get_midx_filename(const char *object_dir) > return xstrfmt("%s/pack/multi-pack-index", object_dir); > } > > +static int midx_read_pack_names(const unsigned char *chunk_start, > + size_t chunk_size, void *data) > +{ > + struct multi_pack_index *m = (struct multi_pack_index *)data; > + m->chunk_pack_names = chunk_start; > + return 0; > +} There are a lot of these callbacks that just assign some 'void **' to point at chunk_start. Maybe a good use of the "pair_chunk" name would be something like: int pair_chunk(struct chunkfile *cf, uint32_t id, const unsigned char **p); which does the same as what you wrote here. So instead of what you wrote, you could instead: pair_chunk(cf, MIDX_CHUNKID_PACKNAMES, &m->chunk_pack_names); This would be in addition to the richer callback-style function which allows the caller greater flexibility (e.g., for the Bloom filter related readers in the commit-graph code). Thanks, Taylor

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 1/26/2021 10:06 PM, Taylor Blau wrote: > On Tue, Jan 26, 2021 at 04:01:23PM +0000, Derrick Stolee via GitGitGadget wrote: >> From: Derrick Stolee <dstolee@microsoft.com> >> >> Instead of parsing the table of contents directly, use the chunk-format >> API methods read_table_of_contents() and pair_chunk(). In particular, we >> can use the return value of pair_chunk() to generate an error when a >> required chunk is missing. >> >> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> >> --- >> midx.c | 103 ++++++++++++++++++++---------------- >> t/t5319-multi-pack-index.sh | 6 +-- >> 2 files changed, 60 insertions(+), 49 deletions(-) >> >> diff --git a/midx.c b/midx.c >> index 0bfd2d802b6..dd019c00795 100644 >> --- a/midx.c >> +++ b/midx.c >> @@ -54,6 +54,51 @@ static char *get_midx_filename(const char *object_dir) >> return xstrfmt("%s/pack/multi-pack-index", object_dir); >> } >> >> +static int midx_read_pack_names(const unsigned char *chunk_start, >> + size_t chunk_size, void *data) >> +{ >> + struct multi_pack_index *m = (struct multi_pack_index *)data; >> + m->chunk_pack_names = chunk_start; >> + return 0; >> +} > > There are a lot of these callbacks that just assign some 'void **' to > point at chunk_start. > > Maybe a good use of the "pair_chunk" name would be something like: > > int pair_chunk(struct chunkfile *cf, uint32_t id, const unsigned char **p); > > which does the same as what you wrote here. So instead of what you > wrote, you could instead: > > pair_chunk(cf, MIDX_CHUNKID_PACKNAMES, &m->chunk_pack_names); > > This would be in addition to the richer callback-style function which > allows the caller greater flexibility (e.g., for the Bloom filter > related readers in the commit-graph code). You're right that _most_ callers just want to assign a pointer, so this mechanism would be better. I'll make a different function, read_chunk() perhaps, that relies on a callback for advanced users. Thanks, -Stolee

gitgitgadget · 2021-01-27T13:24:41Z

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> This is a restart on the topic previously submitted [1] but dropped because
> ak/corrected-commit-date was still in progress. This version is based on
> that branch.

Nice to see that we have an endorsement on ak/corrected-commit-date
topic ;-)

I've scanned this round of the topic and they were pleasant read.

I may have other comments after a more careful reading, but so far,
I am happy with what I see here.

Thanks.

gitgitgadget · 2021-01-27T13:51:41Z

User Derrick Stolee <stolee@gmail.com> has been added to the cc: list.

In preparation for creating an API around file formats using chunks and tables of contents, prepare the commit-graph write code to use prototypes that will match this new API. Specifically, convert chunk_write_fn to take a "void *data" parameter instead of the commit-graph-specific "struct write_commit_graph_context" pointer. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

derrickstolee · 2021-01-27T15:00:51Z

/submit

gitgitgadget · 2021-01-27T15:02:39Z

Submitted as pull.848.v2.git.1611759716.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git pr-848/derrickstolee/chunk-format/refactor-v2

To fetch this version to local tag pr-848/derrickstolee/chunk-format/refactor-v2:

git fetch --no-tags https://github.com/gitgitgadget/git tag pr-848/derrickstolee/chunk-format/refactor-v2

gitgitgadget · 2021-01-27T16:08:03Z

On the Git mailing list, Taylor Blau wrote (reply to this):

On Wed, Jan 27, 2021 at 03:01:39PM +0000, Derrick Stolee via GitGitGadget wrote:
> Updates in V2
> =============
>
>  * The method pair_chunk() now automatically sets a pointer while
>    read_chunk() uses the callback. This greatly reduces the code size.
>
>  * Pointer casts are now implicit instead of explicit.
>
>  * Extra care is taken to not overflow when verifying chunk sizes on write.

Thanks, I read the range-diff between this version and the last and
appreciate you taking the time to address all of my concerns.

I think that this is ready to go, so please have my:

  Reviewed-by: Taylor Blau <me@ttaylorr.com>

Thanks,
Taylor

gitgitgadget · 2021-01-28T06:48:26Z

This patch series was integrated into seen via git@867650a.

gitgitgadget · 2021-01-28T23:20:33Z

This patch series was integrated into seen via git@489277d.

gitgitgadget · 2021-02-11T23:02:29Z

This patch series was integrated into seen via git@2fec6f9.

gitgitgadget · 2021-02-12T22:57:36Z

This patch series was integrated into seen via git@d6bd16c.

gitgitgadget · 2021-02-17T00:15:48Z

This patch series was integrated into seen via git@a4a8003.

In anticipation of combining the logic from the commit-graph and multi-pack-index file formats, create a new chunk-format API. Use a 'struct chunkfile' pointer to keep track of data that has been registered for writes. This struct is anonymous outside of chunk-format.c to ensure no user attempts to interfere with the data. The next change will use this API in commit-graph.c, but the general approach is: 1. initialize the chunkfile with init_chunkfile(f). 2. add chunks in the intended writing order with add_chunk(). 3. write any header information to the hashfile f. 4. write the chunkfile data using write_chunkfile(). 5. free the chunkfile struct using free_chunkfile(). Helped-by: Taylor Blau <me@ttaylorr.com> Helped-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

The commit-graph write logic is ready to make use of the chunk-format write API. Each chunk write method is already in the correct prototype. We only need to use the 'struct chunkfile' pointer and the correct API calls. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

In an effort to streamline our chunk-based file formats, align some of the code structure in write_midx_internal() to be similar to the patterns in write_commit_graph_file(). Specifically, let's create a "struct write_midx_context" that can be used as a data parameter to abstract function types. This change only renames "struct pack_info" to "struct write_midx_context" and the names of instances from "packs" to "ctx". In future changes, we will expand the data inside "struct write_midx_context" and align our chunk-writing method with the chunk-format API. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

In an effort to align the write_midx_internal() to use the chunk-format API, start converting chunk writing methods to match chunk_write_fn. The first case is to convert write_midx_pack_names() to take "void *data". We already have the necessary data in "struct write_midx_context", so this conversion is rather mechanical. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

In an effort to align write_midx_internal() with the chunk-format API, continue to group necessary data into "struct write_midx_context". This change collects the "struct pack_midx_entry *entries" list and its count into the context. Update write_midx_oid_fanout() and write_midx_oid_lookup() to take the context directly, as these are easy conversions with this new data. Only the callers of write_midx_object_offsets() and write_midx_large_offsets() are updated here, since additional data in the context before those methods can match chunk_write_fn. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

In an effort to align write_midx_internal() with the chunk-format API, continue to group necessary data into "struct write_midx_context". This change collects the "uint32_t *pack_perm" and large_offsets_needed bit into the context. Update write_midx_object_offsets() to match chunk_write_fn. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

In an effort to align write_midx_internal() with the chunk-format API, continue to group necessary data into "struct write_midx_context". This change collects the "uint32_t num_large_offsets" into the context. With this new data, write_midx_large_offsets() now matches the chunk_write_fn type. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

Historically, the chunk-writing methods in midx.c have returned the amount of data written so the writer method could compare this with the table of contents. This presents with some interesting issues: 1. If a chunk writing method has a bug that miscalculates the written bytes, then we can satisfy the table of contents without actually writing the right amount of data to the hashfile. The commit-graph writing code checks the hashfile struct directly for a more robust verification. 2. There is no way for a chunk writing method to gracefully fail. Returning an int presents an opportunity to fail without a die(). 3. The current pattern doesn't match chunk_write_fn type exactly, so we cannot share code with commit-graph.c For these reasons, convert the midx chunk writer methods to return an 'int'. Since none of them fail at the moment, they all return 0. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

Most expensive operations in write_midx_internal() use the context struct's progress member, and these indicate the process of the expensive operations within the chunk writing methods. However, there is a competing progress struct that counts the progress over all chunks. This is not very helpful compared to the others, so drop it. This also reduces our barriers to combining the chunk writing code with chunk-format.c. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

The chunk-format API allows writing the table of contents and all chunks using the anonymous 'struct chunkfile' type. We only need to convert our local chunk logic to this API for the multi-pack-index writes to share that logic with the commit-graph file writes. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

Add the capability to read the table of contents, then pair the chunks with necessary logic using read_chunk_fn pointers. Callers will be added in future changes, but the typical outline will be: 1. initialize a 'struct chunkfile' with init_chunkfile(NULL). 2. call read_table_of_contents(). 3. for each chunk to parse, a. call pair_chunk() to assign a pointer with the chunk position, or b. call read_chunk() to run a callback on the chunk start and size. 4. call free_chunkfile() to clear the 'struct chunkfile' data. We are re-using the anonymous 'struct chunkfile' data, as it is internal to the chunk-format API. This gives it essentially two modes: write and read. If the same struct instance was used for both reads and writes, then there would be failures. Helped-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

Instead of parsing the table of contents directly, use the chunk-format API methods read_table_of_contents() and pair_chunk(). While the current implementation loses the duplicate-chunk detection, that will be added in a future change. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

Instead of parsing the table of contents directly, use the chunk-format API methods read_table_of_contents() and pair_chunk(). In particular, we can use the return value of pair_chunk() to generate an error when a required chunk is missing. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

When calculating the sizes of certain chunks, we should use 64-bit multiplication always. This allows us to properly predict the chunk sizes without risk of overflow. Other possible overflows were discovered by evaluating each multiplication in midx.c and ensuring that at least one side of the operator was of type size_t or off_t. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

Before refactoring into the chunk-format API, the commit-graph parsing logic included checks for duplicate chunks. It is unlikely that we would desire a chunk-based file format that allows duplicate chunk IDs in the table of contents, so add duplicate checks into read_table_of_contents(). Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

The chunk-based file format is now an API in the code, but we should also take time to document it as a file format. Specifically, it matches the CHUNK LOOKUP sections of the commit-graph and multi-pack-index files, but there are some commonalities that should be grouped in this document. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

derrickstolee · 2021-02-18T14:06:24Z

/submit

gitgitgadget · 2021-02-18T14:08:23Z

Submitted as pull.848.v4.git.1613657259.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git pr-848/derrickstolee/chunk-format/refactor-v4

To fetch this version to local tag pr-848/derrickstolee/chunk-format/refactor-v4:

git fetch --no-tags https://github.com/gitgitgadget/git tag pr-848/derrickstolee/chunk-format/refactor-v4

gitgitgadget · 2021-02-18T21:51:57Z

Documentation/technical/chunk-format.txt

@@ -0,0 +1,116 @@
+Chunk-based file formats


On the Git mailing list, Junio C Hamano wrote (reply to this):

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > +Chunk-based file formats > +======================== > + > +Some file formats in Git use a common concept of "chunks" to describe > +sections of the file. This allows structured access to a large file by > +scanning a small "table of contents" for the remaining data. This common > +format is used by the `commit-graph` and `multi-pack-index` files. See > +link:technical/pack-format.html[the `multi-pack-index` format] and > +link:technical/commit-graph-format.html[the `commit-graph` format] for > +how they use the chunks to describe structured data. I've read the doc added here to the end; well written and easy to understand. I wonder how/if well reftable files fit in the scheme, or if it doesn't, should the chunk file format API be updated to accomodate it (or the other way around)? > +Extract the data information for each chunk using `pair_chunk()` or > +`read_chunk()`: > + > +* `pair_chunk()` assigns a given pointer with the location inside the > + memory-mapped file corresponding to that chunk's offset. If the chunk > + does not exist, then the pointer is not modified. I think it is worth adding: The caller is expected to know where the returned chunk ends by some out-of-band means, as this function only gives the offset but not the size, unlike the read_chunk() function. > +* `read_chunk()` takes a `chunk_read_fn` function pointer and calls it > + with the appropriate initial pointer and size information. The function > + is not called if the chunk does not exist. Use this method to read chunks > + if you need to perform immediate parsing or if you need to execute logic > + based on the size of the chunk. > + > +After calling these methods, call `free_chunkfile()` to clear the > +`struct chunkfile` data. This will not close the memory-mapped region. > +Callers are expected to own that data for the timeframe the pointers into > +the region are needed. Thanks.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 2/18/2021 4:47 PM, Junio C Hamano wrote: > "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > >> +Chunk-based file formats >> +======================== >> + >> +Some file formats in Git use a common concept of "chunks" to describe >> +sections of the file. This allows structured access to a large file by >> +scanning a small "table of contents" for the remaining data. This common >> +format is used by the `commit-graph` and `multi-pack-index` files. See >> +link:technical/pack-format.html[the `multi-pack-index` format] and >> +link:technical/commit-graph-format.html[the `commit-graph` format] for >> +how they use the chunks to describe structured data. > > I've read the doc added here to the end; well written and easy to > understand. > > I wonder how/if well reftable files fit in the scheme, or if it > doesn't, should the chunk file format API be updated to accomodate > it (or the other way around)? I'm not sure that reftable can work with this format, especially with its design to do most updates as append-only (IIUC). And to change the format to work with the chunk format would violate the compatibility with the JGit version. I would be interested if something like the packed-refs file could use a minor update, but only if there is a realistic benefit to using chunks over the current format. The files that are on my radar for adopting a new file format using the chunk-format API are: * reachability bitmaps: using a similar approach to the commit-graph, we could avoid parsing the entire file before checking if a specific commit has a bitmap. (Requires a commit lookup chunk, a bitmap data chunk, and an offset chunk to connect them.) * index v5: I'm trying to collect a bunch of information about how to update the index for better compression, and the chunk-based approach can provide some fixed-width columns that can vary in length depending on the required data (presenting the interesting behavior from v2 and v3, along with possible approaches previously presented as a potential v5). The paths could be presented as a chunk, giving the interesting options between v2/3 and v4 (prefix compression). I haven't even started the actual work here, but I've been thinking about it a lot. I'll have time next month to start prototyping. Are there other interesting files that could use a new version here? What other pain points are known to experts in the area? >> +Extract the data information for each chunk using `pair_chunk()` or >> +`read_chunk()`: >> + >> +* `pair_chunk()` assigns a given pointer with the location inside the >> + memory-mapped file corresponding to that chunk's offset. If the chunk >> + does not exist, then the pointer is not modified. > > I think it is worth adding: > > The caller is expected to know where the returned chunk ends by > some out-of-band means, as this function only gives the offset > but not the size, unlike the read_chunk() function. True. I suppose that could be more explicit, although it can be gleaned from the omission of any size information. Thanks, -Stolee

gitgitgadget · 2021-02-19T00:12:46Z

This patch series was integrated into seen via git@8a7cd50.

gitgitgadget · 2021-02-23T01:58:38Z

This patch series was integrated into seen via git@3d78343.

gitgitgadget · 2021-02-23T01:58:39Z

This patch series was integrated into next via git@5f0e28c.

gitgitgadget · 2021-02-24T16:56:30Z

commit-graph.c

@@ -19,6 +19,7 @@
 #include "shallow.h"


On the Git mailing list, SZEDER Gábor wrote (reply to this):

On Thu, Feb 18, 2021 at 02:07:25PM +0000, Derrick Stolee via GitGitGadget wrote: > From: Derrick Stolee <dstolee@microsoft.com> > > The commit-graph write logic is ready to make use of the chunk-format > write API. Each chunk write method is already in the correct prototype. > We only need to use the 'struct chunkfile' pointer and the correct API > calls. This patch series messes up the "Writing out commit graph" progress display, and starting at this commit I get: $ git commit-graph write --reachable Expanding reachable commits in commit graph: 837569, done. Writing out commit graph in 3 passes: 166% (4187845/2512707), done. Note that 166%. Before this commit I got: Expanding reachable commits in commit graph: 837569, done. Writing out commit graph in 5 passes: 100% (4187845/4187845), done. Note the different number of passes.

On the Git mailing list, Taylor Blau wrote (reply to this):

On Wed, Feb 24, 2021 at 05:52:42PM +0100, SZEDER Gábor wrote: > On Thu, Feb 18, 2021 at 02:07:25PM +0000, Derrick Stolee via GitGitGadget wrote: > > From: Derrick Stolee <dstolee@microsoft.com> > > > > The commit-graph write logic is ready to make use of the chunk-format > > write API. Each chunk write method is already in the correct prototype. > > We only need to use the 'struct chunkfile' pointer and the correct API > > calls. > > This patch series messes up the "Writing out commit graph" progress > display, and starting at this commit I get: I can confirm. It looks like we never dropped the 'num_chunks' variable, which should have happened in this patch. Here's something to apply on top which fixes the issue. Thanks for reporting. --- >8 --- Subject: [PATCH] commit-graph.c: display correct number of chunks when writing When writing a commit-graph, a progress meter is shown which indicates the number of pieces of data to write (one per commit in each chunk). In 47410aa837 (commit-graph: use chunk-format write API, 2021-02-18), the number of chunks became tracked by the new chunk-format API. But a stray local variable was left behind from when write_commit_graph_file() used to keep track of the same. Since this was no longer updated after 47410aa837, the progress meter appeared broken: $ git commit-graph write --reachable Expanding reachable commits in commit graph: 837569, done. Writing out commit graph in 3 passes: 166% (4187845/2512707), done. Drop the local variable and rely instead on the chunk-format API to tell us the correct number of chunks. Reported-by: SZEDER Gábor <szeder.dev@gmail.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> --- commit-graph.c | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index 78b993c367..6aa0c488f5 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -1791,7 +1791,6 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx) struct lock_file lk = LOCK_INIT; const unsigned hashsz = the_hash_algo->rawsz; struct strbuf progress_title = STRBUF_INIT; - int num_chunks = 3; struct object_id file_hash; struct chunkfile *cf; @@ -1887,11 +1886,11 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx) strbuf_addf(&progress_title, Q_("Writing out commit graph in %d pass", "Writing out commit graph in %d passes", - num_chunks), - num_chunks); + get_num_chunks(cf)), + get_num_chunks(cf)); ctx->progress = start_delayed_progress( progress_title.buf, - num_chunks * ctx->commits.nr); + get_num_chunks(cf) * ctx->commits.nr); } write_chunkfile(cf, ctx); -- 2.30.0.667.g81c0cbc6fd

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 2/24/2021 12:12 PM, Taylor Blau wrote: > On Wed, Feb 24, 2021 at 05:52:42PM +0100, SZEDER Gábor wrote: >> On Thu, Feb 18, 2021 at 02:07:25PM +0000, Derrick Stolee via GitGitGadget wrote: >>> From: Derrick Stolee <dstolee@microsoft.com> >>> >>> The commit-graph write logic is ready to make use of the chunk-format >>> write API. Each chunk write method is already in the correct prototype. >>> We only need to use the 'struct chunkfile' pointer and the correct API >>> calls. >> >> This patch series messes up the "Writing out commit graph" progress >> display, and starting at this commit I get: Thanks for the report and identifying the exact place that caused the mistake. > I can confirm. It looks like we never dropped the 'num_chunks' variable, > which should have happened in this patch. Yes, makes sense. Hard to see that 'num_chunks' wasn't used because it _was_ being used, just not as intended. > @@ -1887,11 +1886,11 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx) > strbuf_addf(&progress_title, > Q_("Writing out commit graph in %d pass", > "Writing out commit graph in %d passes", > - num_chunks), > - num_chunks); > + get_num_chunks(cf)), > + get_num_chunks(cf)); > ctx->progress = start_delayed_progress( > progress_title.buf, > - num_chunks * ctx->commits.nr); > + get_num_chunks(cf) * ctx->commits.nr); This is obviously correct. Thanks for the quick patch! -Stolee

On the Git mailing list, Junio C Hamano wrote (reply to this):

Derrick Stolee <stolee@gmail.com> writes: >>> This patch series messes up the "Writing out commit graph" progress >>> display, and starting at this commit I get: > > Thanks for the report and identifying the exact place that caused the > mistake. > >> I can confirm. It looks like we never dropped the 'num_chunks' variable, >> which should have happened in this patch. > > Yes, makes sense. Hard to see that 'num_chunks' wasn't used because it > _was_ being used, just not as intended. > ... > > This is obviously correct. Thanks for the quick patch! Thanks all for noticing and fixing before the series hit the master branch.

derrickstolee force-pushed the chunk-format/refactor branch 5 times, most recently from 6bb8725 to 05cbd0a Compare January 26, 2021 15:23

gitgitgadget bot added the seen label Jan 27, 2021

gitgitgadget bot reviewed Jan 27, 2021

View reviewed changes

derrickstolee force-pushed the chunk-format/refactor branch from 05cbd0a to 8f3985a Compare January 27, 2021 14:43

derrickstolee changed the base branch from ak/corrected-commit-date to next January 27, 2021 14:55

derrickstolee self-assigned this Jan 28, 2021

derrickstolee added 10 commits February 18, 2021 08:17

derrickstolee force-pushed the chunk-format/refactor branch from a68f2c5 to 62a9565 Compare February 18, 2021 13:26

derrickstolee added 6 commits February 18, 2021 08:32

derrickstolee force-pushed the chunk-format/refactor branch from 62a9565 to 84bf650 Compare February 18, 2021 13:32

gitgitgadget bot reviewed Feb 18, 2021

View reviewed changes

gitgitgadget bot added the next label Feb 23, 2021

derrickstolee closed this Feb 23, 2021

gitgitgadget bot reviewed Feb 24, 2021

View reviewed changes

		@@ -1040,8 +1040,10 @@ struct write_commit_graph_context {
		};

		@@ -451,7 +451,7 @@ static int pack_info_compare(const void _a, const void _b)
		return strcmp(a->pack_name, b->pack_name);

		@@ -854,6 +854,7 @@ LIB_OBJS += bundle.o
		LIB_OBJS += cache-tree.o

Refactor chunk-format into an API #848

Refactor chunk-format into an API #848

Conversation

derrickstolee commented Jan 26, 2021 • edited Loading

Updates in V4

Updates in V3

Updates in V2

derrickstolee commented Jan 26, 2021

gitgitgadget bot commented Jan 26, 2021

gitgitgadget bot commented Jan 27, 2021

gitgitgadget bot commented Jan 27, 2021

gitgitgadget bot Jan 27, 2021

Choose a reason for hiding this comment

gitgitgadget bot Jan 27, 2021

Choose a reason for hiding this comment

gitgitgadget bot Jan 27, 2021

Choose a reason for hiding this comment

gitgitgadget bot commented Jan 27, 2021

gitgitgadget bot Jan 27, 2021

Choose a reason for hiding this comment

gitgitgadget bot Jan 27, 2021

Choose a reason for hiding this comment

gitgitgadget bot Jan 27, 2021

Choose a reason for hiding this comment

gitgitgadget bot commented Jan 27, 2021

gitgitgadget bot Jan 27, 2021

Choose a reason for hiding this comment

gitgitgadget bot Jan 27, 2021

Choose a reason for hiding this comment

gitgitgadget bot Jan 27, 2021

Choose a reason for hiding this comment

gitgitgadget bot commented Jan 27, 2021

gitgitgadget bot commented Jan 27, 2021

derrickstolee commented Jan 27, 2021

gitgitgadget bot commented Jan 27, 2021

gitgitgadget bot commented Jan 27, 2021

gitgitgadget bot commented Jan 28, 2021

gitgitgadget bot commented Jan 28, 2021

gitgitgadget bot commented Feb 11, 2021

gitgitgadget bot commented Feb 12, 2021

gitgitgadget bot commented Feb 17, 2021

derrickstolee commented Feb 18, 2021

gitgitgadget bot commented Feb 18, 2021

gitgitgadget bot Feb 18, 2021

Choose a reason for hiding this comment

gitgitgadget bot Feb 19, 2021

Choose a reason for hiding this comment

gitgitgadget bot commented Feb 19, 2021

gitgitgadget bot commented Feb 23, 2021

gitgitgadget bot commented Feb 23, 2021

gitgitgadget bot Feb 24, 2021

Choose a reason for hiding this comment

gitgitgadget bot Feb 24, 2021

Choose a reason for hiding this comment

gitgitgadget bot Feb 24, 2021

Choose a reason for hiding this comment

gitgitgadget bot Feb 24, 2021

Choose a reason for hiding this comment

derrickstolee commented Jan 26, 2021 •

edited

Loading