Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor chunk-format into an API #848

Closed

Conversation

derrickstolee
Copy link

@derrickstolee derrickstolee commented Jan 26, 2021

This is a restart on the topic previously submitted [1] but dropped because ak/corrected-commit-date was still in progress. This version is based on that branch.

[1] https://lore.kernel.org/git/pull.804.git.1607012215.gitgitgadget@gmail.com/

This version also changes the approach to use a more dynamic interaction with a struct chunkfile pointer. This idea is credited to Taylor Blau [2], but I started again from scratch. I also go further to make struct chunkfile anonymous to API consumers. It is defined only in chunk-format.c, which should hopefully deter future users from interacting with that data directly.

[2] https://lore.kernel.org/git/X8%2FI%2FRzXZksio+ri@nand.local/

This combined API is beneficial to reduce duplicated logic. Or rather, to ensure that similar file formats have similar protections against bad data. The multi-pack-index code did not have as many guards as the commit-graph code did, but now they both share a common base that checks for things like duplicate chunks or offsets outside the size of the file.

Here are some stats for the end-to-end change:

  • 570 insertions(+), 456 deletions(-).
  • commit-graph.c: 107 insertions(+), 192 deletions(-)
  • midx.c: 164 insertions(+), 260 deletions(-)

While there is an overall increase to the code size, the consumers do get smaller. Boilerplate things like abstracting method to match chunk_write_fn and chunk_read_fn make up a lot of these insertions. The "interesting" code gets a lot smaller and cleaner.

Updates in V4

  • Out-of-date macros in commit-graph.c and midx.c are removed in their appropriate patches.
  • Documentation around the read API is improved.

Updates in V3

  • API methods use better types and changed their order to match internal data more closely.

  • Use hashfile_total() instead of internal data values.

  • The implementation of pair_chunk() uses read_chunk().

  • init_chunkfile() has an in-code doc comment warning against using the same struct chunkfile for reads and writes.

  • More multiplications are correctly cast in midx.c.

  • The chunk-format technical docs are expanded.

Updates in V2

  • The method pair_chunk() now automatically sets a pointer while read_chunk() uses the callback. This greatly reduces the code size.

  • Pointer casts are now implicit instead of explicit.

  • Extra care is taken to not overflow when verifying chunk sizes on write.

Thanks,
-Stolee

Cc: me@ttaylorr.com
Cc: gitster@pobox.com
Cc: l.s.r@web.de
Cc: szeder.dev@gmail.com
cc: Chris Torek chris.torek@gmail.com
cc: Derrick Stolee stolee@gmail.com

@derrickstolee derrickstolee force-pushed the chunk-format/refactor branch 5 times, most recently from 6bb8725 to 05cbd0a Compare January 26, 2021 15:23
@derrickstolee
Copy link
Author

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Jan 26, 2021

Submitted as pull.848.git.1611676886.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git pr-848/derrickstolee/chunk-format/refactor-v1

To fetch this version to local tag pr-848/derrickstolee/chunk-format/refactor-v1:

git fetch --no-tags https://github.com/gitgitgadget/git tag pr-848/derrickstolee/chunk-format/refactor-v1

@gitgitgadget
Copy link

gitgitgadget bot commented Jan 27, 2021

This branch is now known as ds/chunked-file-api.

@gitgitgadget
Copy link

gitgitgadget bot commented Jan 27, 2021

This patch series was integrated into seen via git@f8b81ab.

@gitgitgadget gitgitgadget bot added the seen label Jan 27, 2021
@@ -1040,8 +1040,10 @@ struct write_commit_graph_context {
};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Taylor Blau wrote (reply to this):

On Tue, Jan 26, 2021 at 05:53:39PM -0800, Chris Torek wrote:
> Note: this is purely style, and minor, but I'll ask...
>
> On Tue, Jan 26, 2021 at 8:08 AM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
> >  static int write_graph_chunk_fanout(struct hashfile *f,
> > -                                   struct write_commit_graph_context *ctx)
> > +                                   void *data)
> >  {
> > +       struct write_commit_graph_context *ctx =
> > +               (struct write_commit_graph_context *)data;
>
> Why bother with the cast on the last line here?  In C,
> conversion from `void *` to `struct whatever *` is fine.
>
> (the change itself looks fine, btw)

Agreed. It's not a correctness issue, but I find these unnecessary casts
to detract from readability. If you do end up rerolling this series,
I'd rather see

    struct write_commit_graph_context *ctx = data;

...but I don't think that this (non-)issue alone is worth a reroll.

Thanks,
Taylor

@@ -451,7 +451,7 @@ static int pack_info_compare(const void *_a, const void *_b)
return strcmp(a->pack_name, b->pack_name);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Taylor Blau wrote (reply to this):

On Tue, Jan 26, 2021 at 04:01:13PM +0000, Derrick Stolee via GitGitGadget wrote:
> This change only renames "struct pack_info" to "struct
> write_midx_context" and the names of instances from "packs" to "ctx". In
> future changes, we will expand the data inside "struct
> write_midx_context" and align our chunk-writing method with the
> chunk-format API.

Thanks for saying that; that makes clear what is (and isn't) going on
here.

> @@ -463,37 +463,37 @@ struct pack_list {
>  static void add_pack_to_midx(const char *full_path, size_t full_path_len,
>  			     const char *file_name, void *data)
>  {
> -	struct pack_list *packs = (struct pack_list *)data;
> +	struct write_midx_context *ctx = (struct write_midx_context *)data;

Same comments as earlier about the unnecessary cast on the right-hand
side of this (and the below) assignment.

Otherwise this patch looks obviously fine to me.


Thanks,
Taylor

@@ -0,0 +1,155 @@
#include "cache.h"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Taylor Blau wrote (reply to this):

On Tue, Jan 26, 2021 at 04:01:21PM +0000, Derrick Stolee via GitGitGadget wrote:
> diff --git a/chunk-format.c b/chunk-format.c
> index 2ce37ecc6bb..674d31d5e58 100644
> --- a/chunk-format.c
> +++ b/chunk-format.c
> @@ -12,6 +12,8 @@ struct chunk_info {
>  	uint32_t id;
>  	uint64_t size;
>  	chunk_write_fn write_fn;
> +
> +	const void *start;

It may be clearer to fold both of these into an anonymous union along
with an enum to indicate which mode we're in. But, I could also buy that
that is more error prone, so perhaps just a comment along the lines of
"exactly one of these is NULL" would suffice, too.

>  };
>
>  struct chunkfile {
> @@ -89,3 +91,65 @@ int write_chunkfile(struct chunkfile *cf, void *data)
>
>  	return 0;
>  }
> +
> +int read_table_of_contents(struct chunkfile *cf,
> +			   const unsigned char *mfile,
> +			   size_t mfile_size,

Assuming that mfile and mfile_size are a pointer to a memory mapped
region and its size? If so, a nit is that I'd expect "data" and "size"
instead of "mfile".

I think that it's probably going too far to have the chunkfile API
handle mapping its own memory, so in that way I don't think it's wrong
for the callers to be handling that.

OTOH, it does seem a little weird to temporarily hand off ownership like
this. I don't think I have a better suggestion, though.

The implementation of this function looks good to me.

> +int pair_chunk(struct chunkfile *cf,
> +	       uint32_t chunk_id,
> +	       chunk_read_fn fn,
> +	       void *data)
> +{
> +	int i;
> +
> +	for (i = 0; i < cf->chunks_nr; i++) {
> +		if (cf->chunks[i].id == chunk_id)
> +			return fn(cf->chunks[i].start, cf->chunks[i].size, data);
> +	}
> +
> +	return CHUNK_NOT_FOUND;
> +}
> diff --git a/chunk-format.h b/chunk-format.h
> index bfaed672813..250e08b8e6a 100644
> --- a/chunk-format.h
> +++ b/chunk-format.h
> @@ -17,4 +17,25 @@ void add_chunk(struct chunkfile *cf,
>  	       size_t size);
>  int write_chunkfile(struct chunkfile *cf, void *data);
>
> +int read_table_of_contents(struct chunkfile *cf,
> +			   const unsigned char *mfile,
> +			   size_t mfile_size,
> +			   uint64_t toc_offset,
> +			   int toc_length);
> +
> +/*
> + * When reading a table of contents, we find the chunk with matching 'id'
> + * then call its read_fn to populate the necessary 'data' based on the
> + * chunk start and size.
> + */
> +typedef int (*chunk_read_fn)(const unsigned char *chunk_start,
> +			     size_t chunk_size, void *data);
> +
> +
> +#define CHUNK_NOT_FOUND (-2)
> +int pair_chunk(struct chunkfile *cf,
> +	       uint32_t chunk_id,
> +	       chunk_read_fn fn,
> +	       void *data);

From reading the implementation, I take it that this function calls fn
with the location and size of the requested chunk, along with the user
supplied data.

I'm not sure that "pair" gives me that same sense. Maybe "read" or
"lookup" would be better?

Dunno.

Thanks,
Taylor

@gitgitgadget
Copy link

gitgitgadget bot commented Jan 27, 2021

On the Git mailing list, Taylor Blau wrote (reply to this):

On Tue, Jan 26, 2021 at 04:01:09PM +0000, Derrick Stolee via GitGitGadget wrote:
> This version also changes the approach to use a more dynamic interaction
> with a struct chunkfile pointer. This idea is credited to Taylor Blau [2],
> but I started again from scratch. I also go further to make struct chunkfile
> anonymous to API consumers. It is defined only in chunk-format.c, which
> should hopefully deter future users from interacting with that data
> directly.
>
> [2] https://lore.kernel.org/git/X8%2FI%2FRzXZksio+ri@nand.local/

Great; I am very happy that you found my patch to be useful. I'm glad
that you decided to start from scratch, too, since as I recall there
were some unresolved test issues that I punted on in case you decided to
abandon the topic altogether.

> This combined API is beneficial to reduce duplicated logic. Or rather, to
> ensure that similar file formats have similar protections against bad data.
> The multi-pack-index code did not have as many guards as the commit-graph
> code did, but now they both share a common base that checks for things like
> duplicate chunks or offsets outside the size of the file.

Definitely good.

> Here are some stats for the end-to-end change:
>
>  * 638 insertions(+), 456 deletions(-).
>  * commit-graph.c: 171 insertions(+), 192 deletions(-)
>  * midx.c: 196 insertions(+), 260 deletions(-)
>
> While there is an overall increase to the code size, the consumers do get a
> bit smaller. Boilerplate things like abstracting method to match
> chunk_write_fn and chunk_read_fn make up a lot of these insertions. The
> "interesting" code gets a lot smaller and cleaner.

Like I said in [1], I don't think a net +182 line diff is reason alone
not to pursue this topic. I don't think that an chunked index v3 will
come as part of my work on the on-disk revindex format, but I do think
that it's something brian may be interested in. So, I'm feeling rather
certain that we'll eventually have new callers, at which point this will
reduce duplication overall.

[1]: https://lore.kernel.org/git/X8%2FK1dUgUmwp8ZOv@nand.local/

Thanks,
Taylor

@@ -854,6 +854,7 @@ LIB_OBJS += bundle.o
LIB_OBJS += cache-tree.o
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Taylor Blau wrote (reply to this):

On Tue, Jan 26, 2021 at 04:01:11PM +0000, Derrick Stolee via GitGitGadget wrote:
> +/*
> + * When writing a chunk-based file format, collect the chunks in
> + * an array of chunk_info structs. The size stores the _expected_
> + * amount of data that will be written by write_fn.
> + */
> +struct chunk_info {
> +	uint32_t id;
> +	uint64_t size;

Hmm. Would we not want an off_t to indicate the size here?

I wondered briefly if we even needed a size field at all, since calling
write_fn would tell us the number of bytes written. But I suppose you
want to know ahead of time so that you can write the file in one pass
(beginning with the table of contents, which certainly needs to know the
size).

> +	/* Trailing entry marks the end of the chunks */
> +	hashwrite_be32(cf->f, 0);
> +	hashwrite_be64(cf->f, cur_offset);
> +
> +	for (i = 0; i < cf->chunks_nr; i++) {
> +		uint64_t start_offset = cf->f->total + cf->f->offset;
> +		int result = cf->chunks[i].write_fn(cf->f, data);
> +
> +		if (result)
> +			return result;
> +
> +		if (cf->f->total + cf->f->offset != start_offset + cf->chunks[i].size)

I don't think this is a practical concern, but a malicious caller could
overflow this by passing a bogus "size" parameter. Maybe:

    uint64_t end_offset = ...;

    if (end_offset - start_offset != cf->chunks[i].size)
      BUG(...)

?

> diff --git a/chunk-format.h b/chunk-format.h
> new file mode 100644
> index 00000000000..bfaed672813
> --- /dev/null
> +++ b/chunk-format.h
> @@ -0,0 +1,20 @@
> +#ifndef CHUNK_FORMAT_H
> +#define CHUNK_FORMAT_H
> +
> +#include "git-compat-util.h"
> +
> +struct hashfile;
> +struct chunkfile;
> +
> +struct chunkfile *init_chunkfile(struct hashfile *f);
> +void free_chunkfile(struct chunkfile *cf);
> +int get_num_chunks(struct chunkfile *cf);
> +typedef int (*chunk_write_fn)(struct hashfile *f,
> +			      void *data);
> +void add_chunk(struct chunkfile *cf,
> +	       uint64_t id,
> +	       chunk_write_fn fn,
> +	       size_t size);
> +int write_chunkfile(struct chunkfile *cf, void *data);

Very clean API.

Thanks,
Taylor

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 1/26/2021 9:42 PM, Taylor Blau wrote:
> On Tue, Jan 26, 2021 at 04:01:11PM +0000, Derrick Stolee via GitGitGadget wrote:
>> +/*
>> + * When writing a chunk-based file format, collect the chunks in
>> + * an array of chunk_info structs. The size stores the _expected_
>> + * amount of data that will be written by write_fn.
>> + */
>> +struct chunk_info {
>> +	uint32_t id;
>> +	uint64_t size;
> 
> Hmm. Would we not want an off_t to indicate the size here?
> 
> I wondered briefly if we even needed a size field at all, since calling
> write_fn would tell us the number of bytes written. But I suppose you
> want to know ahead of time so that you can write the file in one pass
> (beginning with the table of contents, which certainly needs to know the
> size).

Is off_t 64-bits on a 32-bit machine? This is intentionally typed
to be "64 bits no matter what" because it correlates with the file
format's size for the chunk offsets.

>> +		if (cf->f->total + cf->f->offset != start_offset + cf->chunks[i].size)
> 
> I don't think this is a practical concern, but a malicious caller could
> overflow this by passing a bogus "size" parameter. Maybe:
> 
>     uint64_t end_offset = ...;
> 
>     if (end_offset - start_offset != cf->chunks[i].size)
>       BUG(...)

Sure.

Thanks,
-Stolee

@@ -1040,8 +1040,10 @@ struct write_commit_graph_context {
};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Chris Torek wrote (reply to this):

Note: this is purely style, and minor, but I'll ask...

On Tue, Jan 26, 2021 at 8:08 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>  static int write_graph_chunk_fanout(struct hashfile *f,
> -                                   struct write_commit_graph_context *ctx)
> +                                   void *data)
>  {
> +       struct write_commit_graph_context *ctx =
> +               (struct write_commit_graph_context *)data;

Why bother with the cast on the last line here?  In C,
conversion from `void *` to `struct whatever *` is fine.

(the change itself looks fine, btw)

Chris

@gitgitgadget
Copy link

gitgitgadget bot commented Jan 27, 2021

User Chris Torek <chris.torek@gmail.com> has been added to the cc: list.

@@ -19,6 +19,7 @@
#include "shallow.h"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Taylor Blau wrote (reply to this):

On Tue, Jan 26, 2021 at 04:01:12PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
>
> The commit-graph write logic is ready to make use of the chunk-format
> write API. Each chunk write method is already in the correct prototype.
> We only need to use the 'struct chunkfile' pointer and the correct API
> calls.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>

Nicely done. The majority of this patch was remarkably easy to read,
which I attribute to you doing the necessary prep work to make the
callbacks usable by the new API. Thank you.

> @@ -1941,6 +1896,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>
>  	close_commit_graph(ctx->r->objects);
>  	finalize_hashfile(f, file_hash.hash, CSUM_HASH_IN_STREAM | CSUM_FSYNC);
> +	free_chunkfile(cf);

Since chunkfiles are so tightly coupled to hashfiles (i.e., you can only
"construct" a chunkfile given a 'struct hashfile*'), I wonder whether
this should be:

    finalize_chunkfile(cf, ...)

instead. It seems kind of weird to give up ownership of 'f' down to the
chunkfile API only to reach down into it again.

I could even buy that you'd always want to finalize and free a chunkfile
at the same time, and so perhaps the calls could be combined, but that
may be a step too far.

Thanks,
Taylor

@@ -11,6 +11,7 @@
#include "trace2.h"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Taylor Blau wrote (reply to this):

On Tue, Jan 26, 2021 at 04:01:23PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
>
> Instead of parsing the table of contents directly, use the chunk-format
> API methods read_table_of_contents() and pair_chunk(). In particular, we
> can use the return value of pair_chunk() to generate an error when a
> required chunk is missing.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  midx.c                      | 103 ++++++++++++++++++++----------------
>  t/t5319-multi-pack-index.sh |   6 +--
>  2 files changed, 60 insertions(+), 49 deletions(-)
>
> diff --git a/midx.c b/midx.c
> index 0bfd2d802b6..dd019c00795 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -54,6 +54,51 @@ static char *get_midx_filename(const char *object_dir)
>  	return xstrfmt("%s/pack/multi-pack-index", object_dir);
>  }
>
> +static int midx_read_pack_names(const unsigned char *chunk_start,
> +				size_t chunk_size, void *data)
> +{
> +	struct multi_pack_index *m = (struct multi_pack_index *)data;
> +	m->chunk_pack_names = chunk_start;
> +	return 0;
> +}

There are a lot of these callbacks that just assign some 'void **' to
point at chunk_start.

Maybe a good use of the "pair_chunk" name would be something like:

    int pair_chunk(struct chunkfile *cf, uint32_t id, const unsigned char **p);

which does the same as what you wrote here. So instead of what you
wrote, you could instead:

    pair_chunk(cf, MIDX_CHUNKID_PACKNAMES, &m->chunk_pack_names);

This would be in addition to the richer callback-style function which
allows the caller greater flexibility (e.g., for the Bloom filter
related readers in the commit-graph code).

Thanks,
Taylor

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 1/26/2021 10:06 PM, Taylor Blau wrote:
> On Tue, Jan 26, 2021 at 04:01:23PM +0000, Derrick Stolee via GitGitGadget wrote:
>> From: Derrick Stolee <dstolee@microsoft.com>
>>
>> Instead of parsing the table of contents directly, use the chunk-format
>> API methods read_table_of_contents() and pair_chunk(). In particular, we
>> can use the return value of pair_chunk() to generate an error when a
>> required chunk is missing.
>>
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>> ---
>>  midx.c                      | 103 ++++++++++++++++++++----------------
>>  t/t5319-multi-pack-index.sh |   6 +--
>>  2 files changed, 60 insertions(+), 49 deletions(-)
>>
>> diff --git a/midx.c b/midx.c
>> index 0bfd2d802b6..dd019c00795 100644
>> --- a/midx.c
>> +++ b/midx.c
>> @@ -54,6 +54,51 @@ static char *get_midx_filename(const char *object_dir)
>>  	return xstrfmt("%s/pack/multi-pack-index", object_dir);
>>  }
>>
>> +static int midx_read_pack_names(const unsigned char *chunk_start,
>> +				size_t chunk_size, void *data)
>> +{
>> +	struct multi_pack_index *m = (struct multi_pack_index *)data;
>> +	m->chunk_pack_names = chunk_start;
>> +	return 0;
>> +}
> 
> There are a lot of these callbacks that just assign some 'void **' to
> point at chunk_start.
> 
> Maybe a good use of the "pair_chunk" name would be something like:
> 
>     int pair_chunk(struct chunkfile *cf, uint32_t id, const unsigned char **p);
> 
> which does the same as what you wrote here. So instead of what you
> wrote, you could instead:
> 
>     pair_chunk(cf, MIDX_CHUNKID_PACKNAMES, &m->chunk_pack_names);
> 
> This would be in addition to the richer callback-style function which
> allows the caller greater flexibility (e.g., for the Bloom filter
> related readers in the commit-graph code).

You're right that _most_ callers just want to assign a pointer,
so this mechanism would be better. I'll make a different function,
read_chunk() perhaps, that relies on a callback for advanced users.

Thanks,
-Stolee

@gitgitgadget
Copy link

gitgitgadget bot commented Jan 27, 2021

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> This is a restart on the topic previously submitted [1] but dropped because
> ak/corrected-commit-date was still in progress. This version is based on
> that branch.

Nice to see that we have an endorsement on ak/corrected-commit-date
topic ;-)

I've scanned this round of the topic and they were pleasant read.

I may have other comments after a more careful reading, but so far,
I am happy with what I see here.

Thanks.

@gitgitgadget
Copy link

gitgitgadget bot commented Jan 27, 2021

User Derrick Stolee <stolee@gmail.com> has been added to the cc: list.

In preparation for creating an API around file formats using chunks and
tables of contents, prepare the commit-graph write code to use
prototypes that will match this new API.

Specifically, convert chunk_write_fn to take a "void *data" parameter
instead of the commit-graph-specific "struct write_commit_graph_context"
pointer.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
@derrickstolee derrickstolee changed the base branch from ak/corrected-commit-date to next January 27, 2021 14:55
@derrickstolee
Copy link
Author

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Jan 27, 2021

Submitted as pull.848.v2.git.1611759716.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git pr-848/derrickstolee/chunk-format/refactor-v2

To fetch this version to local tag pr-848/derrickstolee/chunk-format/refactor-v2:

git fetch --no-tags https://github.com/gitgitgadget/git tag pr-848/derrickstolee/chunk-format/refactor-v2

@gitgitgadget
Copy link

gitgitgadget bot commented Jan 27, 2021

On the Git mailing list, Taylor Blau wrote (reply to this):

On Wed, Jan 27, 2021 at 03:01:39PM +0000, Derrick Stolee via GitGitGadget wrote:
> Updates in V2
> =============
>
>  * The method pair_chunk() now automatically sets a pointer while
>    read_chunk() uses the callback. This greatly reduces the code size.
>
>  * Pointer casts are now implicit instead of explicit.
>
>  * Extra care is taken to not overflow when verifying chunk sizes on write.

Thanks, I read the range-diff between this version and the last and
appreciate you taking the time to address all of my concerns.

I think that this is ready to go, so please have my:

  Reviewed-by: Taylor Blau <me@ttaylorr.com>

Thanks,
Taylor

@derrickstolee derrickstolee self-assigned this Jan 28, 2021
@gitgitgadget
Copy link

gitgitgadget bot commented Jan 28, 2021

This patch series was integrated into seen via git@867650a.

@gitgitgadget
Copy link

gitgitgadget bot commented Jan 28, 2021

This patch series was integrated into seen via git@489277d.

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 11, 2021

This patch series was integrated into seen via git@2fec6f9.

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 12, 2021

This patch series was integrated into seen via git@d6bd16c.

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 17, 2021

This patch series was integrated into seen via git@a4a8003.

In anticipation of combining the logic from the commit-graph and
multi-pack-index file formats, create a new chunk-format API. Use a
'struct chunkfile' pointer to keep track of data that has been
registered for writes. This struct is anonymous outside of
chunk-format.c to ensure no user attempts to interfere with the data.

The next change will use this API in commit-graph.c, but the general
approach is:

 1. initialize the chunkfile with init_chunkfile(f).
 2. add chunks in the intended writing order with add_chunk().
 3. write any header information to the hashfile f.
 4. write the chunkfile data using write_chunkfile().
 5. free the chunkfile struct using free_chunkfile().

Helped-by: Taylor Blau <me@ttaylorr.com>
Helped-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
The commit-graph write logic is ready to make use of the chunk-format
write API. Each chunk write method is already in the correct prototype.
We only need to use the 'struct chunkfile' pointer and the correct API
calls.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
In an effort to streamline our chunk-based file formats, align some of
the code structure in write_midx_internal() to be similar to the
patterns in write_commit_graph_file().

Specifically, let's create a "struct write_midx_context" that can be
used as a data parameter to abstract function types.

This change only renames "struct pack_info" to "struct
write_midx_context" and the names of instances from "packs" to "ctx". In
future changes, we will expand the data inside "struct
write_midx_context" and align our chunk-writing method with the
chunk-format API.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
In an effort to align the write_midx_internal() to use the chunk-format
API, start converting chunk writing methods to match chunk_write_fn. The
first case is to convert write_midx_pack_names() to take "void *data".
We already have the necessary data in "struct write_midx_context", so
this conversion is rather mechanical.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
In an effort to align write_midx_internal() with the chunk-format API,
continue to group necessary data into "struct write_midx_context". This
change collects the "struct pack_midx_entry *entries" list and its count
into the context.

Update write_midx_oid_fanout() and write_midx_oid_lookup() to take the
context directly, as these are easy conversions with this new data.

Only the callers of write_midx_object_offsets() and
write_midx_large_offsets() are updated here, since additional data in
the context before those methods can match chunk_write_fn.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
In an effort to align write_midx_internal() with the chunk-format API,
continue to group necessary data into "struct write_midx_context". This
change collects the "uint32_t *pack_perm" and large_offsets_needed bit
into the context.

Update write_midx_object_offsets() to match chunk_write_fn.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
In an effort to align write_midx_internal() with the chunk-format API,
continue to group necessary data into "struct write_midx_context". This
change collects the "uint32_t num_large_offsets" into the context. With
this new data, write_midx_large_offsets() now matches the
chunk_write_fn type.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Historically, the chunk-writing methods in midx.c have returned the
amount of data written so the writer method could compare this with the
table of contents. This presents with some interesting issues:

1. If a chunk writing method has a bug that miscalculates the written
   bytes, then we can satisfy the table of contents without actually
   writing the right amount of data to the hashfile. The commit-graph
   writing code checks the hashfile struct directly for a more robust
   verification.

2. There is no way for a chunk writing method to gracefully fail.
   Returning an int presents an opportunity to fail without a die().

3. The current pattern doesn't match chunk_write_fn type exactly, so we
   cannot share code with commit-graph.c

For these reasons, convert the midx chunk writer methods to return an
'int'. Since none of them fail at the moment, they all return 0.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Most expensive operations in write_midx_internal() use the context
struct's progress member, and these indicate the process of the
expensive operations within the chunk writing methods. However, there is
a competing progress struct that counts the progress over all chunks.
This is not very helpful compared to the others, so drop it.

This also reduces our barriers to combining the chunk writing code with
chunk-format.c.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
The chunk-format API allows writing the table of contents and all chunks
using the anonymous 'struct chunkfile' type. We only need to convert our
local chunk logic to this API for the multi-pack-index writes to share
that logic with the commit-graph file writes.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Add the capability to read the table of contents, then pair the chunks
with necessary logic using read_chunk_fn pointers. Callers will be added
in future changes, but the typical outline will be:

 1. initialize a 'struct chunkfile' with init_chunkfile(NULL).
 2. call read_table_of_contents().
 3. for each chunk to parse,
    a. call pair_chunk() to assign a pointer with the chunk position, or
    b. call read_chunk() to run a callback on the chunk start and size.
 4. call free_chunkfile() to clear the 'struct chunkfile' data.

We are re-using the anonymous 'struct chunkfile' data, as it is internal
to the chunk-format API. This gives it essentially two modes: write and
read. If the same struct instance was used for both reads and writes,
then there would be failures.

Helped-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Instead of parsing the table of contents directly, use the chunk-format
API methods read_table_of_contents() and pair_chunk(). While the current
implementation loses the duplicate-chunk detection, that will be added
in a future change.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Instead of parsing the table of contents directly, use the chunk-format
API methods read_table_of_contents() and pair_chunk(). In particular, we
can use the return value of pair_chunk() to generate an error when a
required chunk is missing.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
When calculating the sizes of certain chunks, we should use 64-bit
multiplication always. This allows us to properly predict the chunk
sizes without risk of overflow.

Other possible overflows were discovered by evaluating each
multiplication in midx.c and ensuring that at least one side of the
operator was of type size_t or off_t.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Before refactoring into the chunk-format API, the commit-graph parsing
logic included checks for duplicate chunks. It is unlikely that we would
desire a chunk-based file format that allows duplicate chunk IDs in the
table of contents, so add duplicate checks into
read_table_of_contents().

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
The chunk-based file format is now an API in the code, but we should
also take time to document it as a file format. Specifically, it matches
the CHUNK LOOKUP sections of the commit-graph and multi-pack-index
files, but there are some commonalities that should be grouped in this
document.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
@derrickstolee
Copy link
Author

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 18, 2021

Submitted as pull.848.v4.git.1613657259.gitgitgadget@gmail.com

To fetch this version into FETCH_HEAD:

git fetch https://github.com/gitgitgadget/git pr-848/derrickstolee/chunk-format/refactor-v4

To fetch this version to local tag pr-848/derrickstolee/chunk-format/refactor-v4:

git fetch --no-tags https://github.com/gitgitgadget/git tag pr-848/derrickstolee/chunk-format/refactor-v4

@@ -0,0 +1,116 @@
Chunk-based file formats
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> +Chunk-based file formats
> +========================
> +
> +Some file formats in Git use a common concept of "chunks" to describe
> +sections of the file. This allows structured access to a large file by
> +scanning a small "table of contents" for the remaining data. This common
> +format is used by the `commit-graph` and `multi-pack-index` files. See
> +link:technical/pack-format.html[the `multi-pack-index` format] and
> +link:technical/commit-graph-format.html[the `commit-graph` format] for
> +how they use the chunks to describe structured data.

I've read the doc added here to the end; well written and easy to
understand.

I wonder how/if well reftable files fit in the scheme, or if it
doesn't, should the chunk file format API be updated to accomodate
it (or the other way around)?

> +Extract the data information for each chunk using `pair_chunk()` or
> +`read_chunk()`:
> +
> +* `pair_chunk()` assigns a given pointer with the location inside the
> +  memory-mapped file corresponding to that chunk's offset. If the chunk
> +  does not exist, then the pointer is not modified.

I think it is worth adding:

    The caller is expected to know where the returned chunk ends by
    some out-of-band means, as this function only gives the offset
    but not the size, unlike the read_chunk() function.

> +* `read_chunk()` takes a `chunk_read_fn` function pointer and calls it
> +  with the appropriate initial pointer and size information. The function
> +  is not called if the chunk does not exist. Use this method to read chunks
> +  if you need to perform immediate parsing or if you need to execute logic
> +  based on the size of the chunk.
> +
> +After calling these methods, call `free_chunkfile()` to clear the
> +`struct chunkfile` data. This will not close the memory-mapped region.
> +Callers are expected to own that data for the timeframe the pointers into
> +the region are needed.

Thanks.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 2/18/2021 4:47 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> +Chunk-based file formats
>> +========================
>> +
>> +Some file formats in Git use a common concept of "chunks" to describe
>> +sections of the file. This allows structured access to a large file by
>> +scanning a small "table of contents" for the remaining data. This common
>> +format is used by the `commit-graph` and `multi-pack-index` files. See
>> +link:technical/pack-format.html[the `multi-pack-index` format] and
>> +link:technical/commit-graph-format.html[the `commit-graph` format] for
>> +how they use the chunks to describe structured data.
> 
> I've read the doc added here to the end; well written and easy to
> understand.
> 
> I wonder how/if well reftable files fit in the scheme, or if it
> doesn't, should the chunk file format API be updated to accomodate
> it (or the other way around)?

I'm not sure that reftable can work with this format, especially with
its design to do most updates as append-only (IIUC). And to change the
format to work with the chunk format would violate the compatibility
with the JGit version. I would be interested if something like the
packed-refs file could use a minor update, but only if there is a
realistic benefit to using chunks over the current format.

The files that are on my radar for adopting a new file format using the
chunk-format API are:

* reachability bitmaps: using a similar approach to the commit-graph,
  we could avoid parsing the entire file before checking if a specific
  commit has a bitmap. (Requires a commit lookup chunk, a bitmap data
  chunk, and an offset chunk to connect them.)

* index v5: I'm trying to collect a bunch of information about how to
  update the index for better compression, and the chunk-based approach
  can provide some fixed-width columns that can vary in length depending
  on the required data (presenting the interesting behavior from v2 and v3,
  along with possible approaches previously presented as a potential v5).
  The paths could be presented as a chunk, giving the interesting options
  between v2/3 and v4 (prefix compression). I haven't even started the
  actual work here, but I've been thinking about it a lot. I'll have time
  next month to start prototyping.

Are there other interesting files that could use a new version here?
What other pain points are known to experts in the area?

>> +Extract the data information for each chunk using `pair_chunk()` or
>> +`read_chunk()`:
>> +
>> +* `pair_chunk()` assigns a given pointer with the location inside the
>> +  memory-mapped file corresponding to that chunk's offset. If the chunk
>> +  does not exist, then the pointer is not modified.
> 
> I think it is worth adding:
> 
>     The caller is expected to know where the returned chunk ends by
>     some out-of-band means, as this function only gives the offset
>     but not the size, unlike the read_chunk() function.

True. I suppose that could be more explicit, although it can be gleaned
from the omission of any size information.

Thanks,
-Stolee

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 19, 2021

This patch series was integrated into seen via git@8a7cd50.

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 23, 2021

This patch series was integrated into seen via git@3d78343.

@gitgitgadget
Copy link

gitgitgadget bot commented Feb 23, 2021

This patch series was integrated into next via git@5f0e28c.

@@ -19,6 +19,7 @@
#include "shallow.h"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, SZEDER Gábor wrote (reply to this):

On Thu, Feb 18, 2021 at 02:07:25PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
> 
> The commit-graph write logic is ready to make use of the chunk-format
> write API. Each chunk write method is already in the correct prototype.
> We only need to use the 'struct chunkfile' pointer and the correct API
> calls.

This patch series messes up the "Writing out commit graph" progress
display, and starting at this commit I get:

  $ git commit-graph write --reachable
  Expanding reachable commits in commit graph: 837569, done.
  Writing out commit graph in 3 passes: 166% (4187845/2512707), done.

Note that 166%.

Before this commit I got:

  Expanding reachable commits in commit graph: 837569, done.
  Writing out commit graph in 5 passes: 100% (4187845/4187845), done.

Note the different number of passes.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Taylor Blau wrote (reply to this):

On Wed, Feb 24, 2021 at 05:52:42PM +0100, SZEDER Gábor wrote:
> On Thu, Feb 18, 2021 at 02:07:25PM +0000, Derrick Stolee via GitGitGadget wrote:
> > From: Derrick Stolee <dstolee@microsoft.com>
> >
> > The commit-graph write logic is ready to make use of the chunk-format
> > write API. Each chunk write method is already in the correct prototype.
> > We only need to use the 'struct chunkfile' pointer and the correct API
> > calls.
>
> This patch series messes up the "Writing out commit graph" progress
> display, and starting at this commit I get:

I can confirm. It looks like we never dropped the 'num_chunks' variable,
which should have happened in this patch.

Here's something to apply on top which fixes the issue. Thanks for
reporting.

--- >8 ---

Subject: [PATCH] commit-graph.c: display correct number of chunks when writing

When writing a commit-graph, a progress meter is shown which indicates
the number of pieces of data to write (one per commit in each chunk).

In 47410aa837 (commit-graph: use chunk-format write API, 2021-02-18),
the number of chunks became tracked by the new chunk-format API. But a
stray local variable was left behind from when write_commit_graph_file()
used to keep track of the same.

Since this was no longer updated after 47410aa837, the progress meter
appeared broken:

    $ git commit-graph write --reachable
    Expanding reachable commits in commit graph: 837569, done.
    Writing out commit graph in 3 passes: 166% (4187845/2512707), done.

Drop the local variable and rely instead on the chunk-format API to tell
us the correct number of chunks.

Reported-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
 commit-graph.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 78b993c367..6aa0c488f5 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1791,7 +1791,6 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 	struct lock_file lk = LOCK_INIT;
 	const unsigned hashsz = the_hash_algo->rawsz;
 	struct strbuf progress_title = STRBUF_INIT;
-	int num_chunks = 3;
 	struct object_id file_hash;
 	struct chunkfile *cf;

@@ -1887,11 +1886,11 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
 		strbuf_addf(&progress_title,
 			    Q_("Writing out commit graph in %d pass",
 			       "Writing out commit graph in %d passes",
-			       num_chunks),
-			    num_chunks);
+			       get_num_chunks(cf)),
+			    get_num_chunks(cf));
 		ctx->progress = start_delayed_progress(
 			progress_title.buf,
-			num_chunks * ctx->commits.nr);
+			get_num_chunks(cf) * ctx->commits.nr);
 	}

 	write_chunkfile(cf, ctx);
--
2.30.0.667.g81c0cbc6fd

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 2/24/2021 12:12 PM, Taylor Blau wrote:
> On Wed, Feb 24, 2021 at 05:52:42PM +0100, SZEDER Gábor wrote:
>> On Thu, Feb 18, 2021 at 02:07:25PM +0000, Derrick Stolee via GitGitGadget wrote:
>>> From: Derrick Stolee <dstolee@microsoft.com>
>>>
>>> The commit-graph write logic is ready to make use of the chunk-format
>>> write API. Each chunk write method is already in the correct prototype.
>>> We only need to use the 'struct chunkfile' pointer and the correct API
>>> calls.
>>
>> This patch series messes up the "Writing out commit graph" progress
>> display, and starting at this commit I get:

Thanks for the report and identifying the exact place that caused the
mistake.
 
> I can confirm. It looks like we never dropped the 'num_chunks' variable,
> which should have happened in this patch.

Yes, makes sense. Hard to see that 'num_chunks' wasn't used because it
_was_ being used, just not as intended.

> @@ -1887,11 +1886,11 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>  		strbuf_addf(&progress_title,
>  			    Q_("Writing out commit graph in %d pass",
>  			       "Writing out commit graph in %d passes",
> -			       num_chunks),
> -			    num_chunks);
> +			       get_num_chunks(cf)),
> +			    get_num_chunks(cf));
>  		ctx->progress = start_delayed_progress(
>  			progress_title.buf,
> -			num_chunks * ctx->commits.nr);
> +			get_num_chunks(cf) * ctx->commits.nr);

This is obviously correct. Thanks for the quick patch!

-Stolee

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

Derrick Stolee <stolee@gmail.com> writes:

>>> This patch series messes up the "Writing out commit graph" progress
>>> display, and starting at this commit I get:
>
> Thanks for the report and identifying the exact place that caused the
> mistake.
>  
>> I can confirm. It looks like we never dropped the 'num_chunks' variable,
>> which should have happened in this patch.
>
> Yes, makes sense. Hard to see that 'num_chunks' wasn't used because it
> _was_ being used, just not as intended.
> ...
>
> This is obviously correct. Thanks for the quick patch!

Thanks all for noticing and fixing before the series hit the master
branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant