-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor chunk-format into an API #848
Refactor chunk-format into an API #848
Conversation
6bb8725
to
05cbd0a
Compare
/submit |
Submitted as pull.848.git.1611676886.gitgitgadget@gmail.com To fetch this version into
To fetch this version to local tag
|
This branch is now known as |
This patch series was integrated into seen via git@f8b81ab. |
@@ -1040,8 +1040,10 @@ struct write_commit_graph_context { | |||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Taylor Blau wrote (reply to this):
On Tue, Jan 26, 2021 at 05:53:39PM -0800, Chris Torek wrote:
> Note: this is purely style, and minor, but I'll ask...
>
> On Tue, Jan 26, 2021 at 8:08 AM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
> > static int write_graph_chunk_fanout(struct hashfile *f,
> > - struct write_commit_graph_context *ctx)
> > + void *data)
> > {
> > + struct write_commit_graph_context *ctx =
> > + (struct write_commit_graph_context *)data;
>
> Why bother with the cast on the last line here? In C,
> conversion from `void *` to `struct whatever *` is fine.
>
> (the change itself looks fine, btw)
Agreed. It's not a correctness issue, but I find these unnecessary casts
to detract from readability. If you do end up rerolling this series,
I'd rather see
struct write_commit_graph_context *ctx = data;
...but I don't think that this (non-)issue alone is worth a reroll.
Thanks,
Taylor
@@ -451,7 +451,7 @@ static int pack_info_compare(const void *_a, const void *_b) | |||
return strcmp(a->pack_name, b->pack_name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Taylor Blau wrote (reply to this):
On Tue, Jan 26, 2021 at 04:01:13PM +0000, Derrick Stolee via GitGitGadget wrote:
> This change only renames "struct pack_info" to "struct
> write_midx_context" and the names of instances from "packs" to "ctx". In
> future changes, we will expand the data inside "struct
> write_midx_context" and align our chunk-writing method with the
> chunk-format API.
Thanks for saying that; that makes clear what is (and isn't) going on
here.
> @@ -463,37 +463,37 @@ struct pack_list {
> static void add_pack_to_midx(const char *full_path, size_t full_path_len,
> const char *file_name, void *data)
> {
> - struct pack_list *packs = (struct pack_list *)data;
> + struct write_midx_context *ctx = (struct write_midx_context *)data;
Same comments as earlier about the unnecessary cast on the right-hand
side of this (and the below) assignment.
Otherwise this patch looks obviously fine to me.
Thanks,
Taylor
@@ -0,0 +1,155 @@ | |||
#include "cache.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Taylor Blau wrote (reply to this):
On Tue, Jan 26, 2021 at 04:01:21PM +0000, Derrick Stolee via GitGitGadget wrote:
> diff --git a/chunk-format.c b/chunk-format.c
> index 2ce37ecc6bb..674d31d5e58 100644
> --- a/chunk-format.c
> +++ b/chunk-format.c
> @@ -12,6 +12,8 @@ struct chunk_info {
> uint32_t id;
> uint64_t size;
> chunk_write_fn write_fn;
> +
> + const void *start;
It may be clearer to fold both of these into an anonymous union along
with an enum to indicate which mode we're in. But, I could also buy that
that is more error prone, so perhaps just a comment along the lines of
"exactly one of these is NULL" would suffice, too.
> };
>
> struct chunkfile {
> @@ -89,3 +91,65 @@ int write_chunkfile(struct chunkfile *cf, void *data)
>
> return 0;
> }
> +
> +int read_table_of_contents(struct chunkfile *cf,
> + const unsigned char *mfile,
> + size_t mfile_size,
Assuming that mfile and mfile_size are a pointer to a memory mapped
region and its size? If so, a nit is that I'd expect "data" and "size"
instead of "mfile".
I think that it's probably going too far to have the chunkfile API
handle mapping its own memory, so in that way I don't think it's wrong
for the callers to be handling that.
OTOH, it does seem a little weird to temporarily hand off ownership like
this. I don't think I have a better suggestion, though.
The implementation of this function looks good to me.
> +int pair_chunk(struct chunkfile *cf,
> + uint32_t chunk_id,
> + chunk_read_fn fn,
> + void *data)
> +{
> + int i;
> +
> + for (i = 0; i < cf->chunks_nr; i++) {
> + if (cf->chunks[i].id == chunk_id)
> + return fn(cf->chunks[i].start, cf->chunks[i].size, data);
> + }
> +
> + return CHUNK_NOT_FOUND;
> +}
> diff --git a/chunk-format.h b/chunk-format.h
> index bfaed672813..250e08b8e6a 100644
> --- a/chunk-format.h
> +++ b/chunk-format.h
> @@ -17,4 +17,25 @@ void add_chunk(struct chunkfile *cf,
> size_t size);
> int write_chunkfile(struct chunkfile *cf, void *data);
>
> +int read_table_of_contents(struct chunkfile *cf,
> + const unsigned char *mfile,
> + size_t mfile_size,
> + uint64_t toc_offset,
> + int toc_length);
> +
> +/*
> + * When reading a table of contents, we find the chunk with matching 'id'
> + * then call its read_fn to populate the necessary 'data' based on the
> + * chunk start and size.
> + */
> +typedef int (*chunk_read_fn)(const unsigned char *chunk_start,
> + size_t chunk_size, void *data);
> +
> +
> +#define CHUNK_NOT_FOUND (-2)
> +int pair_chunk(struct chunkfile *cf,
> + uint32_t chunk_id,
> + chunk_read_fn fn,
> + void *data);
From reading the implementation, I take it that this function calls fn
with the location and size of the requested chunk, along with the user
supplied data.
I'm not sure that "pair" gives me that same sense. Maybe "read" or
"lookup" would be better?
Dunno.
Thanks,
Taylor
On the Git mailing list, Taylor Blau wrote (reply to this):
|
@@ -854,6 +854,7 @@ LIB_OBJS += bundle.o | |||
LIB_OBJS += cache-tree.o |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Taylor Blau wrote (reply to this):
On Tue, Jan 26, 2021 at 04:01:11PM +0000, Derrick Stolee via GitGitGadget wrote:
> +/*
> + * When writing a chunk-based file format, collect the chunks in
> + * an array of chunk_info structs. The size stores the _expected_
> + * amount of data that will be written by write_fn.
> + */
> +struct chunk_info {
> + uint32_t id;
> + uint64_t size;
Hmm. Would we not want an off_t to indicate the size here?
I wondered briefly if we even needed a size field at all, since calling
write_fn would tell us the number of bytes written. But I suppose you
want to know ahead of time so that you can write the file in one pass
(beginning with the table of contents, which certainly needs to know the
size).
> + /* Trailing entry marks the end of the chunks */
> + hashwrite_be32(cf->f, 0);
> + hashwrite_be64(cf->f, cur_offset);
> +
> + for (i = 0; i < cf->chunks_nr; i++) {
> + uint64_t start_offset = cf->f->total + cf->f->offset;
> + int result = cf->chunks[i].write_fn(cf->f, data);
> +
> + if (result)
> + return result;
> +
> + if (cf->f->total + cf->f->offset != start_offset + cf->chunks[i].size)
I don't think this is a practical concern, but a malicious caller could
overflow this by passing a bogus "size" parameter. Maybe:
uint64_t end_offset = ...;
if (end_offset - start_offset != cf->chunks[i].size)
BUG(...)
?
> diff --git a/chunk-format.h b/chunk-format.h
> new file mode 100644
> index 00000000000..bfaed672813
> --- /dev/null
> +++ b/chunk-format.h
> @@ -0,0 +1,20 @@
> +#ifndef CHUNK_FORMAT_H
> +#define CHUNK_FORMAT_H
> +
> +#include "git-compat-util.h"
> +
> +struct hashfile;
> +struct chunkfile;
> +
> +struct chunkfile *init_chunkfile(struct hashfile *f);
> +void free_chunkfile(struct chunkfile *cf);
> +int get_num_chunks(struct chunkfile *cf);
> +typedef int (*chunk_write_fn)(struct hashfile *f,
> + void *data);
> +void add_chunk(struct chunkfile *cf,
> + uint64_t id,
> + chunk_write_fn fn,
> + size_t size);
> +int write_chunkfile(struct chunkfile *cf, void *data);
Very clean API.
Thanks,
Taylor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Derrick Stolee wrote (reply to this):
On 1/26/2021 9:42 PM, Taylor Blau wrote:
> On Tue, Jan 26, 2021 at 04:01:11PM +0000, Derrick Stolee via GitGitGadget wrote:
>> +/*
>> + * When writing a chunk-based file format, collect the chunks in
>> + * an array of chunk_info structs. The size stores the _expected_
>> + * amount of data that will be written by write_fn.
>> + */
>> +struct chunk_info {
>> + uint32_t id;
>> + uint64_t size;
>
> Hmm. Would we not want an off_t to indicate the size here?
>
> I wondered briefly if we even needed a size field at all, since calling
> write_fn would tell us the number of bytes written. But I suppose you
> want to know ahead of time so that you can write the file in one pass
> (beginning with the table of contents, which certainly needs to know the
> size).
Is off_t 64-bits on a 32-bit machine? This is intentionally typed
to be "64 bits no matter what" because it correlates with the file
format's size for the chunk offsets.
>> + if (cf->f->total + cf->f->offset != start_offset + cf->chunks[i].size)
>
> I don't think this is a practical concern, but a malicious caller could
> overflow this by passing a bogus "size" parameter. Maybe:
>
> uint64_t end_offset = ...;
>
> if (end_offset - start_offset != cf->chunks[i].size)
> BUG(...)
Sure.
Thanks,
-Stolee
@@ -1040,8 +1040,10 @@ struct write_commit_graph_context { | |||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Chris Torek wrote (reply to this):
Note: this is purely style, and minor, but I'll ask...
On Tue, Jan 26, 2021 at 8:08 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
> static int write_graph_chunk_fanout(struct hashfile *f,
> - struct write_commit_graph_context *ctx)
> + void *data)
> {
> + struct write_commit_graph_context *ctx =
> + (struct write_commit_graph_context *)data;
Why bother with the cast on the last line here? In C,
conversion from `void *` to `struct whatever *` is fine.
(the change itself looks fine, btw)
Chris
User |
@@ -19,6 +19,7 @@ | |||
#include "shallow.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Taylor Blau wrote (reply to this):
On Tue, Jan 26, 2021 at 04:01:12PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
>
> The commit-graph write logic is ready to make use of the chunk-format
> write API. Each chunk write method is already in the correct prototype.
> We only need to use the 'struct chunkfile' pointer and the correct API
> calls.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Nicely done. The majority of this patch was remarkably easy to read,
which I attribute to you doing the necessary prep work to make the
callbacks usable by the new API. Thank you.
> @@ -1941,6 +1896,7 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
>
> close_commit_graph(ctx->r->objects);
> finalize_hashfile(f, file_hash.hash, CSUM_HASH_IN_STREAM | CSUM_FSYNC);
> + free_chunkfile(cf);
Since chunkfiles are so tightly coupled to hashfiles (i.e., you can only
"construct" a chunkfile given a 'struct hashfile*'), I wonder whether
this should be:
finalize_chunkfile(cf, ...)
instead. It seems kind of weird to give up ownership of 'f' down to the
chunkfile API only to reach down into it again.
I could even buy that you'd always want to finalize and free a chunkfile
at the same time, and so perhaps the calls could be combined, but that
may be a step too far.
Thanks,
Taylor
@@ -11,6 +11,7 @@ | |||
#include "trace2.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Taylor Blau wrote (reply to this):
On Tue, Jan 26, 2021 at 04:01:23PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
>
> Instead of parsing the table of contents directly, use the chunk-format
> API methods read_table_of_contents() and pair_chunk(). In particular, we
> can use the return value of pair_chunk() to generate an error when a
> required chunk is missing.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
> midx.c | 103 ++++++++++++++++++++----------------
> t/t5319-multi-pack-index.sh | 6 +--
> 2 files changed, 60 insertions(+), 49 deletions(-)
>
> diff --git a/midx.c b/midx.c
> index 0bfd2d802b6..dd019c00795 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -54,6 +54,51 @@ static char *get_midx_filename(const char *object_dir)
> return xstrfmt("%s/pack/multi-pack-index", object_dir);
> }
>
> +static int midx_read_pack_names(const unsigned char *chunk_start,
> + size_t chunk_size, void *data)
> +{
> + struct multi_pack_index *m = (struct multi_pack_index *)data;
> + m->chunk_pack_names = chunk_start;
> + return 0;
> +}
There are a lot of these callbacks that just assign some 'void **' to
point at chunk_start.
Maybe a good use of the "pair_chunk" name would be something like:
int pair_chunk(struct chunkfile *cf, uint32_t id, const unsigned char **p);
which does the same as what you wrote here. So instead of what you
wrote, you could instead:
pair_chunk(cf, MIDX_CHUNKID_PACKNAMES, &m->chunk_pack_names);
This would be in addition to the richer callback-style function which
allows the caller greater flexibility (e.g., for the Bloom filter
related readers in the commit-graph code).
Thanks,
Taylor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Derrick Stolee wrote (reply to this):
On 1/26/2021 10:06 PM, Taylor Blau wrote:
> On Tue, Jan 26, 2021 at 04:01:23PM +0000, Derrick Stolee via GitGitGadget wrote:
>> From: Derrick Stolee <dstolee@microsoft.com>
>>
>> Instead of parsing the table of contents directly, use the chunk-format
>> API methods read_table_of_contents() and pair_chunk(). In particular, we
>> can use the return value of pair_chunk() to generate an error when a
>> required chunk is missing.
>>
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>> ---
>> midx.c | 103 ++++++++++++++++++++----------------
>> t/t5319-multi-pack-index.sh | 6 +--
>> 2 files changed, 60 insertions(+), 49 deletions(-)
>>
>> diff --git a/midx.c b/midx.c
>> index 0bfd2d802b6..dd019c00795 100644
>> --- a/midx.c
>> +++ b/midx.c
>> @@ -54,6 +54,51 @@ static char *get_midx_filename(const char *object_dir)
>> return xstrfmt("%s/pack/multi-pack-index", object_dir);
>> }
>>
>> +static int midx_read_pack_names(const unsigned char *chunk_start,
>> + size_t chunk_size, void *data)
>> +{
>> + struct multi_pack_index *m = (struct multi_pack_index *)data;
>> + m->chunk_pack_names = chunk_start;
>> + return 0;
>> +}
>
> There are a lot of these callbacks that just assign some 'void **' to
> point at chunk_start.
>
> Maybe a good use of the "pair_chunk" name would be something like:
>
> int pair_chunk(struct chunkfile *cf, uint32_t id, const unsigned char **p);
>
> which does the same as what you wrote here. So instead of what you
> wrote, you could instead:
>
> pair_chunk(cf, MIDX_CHUNKID_PACKNAMES, &m->chunk_pack_names);
>
> This would be in addition to the richer callback-style function which
> allows the caller greater flexibility (e.g., for the Bloom filter
> related readers in the commit-graph code).
You're right that _most_ callers just want to assign a pointer,
so this mechanism would be better. I'll make a different function,
read_chunk() perhaps, that relies on a callback for advanced users.
Thanks,
-Stolee
On the Git mailing list, Junio C Hamano wrote (reply to this):
|
User |
In preparation for creating an API around file formats using chunks and tables of contents, prepare the commit-graph write code to use prototypes that will match this new API. Specifically, convert chunk_write_fn to take a "void *data" parameter instead of the commit-graph-specific "struct write_commit_graph_context" pointer. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
05cbd0a
to
8f3985a
Compare
/submit |
Submitted as pull.848.v2.git.1611759716.gitgitgadget@gmail.com To fetch this version into
To fetch this version to local tag
|
On the Git mailing list, Taylor Blau wrote (reply to this):
|
This patch series was integrated into seen via git@867650a. |
This patch series was integrated into seen via git@489277d. |
This patch series was integrated into seen via git@2fec6f9. |
This patch series was integrated into seen via git@d6bd16c. |
This patch series was integrated into seen via git@a4a8003. |
In anticipation of combining the logic from the commit-graph and multi-pack-index file formats, create a new chunk-format API. Use a 'struct chunkfile' pointer to keep track of data that has been registered for writes. This struct is anonymous outside of chunk-format.c to ensure no user attempts to interfere with the data. The next change will use this API in commit-graph.c, but the general approach is: 1. initialize the chunkfile with init_chunkfile(f). 2. add chunks in the intended writing order with add_chunk(). 3. write any header information to the hashfile f. 4. write the chunkfile data using write_chunkfile(). 5. free the chunkfile struct using free_chunkfile(). Helped-by: Taylor Blau <me@ttaylorr.com> Helped-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
The commit-graph write logic is ready to make use of the chunk-format write API. Each chunk write method is already in the correct prototype. We only need to use the 'struct chunkfile' pointer and the correct API calls. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
In an effort to streamline our chunk-based file formats, align some of the code structure in write_midx_internal() to be similar to the patterns in write_commit_graph_file(). Specifically, let's create a "struct write_midx_context" that can be used as a data parameter to abstract function types. This change only renames "struct pack_info" to "struct write_midx_context" and the names of instances from "packs" to "ctx". In future changes, we will expand the data inside "struct write_midx_context" and align our chunk-writing method with the chunk-format API. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
In an effort to align the write_midx_internal() to use the chunk-format API, start converting chunk writing methods to match chunk_write_fn. The first case is to convert write_midx_pack_names() to take "void *data". We already have the necessary data in "struct write_midx_context", so this conversion is rather mechanical. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
In an effort to align write_midx_internal() with the chunk-format API, continue to group necessary data into "struct write_midx_context". This change collects the "struct pack_midx_entry *entries" list and its count into the context. Update write_midx_oid_fanout() and write_midx_oid_lookup() to take the context directly, as these are easy conversions with this new data. Only the callers of write_midx_object_offsets() and write_midx_large_offsets() are updated here, since additional data in the context before those methods can match chunk_write_fn. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
In an effort to align write_midx_internal() with the chunk-format API, continue to group necessary data into "struct write_midx_context". This change collects the "uint32_t *pack_perm" and large_offsets_needed bit into the context. Update write_midx_object_offsets() to match chunk_write_fn. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
In an effort to align write_midx_internal() with the chunk-format API, continue to group necessary data into "struct write_midx_context". This change collects the "uint32_t num_large_offsets" into the context. With this new data, write_midx_large_offsets() now matches the chunk_write_fn type. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Historically, the chunk-writing methods in midx.c have returned the amount of data written so the writer method could compare this with the table of contents. This presents with some interesting issues: 1. If a chunk writing method has a bug that miscalculates the written bytes, then we can satisfy the table of contents without actually writing the right amount of data to the hashfile. The commit-graph writing code checks the hashfile struct directly for a more robust verification. 2. There is no way for a chunk writing method to gracefully fail. Returning an int presents an opportunity to fail without a die(). 3. The current pattern doesn't match chunk_write_fn type exactly, so we cannot share code with commit-graph.c For these reasons, convert the midx chunk writer methods to return an 'int'. Since none of them fail at the moment, they all return 0. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Most expensive operations in write_midx_internal() use the context struct's progress member, and these indicate the process of the expensive operations within the chunk writing methods. However, there is a competing progress struct that counts the progress over all chunks. This is not very helpful compared to the others, so drop it. This also reduces our barriers to combining the chunk writing code with chunk-format.c. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
The chunk-format API allows writing the table of contents and all chunks using the anonymous 'struct chunkfile' type. We only need to convert our local chunk logic to this API for the multi-pack-index writes to share that logic with the commit-graph file writes. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
a68f2c5
to
62a9565
Compare
Add the capability to read the table of contents, then pair the chunks with necessary logic using read_chunk_fn pointers. Callers will be added in future changes, but the typical outline will be: 1. initialize a 'struct chunkfile' with init_chunkfile(NULL). 2. call read_table_of_contents(). 3. for each chunk to parse, a. call pair_chunk() to assign a pointer with the chunk position, or b. call read_chunk() to run a callback on the chunk start and size. 4. call free_chunkfile() to clear the 'struct chunkfile' data. We are re-using the anonymous 'struct chunkfile' data, as it is internal to the chunk-format API. This gives it essentially two modes: write and read. If the same struct instance was used for both reads and writes, then there would be failures. Helped-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Instead of parsing the table of contents directly, use the chunk-format API methods read_table_of_contents() and pair_chunk(). While the current implementation loses the duplicate-chunk detection, that will be added in a future change. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Instead of parsing the table of contents directly, use the chunk-format API methods read_table_of_contents() and pair_chunk(). In particular, we can use the return value of pair_chunk() to generate an error when a required chunk is missing. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
When calculating the sizes of certain chunks, we should use 64-bit multiplication always. This allows us to properly predict the chunk sizes without risk of overflow. Other possible overflows were discovered by evaluating each multiplication in midx.c and ensuring that at least one side of the operator was of type size_t or off_t. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Before refactoring into the chunk-format API, the commit-graph parsing logic included checks for duplicate chunks. It is unlikely that we would desire a chunk-based file format that allows duplicate chunk IDs in the table of contents, so add duplicate checks into read_table_of_contents(). Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
The chunk-based file format is now an API in the code, but we should also take time to document it as a file format. Specifically, it matches the CHUNK LOOKUP sections of the commit-graph and multi-pack-index files, but there are some commonalities that should be grouped in this document. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
62a9565
to
84bf650
Compare
/submit |
Submitted as pull.848.v4.git.1613657259.gitgitgadget@gmail.com To fetch this version into
To fetch this version to local tag
|
@@ -0,0 +1,116 @@ | |||
Chunk-based file formats |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Junio C Hamano wrote (reply to this):
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> +Chunk-based file formats
> +========================
> +
> +Some file formats in Git use a common concept of "chunks" to describe
> +sections of the file. This allows structured access to a large file by
> +scanning a small "table of contents" for the remaining data. This common
> +format is used by the `commit-graph` and `multi-pack-index` files. See
> +link:technical/pack-format.html[the `multi-pack-index` format] and
> +link:technical/commit-graph-format.html[the `commit-graph` format] for
> +how they use the chunks to describe structured data.
I've read the doc added here to the end; well written and easy to
understand.
I wonder how/if well reftable files fit in the scheme, or if it
doesn't, should the chunk file format API be updated to accomodate
it (or the other way around)?
> +Extract the data information for each chunk using `pair_chunk()` or
> +`read_chunk()`:
> +
> +* `pair_chunk()` assigns a given pointer with the location inside the
> + memory-mapped file corresponding to that chunk's offset. If the chunk
> + does not exist, then the pointer is not modified.
I think it is worth adding:
The caller is expected to know where the returned chunk ends by
some out-of-band means, as this function only gives the offset
but not the size, unlike the read_chunk() function.
> +* `read_chunk()` takes a `chunk_read_fn` function pointer and calls it
> + with the appropriate initial pointer and size information. The function
> + is not called if the chunk does not exist. Use this method to read chunks
> + if you need to perform immediate parsing or if you need to execute logic
> + based on the size of the chunk.
> +
> +After calling these methods, call `free_chunkfile()` to clear the
> +`struct chunkfile` data. This will not close the memory-mapped region.
> +Callers are expected to own that data for the timeframe the pointers into
> +the region are needed.
Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Derrick Stolee wrote (reply to this):
On 2/18/2021 4:47 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
>> +Chunk-based file formats
>> +========================
>> +
>> +Some file formats in Git use a common concept of "chunks" to describe
>> +sections of the file. This allows structured access to a large file by
>> +scanning a small "table of contents" for the remaining data. This common
>> +format is used by the `commit-graph` and `multi-pack-index` files. See
>> +link:technical/pack-format.html[the `multi-pack-index` format] and
>> +link:technical/commit-graph-format.html[the `commit-graph` format] for
>> +how they use the chunks to describe structured data.
>
> I've read the doc added here to the end; well written and easy to
> understand.
>
> I wonder how/if well reftable files fit in the scheme, or if it
> doesn't, should the chunk file format API be updated to accomodate
> it (or the other way around)?
I'm not sure that reftable can work with this format, especially with
its design to do most updates as append-only (IIUC). And to change the
format to work with the chunk format would violate the compatibility
with the JGit version. I would be interested if something like the
packed-refs file could use a minor update, but only if there is a
realistic benefit to using chunks over the current format.
The files that are on my radar for adopting a new file format using the
chunk-format API are:
* reachability bitmaps: using a similar approach to the commit-graph,
we could avoid parsing the entire file before checking if a specific
commit has a bitmap. (Requires a commit lookup chunk, a bitmap data
chunk, and an offset chunk to connect them.)
* index v5: I'm trying to collect a bunch of information about how to
update the index for better compression, and the chunk-based approach
can provide some fixed-width columns that can vary in length depending
on the required data (presenting the interesting behavior from v2 and v3,
along with possible approaches previously presented as a potential v5).
The paths could be presented as a chunk, giving the interesting options
between v2/3 and v4 (prefix compression). I haven't even started the
actual work here, but I've been thinking about it a lot. I'll have time
next month to start prototyping.
Are there other interesting files that could use a new version here?
What other pain points are known to experts in the area?
>> +Extract the data information for each chunk using `pair_chunk()` or
>> +`read_chunk()`:
>> +
>> +* `pair_chunk()` assigns a given pointer with the location inside the
>> + memory-mapped file corresponding to that chunk's offset. If the chunk
>> + does not exist, then the pointer is not modified.
>
> I think it is worth adding:
>
> The caller is expected to know where the returned chunk ends by
> some out-of-band means, as this function only gives the offset
> but not the size, unlike the read_chunk() function.
True. I suppose that could be more explicit, although it can be gleaned
from the omission of any size information.
Thanks,
-Stolee
This patch series was integrated into seen via git@8a7cd50. |
This patch series was integrated into seen via git@3d78343. |
This patch series was integrated into next via git@5f0e28c. |
@@ -19,6 +19,7 @@ | |||
#include "shallow.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, SZEDER Gábor wrote (reply to this):
On Thu, Feb 18, 2021 at 02:07:25PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
>
> The commit-graph write logic is ready to make use of the chunk-format
> write API. Each chunk write method is already in the correct prototype.
> We only need to use the 'struct chunkfile' pointer and the correct API
> calls.
This patch series messes up the "Writing out commit graph" progress
display, and starting at this commit I get:
$ git commit-graph write --reachable
Expanding reachable commits in commit graph: 837569, done.
Writing out commit graph in 3 passes: 166% (4187845/2512707), done.
Note that 166%.
Before this commit I got:
Expanding reachable commits in commit graph: 837569, done.
Writing out commit graph in 5 passes: 100% (4187845/4187845), done.
Note the different number of passes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Taylor Blau wrote (reply to this):
On Wed, Feb 24, 2021 at 05:52:42PM +0100, SZEDER Gábor wrote:
> On Thu, Feb 18, 2021 at 02:07:25PM +0000, Derrick Stolee via GitGitGadget wrote:
> > From: Derrick Stolee <dstolee@microsoft.com>
> >
> > The commit-graph write logic is ready to make use of the chunk-format
> > write API. Each chunk write method is already in the correct prototype.
> > We only need to use the 'struct chunkfile' pointer and the correct API
> > calls.
>
> This patch series messes up the "Writing out commit graph" progress
> display, and starting at this commit I get:
I can confirm. It looks like we never dropped the 'num_chunks' variable,
which should have happened in this patch.
Here's something to apply on top which fixes the issue. Thanks for
reporting.
--- >8 ---
Subject: [PATCH] commit-graph.c: display correct number of chunks when writing
When writing a commit-graph, a progress meter is shown which indicates
the number of pieces of data to write (one per commit in each chunk).
In 47410aa837 (commit-graph: use chunk-format write API, 2021-02-18),
the number of chunks became tracked by the new chunk-format API. But a
stray local variable was left behind from when write_commit_graph_file()
used to keep track of the same.
Since this was no longer updated after 47410aa837, the progress meter
appeared broken:
$ git commit-graph write --reachable
Expanding reachable commits in commit graph: 837569, done.
Writing out commit graph in 3 passes: 166% (4187845/2512707), done.
Drop the local variable and rely instead on the chunk-format API to tell
us the correct number of chunks.
Reported-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
commit-graph.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)
diff --git a/commit-graph.c b/commit-graph.c
index 78b993c367..6aa0c488f5 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1791,7 +1791,6 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
struct lock_file lk = LOCK_INIT;
const unsigned hashsz = the_hash_algo->rawsz;
struct strbuf progress_title = STRBUF_INIT;
- int num_chunks = 3;
struct object_id file_hash;
struct chunkfile *cf;
@@ -1887,11 +1886,11 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
strbuf_addf(&progress_title,
Q_("Writing out commit graph in %d pass",
"Writing out commit graph in %d passes",
- num_chunks),
- num_chunks);
+ get_num_chunks(cf)),
+ get_num_chunks(cf));
ctx->progress = start_delayed_progress(
progress_title.buf,
- num_chunks * ctx->commits.nr);
+ get_num_chunks(cf) * ctx->commits.nr);
}
write_chunkfile(cf, ctx);
--
2.30.0.667.g81c0cbc6fd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Derrick Stolee wrote (reply to this):
On 2/24/2021 12:12 PM, Taylor Blau wrote:
> On Wed, Feb 24, 2021 at 05:52:42PM +0100, SZEDER Gábor wrote:
>> On Thu, Feb 18, 2021 at 02:07:25PM +0000, Derrick Stolee via GitGitGadget wrote:
>>> From: Derrick Stolee <dstolee@microsoft.com>
>>>
>>> The commit-graph write logic is ready to make use of the chunk-format
>>> write API. Each chunk write method is already in the correct prototype.
>>> We only need to use the 'struct chunkfile' pointer and the correct API
>>> calls.
>>
>> This patch series messes up the "Writing out commit graph" progress
>> display, and starting at this commit I get:
Thanks for the report and identifying the exact place that caused the
mistake.
> I can confirm. It looks like we never dropped the 'num_chunks' variable,
> which should have happened in this patch.
Yes, makes sense. Hard to see that 'num_chunks' wasn't used because it
_was_ being used, just not as intended.
> @@ -1887,11 +1886,11 @@ static int write_commit_graph_file(struct write_commit_graph_context *ctx)
> strbuf_addf(&progress_title,
> Q_("Writing out commit graph in %d pass",
> "Writing out commit graph in %d passes",
> - num_chunks),
> - num_chunks);
> + get_num_chunks(cf)),
> + get_num_chunks(cf));
> ctx->progress = start_delayed_progress(
> progress_title.buf,
> - num_chunks * ctx->commits.nr);
> + get_num_chunks(cf) * ctx->commits.nr);
This is obviously correct. Thanks for the quick patch!
-Stolee
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Junio C Hamano wrote (reply to this):
Derrick Stolee <stolee@gmail.com> writes:
>>> This patch series messes up the "Writing out commit graph" progress
>>> display, and starting at this commit I get:
>
> Thanks for the report and identifying the exact place that caused the
> mistake.
>
>> I can confirm. It looks like we never dropped the 'num_chunks' variable,
>> which should have happened in this patch.
>
> Yes, makes sense. Hard to see that 'num_chunks' wasn't used because it
> _was_ being used, just not as intended.
> ...
>
> This is obviously correct. Thanks for the quick patch!
Thanks all for noticing and fixing before the series hit the master
branch.
This is a restart on the topic previously submitted [1] but dropped because
ak/corrected-commit-date
was still in progress. This version is based on that branch.[1] https://lore.kernel.org/git/pull.804.git.1607012215.gitgitgadget@gmail.com/
This version also changes the approach to use a more dynamic interaction with a
struct chunkfile
pointer. This idea is credited to Taylor Blau [2], but I started again from scratch. I also go further to makestruct chunkfile
anonymous to API consumers. It is defined only inchunk-format.c
, which should hopefully deter future users from interacting with that data directly.[2] https://lore.kernel.org/git/X8%2FI%2FRzXZksio+ri@nand.local/
This combined API is beneficial to reduce duplicated logic. Or rather, to ensure that similar file formats have similar protections against bad data. The multi-pack-index code did not have as many guards as the commit-graph code did, but now they both share a common base that checks for things like duplicate chunks or offsets outside the size of the file.
Here are some stats for the end-to-end change:
While there is an overall increase to the code size, the consumers do get smaller. Boilerplate things like abstracting method to match
chunk_write_fn
andchunk_read_fn
make up a lot of these insertions. The "interesting" code gets a lot smaller and cleaner.Updates in V4
Updates in V3
API methods use better types and changed their order to match internal data more closely.
Use
hashfile_total()
instead of internal data values.The implementation of
pair_chunk()
usesread_chunk()
.init_chunkfile()
has an in-code doc comment warning against using the samestruct chunkfile
for reads and writes.More multiplications are correctly cast in midx.c.
The chunk-format technical docs are expanded.
Updates in V2
The method
pair_chunk()
now automatically sets a pointer whileread_chunk()
uses the callback. This greatly reduces the code size.Pointer casts are now implicit instead of explicit.
Extra care is taken to not overflow when verifying chunk sizes on write.
Thanks,
-Stolee
Cc: me@ttaylorr.com
Cc: gitster@pobox.com
Cc: l.s.r@web.de
Cc: szeder.dev@gmail.com
cc: Chris Torek chris.torek@gmail.com
cc: Derrick Stolee stolee@gmail.com