Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid spawning gzip in git archive #145

Closed
wants to merge 6 commits into from

Conversation

dscho
Copy link
Member

@dscho dscho commented Feb 25, 2019

When creating .tar.gz archives with git archive, we let gzip handle the compression part. But that is not even necessary, as we already require zlib (to compress our loose objects). It is also unfortunate, as it requires gzip to be in the PATH (which is not the case e.g. with MinGit for Windows, which tries to bundle just the bare minimum of files to make Git work non-interactively, for use with 3rd-party applications requiring Git).

This patch series resolves this conundrum by teaching git archive the trick to gzip-compress in-process, and it is based on git-for-windows#2077

Changes since v2:

  • Added a perf test.
  • The change to make BLOCKSIZE unsigned was dropped.
  • The pseudo command is now called :internal-gzip: instead of :zlib.
  • Integrated a patch using an extra thread for gzip compression, based on one of René Scharfe's patches.
  • Performed some benchmarking with zlib 1.2.11 and with a zlib version with speed improvements proposed by Intel, and included the results in the added commit's message.

Changes since v1:

  • The commit messages were touched up quite a bit.
  • There is now a new commit that marks the BLOCKSIZE constant as unsigned.
  • The first commit used to introduce the gzip variable prematurely. This part has been moved into the appropriate commit (which used to be 2/2 and now is 3/4).
  • The second commit was split into two: 3/4 which introduces the new :zlib pseudo command, and 4/4 that switches the default.
  • The GZIP prerequisite was dropped from a couple of tests in t5000 that no longer need it.
  • The magic "wb\0" code was replaced by a strbuf-based one.
  • The error message was improved when gzclose() failed.

Cc: Jeff King peff@peff.net, René Scharfe l.s.r@web.de, brian m. carlson sandals@crustytoothpaste.net
cc: Ævar Arnfjörð Bjarmason avarab@gmail.com

@dscho dscho added the ready to submit Has commits that have not been submitted yet label Feb 25, 2019
@dscho
Copy link
Member Author

dscho commented Apr 12, 2019

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 12, 2019

Submitted as pull.145.git.gitgitgadget@gmail.com

@dscho dscho removed the ready to submit Has commits that have not been submitted yet label Apr 12, 2019
archive-tar.c Show resolved Hide resolved
archive-tar.c Show resolved Hide resolved
@dscho dscho added the needs-work These patches have pending issues that need to be resolved before submitting label Apr 17, 2019
@r1walz

This comment has been minimized.

@dscho

This comment has been minimized.

@dscho

This comment has been minimized.

@r1walz

This comment has been minimized.

@dscho dscho force-pushed the dont-spawn-gzip-in-archive branch 2 times, most recently from 5b33b20 to 0e5826e Compare April 26, 2019 15:12
@dscho
Copy link
Member Author

dscho commented Apr 26, 2019

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 26, 2019

Submitted as pull.145.v2.git.gitgitgadget@gmail.com

@dscho dscho removed the needs-work These patches have pending issues that need to be resolved before submitting label Apr 29, 2019
archive-tar.c Outdated Show resolved Hide resolved
archive-tar.c Outdated Show resolved Hide resolved
@r1walz
Copy link

r1walz commented Jul 13, 2019

Hi @dscho

What's the status of the patch? I read the mails but it seems to have stopped receiving the updates. Hope this does not die out. Thanks for working on this and taking this forward.

@dscho
Copy link
Member Author

dscho commented Jul 16, 2019

What's the status of the patch?

I still need to integrate René's patch.

@dscho dscho added the needs-work These patches have pending issues that need to be resolved before submitting label Jul 16, 2019
@dscho dscho removed the needs-work These patches have pending issues that need to be resolved before submitting label Feb 17, 2020
@dscho dscho force-pushed the dont-spawn-gzip-in-archive branch 3 times, most recently from 82e4900 to a4729f4 Compare February 17, 2020 01:05
@gitgitgadget
Copy link

gitgitgadget bot commented Jun 12, 2022

On the Git mailing list, René Scharfe wrote (reply to this):

Drop the dependency on gzip(1) and use our internal implementation to
create tar.gz and tgz files.

Signed-off-by: René Scharfe <l.s.r@web.de>
---
 Documentation/git-archive.txt |  4 ++--
 archive-tar.c                 |  4 ++--
 t/t5000-tar-tree.sh           | 32 ++++++++++++++++----------------
 3 files changed, 20 insertions(+), 20 deletions(-)

diff --git a/Documentation/git-archive.txt b/Documentation/git-archive.txt
index 5b017c2bdc..9de12896fc 100644
--- a/Documentation/git-archive.txt
+++ b/Documentation/git-archive.txt
@@ -148,8 +148,8 @@ tar.<format>.command::
 	format is given.
 +
 The "tar.gz" and "tgz" formats are defined automatically and default to
-`gzip -cn`. You may override them with custom commands. An internal gzip
-implementation can be used by specifying the value `git archive gzip`.
+the magic value `git archive gzip`, which invokes an internal
+implementation of gzip. You may override them with custom commands.

 tar.<format>.remote::
 	If true, enable `<format>` for use by remote clients via
diff --git a/archive-tar.c b/archive-tar.c
index bf7e321e0e..60669eb7b9 100644
--- a/archive-tar.c
+++ b/archive-tar.c
@@ -528,9 +528,9 @@ void init_tar_archiver(void)
 	int i;
 	register_archiver(&tar_archiver);

-	tar_filter_config("tar.tgz.command", "gzip -cn", NULL);
+	tar_filter_config("tar.tgz.command", internal_gzip_command, NULL);
 	tar_filter_config("tar.tgz.remote", "true", NULL);
-	tar_filter_config("tar.tar.gz.command", "gzip -cn", NULL);
+	tar_filter_config("tar.tar.gz.command", internal_gzip_command, NULL);
 	tar_filter_config("tar.tar.gz.remote", "true", NULL);
 	git_config(git_tar_config, NULL);
 	for (i = 0; i < nr_tar_filters; i++) {
diff --git a/t/t5000-tar-tree.sh b/t/t5000-tar-tree.sh
index 9ac0ec67fe..1a68e89a55 100755
--- a/t/t5000-tar-tree.sh
+++ b/t/t5000-tar-tree.sh
@@ -339,21 +339,21 @@ test_expect_success 'only enabled filters are available remotely' '
 	test_cmp_bin remote.bar config.bar
 '

-test_expect_success GZIP 'git archive --format=tgz' '
+test_expect_success 'git archive --format=tgz' '
 	git archive --format=tgz HEAD >j.tgz
 '

-test_expect_success GZIP 'git archive --format=tar.gz' '
+test_expect_success 'git archive --format=tar.gz' '
 	git archive --format=tar.gz HEAD >j1.tar.gz &&
 	test_cmp_bin j.tgz j1.tar.gz
 '

-test_expect_success GZIP 'infer tgz from .tgz filename' '
+test_expect_success 'infer tgz from .tgz filename' '
 	git archive --output=j2.tgz HEAD &&
 	test_cmp_bin j.tgz j2.tgz
 '

-test_expect_success GZIP 'infer tgz from .tar.gz filename' '
+test_expect_success 'infer tgz from .tar.gz filename' '
 	git archive --output=j3.tar.gz HEAD &&
 	test_cmp_bin j.tgz j3.tar.gz
 '
@@ -363,31 +363,31 @@ test_expect_success GZIP 'extract tgz file' '
 	test_cmp_bin b.tar j.tar
 '

-test_expect_success GZIP 'remote tar.gz is allowed by default' '
+test_expect_success 'remote tar.gz is allowed by default' '
 	git archive --remote=. --format=tar.gz HEAD >remote.tar.gz &&
 	test_cmp_bin j.tgz remote.tar.gz
 '

-test_expect_success GZIP 'remote tar.gz can be disabled' '
+test_expect_success 'remote tar.gz can be disabled' '
 	git config tar.tar.gz.remote false &&
 	test_must_fail git archive --remote=. --format=tar.gz HEAD \
 		>remote.tar.gz
 '

-test_expect_success 'git archive --format=tgz (internal gzip)' '
-	test_config tar.tgz.command "git archive gzip" &&
-	git archive --format=tgz HEAD >internal_gzip.tgz
+test_expect_success GZIP 'git archive --format=tgz (external gzip)' '
+	test_config tar.tgz.command "gzip -cn" &&
+	git archive --format=tgz HEAD >external_gzip.tgz
 '

-test_expect_success 'git archive --format=tar.gz (internal gzip)' '
-	test_config tar.tar.gz.command "git archive gzip" &&
-	git archive --format=tar.gz HEAD >internal_gzip.tar.gz &&
-	test_cmp_bin internal_gzip.tgz internal_gzip.tar.gz
+test_expect_success GZIP 'git archive --format=tar.gz (external gzip)' '
+	test_config tar.tar.gz.command "gzip -cn" &&
+	git archive --format=tar.gz HEAD >external_gzip.tar.gz &&
+	test_cmp_bin external_gzip.tgz external_gzip.tar.gz
 '

-test_expect_success GZIP 'extract tgz file (internal gzip)' '
-	gzip -d -c <internal_gzip.tgz >internal_gzip.tar &&
-	test_cmp_bin b.tar internal_gzip.tar
+test_expect_success GZIP 'extract tgz file (external gzip)' '
+	gzip -d -c <external_gzip.tgz >external_gzip.tar &&
+	test_cmp_bin b.tar external_gzip.tar
 '

 test_expect_success 'archive and :(glob)' '
--
2.36.1

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 13, 2022

On the Git mailing list, Junio C Hamano wrote (reply to this):

René Scharfe <l.s.r@web.de> writes:

> @@ -148,7 +148,8 @@ tar.<format>.command::
>  	format is given.
>  +
>  The "tar.gz" and "tgz" formats are defined automatically and default to
> -`gzip -cn`. You may override them with custom commands.
> +`gzip -cn`. You may override them with custom commands. An internal gzip
> +implementation can be used by specifying the value `git archive gzip`.
>

The new sentence didn't click and I was lost in figuring out what is
set to which value and where, before looking at a test the patch
adds.

I think it is not entirely a fault of this patch, but the badness is
already in the original.  I wouldn't have been as confused if it
were like so:

    The "tar.gz" and "tgz" formats are defined automatically to use
    `gzip -cn` as the command by default.  An internal gzip
    implementation can be used by specifying the value `git archive
    gzip` for these two formats.

for example.

Thanks.

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 13, 2022

On the Git mailing list, Junio C Hamano wrote (reply to this):

René Scharfe <l.s.r@web.de> writes:

> -test_expect_success GZIP 'git archive --format=tar.gz' '
> +test_expect_success 'git archive --format=tar.gz' '
>  	git archive --format=tar.gz HEAD >j1.tar.gz &&
>  	test_cmp_bin j.tgz j1.tar.gz
>  '

Curiously, this breaks for me.  It is understandable if we are not
producing byte-for-byte identical output with internal gzip.

With the following hack I can force the step pass, so it seems that
the two invocations of internal gzip are not emitting identical
result for the tar stream taken out of HEAD^{tree} object?

diff --git c/t/t5000-tar-tree.sh w/t/t5000-tar-tree.sh
index 1a68e89a55..c0a2cb92d4 100755
--- c/t/t5000-tar-tree.sh
+++ w/t/t5000-tar-tree.sh
@@ -340,14 +340,16 @@ test_expect_success 'only enabled filters are available remotely' '
 '
 
 test_expect_success 'git archive --format=tgz' '
-	git archive --format=tgz HEAD >j.tgz
+	git -c tar.tgz.command="gzip -cn" archive --format=tgz HEAD >j.tgz
 '
 
 test_expect_success 'git archive --format=tar.gz' '
-	git archive --format=tar.gz HEAD >j1.tar.gz &&
+	git -c tar.tar.gz.command="gzip -cn" archive --format=tar.gz HEAD >j1.tar.gz &&
 	test_cmp_bin j.tgz j1.tar.gz
 '
 
+exit
+
 test_expect_success 'infer tgz from .tgz filename' '
 	git archive --output=j2.tgz HEAD &&
 	test_cmp_bin j.tgz j2.tgz

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 14, 2022

On the Git mailing list, Johannes Schindelin wrote (reply to this):

Hi Junio,

On Mon, 13 Jun 2022, Junio C Hamano wrote:

> René Scharfe <l.s.r@web.de> writes:
>
> > -test_expect_success GZIP 'git archive --format=tar.gz' '
> > +test_expect_success 'git archive --format=tar.gz' '
> >  	git archive --format=tar.gz HEAD >j1.tar.gz &&
> >  	test_cmp_bin j.tgz j1.tar.gz
> >  '
>
> Curiously, this breaks for me.  It is understandable if we are not
> producing byte-for-byte identical output with internal gzip.

Indeed, I can reproduce this, too. In particular, `j.tgz` and `j1.tar.gz`
differ like this in my test run:

-00000000  1f 8b 08 1a 00 2e ca 09  00 03 04 00 89 45 fc 83 |.............E..|
+00000000  1f 8b 08 1a 00 35 2a 10  00 03 04 00 89 45 fc 83 |.....5*......E..|

and

-00000010  7d fc 00 f1 d0 ec b7 63  8c 30 cc 9b e6 db b6 6d |}......c.0.....m|
+00000010  7d fc 00 54 ff ec b7 63  8c 30 cc 9b e6 db b6 6d |}..T...c.0.....m|

According to https://datatracker.ietf.org/doc/html/rfc1952#page-5, the
difference in the first line is the mtime. For reference, this is the
version with `git -c tar.tgz.command="gzip -cn" archive --format=tgz
HEAD`:

00000000  1f 8b 08 00 00 00 00 00  00 03 ec b7 63 8c 30 cc |............c.0.|

In other words, `gzip` forces the `mtim` member to all zeros, which makes
sense.

The recorded mtimes are a bit funny, according to
https://wolf-tungsten.github.io/gzip-analyzer/, they are 1975-03-17
00:36:32 and 1978-08-05 22:45:36, respectively...

And the mtime actually changes all the time.

What's even more funny: if I comment out the `deflateSetHeader()`, the
mtime header field is left at all-zeros. This is on Ubuntu 18.04 with
zlib1g 1:1.2.11.dfsg-0ubuntu2.

So I dug in a bit deeper and what do you know, the `deflateHeader()`
function is implemented like this
(https://github.com/madler/zlib/blob/21767c654d31/deflate.c#L557-L565):

	int ZEXPORT deflateSetHeader (strm, head)
	    z_streamp strm;
	    gz_headerp head;
	{
	    if (deflateStateCheck(strm) || strm->state->wrap != 2)
		return Z_STREAM_ERROR;
	    strm->state->gzhead = head;
	    return Z_OK;
	}

Now, the caller is implemented like this:

	static void tgz_set_os(git_zstream *strm, int os)
	{
	#if ZLIB_VERNUM >= 0x1221
		struct gz_header_s gzhead = { .os = os };
		deflateSetHeader(&strm->z, &gzhead);
	#endif
	}

The biggest problem is not that the return value of `deflateSetHeader()`
is ignored. The biggest problem is that it passes the address of a heap
variable to the `deflateSetHeader()` function, which then stores it away
in another struct that lives beyond the point when we return from
`tgz_set_os()`.

In other words, this is the very issue I pointed out as GCC not catching:
https://lore.kernel.org/git/nycvar.QRO.7.76.6.2205272235220.349@tvgsbejvaqbjf.bet/

The solution is to move the heap variable back into a scope that matches
the lifetime of the compression:

-- snip --
diff --git a/archive-tar.c b/archive-tar.c
index 60669eb7b9c..3d77e0f7509 100644
--- a/archive-tar.c
+++ b/archive-tar.c
@@ -460,17 +460,12 @@ static void tgz_write_block(const void *data)

 static const char internal_gzip_command[] = "git archive gzip";

-static void tgz_set_os(git_zstream *strm, int os)
-{
-#if ZLIB_VERNUM >= 0x1221
-	struct gz_header_s gzhead = { .os = os };
-	deflateSetHeader(&strm->z, &gzhead);
-#endif
-}
-
 static int write_tar_filter_archive(const struct archiver *ar,
 				    struct archiver_args *args)
 {
+#if ZLIB_VERNUM >= 0x1221
+	struct gz_header_s gzhead = { .os = 3 }; /* Unix, for reproducibility */
+#endif
 	struct strbuf cmd = STRBUF_INIT;
 	struct child_process filter = CHILD_PROCESS_INIT;
 	int r;
@@ -481,7 +476,10 @@ static int write_tar_filter_archive(const struct archiver *ar,
 	if (!strcmp(ar->filter_command, internal_gzip_command)) {
 		write_block = tgz_write_block;
 		git_deflate_init_gzip(&gzstream, args->compression_level);
-		tgz_set_os(&gzstream, 3); /* Unix, for reproducibility */
+#if ZLIB_VERNUM >= 0x1221
+		if (deflateSetHeader(&gzstream.z, &gzhead) != Z_OK)
+			BUG("deflateSetHeader() called too late");
+#endif
 		gzstream.next_out = outbuf;
 		gzstream.avail_out = sizeof(outbuf);

-- snap --

With this, the test passes for me.

René, would you mind squashing this into your patch series?

Thank you,
Dscho

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 14, 2022

On the Git mailing list, Johannes Schindelin wrote (reply to this):

Hi René,

On Sun, 12 Jun 2022, René Scharfe wrote:

> It's been a while, let's try again.

Thank you for picking this up again!

> Changes:
> - Use our own zlib helpers instead of the gz* functions of zlib,
> - ... which allows us to set the OS_CODE header consistently.
> - Pseudo-command "git archive gzip" to select the internal
>   implementation in config.
> - Use a function pointer to plug in the internal gzip.
> - Tests.
> - Discuss performance in commit message.

Makes sense. Here is the range-diff:

-- snip --
-:  ----------- > 1:  9847267888e archive: rename archiver data field to filter_command
1:  7d50f52e490 ! 2:  a98ef655af9 archive: factor out writing blocks into a separate function
    @@
      ## Metadata ##
    -Author: Rohit Ashiwal <rohit.ashiwal265@gmail.com>
    +Author: René Scharfe <l.s.r@web.de>

      ## Commit message ##
    -    archive: factor out writing blocks into a separate function
    +    archive-tar: factor out write_block()

    -    The `git archive --format=tgz` command spawns `gzip` to perform the
    -    actual compression. However, the MinGit flavor of Git for Windows
    -    comes without `gzip` bundled inside.
    +    All tar archive writes have the same size and are done to the same file
    +    descriptor.  Move them to a common function, write_block(), to reduce
    +    code duplication and make it easy to change the destination.

    -    To help with that, we will teach `git archive` to let zlib perform the
    -    actual compression.
    -
    -    In preparation for this, we consolidate all the block writes into the
    -    function `write_block_or_die()`.
    -
    -    Note: .tar files have a well-defined, fixed block size. For that reason,
    -    it does not make any sense to pass anything but a fully-populated,
    -    full-length block to the `write_block_or_die()` function, and we can
    -    save ourselves some future trouble by not even allowing to pass an
    -    incorrect `size` parameter to it.
    -
    -    Signed-off-by: Rohit Ashiwal <rohit.ashiwal265@gmail.com>
    -    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
    +    Original-patch-by: Rohit Ashiwal <rohit.ashiwal265@gmail.com>
    +    Signed-off-by: René Scharfe <l.s.r@web.de>
    +    Signed-off-by: Junio C Hamano <gitster@pobox.com>

      ## archive-tar.c ##
     @@ archive-tar.c: static int write_tar_filter_archive(const struct archiver *ar,
      #define USTAR_MAX_MTIME 077777777777ULL
      #endif

    -+/* writes out the whole block, or dies if fails */
    -+static void write_block_or_die(const char *block) {
    -+	write_or_die(1, block, BLOCKSIZE);
    ++static void write_block(const void *buf)
    ++{
    ++	write_or_die(1, buf, BLOCKSIZE);
     +}
     +
      /* writes out the whole block, but only if it is full */
    @@ archive-tar.c: static int write_tar_filter_archive(const struct archiver *ar,
      {
      	if (offset == BLOCKSIZE) {
     -		write_or_die(1, block, BLOCKSIZE);
    -+		write_block_or_die(block);
    ++		write_block(block);
      		offset = 0;
      	}
      }
    @@ archive-tar.c: static void do_write_blocked(const void *data, unsigned long size
      	}
      	while (size >= BLOCKSIZE) {
     -		write_or_die(1, buf, BLOCKSIZE);
    -+		write_block_or_die(buf);
    ++		write_block(buf);
      		size -= BLOCKSIZE;
      		buf += BLOCKSIZE;
      	}
    @@ archive-tar.c: static void write_trailer(void)
      	int tail = BLOCKSIZE - offset;
      	memset(block + offset, 0, tail);
     -	write_or_die(1, block, BLOCKSIZE);
    -+	write_block_or_die(block);
    ++	write_block(block);
      	if (tail < 2 * RECORDSIZE) {
      		memset(block, 0, offset);
     -		write_or_die(1, block, BLOCKSIZE);
    -+		write_block_or_die(block);
    ++		write_block(block);
      	}
      }

2:  ac2b2488a1b < -:  ----------- archive-tar: mark RECORDSIZE/BLOCKSIZE as unsigned
3:  4ea94a87848 ! 3:  5e3c0d79589 archive: optionally use zlib directly for gzip compression
    @@
      ## Metadata ##
    -Author: Rohit Ashiwal <rohit.ashiwal265@gmail.com>
    +Author: René Scharfe <l.s.r@web.de>

      ## Commit message ##
    -    archive: optionally use zlib directly for gzip compression
    +    archive-tar: add internal gzip implementation

    -    As we already link to the zlib library, we can perform the compression
    -    without even requiring gzip on the host machine.
    +    Git uses zlib for its own object store, but calls gzip when creating tgz
    +    archives.  Add an option to perform the gzip compression for the latter
    +    using zlib, without depending on the external gzip binary.

    -    Note: the `-n` flag that `git archive` passed to `gzip` wants to ensure
    -    that a reproducible file is written, i.e. no filename or mtime will be
    -    recorded in the compressed output. This is already the default for
    -    zlib's `gzopen()` function (if the file name or mtime should be
    -    recorded, the `deflateSetHeader()` function would have to be called
    -    instead).
    +    Plug it in by making write_block a function pointer and switching to a
    +    compressing variant if the filter command has the magic value "git
    +    archive gzip".  Does that indirection slow down tar creation?  Not
    +    really, at least not in this test:

    -    Note also that the `gzFile` datatype is defined as a pointer in
    -    `zlib.h`, i.e. we can rely on the fact that it can be `NULL`.
    +    $ hyperfine -w3 -L rev HEAD,origin/main -p 'git checkout {rev} && make' \
    +    './git -C ../linux archive --format=tar HEAD # {rev}'
    +    Benchmark #1: ./git -C ../linux archive --format=tar HEAD # HEAD
    +      Time (mean ± σ):      4.044 s ±  0.007 s    [User: 3.901 s, System: 0.137 s]
    +      Range (min … max):    4.038 s …  4.059 s    10 runs

    -    At this point, this new mode is hidden behind the pseudo command
    -    `:zlib`: assign this magic string to the `archive.tgz.command` config
    -    setting to enable it.
    +    Benchmark #2: ./git -C ../linux archive --format=tar HEAD # origin/main
    +      Time (mean ± σ):      4.047 s ±  0.009 s    [User: 3.903 s, System: 0.138 s]
    +      Range (min … max):    4.038 s …  4.066 s    10 runs

    -    Signed-off-by: Rohit Ashiwal <rohit.ashiwal265@gmail.com>
    -    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
    +    How does tgz creation perform?

    - ## archive-tar.c ##
    -@@ archive-tar.c: static unsigned long offset;
    -
    - static int tar_umask = 002;
    -
    -+static gzFile gzip;
    -+
    - static int write_tar_filter_archive(const struct archiver *ar,
    - 				    struct archiver_args *args);
    +    $ hyperfine -w3 -L command 'gzip -cn','git archive gzip' \
    +    './git -c tar.tgz.command="{command}" -C ../linux archive --format=tgz HEAD'
    +    Benchmark #1: ./git -c tar.tgz.command="gzip -cn" -C ../linux archive --format=tgz HEAD
    +      Time (mean ± σ):     20.404 s ±  0.006 s    [User: 23.943 s, System: 0.401 s]
    +      Range (min … max):   20.395 s … 20.414 s    10 runs
    +
    +    Benchmark #2: ./git -c tar.tgz.command="git archive gzip" -C ../linux archive --format=tgz HEAD
    +      Time (mean ± σ):     23.807 s ±  0.023 s    [User: 23.655 s, System: 0.145 s]
    +      Range (min … max):   23.782 s … 23.857 s    10 runs
    +
    +    Summary
    +      './git -c tar.tgz.command="gzip -cn" -C ../linux archive --format=tgz HEAD' ran
    +        1.17 ± 0.00 times faster than './git -c tar.tgz.command="git archive gzip" -C ../linux archive --format=tgz HEAD'
    +
    +    So the internal implementation takes 17% longer on the Linux repo, but
    +    uses 2% less CPU time.  That's because the external gzip can run in
    +    parallel on its own processor, while the internal one works sequentially
    +    and avoids the inter-process communication overhead.
    +
    +    What are the benefits?  Only an internal sequential implementation can
    +    offer this eco mode, and it allows avoiding the gzip(1) requirement.
    +
    +    This implementation uses the helper functions from our zlib.c instead of
    +    the convenient gz* functions from zlib, because the latter doesn't give
    +    the control over the generated gzip header that the next patch requires.
    +
    +    Original-patch-by: Rohit Ashiwal <rohit.ashiwal265@gmail.com>
    +    Signed-off-by: René Scharfe <l.s.r@web.de>
    +    Signed-off-by: Junio C Hamano <gitster@pobox.com>
    +
    + ## Documentation/git-archive.txt ##
    +@@ Documentation/git-archive.txt: tar.<format>.command::
    + 	format is given.
    + +
    + The "tar.gz" and "tgz" formats are defined automatically and default to
    +-`gzip -cn`. You may override them with custom commands.
    ++`gzip -cn`. You may override them with custom commands. An internal gzip
    ++implementation can be used by specifying the value `git archive gzip`.

    + tar.<format>.remote::
    + 	If true, enable `<format>` for use by remote clients via
    +
    + ## archive-tar.c ##
     @@ archive-tar.c: static int write_tar_filter_archive(const struct archiver *ar,
    + #define USTAR_MAX_MTIME 077777777777ULL
    + #endif

    - /* writes out the whole block, or dies if fails */
    - static void write_block_or_die(const char *block) {
    --	write_or_die(1, block, BLOCKSIZE);
    -+	if (!gzip)
    -+		write_or_die(1, block, BLOCKSIZE);
    -+	else if (gzwrite(gzip, block, (unsigned) BLOCKSIZE) != BLOCKSIZE)
    -+		die(_("gzwrite failed"));
    +-static void write_block(const void *buf)
    ++static void tar_write_block(const void *buf)
    + {
    + 	write_or_die(1, buf, BLOCKSIZE);
      }

    ++static void (*write_block)(const void *) = tar_write_block;
    ++
      /* writes out the whole block, but only if it is full */
    -@@ archive-tar.c: static int write_tar_filter_archive(const struct archiver *ar,
    - 	filter.use_shell = 1;
    - 	filter.in = -1;
    + static void write_if_needed(void)
    + {
    +@@ archive-tar.c: static int write_tar_archive(const struct archiver *ar,
    + 	return err;
    + }

    --	if (start_command(&filter) < 0)
    --		die_errno(_("unable to start '%s' filter"), argv[0]);
    --	close(1);
    --	if (dup2(filter.in, 1) < 0)
    --		die_errno(_("unable to redirect descriptor"));
    --	close(filter.in);
    -+	if (!strcmp(":zlib", ar->data)) {
    -+		struct strbuf mode = STRBUF_INIT;
    ++static git_zstream gzstream;
    ++static unsigned char outbuf[16384];
     +
    -+		strbuf_addstr(&mode, "wb");
    ++static void tgz_deflate(int flush)
    ++{
    ++	while (gzstream.avail_in || flush == Z_FINISH) {
    ++		int status = git_deflate(&gzstream, flush);
    ++		if (!gzstream.avail_out || status == Z_STREAM_END) {
    ++			write_or_die(1, outbuf, gzstream.next_out - outbuf);
    ++			gzstream.next_out = outbuf;
    ++			gzstream.avail_out = sizeof(outbuf);
    ++			if (status == Z_STREAM_END)
    ++				break;
    ++		}
    ++		if (status != Z_OK && status != Z_BUF_ERROR)
    ++			die(_("deflate error (%d)"), status);
    ++	}
    ++}
     +
    -+		if (args->compression_level >= 0 && args->compression_level <= 9)
    -+			strbuf_addf(&mode, "%d", args->compression_level);
    ++static void tgz_write_block(const void *data)
    ++{
    ++	gzstream.next_in = (void *)data;
    ++	gzstream.avail_in = BLOCKSIZE;
    ++	tgz_deflate(Z_NO_FLUSH);
    ++}
     +
    -+		gzip = gzdopen(fileno(stdout), mode.buf);
    -+		if (!gzip)
    -+			die(_("Could not gzdopen stdout"));
    -+		strbuf_release(&mode);
    -+	} else {
    -+		if (start_command(&filter) < 0)
    -+			die_errno(_("unable to start '%s' filter"), argv[0]);
    -+		close(1);
    -+		if (dup2(filter.in, 1) < 0)
    -+			die_errno(_("unable to redirect descriptor"));
    -+		close(filter.in);
    -+	}
    -
    - 	r = write_tar_archive(ar, args);
    ++static const char internal_gzip_command[] = "git archive gzip";
    ++
    + static int write_tar_filter_archive(const struct archiver *ar,
    + 				    struct archiver_args *args)
    + {
    +@@ archive-tar.c: static int write_tar_filter_archive(const struct archiver *ar,
    + 	if (!ar->filter_command)
    + 		BUG("tar-filter archiver called with no filter defined");

    --	close(1);
    --	if (finish_command(&filter) != 0)
    --		die(_("'%s' filter reported error"), argv[0]);
    -+	if (gzip) {
    -+		int ret = gzclose(gzip);
    -+		if (ret == Z_ERRNO)
    -+			die_errno(_("gzclose failed"));
    -+		else if (ret != Z_OK)
    -+			die(_("gzclose failed (%d)"), ret);
    -+	} else {
    -+		close(1);
    -+		if (finish_command(&filter) != 0)
    -+			die(_("'%s' filter reported error"), argv[0]);
    ++	if (!strcmp(ar->filter_command, internal_gzip_command)) {
    ++		write_block = tgz_write_block;
    ++		git_deflate_init_gzip(&gzstream, args->compression_level);
    ++		gzstream.next_out = outbuf;
    ++		gzstream.avail_out = sizeof(outbuf);
    ++
    ++		r = write_tar_archive(ar, args);
    ++
    ++		tgz_deflate(Z_FINISH);
    ++		git_deflate_end(&gzstream);
    ++		return r;
     +	}
    ++
    + 	strbuf_addstr(&cmd, ar->filter_command);
    + 	if (args->compression_level >= 0)
    + 		strbuf_addf(&cmd, " -%d", args->compression_level);
    +
    + ## t/t5000-tar-tree.sh ##
    +@@ t/t5000-tar-tree.sh: test_expect_success GZIP 'remote tar.gz can be disabled' '
    + 		>remote.tar.gz
    + '

    - 	strbuf_release(&cmd);
    - 	return r;
    ++test_expect_success 'git archive --format=tgz (internal gzip)' '
    ++	test_config tar.tgz.command "git archive gzip" &&
    ++	git archive --format=tgz HEAD >internal_gzip.tgz
    ++'
    ++
    ++test_expect_success 'git archive --format=tar.gz (internal gzip)' '
    ++	test_config tar.tar.gz.command "git archive gzip" &&
    ++	git archive --format=tar.gz HEAD >internal_gzip.tar.gz &&
    ++	test_cmp_bin internal_gzip.tgz internal_gzip.tar.gz
    ++'
    ++
    ++test_expect_success GZIP 'extract tgz file (internal gzip)' '
    ++	gzip -d -c <internal_gzip.tgz >internal_gzip.tar &&
    ++	test_cmp_bin b.tar internal_gzip.tar
    ++'
    ++
    + test_expect_success 'archive and :(glob)' '
    + 	git archive -v HEAD -- ":(glob)**/sh" >/dev/null 2>actual &&
    + 	cat >expect <<EOF &&
-:  ----------- > 4:  af27bea4fc3 archive-tar: use OS_CODE 3 (Unix) for internal gzip
4:  0e5826e3f25 ! 5:  62038b8e911 archive: use the internal zlib-based gzip compression by default
    @@
      ## Metadata ##
    -Author: Johannes Schindelin <Johannes.Schindelin@gmx.de>
    +Author: René Scharfe <l.s.r@web.de>

      ## Commit message ##
    -    archive: use the internal zlib-based gzip compression by default
    +    archive-tar: use internal gzip by default

    -    We just introduced support for compressing `.tar.gz` archives in the
    -    `git archive` process itself, using zlib directly instead of spawning
    -    `gzip`.
    +    Drop the dependency on gzip(1) and use our internal implementation to
    +    create tar.gz and tgz files.

    -    While this takes less CPU time overall, on multi-core machines, this is
    -    slightly slower in terms of wall clock time (it seems to be in the
    -    ballpark of 15%).
    -
    -    It does reduce the number of dependencies by one, though, which makes it
    -    desirable to turn that mode on by default.
    -
    -    Changing the default benefits most notably the MinGit flavor of Git for
    -    Windows (which intends to support 3rd-party applications that want to
    -    use Git and want to bundle a minimal set of files for that purpose, i.e.
    -    stripping out all non-essential files such as interactive commands,
    -    Perl, and yes, also `gzip`).
    -
    -    We also can now remove the `GZIP` prerequisite from quite a number of
    -    test cases in `t/t5000-tar-tree.sh`.
    -
    -    This closes https://github.com/git-for-windows/git/issues/1970
    -
    -    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
    +    Signed-off-by: René Scharfe <l.s.r@web.de>
    +    Signed-off-by: Junio C Hamano <gitster@pobox.com>

      ## Documentation/git-archive.txt ##
     @@ Documentation/git-archive.txt: tar.<format>.command::
      	format is given.
      +
      The "tar.gz" and "tgz" formats are defined automatically and default to
    --`gzip -cn`. You may override them with custom commands.
    -+`:zlib`, triggering an in-process gzip compression. You may override
    -+them with custom commands, e.g. `gzip -cn` or `pigz -cn`.
    +-`gzip -cn`. You may override them with custom commands. An internal gzip
    +-implementation can be used by specifying the value `git archive gzip`.
    ++the magic value `git archive gzip`, which invokes an internal
    ++implementation of gzip. You may override them with custom commands.

      tar.<format>.remote::
      	If true, enable `<format>` for use by remote clients via
    @@ archive-tar.c: void init_tar_archiver(void)
      	register_archiver(&tar_archiver);

     -	tar_filter_config("tar.tgz.command", "gzip -cn", NULL);
    -+	tar_filter_config("tar.tgz.command", ":zlib", NULL);
    ++	tar_filter_config("tar.tgz.command", internal_gzip_command, NULL);
      	tar_filter_config("tar.tgz.remote", "true", NULL);
     -	tar_filter_config("tar.tar.gz.command", "gzip -cn", NULL);
    -+	tar_filter_config("tar.tar.gz.command", ":zlib", NULL);
    ++	tar_filter_config("tar.tar.gz.command", internal_gzip_command, NULL);
      	tar_filter_config("tar.tar.gz.remote", "true", NULL);
      	git_config(git_tar_config, NULL);
      	for (i = 0; i < nr_tar_filters; i++) {
    @@ t/t5000-tar-tree.sh: test_expect_success 'only enabled filters are available rem
      	git archive --output=j3.tar.gz HEAD &&
      	test_cmp_bin j.tgz j3.tar.gz
      '
    -
    -+test_expect_success 'use `archive.tgz.command=:zlib` explicitly' '
    -+	git -c archive.tgz.command=:zlib archive --output=j4.tgz HEAD &&
    -+	test_cmp_bin j.tgz j4.tgz
    -+'
    -+
    - test_expect_success GZIP 'extract tgz file' '
    - 	gzip -d -c <j.tgz >j.tar &&
    +@@ t/t5000-tar-tree.sh: test_expect_success GZIP 'extract tgz file' '
      	test_cmp_bin b.tar j.tar
      '

    @@ t/t5000-tar-tree.sh: test_expect_success 'only enabled filters are available rem
      	git config tar.tar.gz.remote false &&
      	test_must_fail git archive --remote=. --format=tar.gz HEAD \
      		>remote.tar.gz
    + '
    +
    +-test_expect_success 'git archive --format=tgz (internal gzip)' '
    +-	test_config tar.tgz.command "git archive gzip" &&
    +-	git archive --format=tgz HEAD >internal_gzip.tgz
    ++test_expect_success GZIP 'git archive --format=tgz (external gzip)' '
    ++	test_config tar.tgz.command "gzip -cn" &&
    ++	git archive --format=tgz HEAD >external_gzip.tgz
    + '
    +
    +-test_expect_success 'git archive --format=tar.gz (internal gzip)' '
    +-	test_config tar.tar.gz.command "git archive gzip" &&
    +-	git archive --format=tar.gz HEAD >internal_gzip.tar.gz &&
    +-	test_cmp_bin internal_gzip.tgz internal_gzip.tar.gz
    ++test_expect_success GZIP 'git archive --format=tar.gz (external gzip)' '
    ++	test_config tar.tar.gz.command "gzip -cn" &&
    ++	git archive --format=tar.gz HEAD >external_gzip.tar.gz &&
    ++	test_cmp_bin external_gzip.tgz external_gzip.tar.gz
    + '
    +
    +-test_expect_success GZIP 'extract tgz file (internal gzip)' '
    +-	gzip -d -c <internal_gzip.tgz >internal_gzip.tar &&
    +-	test_cmp_bin b.tar internal_gzip.tar
    ++test_expect_success GZIP 'extract tgz file (external gzip)' '
    ++	gzip -d -c <external_gzip.tgz >external_gzip.tar &&
    ++	test_cmp_bin b.tar external_gzip.tar
    + '
    +
    + test_expect_success 'archive and :(glob)' '
-- snap --

All of these changes look sensible to me, and the performance
implications, while at first glance unfavorable because wallclock time
increases even as CPU time decreases, are actually quite good. As Peff
said in
https://lore.kernel.org/git/20190501181807.GC4109@sigill.intra.peff.net/t/#u

> [...] whatever has the lowest overall CPU time is generally preferable
> [...]

By the way, the main reason why I did not work more is that in
http://madler.net/pipermail/zlib-devel_madler.net/2019-December/003308.html,
Mark Adler (the zlib maintainer) announced that...

> [...] There are many well-tested performance improvements in zlib
> waiting in the wings that will be incorporated over the next several
> months. [...]

This was in December 2019. And now it's June 2022 and I kind of wonder
whether those promised improvements will still come.

In the meantime, however, a viable alternative seems to have cropped up:
https://github.com/zlib-ng/zlib-ng. Essentially, it looks as if it is what
zlib should have become after above-quoted announcement.

In particular the CPU intrinsics support (think MMX, SSE2/3, etc) seem to
be very interesting and I would not be completely surprised if building
Git with your patches and linking against zlib-ng would paint a very
favorable picture not only in terms of CPU time but also in terms of
wallclock time. Sadly, I have not been able to set aside time to look into
that angle, but maybe I can peak your interest?

Thanks,
Dscho

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 14, 2022

On the Git mailing list, René Scharfe wrote (reply to this):

Am 14.06.22 um 13:27 schrieb Johannes Schindelin:
> Hi Junio,
>
> On Mon, 13 Jun 2022, Junio C Hamano wrote:
>
>> René Scharfe <l.s.r@web.de> writes:
>>
>>> -test_expect_success GZIP 'git archive --format=tar.gz' '
>>> +test_expect_success 'git archive --format=tar.gz' '
>>>  	git archive --format=tar.gz HEAD >j1.tar.gz &&
>>>  	test_cmp_bin j.tgz j1.tar.gz
>>>  '
>>
>> Curiously, this breaks for me.  It is understandable if we are not
>> producing byte-for-byte identical output with internal gzip.

Makes sense in retrospect, there's no reason the output of gzip(1) and
zlib would have to be the same exactly.  It just happened to be so on my
platform, so the tests deceptively passed for me.  I think we simply
have to drop those that try to compare compressed files made by
different tools -- we can still check if their content can be extracted
and matches.

> Indeed, I can reproduce this, too. In particular, `j.tgz` and `j1.tar.gz`
> differ like this in my test run:
>
> -00000000  1f 8b 08 1a 00 2e ca 09  00 03 04 00 89 45 fc 83 |.............E..|
> +00000000  1f 8b 08 1a 00 35 2a 10  00 03 04 00 89 45 fc 83 |.....5*......E..|
>
> and
>
> -00000010  7d fc 00 f1 d0 ec b7 63  8c 30 cc 9b e6 db b6 6d |}......c.0.....m|
> +00000010  7d fc 00 54 ff ec b7 63  8c 30 cc 9b e6 db b6 6d |}..T...c.0.....m|
>
> According to https://datatracker.ietf.org/doc/html/rfc1952#page-5, the
> difference in the first line is the mtime. For reference, this is the
> version with `git -c tar.tgz.command="gzip -cn" archive --format=tgz
> HEAD`:
>
> 00000000  1f 8b 08 00 00 00 00 00  00 03 ec b7 63 8c 30 cc |............c.0.|
>
> In other words, `gzip` forces the `mtim` member to all zeros, which makes
> sense.

And that's what zlib does as well for me:

   $ ./git-archive --format=tgz HEAD | hexdump -C | head -1
   00000000  1f 8b 08 00 00 00 00 00  00 03 ec bd 59 73 1c 49  |..........?Ys.I|

>
> The recorded mtimes are a bit funny, according to
> https://wolf-tungsten.github.io/gzip-analyzer/, they are 1975-03-17
> 00:36:32 and 1978-08-05 22:45:36, respectively...
>
> And the mtime actually changes all the time.
>
> What's even more funny: if I comment out the `deflateSetHeader()`, the
> mtime header field is left at all-zeros. This is on Ubuntu 18.04 with
> zlib1g 1:1.2.11.dfsg-0ubuntu2.
>
> So I dug in a bit deeper and what do you know, the `deflateHeader()`
> function is implemented like this
> (https://github.com/madler/zlib/blob/21767c654d31/deflate.c#L557-L565):
>
> 	int ZEXPORT deflateSetHeader (strm, head)
> 	    z_streamp strm;
> 	    gz_headerp head;
> 	{
> 	    if (deflateStateCheck(strm) || strm->state->wrap != 2)
> 		return Z_STREAM_ERROR;
> 	    strm->state->gzhead = head;
> 	    return Z_OK;
> 	}
>
> Now, the caller is implemented like this:
>
> 	static void tgz_set_os(git_zstream *strm, int os)
> 	{
> 	#if ZLIB_VERNUM >= 0x1221
> 		struct gz_header_s gzhead = { .os = os };
> 		deflateSetHeader(&strm->z, &gzhead);
> 	#endif
> 	}
>
> The biggest problem is not that the return value of `deflateSetHeader()`
> is ignored. The biggest problem is that it passes the address of a heap
> variable to the `deflateSetHeader()` function, which then stores it away
> in another struct that lives beyond the point when we return from
> `tgz_set_os()`.

Ah, you mean the address of an automatic variable on the stack, but I
get it.  D'oh!

>
> In other words, this is the very issue I pointed out as GCC not catching:
> https://lore.kernel.org/git/nycvar.QRO.7.76.6.2205272235220.349@tvgsbejvaqbjf.bet/
>
> The solution is to move the heap variable back into a scope that matches
> the lifetime of the compression:
>
> -- snip --
> diff --git a/archive-tar.c b/archive-tar.c
> index 60669eb7b9c..3d77e0f7509 100644
> --- a/archive-tar.c
> +++ b/archive-tar.c
> @@ -460,17 +460,12 @@ static void tgz_write_block(const void *data)
>
>  static const char internal_gzip_command[] = "git archive gzip";
>
> -static void tgz_set_os(git_zstream *strm, int os)
> -{
> -#if ZLIB_VERNUM >= 0x1221
> -	struct gz_header_s gzhead = { .os = os };
> -	deflateSetHeader(&strm->z, &gzhead);
> -#endif
> -}
> -
>  static int write_tar_filter_archive(const struct archiver *ar,
>  				    struct archiver_args *args)
>  {
> +#if ZLIB_VERNUM >= 0x1221
> +	struct gz_header_s gzhead = { .os = 3 }; /* Unix, for reproducibility */
> +#endif
>  	struct strbuf cmd = STRBUF_INIT;
>  	struct child_process filter = CHILD_PROCESS_INIT;
>  	int r;
> @@ -481,7 +476,10 @@ static int write_tar_filter_archive(const struct archiver *ar,
>  	if (!strcmp(ar->filter_command, internal_gzip_command)) {
>  		write_block = tgz_write_block;
>  		git_deflate_init_gzip(&gzstream, args->compression_level);
> -		tgz_set_os(&gzstream, 3); /* Unix, for reproducibility */
> +#if ZLIB_VERNUM >= 0x1221
> +		if (deflateSetHeader(&gzstream.z, &gzhead) != Z_OK)
> +			BUG("deflateSetHeader() called too late");
> +#endif
>  		gzstream.next_out = outbuf;
>  		gzstream.avail_out = sizeof(outbuf);
>
> -- snap --

Good find, thank you!  A shorter solution would be to make gzhead static.

>
> With this, the test passes for me.
>
> René, would you mind squashing this into your patch series?
>
> Thank you,
> Dscho

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 14, 2022

On the Git mailing list, René Scharfe wrote (reply to this):

Am 14.06.22 um 17:47 schrieb René Scharfe:
> Am 14.06.22 um 13:27 schrieb Johannes Schindelin:
>> Hi Junio,
>>
>> On Mon, 13 Jun 2022, Junio C Hamano wrote:
>>
>>> René Scharfe <l.s.r@web.de> writes:
>>>
>>>> -test_expect_success GZIP 'git archive --format=tar.gz' '
>>>> +test_expect_success 'git archive --format=tar.gz' '
>>>>  	git archive --format=tar.gz HEAD >j1.tar.gz &&
>>>>  	test_cmp_bin j.tgz j1.tar.gz
>>>>  '
>>>
>>> Curiously, this breaks for me.  It is understandable if we are not
>>> producing byte-for-byte identical output with internal gzip.
>
> Makes sense in retrospect, there's no reason the output of gzip(1) and
> zlib would have to be the same exactly.  It just happened to be so on my
> platform, so the tests deceptively passed for me.  I think we simply
> have to drop those that try to compare compressed files made by
> different tools -- we can still check if their content can be extracted
> and matches.

I have to take that back, I was confused -- the tests are fine.  There are no
comparisons between gzip-generated and zlib-generated files.  There's just
the gzhead use-after-return error that Dscho discovered.

René

@dscho
Copy link
Member Author

dscho commented Jun 14, 2022

@dscho dscho closed this Jun 14, 2022
@dscho dscho deleted the dont-spawn-gzip-in-archive branch June 14, 2022 16:31
@gitgitgadget
Copy link

gitgitgadget bot commented Jun 14, 2022

On the Git mailing list, Johannes Schindelin wrote (reply to this):

Hi René,

On Tue, 14 Jun 2022, René Scharfe wrote:

> Am 14.06.22 um 13:27 schrieb Johannes Schindelin:
>
> > The solution is to move the heap variable back into a scope that matches
> > the lifetime of the compression:
> >
> > -- snip --
> > diff --git a/archive-tar.c b/archive-tar.c
> > index 60669eb7b9c..3d77e0f7509 100644
> > --- a/archive-tar.c
> > +++ b/archive-tar.c
> > @@ -460,17 +460,12 @@ static void tgz_write_block(const void *data)
> >
> >  static const char internal_gzip_command[] = "git archive gzip";
> >
> > -static void tgz_set_os(git_zstream *strm, int os)
> > -{
> > -#if ZLIB_VERNUM >= 0x1221
> > -	struct gz_header_s gzhead = { .os = os };
> > -	deflateSetHeader(&strm->z, &gzhead);
> > -#endif
> > -}
> > -
> >  static int write_tar_filter_archive(const struct archiver *ar,
> >  				    struct archiver_args *args)
> >  {
> > +#if ZLIB_VERNUM >= 0x1221
> > +	struct gz_header_s gzhead = { .os = 3 }; /* Unix, for reproducibility */
> > +#endif
> >  	struct strbuf cmd = STRBUF_INIT;
> >  	struct child_process filter = CHILD_PROCESS_INIT;
> >  	int r;
> > @@ -481,7 +476,10 @@ static int write_tar_filter_archive(const struct archiver *ar,
> >  	if (!strcmp(ar->filter_command, internal_gzip_command)) {
> >  		write_block = tgz_write_block;
> >  		git_deflate_init_gzip(&gzstream, args->compression_level);
> > -		tgz_set_os(&gzstream, 3); /* Unix, for reproducibility */
> > +#if ZLIB_VERNUM >= 0x1221
> > +		if (deflateSetHeader(&gzstream.z, &gzhead) != Z_OK)
> > +			BUG("deflateSetHeader() called too late");
> > +#endif
> >  		gzstream.next_out = outbuf;
> >  		gzstream.avail_out = sizeof(outbuf);
> >
> > -- snap --
>
> Good find, thank you!  A shorter solution would be to make gzhead static.

I should have said that I had considered this, but decided against it
because it would introduce yet another issue: it would render the code
needlessly un-multi-threadable. And that can be avoided _really_ easily.

Ciao,
Dscho

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 14, 2022

On the Git mailing list, René Scharfe wrote (reply to this):

Am 14.06.22 um 18:29 schrieb Johannes Schindelin:
> Hi René,
>
> On Tue, 14 Jun 2022, René Scharfe wrote:
>
>> Am 14.06.22 um 13:27 schrieb Johannes Schindelin:
>>
>>> The solution is to move the heap variable back into a scope that matches
>>> the lifetime of the compression:
>>>
>>> -- snip --
>>> diff --git a/archive-tar.c b/archive-tar.c
>>> index 60669eb7b9c..3d77e0f7509 100644
>>> --- a/archive-tar.c
>>> +++ b/archive-tar.c
>>> @@ -460,17 +460,12 @@ static void tgz_write_block(const void *data)
>>>
>>>  static const char internal_gzip_command[] = "git archive gzip";
>>>
>>> -static void tgz_set_os(git_zstream *strm, int os)
>>> -{
>>> -#if ZLIB_VERNUM >= 0x1221
>>> -	struct gz_header_s gzhead = { .os = os };
>>> -	deflateSetHeader(&strm->z, &gzhead);
>>> -#endif
>>> -}
>>> -
>>>  static int write_tar_filter_archive(const struct archiver *ar,
>>>  				    struct archiver_args *args)
>>>  {
>>> +#if ZLIB_VERNUM >= 0x1221
>>> +	struct gz_header_s gzhead = { .os = 3 }; /* Unix, for reproducibility */
>>> +#endif
>>>  	struct strbuf cmd = STRBUF_INIT;
>>>  	struct child_process filter = CHILD_PROCESS_INIT;
>>>  	int r;
>>> @@ -481,7 +476,10 @@ static int write_tar_filter_archive(const struct archiver *ar,
>>>  	if (!strcmp(ar->filter_command, internal_gzip_command)) {
>>>  		write_block = tgz_write_block;
>>>  		git_deflate_init_gzip(&gzstream, args->compression_level);
>>> -		tgz_set_os(&gzstream, 3); /* Unix, for reproducibility */
>>> +#if ZLIB_VERNUM >= 0x1221
>>> +		if (deflateSetHeader(&gzstream.z, &gzhead) != Z_OK)
>>> +			BUG("deflateSetHeader() called too late");
>>> +#endif
>>>  		gzstream.next_out = outbuf;
>>>  		gzstream.avail_out = sizeof(outbuf);
>>>
>>> -- snap --
>>
>> Good find, thank you!  A shorter solution would be to make gzhead static.
>
> I should have said that I had considered this, but decided against it
> because it would introduce yet another issue: it would render the code
> needlessly un-multi-threadable. And that can be avoided _really_ easily.

archive-tar.c (and archive-zip.c) use other static variables, so a
static gzhead won't break or block anything in this regard.  There was
no interest in running it in parallel threads so far AFAIK, and it's
hard for me to imagine the usefulness of creating multiple .tgz files at
the same time.

The doubled ZLIB_VERNUM is unsightly and I'm not sure the BUG check is
useful -- I omitted error checking because there is no recurse for us if
deflateSetHeader() doesn't work, and on ancient zlib versions we
silently continue anyway.

But that's all just minor quibbling -- I'll include your changes in the
next version, they look fine overall.

René

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 14, 2022

On the Git mailing list, René Scharfe wrote (reply to this):

Am 14.06.22 um 13:28 schrieb Johannes Schindelin:
>
> By the way, the main reason why I did not work more is that in
> http://madler.net/pipermail/zlib-devel_madler.net/2019-December/003308.html,
> Mark Adler (the zlib maintainer) announced that...
>
>> [...] There are many well-tested performance improvements in zlib
>> waiting in the wings that will be incorporated over the next several
>> months. [...]
>
> This was in December 2019. And now it's June 2022 and I kind of wonder
> whether those promised improvements will still come.
>
> In the meantime, however, a viable alternative seems to have cropped up:
> https://github.com/zlib-ng/zlib-ng. Essentially, it looks as if it is what
> zlib should have become after above-quoted announcement.
>
> In particular the CPU intrinsics support (think MMX, SSE2/3, etc) seem to
> be very interesting and I would not be completely surprised if building
> Git with your patches and linking against zlib-ng would paint a very
> favorable picture not only in terms of CPU time but also in terms of
> wallclock time. Sadly, I have not been able to set aside time to look into
> that angle, but maybe I can peak your interest?
I was unable to preload zlib-ng using DYLD_INSERT_LIBRARIES on macOS
12.4 so far.  The included demo proggy looks impressive, though:

$ hyperfine -w3 -L gzip gzip,../zlib-ng/minigzip "git -C ../linux archive --format=tar HEAD | {gzip} -c"
Benchmark #1: git -C ../linux archive --format=tar HEAD | gzip -c
  Time (mean ± σ):     20.424 s ±  0.006 s    [User: 23.964 s, System: 0.432 s]
  Range (min … max):   20.414 s … 20.434 s    10 runs

Benchmark #2: git -C ../linux archive --format=tar HEAD | ../zlib-ng/minigzip -c
  Time (mean ± σ):     12.158 s ±  0.006 s    [User: 13.908 s, System: 0.376 s]
  Range (min … max):   12.145 s … 12.166 s    10 runs

Summary
  'git -C ../linux archive --format=tar HEAD | ../zlib-ng/minigzip -c' ran
    1.68 ± 0.00 times faster than 'git -C ../linux archive --format=tar HEAD | gzip -c'

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 15, 2022

On the Git mailing list, Junio C Hamano wrote (reply to this):

René Scharfe <l.s.r@web.de> writes:

> archive-tar.c (and archive-zip.c) use other static variables, so a
> static gzhead won't break or block anything in this regard.  There was
> no interest in running it in parallel threads so far AFAIK, and it's
> hard for me to imagine the usefulness of creating multiple .tgz files at
> the same time.

;-)  FWIW I had exactly the same reaction.  

If this were a lot isolated piece of helper function, perhaps, but
no, reentrancy for this helper function specifically written for
this code path is not a very good argument.

The code structure the (not very good) argument tried to suport,
however, is a good thing to have regardless, and if the code is
written in such a way from the beginning, there is no reason to
reject it.  If it opens the door to a unified way to deal with all
the other static global variables (e.g. have a "archiver_state"
structure that collects all of them, with this one included, and
pass it around, or something), that would be great.

Short of that, I do not care too much either way.  As this topic is
not even in 'next', "once written in one way, it is not worth the
code churn to rewrite it in the other way, when these two ways are
both reasonable" does not yet apply here, so ...

> But that's all just minor quibbling -- I'll include your changes in the
> next version, they look fine overall.

... that's fine, too.

Thanks, both.

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 15, 2022

On the Git mailing list, René Scharfe wrote (reply to this):

Changes since v3:
- Use deflateSetHeader() correctly, thanks to Dscho.
- New patch to update the format-related documentation.

  archive: update format documentation
  archive: rename archiver data field to filter_command
  archive-tar: factor out write_block()
  archive-tar: add internal gzip implementation
  archive-tar: use OS_CODE 3 (Unix) for internal gzip
  archive-tar: use internal gzip by default

 Documentation/git-archive.txt | 21 +++++-----
 archive-tar.c                 | 77 ++++++++++++++++++++++++++++++-----
 archive.h                     |  2 +-
 t/t5000-tar-tree.sh           | 28 ++++++++++---
 4 files changed, 100 insertions(+), 28 deletions(-)

Range-Diff vs. v3:
-:  ---------- > 1:  67369ed452 archive: update format documentation
1:  73ccd190bd = 2:  6a7cce50ef archive: rename archiver data field to filter_command
2:  352cff7163 = 3:  c86e82bee8 archive-tar: factor out write_block()
3:  4e7cf97631 ! 4:  6196b0e39d archive-tar: add internal gzip implementation
    @@ Commit message

      ## Documentation/git-archive.txt ##
     @@ Documentation/git-archive.txt: tar.<format>.command::
    - 	format is given.
    + 	to the command (e.g., `-9`).
      +
    - The "tar.gz" and "tgz" formats are defined automatically and default to
    --`gzip -cn`. You may override them with custom commands.
    -+`gzip -cn`. You may override them with custom commands. An internal gzip
    -+implementation can be used by specifying the value `git archive gzip`.
    + The `tar.gz` and `tgz` formats are defined automatically and use the
    +-command `gzip -cn` by default.
    ++command `gzip -cn` by default. An internal gzip implementation can be
    ++used by specifying the value `git archive gzip`.

      tar.<format>.remote::
    - 	If true, enable `<format>` for use by remote clients via
    + 	If true, enable the format for use by remote clients via

      ## archive-tar.c ##
     @@ archive-tar.c: static int write_tar_filter_archive(const struct archiver *ar,
4:  cb2bbe9f6d < -:  ---------- archive-tar: use OS_CODE 3 (Unix) for internal gzip
-:  ---------- > 5:  19d286af6a archive-tar: use OS_CODE 3 (Unix) for internal gzip
5:  5dd968ced1 ! 6:  74683137af archive-tar: use internal gzip by default
    @@ Commit message

      ## Documentation/git-archive.txt ##
     @@ Documentation/git-archive.txt: tar.<format>.command::
    - 	format is given.
    + 	to the command (e.g., `-9`).
      +
    - The "tar.gz" and "tgz" formats are defined automatically and default to
    --`gzip -cn`. You may override them with custom commands. An internal gzip
    --implementation can be used by specifying the value `git archive gzip`.
    -+the magic value `git archive gzip`, which invokes an internal
    -+implementation of gzip. You may override them with custom commands.
    + The `tar.gz` and `tgz` formats are defined automatically and use the
    +-command `gzip -cn` by default. An internal gzip implementation can be
    +-used by specifying the value `git archive gzip`.
    ++magic command `git archive gzip` by default, which invokes an internal
    ++implementation of gzip.

      tar.<format>.remote::
    - 	If true, enable `<format>` for use by remote clients via
    + 	If true, enable the format for use by remote clients via

      ## archive-tar.c ##
     @@ archive-tar.c: void init_tar_archiver(void)
--
2.36.1

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 15, 2022

On the Git mailing list, René Scharfe wrote (reply to this):

Mention all formats in the --format section, use backtick quoting for
literal values throughout, clarify the description of the configuration
option.

Helped-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: René Scharfe <l.s.r@web.de>
---
 Documentation/git-archive.txt | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/Documentation/git-archive.txt b/Documentation/git-archive.txt
index 56989a2f34..ff3f7b0344 100644
--- a/Documentation/git-archive.txt
+++ b/Documentation/git-archive.txt
@@ -34,10 +34,12 @@ OPTIONS
 -------

 --format=<fmt>::
-	Format of the resulting archive: 'tar' or 'zip'. If this option
+	Format of the resulting archive. Possible values are `tar`,
+	`zip`, `tar.gz`, `tgz`, and any format defined using the
+	configuration option `tar.<format>.command`. If `--format`
 	is not given, and the output file is specified, the format is
-	inferred from the filename if possible (e.g. writing to "foo.zip"
-	makes the output to be in the zip format). Otherwise the output
+	inferred from the filename if possible (e.g. writing to `foo.zip`
+	makes the output to be in the `zip` format). Otherwise the output
 	format is `tar`.

 -l::
@@ -143,17 +145,15 @@ tar.<format>.command::
 	is executed using the shell with the generated tar file on its
 	standard input, and should produce the final output on its
 	standard output. Any compression-level options will be passed
-	to the command (e.g., "-9"). An output file with the same
-	extension as `<format>` will be use this format if no other
-	format is given.
+	to the command (e.g., `-9`).
 +
-The "tar.gz" and "tgz" formats are defined automatically and default to
-`gzip -cn`. You may override them with custom commands.
+The `tar.gz` and `tgz` formats are defined automatically and use the
+command `gzip -cn` by default.

 tar.<format>.remote::
-	If true, enable `<format>` for use by remote clients via
+	If true, enable the format for use by remote clients via
 	linkgit:git-upload-archive[1]. Defaults to false for
-	user-defined formats, but true for the "tar.gz" and "tgz"
+	user-defined formats, but true for the `tar.gz` and `tgz`
 	formats.

 [[ATTRIBUTES]]
--
2.36.1

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 15, 2022

On the Git mailing list, René Scharfe wrote (reply to this):

The void pointer "data" in struct archiver is only used to store filter
commands to pass tar archives to, like gzip.  Rename it accordingly and
also turn it into a char pointer to document the fact that it's a string
reference.

Signed-off-by: René Scharfe <l.s.r@web.de>
---
 archive-tar.c | 10 +++++-----
 archive.h     |  2 +-
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/archive-tar.c b/archive-tar.c
index 042feb66d2..2717e34a1d 100644
--- a/archive-tar.c
+++ b/archive-tar.c
@@ -383,8 +383,8 @@ static int tar_filter_config(const char *var, const char *value, void *data)
 	if (!strcmp(type, "command")) {
 		if (!value)
 			return config_error_nonbool(var);
-		free(ar->data);
-		ar->data = xstrdup(value);
+		free(ar->filter_command);
+		ar->filter_command = xstrdup(value);
 		return 0;
 	}
 	if (!strcmp(type, "remote")) {
@@ -432,10 +432,10 @@ static int write_tar_filter_archive(const struct archiver *ar,
 	struct child_process filter = CHILD_PROCESS_INIT;
 	int r;

-	if (!ar->data)
+	if (!ar->filter_command)
 		BUG("tar-filter archiver called with no filter defined");

-	strbuf_addstr(&cmd, ar->data);
+	strbuf_addstr(&cmd, ar->filter_command);
 	if (args->compression_level >= 0)
 		strbuf_addf(&cmd, " -%d", args->compression_level);

@@ -478,7 +478,7 @@ void init_tar_archiver(void)
 	git_config(git_tar_config, NULL);
 	for (i = 0; i < nr_tar_filters; i++) {
 		/* omit any filters that never had a command configured */
-		if (tar_filters[i]->data)
+		if (tar_filters[i]->filter_command)
 			register_archiver(tar_filters[i]);
 	}
 }
diff --git a/archive.h b/archive.h
index 49fab71aaf..08bed3ed3a 100644
--- a/archive.h
+++ b/archive.h
@@ -43,7 +43,7 @@ struct archiver {
 	const char *name;
 	int (*write_archive)(const struct archiver *, struct archiver_args *);
 	unsigned flags;
-	void *data;
+	char *filter_command;
 };
 void register_archiver(struct archiver *);

--
2.36.1

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 15, 2022

On the Git mailing list, René Scharfe wrote (reply to this):

All tar archive writes have the same size and are done to the same file
descriptor.  Move them to a common function, write_block(), to reduce
code duplication and make it easy to change the destination.

Original-patch-by: Rohit Ashiwal <rohit.ashiwal265@gmail.com>
Signed-off-by: René Scharfe <l.s.r@web.de>
---
 archive-tar.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/archive-tar.c b/archive-tar.c
index 2717e34a1d..4e6a3deb80 100644
--- a/archive-tar.c
+++ b/archive-tar.c
@@ -38,11 +38,16 @@ static int write_tar_filter_archive(const struct archiver *ar,
 #define USTAR_MAX_MTIME 077777777777ULL
 #endif

+static void write_block(const void *buf)
+{
+	write_or_die(1, buf, BLOCKSIZE);
+}
+
 /* writes out the whole block, but only if it is full */
 static void write_if_needed(void)
 {
 	if (offset == BLOCKSIZE) {
-		write_or_die(1, block, BLOCKSIZE);
+		write_block(block);
 		offset = 0;
 	}
 }
@@ -66,7 +71,7 @@ static void do_write_blocked(const void *data, unsigned long size)
 		write_if_needed();
 	}
 	while (size >= BLOCKSIZE) {
-		write_or_die(1, buf, BLOCKSIZE);
+		write_block(buf);
 		size -= BLOCKSIZE;
 		buf += BLOCKSIZE;
 	}
@@ -101,10 +106,10 @@ static void write_trailer(void)
 {
 	int tail = BLOCKSIZE - offset;
 	memset(block + offset, 0, tail);
-	write_or_die(1, block, BLOCKSIZE);
+	write_block(block);
 	if (tail < 2 * RECORDSIZE) {
 		memset(block, 0, offset);
-		write_or_die(1, block, BLOCKSIZE);
+		write_block(block);
 	}
 }

--
2.36.1

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 15, 2022

On the Git mailing list, René Scharfe wrote (reply to this):

Git uses zlib for its own object store, but calls gzip when creating tgz
archives.  Add an option to perform the gzip compression for the latter
using zlib, without depending on the external gzip binary.

Plug it in by making write_block a function pointer and switching to a
compressing variant if the filter command has the magic value "git
archive gzip".  Does that indirection slow down tar creation?  Not
really, at least not in this test:

$ hyperfine -w3 -L rev HEAD,origin/main -p 'git checkout {rev} && make' \
'./git -C ../linux archive --format=tar HEAD # {rev}'
Benchmark #1: ./git -C ../linux archive --format=tar HEAD # HEAD
  Time (mean ± σ):      4.044 s ±  0.007 s    [User: 3.901 s, System: 0.137 s]
  Range (min … max):    4.038 s …  4.059 s    10 runs

Benchmark #2: ./git -C ../linux archive --format=tar HEAD # origin/main
  Time (mean ± σ):      4.047 s ±  0.009 s    [User: 3.903 s, System: 0.138 s]
  Range (min … max):    4.038 s …  4.066 s    10 runs

How does tgz creation perform?

$ hyperfine -w3 -L command 'gzip -cn','git archive gzip' \
'./git -c tar.tgz.command="{command}" -C ../linux archive --format=tgz HEAD'
Benchmark #1: ./git -c tar.tgz.command="gzip -cn" -C ../linux archive --format=tgz HEAD
  Time (mean ± σ):     20.404 s ±  0.006 s    [User: 23.943 s, System: 0.401 s]
  Range (min … max):   20.395 s … 20.414 s    10 runs

Benchmark #2: ./git -c tar.tgz.command="git archive gzip" -C ../linux archive --format=tgz HEAD
  Time (mean ± σ):     23.807 s ±  0.023 s    [User: 23.655 s, System: 0.145 s]
  Range (min … max):   23.782 s … 23.857 s    10 runs

Summary
  './git -c tar.tgz.command="gzip -cn" -C ../linux archive --format=tgz HEAD' ran
    1.17 ± 0.00 times faster than './git -c tar.tgz.command="git archive gzip" -C ../linux archive --format=tgz HEAD'

So the internal implementation takes 17% longer on the Linux repo, but
uses 2% less CPU time.  That's because the external gzip can run in
parallel on its own processor, while the internal one works sequentially
and avoids the inter-process communication overhead.

What are the benefits?  Only an internal sequential implementation can
offer this eco mode, and it allows avoiding the gzip(1) requirement.

This implementation uses the helper functions from our zlib.c instead of
the convenient gz* functions from zlib, because the latter doesn't give
the control over the generated gzip header that the next patch requires.

Original-patch-by: Rohit Ashiwal <rohit.ashiwal265@gmail.com>
Signed-off-by: René Scharfe <l.s.r@web.de>
---
 Documentation/git-archive.txt |  3 ++-
 archive-tar.c                 | 45 ++++++++++++++++++++++++++++++++++-
 t/t5000-tar-tree.sh           | 16 +++++++++++++
 3 files changed, 62 insertions(+), 2 deletions(-)

diff --git a/Documentation/git-archive.txt b/Documentation/git-archive.txt
index ff3f7b0344..b2d1b63d31 100644
--- a/Documentation/git-archive.txt
+++ b/Documentation/git-archive.txt
@@ -148,7 +148,8 @@ tar.<format>.command::
 	to the command (e.g., `-9`).
 +
 The `tar.gz` and `tgz` formats are defined automatically and use the
-command `gzip -cn` by default.
+command `gzip -cn` by default. An internal gzip implementation can be
+used by specifying the value `git archive gzip`.

 tar.<format>.remote::
 	If true, enable the format for use by remote clients via
diff --git a/archive-tar.c b/archive-tar.c
index 4e6a3deb80..53d0ef685c 100644
--- a/archive-tar.c
+++ b/archive-tar.c
@@ -38,11 +38,13 @@ static int write_tar_filter_archive(const struct archiver *ar,
 #define USTAR_MAX_MTIME 077777777777ULL
 #endif

-static void write_block(const void *buf)
+static void tar_write_block(const void *buf)
 {
 	write_or_die(1, buf, BLOCKSIZE);
 }

+static void (*write_block)(const void *) = tar_write_block;
+
 /* writes out the whole block, but only if it is full */
 static void write_if_needed(void)
 {
@@ -430,6 +432,34 @@ static int write_tar_archive(const struct archiver *ar,
 	return err;
 }

+static git_zstream gzstream;
+static unsigned char outbuf[16384];
+
+static void tgz_deflate(int flush)
+{
+	while (gzstream.avail_in || flush == Z_FINISH) {
+		int status = git_deflate(&gzstream, flush);
+		if (!gzstream.avail_out || status == Z_STREAM_END) {
+			write_or_die(1, outbuf, gzstream.next_out - outbuf);
+			gzstream.next_out = outbuf;
+			gzstream.avail_out = sizeof(outbuf);
+			if (status == Z_STREAM_END)
+				break;
+		}
+		if (status != Z_OK && status != Z_BUF_ERROR)
+			die(_("deflate error (%d)"), status);
+	}
+}
+
+static void tgz_write_block(const void *data)
+{
+	gzstream.next_in = (void *)data;
+	gzstream.avail_in = BLOCKSIZE;
+	tgz_deflate(Z_NO_FLUSH);
+}
+
+static const char internal_gzip_command[] = "git archive gzip";
+
 static int write_tar_filter_archive(const struct archiver *ar,
 				    struct archiver_args *args)
 {
@@ -440,6 +470,19 @@ static int write_tar_filter_archive(const struct archiver *ar,
 	if (!ar->filter_command)
 		BUG("tar-filter archiver called with no filter defined");

+	if (!strcmp(ar->filter_command, internal_gzip_command)) {
+		write_block = tgz_write_block;
+		git_deflate_init_gzip(&gzstream, args->compression_level);
+		gzstream.next_out = outbuf;
+		gzstream.avail_out = sizeof(outbuf);
+
+		r = write_tar_archive(ar, args);
+
+		tgz_deflate(Z_FINISH);
+		git_deflate_end(&gzstream);
+		return r;
+	}
+
 	strbuf_addstr(&cmd, ar->filter_command);
 	if (args->compression_level >= 0)
 		strbuf_addf(&cmd, " -%d", args->compression_level);
diff --git a/t/t5000-tar-tree.sh b/t/t5000-tar-tree.sh
index 7f8d2ab0a7..9ac0ec67fe 100755
--- a/t/t5000-tar-tree.sh
+++ b/t/t5000-tar-tree.sh
@@ -374,6 +374,22 @@ test_expect_success GZIP 'remote tar.gz can be disabled' '
 		>remote.tar.gz
 '

+test_expect_success 'git archive --format=tgz (internal gzip)' '
+	test_config tar.tgz.command "git archive gzip" &&
+	git archive --format=tgz HEAD >internal_gzip.tgz
+'
+
+test_expect_success 'git archive --format=tar.gz (internal gzip)' '
+	test_config tar.tar.gz.command "git archive gzip" &&
+	git archive --format=tar.gz HEAD >internal_gzip.tar.gz &&
+	test_cmp_bin internal_gzip.tgz internal_gzip.tar.gz
+'
+
+test_expect_success GZIP 'extract tgz file (internal gzip)' '
+	gzip -d -c <internal_gzip.tgz >internal_gzip.tar &&
+	test_cmp_bin b.tar internal_gzip.tar
+'
+
 test_expect_success 'archive and :(glob)' '
 	git archive -v HEAD -- ":(glob)**/sh" >/dev/null 2>actual &&
 	cat >expect <<EOF &&
--
2.36.1

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 15, 2022

On the Git mailing list, René Scharfe wrote (reply to this):

gzip(1) encodes the OS it runs on in the 10th byte of its output. It
uses the following OS_CODE values according to its tailor.h [1]:

        0 - MS-DOS
        3 - UNIX
        5 - Atari ST
        6 - OS/2
       10 - TOPS-20
       11 - Windows NT

The gzip.exe that comes with Git for Windows uses OS_CODE 3 for some
reason, so this value is used on practically all supported platforms
when generating tgz archives using gzip(1).

Zlib uses a bigger set of values according to its zutil.h [2], aligned
with section 4.4.2 of the ZIP specification, APPNOTE.txt [3]:

         0 - MS-DOS
         1 - Amiga
         3 - UNIX
         4 - VM/CMS
         5 - Atari ST
         6 - OS/2
         7 - Macintosh
         8 - Z-System
        10 - Windows NT
        11 - MVS (OS/390 - Z/OS)
        13 - Acorn Risc
        16 - BeOS
        18 - OS/400
        19 - OS X (Darwin)

Thus the internal gzip implementation in archive-tar.c sets different
OS_CODE header values on major platforms Windows and macOS.  Git for
Windows uses its own zlib-based variant since v2.20.1 by default and
thus embeds OS_CODE 10 in tgz archives.

The tar archive for a commit is generated consistently on all systems
(by the same Git version).  The OS_CODE in the gzip header does not
influence extraction.  Avoid leaking OS information and make tgz
archives constistent and reproducable (with the same Git and libz
versions) by using OS_CODE 3 everywhere.

At least on macOS 12.4 this produces the same output as gzip(1) for the
examples I tried:

   # before
   $ git -c tar.tgz.command='git archive gzip' archive --format=tgz v2.36.0 | shasum
   3abbffb40b7c63cf9b7d91afc682f11682f80759  -

   # with this patch
   $ git -c tar.tgz.command='git archive gzip' archive --format=tgz v2.36.0 | shasum
   dc6dc6ba9636d522799085d0d77ab6a110bcc141  -

   $ git archive --format=tar v2.36.0 | gzip -cn | shasum
   dc6dc6ba9636d522799085d0d77ab6a110bcc141  -

[1] https://git.savannah.gnu.org/cgit/gzip.git/tree/tailor.h
[2] https://github.com/madler/zlib/blob/master/zutil.h
[3] https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT

Helped-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: René Scharfe <l.s.r@web.de>
---
 archive-tar.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/archive-tar.c b/archive-tar.c
index 53d0ef685c..efba78118b 100644
--- a/archive-tar.c
+++ b/archive-tar.c
@@ -463,6 +463,9 @@ static const char internal_gzip_command[] = "git archive gzip";
 static int write_tar_filter_archive(const struct archiver *ar,
 				    struct archiver_args *args)
 {
+#if ZLIB_VERNUM >= 0x1221
+	struct gz_header_s gzhead = { .os = 3 }; /* Unix, for reproducibility */
+#endif
 	struct strbuf cmd = STRBUF_INIT;
 	struct child_process filter = CHILD_PROCESS_INIT;
 	int r;
@@ -473,6 +476,10 @@ static int write_tar_filter_archive(const struct archiver *ar,
 	if (!strcmp(ar->filter_command, internal_gzip_command)) {
 		write_block = tgz_write_block;
 		git_deflate_init_gzip(&gzstream, args->compression_level);
+#if ZLIB_VERNUM >= 0x1221
+		if (deflateSetHeader(&gzstream.z, &gzhead) != Z_OK)
+			BUG("deflateSetHeader() called too late");
+#endif
 		gzstream.next_out = outbuf;
 		gzstream.avail_out = sizeof(outbuf);

--
2.36.1

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 15, 2022

On the Git mailing list, René Scharfe wrote (reply to this):

Drop the dependency on gzip(1) and use our internal implementation to
create tar.gz and tgz files.

Signed-off-by: René Scharfe <l.s.r@web.de>
---
 Documentation/git-archive.txt |  4 ++--
 archive-tar.c                 |  4 ++--
 t/t5000-tar-tree.sh           | 32 ++++++++++++++++----------------
 3 files changed, 20 insertions(+), 20 deletions(-)

diff --git a/Documentation/git-archive.txt b/Documentation/git-archive.txt
index b2d1b63d31..60c040988b 100644
--- a/Documentation/git-archive.txt
+++ b/Documentation/git-archive.txt
@@ -148,8 +148,8 @@ tar.<format>.command::
 	to the command (e.g., `-9`).
 +
 The `tar.gz` and `tgz` formats are defined automatically and use the
-command `gzip -cn` by default. An internal gzip implementation can be
-used by specifying the value `git archive gzip`.
+magic command `git archive gzip` by default, which invokes an internal
+implementation of gzip.

 tar.<format>.remote::
 	If true, enable the format for use by remote clients via
diff --git a/archive-tar.c b/archive-tar.c
index efba78118b..3d77e0f750 100644
--- a/archive-tar.c
+++ b/archive-tar.c
@@ -526,9 +526,9 @@ void init_tar_archiver(void)
 	int i;
 	register_archiver(&tar_archiver);

-	tar_filter_config("tar.tgz.command", "gzip -cn", NULL);
+	tar_filter_config("tar.tgz.command", internal_gzip_command, NULL);
 	tar_filter_config("tar.tgz.remote", "true", NULL);
-	tar_filter_config("tar.tar.gz.command", "gzip -cn", NULL);
+	tar_filter_config("tar.tar.gz.command", internal_gzip_command, NULL);
 	tar_filter_config("tar.tar.gz.remote", "true", NULL);
 	git_config(git_tar_config, NULL);
 	for (i = 0; i < nr_tar_filters; i++) {
diff --git a/t/t5000-tar-tree.sh b/t/t5000-tar-tree.sh
index 9ac0ec67fe..1a68e89a55 100755
--- a/t/t5000-tar-tree.sh
+++ b/t/t5000-tar-tree.sh
@@ -339,21 +339,21 @@ test_expect_success 'only enabled filters are available remotely' '
 	test_cmp_bin remote.bar config.bar
 '

-test_expect_success GZIP 'git archive --format=tgz' '
+test_expect_success 'git archive --format=tgz' '
 	git archive --format=tgz HEAD >j.tgz
 '

-test_expect_success GZIP 'git archive --format=tar.gz' '
+test_expect_success 'git archive --format=tar.gz' '
 	git archive --format=tar.gz HEAD >j1.tar.gz &&
 	test_cmp_bin j.tgz j1.tar.gz
 '

-test_expect_success GZIP 'infer tgz from .tgz filename' '
+test_expect_success 'infer tgz from .tgz filename' '
 	git archive --output=j2.tgz HEAD &&
 	test_cmp_bin j.tgz j2.tgz
 '

-test_expect_success GZIP 'infer tgz from .tar.gz filename' '
+test_expect_success 'infer tgz from .tar.gz filename' '
 	git archive --output=j3.tar.gz HEAD &&
 	test_cmp_bin j.tgz j3.tar.gz
 '
@@ -363,31 +363,31 @@ test_expect_success GZIP 'extract tgz file' '
 	test_cmp_bin b.tar j.tar
 '

-test_expect_success GZIP 'remote tar.gz is allowed by default' '
+test_expect_success 'remote tar.gz is allowed by default' '
 	git archive --remote=. --format=tar.gz HEAD >remote.tar.gz &&
 	test_cmp_bin j.tgz remote.tar.gz
 '

-test_expect_success GZIP 'remote tar.gz can be disabled' '
+test_expect_success 'remote tar.gz can be disabled' '
 	git config tar.tar.gz.remote false &&
 	test_must_fail git archive --remote=. --format=tar.gz HEAD \
 		>remote.tar.gz
 '

-test_expect_success 'git archive --format=tgz (internal gzip)' '
-	test_config tar.tgz.command "git archive gzip" &&
-	git archive --format=tgz HEAD >internal_gzip.tgz
+test_expect_success GZIP 'git archive --format=tgz (external gzip)' '
+	test_config tar.tgz.command "gzip -cn" &&
+	git archive --format=tgz HEAD >external_gzip.tgz
 '

-test_expect_success 'git archive --format=tar.gz (internal gzip)' '
-	test_config tar.tar.gz.command "git archive gzip" &&
-	git archive --format=tar.gz HEAD >internal_gzip.tar.gz &&
-	test_cmp_bin internal_gzip.tgz internal_gzip.tar.gz
+test_expect_success GZIP 'git archive --format=tar.gz (external gzip)' '
+	test_config tar.tar.gz.command "gzip -cn" &&
+	git archive --format=tar.gz HEAD >external_gzip.tar.gz &&
+	test_cmp_bin external_gzip.tgz external_gzip.tar.gz
 '

-test_expect_success GZIP 'extract tgz file (internal gzip)' '
-	gzip -d -c <internal_gzip.tgz >internal_gzip.tar &&
-	test_cmp_bin b.tar internal_gzip.tar
+test_expect_success GZIP 'extract tgz file (external gzip)' '
+	gzip -d -c <external_gzip.tgz >external_gzip.tar &&
+	test_cmp_bin b.tar external_gzip.tar
 '

 test_expect_success 'archive and :(glob)' '
--
2.36.1

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 15, 2022

On the Git mailing list, Ævar Arnfjörð Bjarmason wrote (reply to this):

On Wed, Jun 15 2022, René Scharfe wrote:

> Git uses zlib for its own object store, but calls gzip when creating tgz
> archives.  Add an option to perform the gzip compression for the latter
> using zlib, without depending on the external gzip binary.
>
> Plug it in by making write_block a function pointer and switching to a
> compressing variant if the filter command has the magic value "git
> archive gzip".  Does that indirection slow down tar creation?  Not
> really, at least not in this test:
>
> $ hyperfine -w3 -L rev HEAD,origin/main -p 'git checkout {rev} && make' \
> './git -C ../linux archive --format=tar HEAD # {rev}'

Shameless plug: https://lore.kernel.org/git/211201.86r1aw9gbd.gmgdl@evledraar.gmail.com/

I.e. a "hyperfine" wrapper I wrote to make exactly this sort of thing
easier.

You'll find that you need less or no --warmup with it, since the
checkout flip-flopping and re-making (and resulting FS and other cache
eviction) will go away, as we'll use different "git worktree"'s for the
two "rev".

(Also, putting those on a ramdisk really helps)

> Benchmark #1: ./git -C ../linux archive --format=tar HEAD # HEAD
>   Time (mean ± σ):      4.044 s ±  0.007 s    [User: 3.901 s, System: 0.137 s]
>   Range (min … max):    4.038 s …  4.059 s    10 runs
>
> Benchmark #2: ./git -C ../linux archive --format=tar HEAD # origin/main
>   Time (mean ± σ):      4.047 s ±  0.009 s    [User: 3.903 s, System: 0.138 s]
>   Range (min … max):    4.038 s …  4.066 s    10 runs
>
> How does tgz creation perform?
>
> $ hyperfine -w3 -L command 'gzip -cn','git archive gzip' \
> './git -c tar.tgz.command="{command}" -C ../linux archive --format=tgz HEAD'
> Benchmark #1: ./git -c tar.tgz.command="gzip -cn" -C ../linux archive --format=tgz HEAD
>   Time (mean ± σ):     20.404 s ±  0.006 s    [User: 23.943 s, System: 0.401 s]
>   Range (min … max):   20.395 s … 20.414 s    10 runs
>
> Benchmark #2: ./git -c tar.tgz.command="git archive gzip" -C ../linux archive --format=tgz HEAD
>   Time (mean ± σ):     23.807 s ±  0.023 s    [User: 23.655 s, System: 0.145 s]
>   Range (min … max):   23.782 s … 23.857 s    10 runs
>
> Summary
>   './git -c tar.tgz.command="gzip -cn" -C ../linux archive --format=tgz HEAD' ran
>     1.17 ± 0.00 times faster than './git -c tar.tgz.command="git archive gzip" -C ../linux archive --format=tgz HEAD'
>
> So the internal implementation takes 17% longer on the Linux repo, but
> uses 2% less CPU time.  That's because the external gzip can run in
> parallel on its own processor, while the internal one works sequentially
> and avoids the inter-process communication overhead.
>
> What are the benefits?  Only an internal sequential implementation can
> offer this eco mode, and it allows avoiding the gzip(1) requirement.

I had been keeping one eye on this series, but didn't look at it in any
detail.

I found this after reading 6/6, which I think in any case could really
use some "why" summary, which seems to mostly be covered here.

I.e. it's unclear if the "drop the dependency on gzip(1)" in 6/6 is a
reference to the GZIP test dependency, or that our users are unlikely to
have "gzip(1)" on their systems.

If it's the latter I'd much rather (as a user) take a 17% wallclock
improvement over a 2% cost of CPU. I mostly care about my own time, not
that of the CPU.

Can't we have our 6/6 cake much easier and eat it too by learning a
"fallback" mode, i.e. we try to invoke gzip, and if that doesn't work
use the "internal" one?

Re the "eco mode": I also wonder how much of the overhead you're seeing
for both that 17% and 2% would go away if you pin both processes to the
same CPU, I can't recall the command offhand, but IIRC taskset or
numactl can do that. I.e. is this really measuring IPC overhead, or
I-CPU overhead on your system?

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 15, 2022

User Ævar Arnfjörð Bjarmason <avarab@gmail.com> has been added to the cc: list.

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 16, 2022

On the Git mailing list, René Scharfe wrote (reply to this):

Am 15.06.22 um 22:32 schrieb Ævar Arnfjörð Bjarmason:
>
> On Wed, Jun 15 2022, René Scharfe wrote:
>
>> Git uses zlib for its own object store, but calls gzip when creating tgz
>> archives.  Add an option to perform the gzip compression for the latter
>> using zlib, without depending on the external gzip binary.
>>
>> Plug it in by making write_block a function pointer and switching to a
>> compressing variant if the filter command has the magic value "git
>> archive gzip".  Does that indirection slow down tar creation?  Not
>> really, at least not in this test:
>>
>> $ hyperfine -w3 -L rev HEAD,origin/main -p 'git checkout {rev} && make' \
>> './git -C ../linux archive --format=tar HEAD # {rev}'
>
> Shameless plug: https://lore.kernel.org/git/211201.86r1aw9gbd.gmgdl@evledraar.gmail.com/
>
> I.e. a "hyperfine" wrapper I wrote to make exactly this sort of thing
> easier.
>
> You'll find that you need less or no --warmup with it, since the
> checkout flip-flopping and re-making (and resulting FS and other cache
> eviction) will go away, as we'll use different "git worktree"'s for the
> two "rev".

OK, but requiring hyperfine alone is burden enough for reviewers.

I had a try anyway and it took me a while to realize that git-hyperfine
requires setting the Git config option hyperfine.run-dir band that it
ignores it on my system.  Had to hard-code it in the script.

> (Also, putting those on a ramdisk really helps)
>
>> Benchmark #1: ./git -C ../linux archive --format=tar HEAD # HEAD
>>   Time (mean ± σ):      4.044 s ±  0.007 s    [User: 3.901 s, System: 0.137 s]
>>   Range (min … max):    4.038 s …  4.059 s    10 runs
>>
>> Benchmark #2: ./git -C ../linux archive --format=tar HEAD # origin/main
>>   Time (mean ± σ):      4.047 s ±  0.009 s    [User: 3.903 s, System: 0.138 s]
>>   Range (min … max):    4.038 s …  4.066 s    10 runs
>>
>> How does tgz creation perform?
>>
>> $ hyperfine -w3 -L command 'gzip -cn','git archive gzip' \
>> './git -c tar.tgz.command="{command}" -C ../linux archive --format=tgz HEAD'
>> Benchmark #1: ./git -c tar.tgz.command="gzip -cn" -C ../linux archive --format=tgz HEAD
>>   Time (mean ± σ):     20.404 s ±  0.006 s    [User: 23.943 s, System: 0.401 s]
>>   Range (min … max):   20.395 s … 20.414 s    10 runs
>>
>> Benchmark #2: ./git -c tar.tgz.command="git archive gzip" -C ../linux archive --format=tgz HEAD
>>   Time (mean ± σ):     23.807 s ±  0.023 s    [User: 23.655 s, System: 0.145 s]
>>   Range (min … max):   23.782 s … 23.857 s    10 runs
>>
>> Summary
>>   './git -c tar.tgz.command="gzip -cn" -C ../linux archive --format=tgz HEAD' ran
>>     1.17 ± 0.00 times faster than './git -c tar.tgz.command="git archive gzip" -C ../linux archive --format=tgz HEAD'
>>
>> So the internal implementation takes 17% longer on the Linux repo, but
>> uses 2% less CPU time.  That's because the external gzip can run in
>> parallel on its own processor, while the internal one works sequentially
>> and avoids the inter-process communication overhead.
>>
>> What are the benefits?  Only an internal sequential implementation can
>> offer this eco mode, and it allows avoiding the gzip(1) requirement.
>
> I had been keeping one eye on this series, but didn't look at it in any
> detail.
>
> I found this after reading 6/6, which I think in any case could really
> use some "why" summary, which seems to mostly be covered here.
>
> I.e. it's unclear if the "drop the dependency on gzip(1)" in 6/6 is a
> reference to the GZIP test dependency, or that our users are unlikely to
> have "gzip(1)" on their systems.

It's to avoid a run dependency; the build/test dependency remains.

> If it's the latter I'd much rather (as a user) take a 17% wallclock
> improvement over a 2% cost of CPU. I mostly care about my own time, not
> that of the CPU.

Understandable, and you can set tar.tgz.command='gzip -cn' to get the
old behavior.  Saving energy is a better default, though.

The runtime in the real world probably includes lots more I/O time.  The
tests above are repeated and warmed up to get consistent measurements,
but big repos are probably not fully kept in memory like that.

> Can't we have our 6/6 cake much easier and eat it too by learning a
> "fallback" mode, i.e. we try to invoke gzip, and if that doesn't work
> use the "internal" one?

Interesting idea, but I think the existing config option suffices.  E.g.
a distro could set it in the system-wide config file if/when gzip is
installed.

> Re the "eco mode": I also wonder how much of the overhead you're seeing
> for both that 17% and 2% would go away if you pin both processes to the
> same CPU, I can't recall the command offhand, but IIRC taskset or
> numactl can do that. I.e. is this really measuring IPC overhead, or
> I-CPU overhead on your system?

I'd expect that running git archive and gzip at the same CPU core takes
more wall-clock time than using zlib because inflating the object files
and deflating the archive are done sequentially in both scenarios.
Can't test it on macOS because it doesn't offer a way to pin programs to
a certain core, but e.g. someone with access to a Linux system can check
that using taskset(1).

René

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 24, 2022

On the Git mailing list, Ævar Arnfjörð Bjarmason wrote (reply to this):

On Thu, Jun 16 2022, René Scharfe wrote:

> Am 15.06.22 um 22:32 schrieb Ævar Arnfjörð Bjarmason:
>> [...]
> Understandable, and you can set tar.tgz.command='gzip -cn' to get the
> old behavior.  Saving energy is a better default, though.

I disagree with that in general, a big reason for why git won out over
other VCS's is that it wasn't as slow. I think we should primarily be
interested in the time a user might end up staring at the screen.

I understand the concern to have "git archive" just work, e.g. if you
uninstall gzip(1) (although that seems rather obscure, but perhaps this
is for more minimal setups).

I don't think saving energy is a virtue, *maybe* it is, but maybe your
computer is powered by hydro, solar or nuclear instead of coal, so even
if we're taking global energy policy into account for changes to git
it's highly context dependant.

In any case, this is also true for pretty much any other git command
that might spawn processes or threads, e.g. "git grep":

	$ hyperfine -w3 -L cpus 0,0-7 'taskset --cpu-list {cpus} ./git grep foo.*bar' -r 10
	Benchmark 1: taskset --cpu-list 0 ./git grep foo.*bar
	  Time (mean ± σ):      39.3 ms ±   1.2 ms    [User: 20.0 ms, System: 18.6 ms]
	  Range (min … max):    38.2 ms …  41.8 ms    10 runs
	
	Benchmark 2: taskset --cpu-list 0-7 ./git grep foo.*bar
	  Time (mean ± σ):      28.1 ms ±   1.3 ms    [User: 43.5 ms, System: 51.0 ms]
	  Range (min … max):    26.6 ms …  31.2 ms    10 runs
	
	Summary
	  'taskset --cpu-list 0-7 ./git grep foo.*bar' ran
	    1.40 ± 0.08 times faster than 'taskset --cpu-list 0 ./git grep foo.*bar'

Here we use less than 1/2 the user/system time when I pin it to 1 cpu,
but we're 40% slower.

So this is a bit of a digression, but this particular thing seems much
better left to the OS or your hardware's CPU throttling policy. To the
extent that we care perhaps more fitting would be to have a global
core.wrapper-cmd option or something, so you could pass all git commands
through "taskset" (or your local equivalent), or just use shell aliases.

> The runtime in the real world probably includes lots more I/O time.  The
> tests above are repeated and warmed up to get consistent measurements,
> but big repos are probably not fully kept in memory like that.
>
>> Can't we have our 6/6 cake much easier and eat it too by learning a
>> "fallback" mode, i.e. we try to invoke gzip, and if that doesn't work
>> use the "internal" one?
>
> Interesting idea, but I think the existing config option suffices.  E.g.
> a distro could set it in the system-wide config file if/when gzip is
> installed.

I think in practice distros are unlikely to have such triggers for
"package X is installed, let's set config Y". I mean, e.g. Debian can do
that with its packaging system, but it's expecting a lot. Why not flip
the default depending on if start_command() fails?

>> Re the "eco mode": I also wonder how much of the overhead you're seeing
>> for both that 17% and 2% would go away if you pin both processes to the
>> same CPU, I can't recall the command offhand, but IIRC taskset or
>> numactl can do that. I.e. is this really measuring IPC overhead, or
>> I-CPU overhead on your system?
>
> I'd expect that running git archive and gzip at the same CPU core takes
> more wall-clock time than using zlib because inflating the object files
> and deflating the archive are done sequentially in both scenarios.
> Can't test it on macOS because it doesn't offer a way to pin programs to
> a certain core, but e.g. someone with access to a Linux system can check
> that using taskset(1).

Here's a benchmark, this is your hyperfine command, just with taskset
added. It's an 8-core box, so 0-7 is "all CPUs" (I think...):

	hyperfine -w3 \
		-L cpus 0,0-7 \
		-L command 'gzip -cn','git archive gzip' \
		'taskset --cpu-list {cpus} ./git -c tar.tgz.command="{command}" archive --format=tgz HEAD'

Which gives me:

	Benchmark 1: taskset --cpu-list 0 ./git -c tar.tgz.command="gzip -cn" archive --format=tgz HEAD
	  Time (mean ± σ):      1.561 s ±  0.029 s    [User: 1.503 s, System: 0.058 s]
	  Range (min … max):    1.522 s …  1.622 s    10 runs
	 
	Benchmark 2: taskset --cpu-list 0-7 ./git -c tar.tgz.command="gzip -cn" archive --format=tgz HEAD
	  Time (mean ± σ):      1.337 s ±  0.029 s    [User: 1.535 s, System: 0.075 s]
	  Range (min … max):    1.298 s …  1.388 s    10 runs
	 
	Benchmark 3: taskset --cpu-list 0 ./git -c tar.tgz.command="git archive gzip" archive --format=tgz HEAD
	  Time (mean ± σ):      1.493 s ±  0.032 s    [User: 1.453 s, System: 0.040 s]
	  Range (min … max):    1.462 s …  1.572 s    10 runs
	 
	Benchmark 4: taskset --cpu-list 0-7 ./git -c tar.tgz.command="git archive gzip" archive --format=tgz HEAD
	  Time (mean ± σ):      1.503 s ±  0.026 s    [User: 1.466 s, System: 0.036 s]
	  Range (min … max):    1.469 s …  1.542 s    10 runs
	 
	Summary
	  'taskset --cpu-list 0-7 ./git -c tar.tgz.command="gzip -cn" archive --format=tgz HEAD' ran
	    1.12 ± 0.03 times faster than 'taskset --cpu-list 0 ./git -c tar.tgz.command="git archive gzip" archive --format=tgz HEAD'
	    1.12 ± 0.03 times faster than 'taskset --cpu-list 0-7 ./git -c tar.tgz.command="git archive gzip" archive --format=tgz HEAD'
	    1.17 ± 0.03 times faster than 'taskset --cpu-list 0 ./git -c tar.tgz.command="gzip -cn" archive --format=tgz HEAD'

Whic I think should control for the IPC overhead v.s. the advantage of
multicore. I.e. we're faster with "gzip -cn" on multicore, but the
internal implementation has an advantage when it comes to 

I tried out the fallback method, memory leaks aside (needs to do a
proper cleanup) this seems to work. Most of the diff is moving the
existing code into a function:

diff --git a/archive-tar.c b/archive-tar.c
index 3d77e0f7509..a1b08812ee3 100644
--- a/archive-tar.c
+++ b/archive-tar.c
@@ -458,14 +458,36 @@ static void tgz_write_block(const void *data)
 	tgz_deflate(Z_NO_FLUSH);
 }
 
+static const char default_gzip_command[] = "gzip -cn";
 static const char internal_gzip_command[] = "git archive gzip";
 
-static int write_tar_filter_archive(const struct archiver *ar,
-				    struct archiver_args *args)
+static int do_internal_gzip(const struct archiver *ar,
+			    struct archiver_args *args)
 {
 #if ZLIB_VERNUM >= 0x1221
 	struct gz_header_s gzhead = { .os = 3 }; /* Unix, for reproducibility */
 #endif
+	int r;
+
+	write_block = tgz_write_block;
+	git_deflate_init_gzip(&gzstream, args->compression_level);
+#if ZLIB_VERNUM >= 0x1221
+	if (deflateSetHeader(&gzstream.z, &gzhead) != Z_OK)
+		BUG("deflateSetHeader() called too late");
+#endif
+	gzstream.next_out = outbuf;
+	gzstream.avail_out = sizeof(outbuf);
+
+	r = write_tar_archive(ar, args);
+
+	tgz_deflate(Z_FINISH);
+	git_deflate_end(&gzstream);
+	return r;
+}
+
+static int write_tar_filter_archive(const struct archiver *ar,
+				    struct archiver_args *args)
+{
 	struct strbuf cmd = STRBUF_INIT;
 	struct child_process filter = CHILD_PROCESS_INIT;
 	int r;
@@ -473,33 +495,24 @@ static int write_tar_filter_archive(const struct archiver *ar,
 	if (!ar->filter_command)
 		BUG("tar-filter archiver called with no filter defined");
 
-	if (!strcmp(ar->filter_command, internal_gzip_command)) {
-		write_block = tgz_write_block;
-		git_deflate_init_gzip(&gzstream, args->compression_level);
-#if ZLIB_VERNUM >= 0x1221
-		if (deflateSetHeader(&gzstream.z, &gzhead) != Z_OK)
-			BUG("deflateSetHeader() called too late");
-#endif
-		gzstream.next_out = outbuf;
-		gzstream.avail_out = sizeof(outbuf);
-
-		r = write_tar_archive(ar, args);
-
-		tgz_deflate(Z_FINISH);
-		git_deflate_end(&gzstream);
-		return r;
-	}
+	if (!strcmp(ar->filter_command, internal_gzip_command))
+		return do_internal_gzip(ar, args);
 
 	strbuf_addstr(&cmd, ar->filter_command);
 	if (args->compression_level >= 0)
 		strbuf_addf(&cmd, " -%d", args->compression_level);
 
-	strvec_push(&filter.args, cmd.buf);
-	filter.use_shell = 1;
+	strvec_split(&filter.args, cmd.buf);
 	filter.in = -1;
 
-	if (start_command(&filter) < 0)
+	if (start_command(&filter) < 0) {
+		if (!strcmp(ar->filter_command, default_gzip_command)) {
+			warning_errno(_("could not start '%s' filter, falling back to '%s'"),
+				      ar->filter_command, internal_gzip_command);
+			return do_internal_gzip(ar, args);
+		}
 		die_errno(_("unable to start '%s' filter"), cmd.buf);
+	}
 	close(1);
 	if (dup2(filter.in, 1) < 0)
 		die_errno(_("unable to redirect descriptor"));

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 24, 2022

On the Git mailing list, René Scharfe wrote (reply to this):

Am 24.06.22 um 13:13 schrieb Ævar Arnfjörð Bjarmason:
>
> On Thu, Jun 16 2022, René Scharfe wrote:
>
>> Am 15.06.22 um 22:32 schrieb Ævar Arnfjörð Bjarmason:
>>> [...]
>> Understandable, and you can set tar.tgz.command='gzip -cn' to get the
>> old behavior.  Saving energy is a better default, though.
>
> I disagree with that in general, a big reason for why git won out over
> other VCS's is that it wasn't as slow. I think we should primarily be
> interested in the time a user might end up staring at the screen.
>
> I understand the concern to have "git archive" just work, e.g. if you
> uninstall gzip(1) (although that seems rather obscure, but perhaps this
> is for more minimal setups).

The previous attempt came from/via Git on Windows.

> I don't think saving energy is a virtue, *maybe* it is, but maybe your
> computer is powered by hydro, solar or nuclear instead of coal, so even
> if we're taking global energy policy into account for changes to git
> it's highly context dependant.

Or a device runs on battery power and saving energy keeps it running a
bit longer.  Or it's housed in a data center and saving energy helps
reduce cooling requirements.

> In any case, this is also true for pretty much any other git command
> that might spawn processes or threads, e.g. "git grep":
>
> 	$ hyperfine -w3 -L cpus 0,0-7 'taskset --cpu-list {cpus} ./git grep foo.*bar' -r 10
> 	Benchmark 1: taskset --cpu-list 0 ./git grep foo.*bar
> 	  Time (mean ± σ):      39.3 ms ±   1.2 ms    [User: 20.0 ms, System: 18.6 ms]
> 	  Range (min … max):    38.2 ms …  41.8 ms    10 runs
>
> 	Benchmark 2: taskset --cpu-list 0-7 ./git grep foo.*bar
> 	  Time (mean ± σ):      28.1 ms ±   1.3 ms    [User: 43.5 ms, System: 51.0 ms]
> 	  Range (min … max):    26.6 ms …  31.2 ms    10 runs
>
> 	Summary
> 	  'taskset --cpu-list 0-7 ./git grep foo.*bar' ran
> 	    1.40 ± 0.08 times faster than 'taskset --cpu-list 0 ./git grep foo.*bar'
>
> Here we use less than 1/2 the user/system time when I pin it to 1 cpu,
> but we're 40% slower.
>
> So this is a bit of a digression, but this particular thing seems much
> better left to the OS or your hardware's CPU throttling policy. To the
> extent that we care perhaps more fitting would be to have a global
> core.wrapper-cmd option or something, so you could pass all git commands
> through "taskset" (or your local equivalent), or just use shell aliases.

Not sure what conclusion to draw from these numbers.  Perhaps that
computation is not the bottleneck here (increasing the number of cores by
700% increases speed only by 40%)?  That coordination overhead makes up a
big percentage and there might be room for improvement/tuning?

In any case, I agree we should leave scheduling decisions at runtime to
the OS.

>> The runtime in the real world probably includes lots more I/O time.  The
>> tests above are repeated and warmed up to get consistent measurements,
>> but big repos are probably not fully kept in memory like that.

On top of that I guess only few people create tgz files at all.  Most of
them I would expect to be created automatically (and cached) by sites
like kernel.org.  So I imagine people rather create tar.xz, tar.zst or
zip archives these days.  Or use git at both ends (push/pull), as they
should. ;-)  I have no data to support this guess, though.

But yeah, the tradeoff sounds a bit weird: Give 17% duration, get 2% CPU
time back -- sounds like a ripoff.  In your example below it's 12%
longer duration for 5% saved CPU time, which sounds a bit better, but
still not terribly attractive.

Look at it from a different angle: This basic sequential implementation
is better for non-interactive tgz creation due to its slightly lower
CPU usage, which we cannot achieve with any parallel process setup.
It's easier to deploy because it doesn't need gzip.  Its runtime hit
isn't *that* hard, and people interested primarily in speed should
parallelize the expensive part, deflate, not run the cheap tar creation
parallel to a single-threaded deflate.  I.e. they should already run
pigz (https://zlib.net/pigz/).

$ hyperfine -L gz gzip,pigz -w3 'git -C ../linux archive --format=tar HEAD | {gz} -cn'
Benchmark 1: git -C ../linux archive --format=tar HEAD | gzip -cn
  Time (mean ± σ):     20.764 s ±  0.007 s    [User: 24.119 s, System: 0.606 s]
  Range (min … max):   20.758 s … 20.781 s    10 runs

Benchmark 2: git -C ../linux archive --format=tar HEAD | pigz -cn
  Time (mean ± σ):      6.077 s ±  0.023 s    [User: 29.708 s, System: 1.599 s]
  Range (min … max):    6.037 s …  6.125 s    10 runs

Summary
  'git -C ../linux archive --format=tar HEAD | pigz -cn' ran
    3.42 ± 0.01 times faster than 'git -C ../linux archive --format=tar HEAD | gzip -cn'

>>> Can't we have our 6/6 cake much easier and eat it too by learning a
>>> "fallback" mode, i.e. we try to invoke gzip, and if that doesn't work
>>> use the "internal" one?
>>
>> Interesting idea, but I think the existing config option suffices.  E.g.
>> a distro could set it in the system-wide config file if/when gzip is
>> installed.
>
> I think in practice distros are unlikely to have such triggers for
> "package X is installed, let's set config Y". I mean, e.g. Debian can do
> that with its packaging system, but it's expecting a lot.

I don't *expect* any reaction either way, but packagers *can* go with a
custom config if they see the need.

> Why not flip
> the default depending on if start_command() fails?

Because it's harder to test and support due to its more complicated
behavior, and I don't see why it would be needed.

>>> Re the "eco mode": I also wonder how much of the overhead you're seeing
>>> for both that 17% and 2% would go away if you pin both processes to the
>>> same CPU, I can't recall the command offhand, but IIRC taskset or
>>> numactl can do that. I.e. is this really measuring IPC overhead, or
>>> I-CPU overhead on your system?
>>
>> I'd expect that running git archive and gzip at the same CPU core takes
>> more wall-clock time than using zlib because inflating the object files
>> and deflating the archive are done sequentially in both scenarios.
>> Can't test it on macOS because it doesn't offer a way to pin programs to
>> a certain core, but e.g. someone with access to a Linux system can check
>> that using taskset(1).
>
> Here's a benchmark, this is your hyperfine command, just with taskset
> added. It's an 8-core box, so 0-7 is "all CPUs" (I think...):
>
> 	hyperfine -w3 \
> 		-L cpus 0,0-7 \
> 		-L command 'gzip -cn','git archive gzip' \
> 		'taskset --cpu-list {cpus} ./git -c tar.tgz.command="{command}" archive --format=tgz HEAD'
>
> Which gives me:
>
> 	Benchmark 1: taskset --cpu-list 0 ./git -c tar.tgz.command="gzip -cn" archive --format=tgz HEAD
> 	  Time (mean ± σ):      1.561 s ±  0.029 s    [User: 1.503 s, System: 0.058 s]
> 	  Range (min … max):    1.522 s …  1.622 s    10 runs
>
> 	Benchmark 2: taskset --cpu-list 0-7 ./git -c tar.tgz.command="gzip -cn" archive --format=tgz HEAD
> 	  Time (mean ± σ):      1.337 s ±  0.029 s    [User: 1.535 s, System: 0.075 s]
> 	  Range (min … max):    1.298 s …  1.388 s    10 runs
>
> 	Benchmark 3: taskset --cpu-list 0 ./git -c tar.tgz.command="git archive gzip" archive --format=tgz HEAD
> 	  Time (mean ± σ):      1.493 s ±  0.032 s    [User: 1.453 s, System: 0.040 s]
> 	  Range (min … max):    1.462 s …  1.572 s    10 runs
>
> 	Benchmark 4: taskset --cpu-list 0-7 ./git -c tar.tgz.command="git archive gzip" archive --format=tgz HEAD
> 	  Time (mean ± σ):      1.503 s ±  0.026 s    [User: 1.466 s, System: 0.036 s]
> 	  Range (min … max):    1.469 s …  1.542 s    10 runs
>
> 	Summary
> 	  'taskset --cpu-list 0-7 ./git -c tar.tgz.command="gzip -cn" archive --format=tgz HEAD' ran
> 	    1.12 ± 0.03 times faster than 'taskset --cpu-list 0 ./git -c tar.tgz.command="git archive gzip" archive --format=tgz HEAD'
> 	    1.12 ± 0.03 times faster than 'taskset --cpu-list 0-7 ./git -c tar.tgz.command="git archive gzip" archive --format=tgz HEAD'
> 	    1.17 ± 0.03 times faster than 'taskset --cpu-list 0 ./git -c tar.tgz.command="gzip -cn" archive --format=tgz HEAD'
>
> Whic I think should control for the IPC overhead v.s. the advantage of
> multicore. I.e. we're faster with "gzip -cn" on multicore, but the
> internal implementation has an advantage when it comes to

Right, #1, #3 and #4 all run sequentially, but #1 has the pipe overhead
to deal with as well, which adds 5 percentage points to its runtime.

René

@gitgitgadget
Copy link

gitgitgadget bot commented Jun 30, 2022

On the Git mailing list, Johannes Schindelin wrote (reply to this):

Hi René,

On Tue, 14 Jun 2022, René Scharfe wrote:

> Am 14.06.22 um 13:28 schrieb Johannes Schindelin:
> >
> > By the way, the main reason why I did not work more is that in
> > http://madler.net/pipermail/zlib-devel_madler.net/2019-December/003308.html,
> > Mark Adler (the zlib maintainer) announced that...
> >
> >> [...] There are many well-tested performance improvements in zlib
> >> waiting in the wings that will be incorporated over the next several
> >> months. [...]
> >
> > This was in December 2019. And now it's June 2022 and I kind of wonder
> > whether those promised improvements will still come.
> >
> > In the meantime, however, a viable alternative seems to have cropped up:
> > https://github.com/zlib-ng/zlib-ng. Essentially, it looks as if it is what
> > zlib should have become after above-quoted announcement.
> >
> > In particular the CPU intrinsics support (think MMX, SSE2/3, etc) seem to
> > be very interesting and I would not be completely surprised if building
> > Git with your patches and linking against zlib-ng would paint a very
> > favorable picture not only in terms of CPU time but also in terms of
> > wallclock time. Sadly, I have not been able to set aside time to look into
> > that angle, but maybe I can peak your interest?
> I was unable to preload zlib-ng using DYLD_INSERT_LIBRARIES on macOS
> 12.4 so far.  The included demo proggy looks impressive, though:
>
> $ hyperfine -w3 -L gzip gzip,../zlib-ng/minigzip "git -C ../linux archive --format=tar HEAD | {gzip} -c"
> Benchmark #1: git -C ../linux archive --format=tar HEAD | gzip -c
>   Time (mean ± σ):     20.424 s ±  0.006 s    [User: 23.964 s, System: 0.432 s]
>   Range (min … max):   20.414 s … 20.434 s    10 runs
>
> Benchmark #2: git -C ../linux archive --format=tar HEAD | ../zlib-ng/minigzip -c
>   Time (mean ± σ):     12.158 s ±  0.006 s    [User: 13.908 s, System: 0.376 s]
>   Range (min … max):   12.145 s … 12.166 s    10 runs
>
> Summary
>   'git -C ../linux archive --format=tar HEAD | ../zlib-ng/minigzip -c' ran
>     1.68 ± 0.00 times faster than 'git -C ../linux archive --format=tar HEAD | gzip -c'

Intriguing.

I finally managed to play around with building and packaging zlib-ng [*1*]
(since I want to use it as a drop-in replacement for zlib, I think it is
best to configure it with `--zlib-compat`, that way I do not have to
fiddle with any equivalent of `LD_PRELOAD`). Here are my numbers:

	zlib-ng: 14.409 s ± 0.209 s
	zlib:    26.843 s ± 0.636 s

These are pretty good, which made me think that they might actually even
help regular Git operations (because we zlib every loose object).

So I tried to `fast-import` some 2500 commits from linux.git into a fresh
repository, and the zlib-ng version takes ~51s and the zlib version takes
~58s. At first I thought that it might be noise, but the trend seems to be
steady. It's not a huge improvement, of course, but I think that might be
because most of the time is spent parsing.

I then tried to test the performance focusing on writing loose object, by
using p0008 (increasing the number of files from 50 to 1500 and
restricting it to fsyncMethod=none).

Unfortunately, the numbers are not really conclusive. I do see minor
speed-ups with zlib-ng, mostly, in the single digit percentages, though
occasionally in the other direction. In other words, there is no clear-cut
change, just a vague tendency. My guess: Git writes too small files (their
contents are of the form "$basedir$test_tick.$counter") and zlib-ng's
superior performance does not come to bear.

Still, for larger workloads, zlib-ng seems to offer a quite nice and
substantial performance improvement over zlib.

Ciao,
Dscho

Footnote *1*: https://github.com/msys2/MINGW-packages/compare/master...dscho:zlib-ng

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 1, 2022

On the Git mailing list, Johannes Schindelin wrote (reply to this):

Me again,

On Thu, 30 Jun 2022, Johannes Schindelin wrote:

> I finally managed to play around with building and packaging zlib-ng
> [*1*] (since I want to use it as a drop-in replacement for zlib, I think
> it is best to configure it with `--zlib-compat`, that way I do not have
> to fiddle with any equivalent of `LD_PRELOAD`). Here are my numbers:
>
> 	zlib-ng: 14.409 s ± 0.209 s
> 	zlib:    26.843 s ± 0.636 s
>
> These are pretty good, which made me think that they might actually even
> help regular Git operations (because we zlib every loose object).
>
> So I tried to `fast-import` some 2500 commits from linux.git into a fresh
> repository, and the zlib-ng version takes ~51s and the zlib version takes
> ~58s. At first I thought that it might be noise, but the trend seems to be
> steady. It's not a huge improvement, of course, but I think that might be
> because most of the time is spent parsing.
>
> I then tried to test the performance focusing on writing loose object, by
> using p0008 (increasing the number of files from 50 to 1500 and
> restricting it to fsyncMethod=none).
>
> Unfortunately, the numbers are not really conclusive. I do see minor
> speed-ups with zlib-ng, mostly, in the single digit percentages, though
> occasionally in the other direction. In other words, there is no clear-cut
> change, just a vague tendency. My guess: Git writes too small files (their
> contents are of the form "$basedir$test_tick.$counter") and zlib-ng's
> superior performance does not come to bear.
>
> Still, for larger workloads, zlib-ng seems to offer a quite nice and
> substantial performance improvement over zlib.

Stolee pointed out to me that objects inside pack files are also
zlib-compressed, and that measuring the speed of `git rev-list --objects
--all --count` might therefore be a better test.

And this is where things get a little messy: in the context of Git for
Windows, my local measurements indicate that zlib is better, with ~41
seconds using zlib vs ~52 seconds using zlib-ng (but the latter has a
rather large variance).

These measurements were done with a relatively straight-forward build of
zlib-ng v2.0.6, and on a hunch I then tried to build the tip of zlib-ng's
`develop` branch (which was much less straight-forward) and now get
virtually the same speed with that `rev-list` command.

But then I repeated the `archive` measurement with the `develop` version
of zlib-ng, and while it was still substantially faster than zlib, it was
slightly slower than zlib-ng v2.0.6 (zlib: ~26 seconds, zlib-ng v2.0.6:
~14 seconds, zlib-ng develop: ~16 seconds). Still, much, much faster than
using `-c tar.tgz.command="gzip -cn"` at ~24 seconds.

So: the picture is messy. The latest official release of zlib-ng seems to
offer performance wins using `archive` but slight losses using `rev-list.
Upgrading to the latest revision of zlib-ng offers slightly smaller
performance wins using `archive` and equivalent performance using
`rev-list`. Both blow `gzip -cn` out of the water, thanks to using MMX or
whatever my laptop's CPU offers.

The take-away as far as Git for Windows is concerned: It seems not _quite_
the time yet to switch from zlib to zlib-ng, I want to wait until there is
an official zlib-ng release with favorable speed.

Ciao,
Dscho

P.S.: I pushed a WIP update to this branch:

> Footnote *1*: https://github.com/msys2/MINGW-packages/compare/master...dscho:zlib-ng

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 1, 2022

On the Git mailing list, Jeff King wrote (reply to this):

On Fri, Jul 01, 2022 at 06:05:59PM +0200, Johannes Schindelin wrote:

> Stolee pointed out to me that objects inside pack files are also
> zlib-compressed, and that measuring the speed of `git rev-list --objects
> --all --count` might therefore be a better test.

That will spend quite a lot of time doing hash-lookups for each tree
entry. A better raw zlib test might be:

  git cat-file --batch --batch-all-objects --unordered >/dev/null

which will just dump each object, and should mostly be zlib and delta
reconstruction (the --unordered is important to hit the deltas in the
right order).

> And this is where things get a little messy: in the context of Git for
> Windows, my local measurements indicate that zlib is better, with ~41
> seconds using zlib vs ~52 seconds using zlib-ng (but the latter has a
> rather large variance).

That is a surprising slow-down between the two. I'd expect the command
above to show even more pronounced results, though, as it's spending
less time doing non-zlib things. But it's still just inflating (as
opposed to git-archive, which is both inflating and deflating).

-Peff

@gitgitgadget
Copy link

gitgitgadget bot commented Jul 1, 2022

On the Git mailing list, Junio C Hamano wrote (reply to this):

Jeff King <peff@peff.net> writes:

> That will spend quite a lot of time doing hash-lookups for each tree
> entry. A better raw zlib test might be:
>
>   git cat-file --batch --batch-all-objects --unordered >/dev/null
>
> which will just dump each object, and should mostly be zlib and delta
> reconstruction (the --unordered is important to hit the deltas in the
> right order).

;-)

I like --unordered has the meaning "use the order Git likes" (which
is probably the packfile offset order, which we optimize for
minimizing seek during delta reconstruction).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants