Skip to content
Permalink
Browse files Browse the repository at this point in the history
archive-tar: use internal gzip by default
Drop the dependency on gzip(1) and use our internal implementation to
create tar.gz and tgz files.

Signed-off-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
  • Loading branch information
rscharfe authored and gitster committed Jun 15, 2022
1 parent 23fcf8b commit 4f4be00
Show file tree
Hide file tree
Showing 3 changed files with 20 additions and 20 deletions.
4 changes: 2 additions & 2 deletions Documentation/git-archive.txt
Expand Up @@ -148,8 +148,8 @@ tar.<format>.command::
to the command (e.g., `-9`).
+
The `tar.gz` and `tgz` formats are defined automatically and use the
command `gzip -cn` by default. An internal gzip implementation can be
used by specifying the value `git archive gzip`.
magic command `git archive gzip` by default, which invokes an internal
implementation of gzip.

tar.<format>.remote::
If true, enable the format for use by remote clients via
Expand Down
4 changes: 2 additions & 2 deletions archive-tar.c
Expand Up @@ -526,9 +526,9 @@ void init_tar_archiver(void)
int i;
register_archiver(&tar_archiver);

tar_filter_config("tar.tgz.command", "gzip -cn", NULL);
tar_filter_config("tar.tgz.command", internal_gzip_command, NULL);
tar_filter_config("tar.tgz.remote", "true", NULL);
tar_filter_config("tar.tar.gz.command", "gzip -cn", NULL);
tar_filter_config("tar.tar.gz.command", internal_gzip_command, NULL);
tar_filter_config("tar.tar.gz.remote", "true", NULL);
git_config(git_tar_config, NULL);
for (i = 0; i < nr_tar_filters; i++) {
Expand Down
32 changes: 16 additions & 16 deletions t/t5000-tar-tree.sh
Expand Up @@ -339,21 +339,21 @@ test_expect_success 'only enabled filters are available remotely' '
test_cmp_bin remote.bar config.bar
'

test_expect_success GZIP 'git archive --format=tgz' '
test_expect_success 'git archive --format=tgz' '
git archive --format=tgz HEAD >j.tgz
'

test_expect_success GZIP 'git archive --format=tar.gz' '
test_expect_success 'git archive --format=tar.gz' '
git archive --format=tar.gz HEAD >j1.tar.gz &&
test_cmp_bin j.tgz j1.tar.gz
'

test_expect_success GZIP 'infer tgz from .tgz filename' '
test_expect_success 'infer tgz from .tgz filename' '
git archive --output=j2.tgz HEAD &&
test_cmp_bin j.tgz j2.tgz
'

test_expect_success GZIP 'infer tgz from .tar.gz filename' '
test_expect_success 'infer tgz from .tar.gz filename' '
git archive --output=j3.tar.gz HEAD &&
test_cmp_bin j.tgz j3.tar.gz
'
Expand All @@ -363,31 +363,31 @@ test_expect_success GZIP 'extract tgz file' '
test_cmp_bin b.tar j.tar
'

test_expect_success GZIP 'remote tar.gz is allowed by default' '
test_expect_success 'remote tar.gz is allowed by default' '
git archive --remote=. --format=tar.gz HEAD >remote.tar.gz &&
test_cmp_bin j.tgz remote.tar.gz
'

test_expect_success GZIP 'remote tar.gz can be disabled' '
test_expect_success 'remote tar.gz can be disabled' '
git config tar.tar.gz.remote false &&
test_must_fail git archive --remote=. --format=tar.gz HEAD \
>remote.tar.gz
'

test_expect_success 'git archive --format=tgz (internal gzip)' '
test_config tar.tgz.command "git archive gzip" &&
git archive --format=tgz HEAD >internal_gzip.tgz
test_expect_success GZIP 'git archive --format=tgz (external gzip)' '
test_config tar.tgz.command "gzip -cn" &&
git archive --format=tgz HEAD >external_gzip.tgz
'

test_expect_success 'git archive --format=tar.gz (internal gzip)' '
test_config tar.tar.gz.command "git archive gzip" &&
git archive --format=tar.gz HEAD >internal_gzip.tar.gz &&
test_cmp_bin internal_gzip.tgz internal_gzip.tar.gz
test_expect_success GZIP 'git archive --format=tar.gz (external gzip)' '
test_config tar.tar.gz.command "gzip -cn" &&
git archive --format=tar.gz HEAD >external_gzip.tar.gz &&
test_cmp_bin external_gzip.tgz external_gzip.tar.gz
'

test_expect_success GZIP 'extract tgz file (internal gzip)' '
gzip -d -c <internal_gzip.tgz >internal_gzip.tar &&
test_cmp_bin b.tar internal_gzip.tar
test_expect_success GZIP 'extract tgz file (external gzip)' '
gzip -d -c <external_gzip.tgz >external_gzip.tar &&
test_cmp_bin b.tar external_gzip.tar
'

test_expect_success 'archive and :(glob)' '
Expand Down

16 comments on commit 4f4be00

@72squared
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this broke checksums of the tars and lots of things relied on the consistent checksum, including bazelbuild:
https://twitter.com/shs96c/status/1620201523211894784

@chenrui333
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also broke of bunch of builds in the homebrew side, https://github.com/Homebrew/homebrew-core/labels/git-archive-checksum-incident

@Salamandar
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will break Archlinux packages. And most probably deb, rpm, npm, pypi, etc.

@hacker1024
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nix's ecosystem will be highly affected too.

@opticron
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This broke RTEMS tools builds, we adjusted to the new hashes, and then it got reverted so we have to revert our hash updates 🤦

@tomoveu
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI on open-source projects like Buildroot reported possible Man-in-the-middle attacks because of this change 🥶 Remember the old school engineering rule: If it works well - do not touch it.

@bsdimp
Copy link

@bsdimp bsdimp commented on 4f4be00 Jan 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This broke all the FreeBSD ports that downloaded tar.gz files from github. This is approximately 5800
packages out of a total of 31000 that are affected. Thankfully, it was reverted... regenerating all these hashes would be quite the chore.

The right way to cope with this would be to add newer, better compression methods, give people a chance to migrate and then if you really want to get rid of this dependency, just delete .gz files. They are so early 90s anyway :)

@adam-azarchs
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be all for adding archives in more modern formats1. However that's not really the point - the point is there are a lot of tools which download an archive and then verify it's checksum. It doesn't really matter what compression scheme those archives use, but stability is important. Even if documentation says the checksums shouldn't be relied upon to remain stable, they have remained stable for long enough that people have begun relying on them anyway. There are also a lot of use cases where release artifacts aren't an acceptable substitute, either2.

Footnotes

  1. brotli, zstd, and xz are all great formats with different speed/compression-ratio trade-offs. The first two are both faster and achieve higher compression ratios than gzip. xz gets higher compression ratios than any of the others, but is slightly slower. I don't think there are any supported operating systems which still ship without xz in their default set of installed packages, while brotli and zstd support is less common. But none of those formats are available now.

  2. Sometimes you need to get a commit that isn't a tagged release, for a repository you don't have control over. Also, for private repos, you can fetch the default archives by supplying an Authorization header but fetching other kinds of release artifacts is more complicated and, depending on the tool that's doing the fetching, it might be impossible, if for example there isn't an option for setting the Accept header.

@risa2000
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if documentation says the checksums shouldn't be relied upon to remain stable, they have remained stable for long enough that people have begun relying on them anyway. There are also a lot of use cases where release artifacts aren't an acceptable substitute, either2.

I guess that the whole affair just proved that the only artifacts that are useful are those which are immutable. Because only then their integrity can be verified the easiest (and only) way, by using a cryptographic hash. If GitHub would want to insist on not guaranteeing that then the response should be why even bother.

Using a brand new compression algorithm/format for any future artifacts should not be a problem as long as they stay that way (and assuming the brand new and shiny stuff is reasonably available). Defining the "future" might be the only tricky part.

@xfix
Copy link
Contributor

@xfix xfix commented on 4f4be00 Feb 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nix's ecosystem will be highly affected too.

Nixpkgs doesn't calculate the checksum of the archive, but rather checksum of files within it specifically to avoid this issue (see https://github.com/NixOS/nixpkgs/blob/master/pkgs/build-support/fetchzip/default.nix). If you use fetchFromGitHub, the checksums will be stable, and they were since 2014.

# This function downloads and unpacks an archive file, such as a zip
# or tar file. This is primarily useful for dynamically generated
# archives, such as GitHub's /archive URLs, where the unpacked content
# of the zip file doesn't change, but the zip file itself may
# (e.g. due to minor changes in the compression algorithm, or changes
# in timestamps).

@hacker1024
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nix's ecosystem will be highly affected too.

Nixpkgs doesn't calculate the checksum of the archive, but rather checksum of files within it specifically to avoid this issue (see https://github.com/NixOS/nixpkgs/blob/master/pkgs/build-support/fetchzip/default.nix). If you use fetchFromGitHub, the checksums will be stable, and they were since 2014.

# This function downloads and unpacks an archive file, such as a zip
# or tar file. This is primarily useful for dynamically generated
# archives, such as GitHub's /archive URLs, where the unpacked content
# of the zip file doesn't change, but the zip file itself may
# (e.g. due to minor changes in the compression algorithm, or changes
# in timestamps).

This is true, but I've still seen fetchurl used where it shouldn't be much too frequently (e.g. lopsided98/nix-ros-overlay#241)

@cielavenir
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An attempt to fix this matter, #1454

@Salamandar
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nix's ecosystem will be highly affected too.

Nixpkgs doesn't calculate the checksum of the archive, but rather checksum of files within it specifically to avoid this issue (see https://github.com/NixOS/nixpkgs/blob/master/pkgs/build-support/fetchzip/default.nix). If you use fetchFromGitHub, the checksums will be stable, and they were since 2014.

# This function downloads and unpacks an archive file, such as a zip
# or tar file. This is primarily useful for dynamically generated
# archives, such as GitHub's /archive URLs, where the unpacked content
# of the zip file doesn't change, but the zip file itself may
# (e.g. due to minor changes in the compression algorithm, or changes
# in timestamps).

How do you prevent tarbombs ? The idea of checking the tarball itself is to do it before any extraction.

@Salamandar
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://sysdfree.wordpress.com/2020/01/04/293/

Arch already moved to Facebook's Zstandard.

… For packages. Not source downloading. It's not today's subject.

@xfix
Copy link
Contributor

@xfix xfix commented on 4f4be00 Feb 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you prevent tarbombs ? The idea of checking the tarball itself is to do it before any extraction.

Quite simply, Nix doesn't. Although I wonder how many package managers deal with Content-Encoding: gzip bombs.

However, there are some measures against it. First of all, for packages inside Nixpkgs, the source files aren't usually downloaded directly from upstream, but from the cache. The cache is designed to ensure that packages will still work even if something happens to upstream, but it also helps with that. Also, the default build directory is inside /tmp, which is usually a tmpfs which has its own size limits (by default half of physical RAM without swap), which will avoid most problems.

@Salamandar
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you prevent tarbombs ? The idea of checking the tarball itself is to do it before any extraction.

Quite simply, we don't. Although I wonder how many package managers deal with Content-Encoding: gzip bombs.

However, there are some measures against it. First of all, for packages inside Nixpkgs, the source files aren't usually downloaded directly from upstream, but from the cache. The cache is designed to ensure that packages will still work even if something happens to upstream, but it also helps with that. Also, the default build directory is inside /tmp, which is usually a tmpfs which has its own size limits (by default half of physical RAM without swap), which will avoid most problems.

Thanks, that's actually a nice answer.

Please sign in to comment.