Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Future CI: F37 fails 'podman load compressed tar file' #15944

Closed
cevich opened this issue Sep 26, 2022 · 30 comments · Fixed by klauspost/pgzip#50 or #16015
Closed

Future CI: F37 fails 'podman load compressed tar file' #15944

cevich opened this issue Sep 26, 2022 · 30 comments · Fixed by klauspost/pgzip#50 or #16015
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@cevich
Copy link
Member

cevich commented Sep 26, 2022

Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)

/kind bug

Description

While attempting to update the CI VM images to F37 (beta), found this test is failing on most test-configurations.

Steps to reproduce the issue:

  1. Run CI in Cirrus: Update CI VM images to F37beta #15760 (or make similar changes in separate PR)

Describe the results you received:

...cut...
# podman [options] load -i /tmp/podman_test1780044227/alpine.tar.gz
         Error: payload does not match any of the supported image formats:
          * oci: initializing source oci:/tmp/podman_test1780044227/alpine.tar.gz:: open /tmp/podman_test1780044227/alpine.tar.gz/index.json: not a directory
          * oci-archive: loading index: open /var/tmp/oci1331170819/index.json: no such file or directory
          * docker-archive: loading tar component manifest.json: archive/tar: invalid tar header
          * dir: open /tmp/podman_test1780044227/alpine.tar.gz/manifest.json: not a directory
         [AfterEach] Podman load
           /var/tmp/go/src/github.com[/containers/podman/test/e2e/load_test.go:31](https://github.com/containers/podman/blob/aa8a2964392040ee85652fcff59be73b80462bab/test/e2e/load_test.go#L31)
         # podman [options] pod rm -fa -t 0
         # podman [options] rm -fa -t 0
         
         • Failure [2.285 seconds]
         Podman load
         /var/tmp/go/src/github.com[/containers/podman/test/e2e/load_test.go:14](https://github.com/containers/podman/blob/aa8a2964392040ee85652fcff59be73b80462bab/test/e2e/load_test.go#L14)
           podman load compressed tar file [It]
           /var/tmp/go/src/github.com[/containers/podman/test/e2e/load_test.go:58](https://github.com/containers/podman/blob/aa8a2964392040ee85652fcff59be73b80462bab/test/e2e/load_test.go#L58)
         
           Expected
               <int>: 125
           to match exit code:
               <int>: 0
...cut...

Describe the results you expected:

Test should pass

Additional information you deem important (e.g. issue happens only occasionally):

Reproduces on:

  • int remote fedora-37 root host
  • int podman fedora-37 rootless host
  • int podman fedora-37 root host
  • int podman fedora-37 root container

Output of podman version:

See below

Output of podman info:

+  ./bin/podman system info
------------------------------------------------------------
host:
  arch: amd64
  buildahVersion: 1.28.0-dev
  cgroupControllers:
  - cpuset
  - cpu
  - io
  - memory
  - hugetlb
  - pids
  - misc
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.1.3-1.fc37.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.3, commit: '
  cpuUtilization:
    idlePercent: 17.89
    systemPercent: 35.02
    userPercent: 47.09
  cpus: 2
  distribution:
    distribution: fedora
    variant: cloud
    version: "37"
  eventLogger: journald
  hostname: cirrus-task-5213415214940160
  idMappings:
    gidmap: null
    uidmap: null
  kernel: 5.19.10-300.fc37.x86_64
  linkmode: dynamic
  logDriver: journald
  memFree: 902230016
  memTotal: 4109639680
  networkBackend: netavark
  ociRuntime:
    name: crun
    package: crun-1.6-2.fc37.x86_64
    path: /usr/bin/crun
    version: |-
      crun version 1.6
      commit: 18cf2efbb8feb2b2f20e316520e0fd0b6c41ef4d
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +LIBKRUN +YAJL
  os: linux
  remoteSocket:
    path: /run/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: false
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.2.0-5.fc37.x86_64
    version: |-
      slirp4netns version 1.2.0
      commit: 656041d45cfca7a4176f6b7eed9e4fe6c11e8383
      libslirp: 4.7.0
      SLIRP_CONFIG_VERSION_MAX: 4
      libseccomp: 2.5.3
  swapFree: 4107792384
  swapTotal: 4109365248
  uptime: 0h 29m 48.00s
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  volume:
  - local
registries:
  docker.io:
    Blocked: false
    Insecure: false
    Location: mirror.gcr.io
    MirrorByDigestOnly: false
    Mirrors: null
    Prefix: docker.io
    PullFromMirror: ""
  docker.io/library:
    Blocked: false
    Insecure: false
    Location: quay.io/libpod
    MirrorByDigestOnly: false
    Mirrors: null
    Prefix: docker.io/library
    PullFromMirror: ""
  localhost:5000:
    Blocked: false
    Insecure: true
    Location: localhost:5000
    MirrorByDigestOnly: false
    Mirrors: null
    Prefix: localhost:5000
    PullFromMirror: ""
  search:
  - docker.io
  - quay.io
  - registry.fedoraproject.org
store:
  configFile: /usr/share/containers/storage.conf
  containerStore:
    number: 0
    paused: 0
    running: 0
    stopped: 0
  graphDriverName: overlay
  graphOptions:
    overlay.mountopt: nodev,metacopy=on
  graphRoot: /var/lib/containers/storage
  graphRootAllocated: 213588619264
  graphRootUsed: 3577217024
  graphStatus:
    Backing Filesystem: btrfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "true"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 0
  runRoot: /run/containers/storage
  volumePath: /var/lib/containers/storage/volumes
version:
  APIVersion: 4.3.0-dev
  Built: 1663942581
  BuiltTime: Fri Sep 23 09:16:21 2022
  GitCommit: aa8a2964392040ee85652fcff59be73b80462bab
  GoVersion: go1.19.1
  Os: linux
  OsArch: linux/amd64
  Version: 4.3.0-dev
------------------------------------------------------------

Package info (e.g. output of rpm -q podman or apt list podman):

Fedora release 37 (Thirty Seven)
Kernel:  5.19.10-300.fc37.x86_64
Cgroups:  cgroup2fs
conmon-2.1.3-1.fc37-x86_64
containernetworking-plugins-1.1.1-8.fc37-x86_64
containers-common-1-66.fc37-noarch
container-selinux-2.190.0-1.fc37-noarch
criu-3.17.1-3.fc37-x86_64
crun-1.6-2.fc37-x86_64
golang-1.19.1-1.fc37-x86_64
libseccomp-2.5.3-3.fc37-x86_64
netavark-1.1.0-1.fc37-x86_64
package aardvark is not installed
podman-4.2.1-2.fc37-x86_64
runc-1.1.3-1.fc37-x86_64
skopeo-1.9.2-1.fc37-x86_64
slirp4netns-1.2.0-5.fc37-x86_64

Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide? (https://github.com/containers/podman/blob/main/troubleshooting.md)

Yes

Additional environment details (AWS, VirtualBox, physical, etc.):

Annotated log

@openshift-ci openshift-ci bot added the kind/bug Categorizes issue or PR as related to a bug. label Sep 26, 2022
cevich added a commit to cevich/podman that referenced this issue Sep 26, 2022
Ref: containers#15944

Signed-off-by: Chris Evich <cevich@redhat.com>
cevich added a commit to cevich/podman that referenced this issue Sep 29, 2022
Ref: containers#15944

Signed-off-by: Chris Evich <cevich@redhat.com>
@edsantiago
Copy link
Collaborator

A lot of historical issues suggest that this is how podman says "out of disk space". Is it possible to check df on the new f37 VM?

@cevich
Copy link
Member Author

cevich commented Sep 29, 2022

CI captures df output at the end of the integration (and system?) task, but doesn't log it during. That said, the VMs should start with 80+ GB free space, but it's possible the disk expansion mechanism broke...

@cevich
Copy link
Member Author

cevich commented Sep 29, 2022

...heh, no I don't think this is the problem (df output), but it was wise to suggest checking, thanks!

+  df -lhTx tmpfs
------------------------------------------------------------
Filesystem     Type      Size  Used Avail Use% Mounted on
devtmpfs       devtmpfs  4.0M     0  4.0M   0% /dev
/dev/sda5      btrfs     199G  3.4G  195G   2% /
/dev/sda5      btrfs     199G  3.4G  195G   2% /home
/dev/sda2      ext4      966M   99M  802M  11% /boot
/dev/sda3      vfat      100M  9.9M   90M  10% /boot/efi
overlay        overlay   199G  3.4G  195G   2% /tmp/podman_test1390407731/server_root/overlay/12c842d64ba1a9b9d48f2d7f63656001f20429cdafea877049bb714d93d9cd7b/merged

@cevich
Copy link
Member Author

cevich commented Sep 29, 2022

This part of the error is curious:

          * docker-archive: loading tar component manifest.json: archive/tar: invalid tar header

Maybe something changed in go-land (archive/tar presumably) that removed some former "un-gzip it first" behavior?

@mtrmac knows about image stuff...does this ring any bells for you?

@mtrmac
Copy link
Collaborator

mtrmac commented Sep 29, 2022

It doesn’t really ring any bells, and I don’t much understand how changing the build environment (without changing the Podman or c/image codebase at all, AFAICS) could make a difference here.

If I had to guess, missing gzip command in the image?

@cevich
Copy link
Member Author

cevich commented Sep 29, 2022

how changing the build environment could make a difference here.

Fair point. Hmmm. It's possible gzip is missing, I can check that. Thanks for the suggestion.

@edsantiago
Copy link
Collaborator

No, it's not missing gzip. I would go as far as to say that a UNIX system without gzip is impossible in 2022.

This reproduces super-trivially:

# podman pull quay.io/libpod/alpine:latest
...
# podman save -o /tmp/foo.tar alpine
...
# gzip /tmp/foo.tar
# podman load -i /tmp/foo.tar.gz
Error: payload does not match any of the supported image formats:
 * oci: initializing source oci:/tmp/foo.tar.gz:: open /tmp/foo.tar.gz/index.json: not a directory
 * oci-archive: loading index: open /var/tmp/oci3715535303/index.json: no such file or directory
 * docker-archive: loading tar component manifest.json: archive/tar: invalid tar header
 * dir: open /tmp/foo.tar.gz/manifest.json: not a directory

podman-4.3.0~rc1-1.fc37.x86_64
gzip-1.12-2.fc37.x86_64

@cevich
Copy link
Member Author

cevich commented Sep 29, 2022

Damn...well that probably rules out easy fixes I can do.

@edsantiago
Copy link
Collaborator

Fails even with gzip-1.11-1.fc36.x86_64. Fails with podman-4.2.1-2.fc37.x86_64. Fails with podman built from source.

@cevich
Copy link
Member Author

cevich commented Sep 29, 2022

Should we be asking: How did this ever pass? Maybe the test was disabled and recently re-enabled by accident or fluke?

@mtrmac
Copy link
Collaborator

mtrmac commented Sep 29, 2022

This reproduces super-trivially:

You’re right, this is clearly broken in c/image. Looking…

@edsantiago
Copy link
Collaborator

It passed because we don't do CI on f37, and e2e tests don't run as part of Fedora gating, only system tests. (Oh hi. This is me with a friendly reminder that test/system exists and feels neglected). So, something changed in f37, and the question is, what?

@cevich
Copy link
Member Author

cevich commented Sep 29, 2022

For context, I caught this in the e2e tests. What I meant in my comment above is...if it's a problem in the podman code (or a dependency), it seems strange it's been passing all along in F36. Since it has been, the problem must somehow be influenced by both the code AND the environment...which is hard to put my head around (in the context of the error).

@edsantiago
Copy link
Collaborator

@mtrmac I wouldn't be so sure it's c/image. podman @ main on my f36 laptop passes happily. Only f37 fails. This strongly smells like a problem in f37.

@cevich
Copy link
Member Author

cevich commented Sep 29, 2022

Does c/image interact with the system's gzip/tar or kernel in some way? Oof. I'm glad Dr. Miloslav is on the case 😀

@mtrmac
Copy link
Collaborator

mtrmac commented Sep 29, 2022

Looking at the generated file, it seems almost exactly the first 1 MB of the decompressed data is just missing.

Replacing the use of github.com/klauspost/pgzip with standard library’s compress/gzip makes things work.

@edsantiago
Copy link
Collaborator

No difference if I downgrade to glibc-2.35.9000-32.fc37.x86_64, make clean, and make.

Here's the last bit of strace:

openat(AT_FDCWD, "/tmp/foo.tar.gz/manifest.json", O_RDONLY|O_CLOEXEC) = -1 ENOTDIR (Not a directory)
write(2, "Error: payload does not match an"..., 410) = 410

@mtrmac
Copy link
Collaborator

mtrmac commented Sep 29, 2022

@edsantiago There’s a bit of confusing Podman behavior: “payload does not match any of the supported image formats:” i.e. it blindly tries four different formats. The last one is definitely going to fail like that; we are interested in the docker-archive: failure.

@edsantiago
Copy link
Collaborator

Huh... okay, trimmed a little, istm that the untar is failing:

newfstatat(AT_FDCWD, "/tmp/foo.tar.gz", {st_mode=S_IFREG|0644, st_size=2715866, ...}, 0) = 0
openat(AT_FDCWD, "/tmp/foo.tar.gz", O_RDONLY|O_CLOEXEC) = 3
epoll_ctl(4, EPOLL_CTL_ADD, 3, {events=EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, data={u32=3004675888, u64=140156377478960}}) = -1 EPERM (Operation not permitted)
read(3, "\37\213\10\10\3\3635c", 8)     = 8
mmap(0xc000c00000, 4194304, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xc000c00000
epoll_pwait(4, [], 128, 0, NULL, 0)     = 0
read(3, "\0\3foo.tar\0\354\234\17|T\325\225\307/\210\20\4\333\200(hU\302Z-\376\1"..., 4096) = 4096
openat(AT_FDCWD, "/var/tmp/docker-tar3612857519", O_RDWR|O_CREAT|O_EXCL|O_CLOEXEC, 0600) = 7
.....
read(3, "\234m!`201\367U\24\230\344\342\233\r\354\364h\347\272_\vL\1_1\357;v\334Q\345"..., 4096) = 4096
write(7, "\2\0\2\0\2\0\2\0\2\0\2\0\2\0\2\0\2\0\2\0\v\0\2\0\2\0\2\0\v\0\2\0"..., 1048576) = 1048576
futex(0xc00043c548, FUTEX_WAIT_PRIVATE, 0, NULL) = ?
+++ exited with 125 +++

@mtrmac
Copy link
Collaborator

mtrmac commented Sep 29, 2022

My current hypothesis is that this is Go 1.19 and

NopCloser's result now implements WriterTo whenever its input does.

triggering a bug in pgzip’s WriteTo.

Would it be trivially easy to try downgrading to Go 1.18 to see if that makes a difference? (If not, don’t do anything complicated, I can continue working from my end.)

@edsantiago
Copy link
Collaborator

Oh, good thinking. With golang-1.18.4-1.fc37.x86_64 it works again.

@cevich
Copy link
Member Author

cevich commented Sep 29, 2022

Downgrading in CI is a "hard sell" for me with the F37 images. I feel it's important that we're running the newer/native distro toolchain to catch things like this (since it will be used at rpmbuild time). I'm okay with leaving the test skip() in place for now, though the underlying issue will certainly affect end-users of F37 in a few weeks time if it's not fixed 😢

@mtrmac is this going to break your back? Should we start a conversation with upstream Fedora about "Don't use golang 1.19" in 37?

@edsantiago
Copy link
Collaborator

triggering a bug in pgzip’s WriteTo.

Seems like the obvious solution is to fix pgzip. I trust that is what @mtrmac's first approach will be.

@mtrmac
Copy link
Collaborator

mtrmac commented Sep 29, 2022

I think the ultimately correct fix will be some variant of klauspost/pgzip#50 . There might be smaller workarounds we can do with more confidence in the meantime (e.g. wrap the code to make the WriteTo method unreachable, falling back to the pre-Go 1.19 code path).

FYI @mheon — current Podman is succeeding in tests, but podman load is broken when compiled with Go 1.19 (which we don’t yet test). I have no idea what that should mean for Podman 4.3, I’ll leave that to you and others.

@mtrmac
Copy link
Collaborator

mtrmac commented Sep 29, 2022

(A hypothetical bigger worry is that the pgzip package is used in other decompression paths, notably the primary pull code. I don’t immediately know if it’s possible for the problematic call combination to happen there — I suppose if it happened on the primary code paths, we would have seen much more breakage, so we are probably fine.)

@mtrmac
Copy link
Collaborator

mtrmac commented Sep 30, 2022

FWIW klauspost/pgzip#50 was merged; so we need it in all the top-level commands.

@cevich
Copy link
Member Author

cevich commented Oct 3, 2022

Okay, so sounds like we're likely to hit the same/similar/related issues bringing F37 into other repo's CI. Dang 😞

@cevich
Copy link
Member Author

cevich commented Oct 3, 2022

Oops, didn't refresh before commenting. I'll rebase my PR w/o the test-skip and cross my fingers 🤞

@cevich
Copy link
Member Author

cevich commented Oct 3, 2022

Rebased my PR with these changes and force-pushed #15760. CI is running now. Thanks a bunch @mtrmac

@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 13, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 13, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
3 participants