Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

buildah pull sometimes hangs forever (v1.23.0) #3662

Closed
dawagner opened this issue Dec 3, 2021 · 17 comments
Closed

buildah pull sometimes hangs forever (v1.23.0) #3662

dawagner opened this issue Dec 3, 2021 · 17 comments

Comments

@dawagner
Copy link

dawagner commented Dec 3, 2021

Description

buildah pull sometimes hangs forever. Running it with --log-level debug shows this error and hangs immediately thereafter:

Failed to retrieve partial blob: blob type not supported for partial retrieval

I'm pulling images that are constructed in an iterative fashion (each image is constructed from the previous one) and the reproducibility seems to vary depending on the layer. For instance, I have an image that only adds an environment variable (the filesystem diff is empty) and that one seems to reproduce the issue more than the others.

Besides, the images are constructed with buildah but then pushed to the local docker daemon which, in turn, pushes them to an AWS ECR registry.

Steps to reproduce the issue:

Just run buildah pull <some image on AWS ECR>. It sometimes work; in that case delete the image and try again.

Output of rpm -q buildah or apt list buildah:

buildah-1.23.0-1.fc33.x86_64

Output of buildah version:

Version:         1.23.0
Go Version:      go1.15.14
Image Spec:      1.0.1-dev
Runtime Spec:    1.0.2-dev
CNI Spec:        0.4.0
libcni Version:  
image Version:   5.16.0
Git Commit:      
Built:           Thu Jan  1 01:00:00 1970
OS/Arch:         linux/amd64
BuildPlatform:   linux/amd64

Output of cat /etc/*release:

Fedora release 33 (Thirty Three)
NAME=Fedora
VERSION="33 (Workstation Edition)"
ID=fedora
VERSION_ID=33
VERSION_CODENAME=""
PLATFORM_ID="platform:f33"
PRETTY_NAME="Fedora 33 (Workstation Edition)"
ANSI_COLOR="0;38;2;60;110;180"
LOGO=fedora-logo-icon
CPE_NAME="cpe:/o:fedoraproject:fedora:33"
HOME_URL="https://fedoraproject.org/"
DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora/f33/system-administrators-guide/"
SUPPORT_URL="https://fedoraproject.org/wiki/Communicating_and_getting_help"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="Fedora"
REDHAT_BUGZILLA_PRODUCT_VERSION=33
REDHAT_SUPPORT_PRODUCT="Fedora"
REDHAT_SUPPORT_PRODUCT_VERSION=33
PRIVACY_POLICY_URL="https://fedoraproject.org/wiki/Legal:PrivacyPolicy"
VARIANT="Workstation Edition"
VARIANT_ID=workstation
Fedora release 33 (Thirty Three)
Fedora release 33 (Thirty Three)

Output of uname -a:

Linux seldon 5.14.17-101.fc33.x86_64 #1 SMP Mon Nov 8 21:25:05 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Output of cat /etc/containers/storage.conf:

# some comments removed to reduce the noise
[storage]

# Default Storage Driver, Must be set for proper operation.
driver = "overlay"

# Temporary storage location
runroot = "/run/containers/storage"

# Primary Read/Write location of container storage
graphroot = "/var/lib/containers/storage"

# Storage path for rootless users
#
# rootless_storage_path = "$HOME/.local/share/containers/storage"

[storage.options]
# Storage options to be passed to underlying storage drivers

# AdditionalImageStores is used to pass paths to additional Read/Only image stores
# Must be comma separated list.
additionalimagestores = [
]


# remap-uids = 0:1668442479:65536
# remap-gids = 0:1668442479:65536


# remap-user = "containers"
# remap-group = "containers"


# root-auto-userns-user = "storage"
#
# Auto-userns-min-size is the minimum size for a user namespace created automatically.
# auto-userns-min-size=1024
#
# Auto-userns-max-size is the minimum size for a user namespace created automatically.
# auto-userns-max-size=65536

[storage.options.overlay]

#ignore_chown_errors = "false"


#mount_program = "/usr/bin/fuse-overlayfs"

# mountopt specifies comma separated list of extra mount options
mountopt = "nodev"

# Set to skip a PRIVATE bind mount on the storage home directory.
# skip_mount_home = "false"

# Size is used to set a maximum size of the container image.
# size = ""


# force_mask = ""

[storage.options.thinpool]
# Storage Options for thinpool

# autoextend_percent = "20"
# autoextend_threshold = "80"
# basesize = "10G"
# blocksize="64k"
# directlvm_device = ""
# directlvm_device_force wipes device even if device already has a filesystem.
# directlvm_device_force = "True"

# fs specifies the filesystem type to use for the base device.
# fs="xfs"

# log_level = "7"


# min_free_space = "10%"

# mkfsarg specifies extra mkfs arguments to be used when creating the base
# device.
# mkfsarg = ""

# metadata_size is used to set the `pvcreate --metadatasize` options when
# creating thin devices. Default is 128k
# metadata_size = ""

# Size is used to set a maximum size of the container image.
# size = ""

# use_deferred_removal = "True"
# use_deferred_deletion = "True"
# xfs_nospace_max_retries = "0"

@dawagner
Copy link
Author

dawagner commented Dec 3, 2021

Using nix, I pulled this version of buildah:

$ buildah version
Version:         1.21.0
Go Version:      go1.16.8
Image Spec:      1.0.1-dev
Runtime Spec:    1.0.2-dev
CNI Spec:        0.4.0
libcni Version:  v0.8.1
image Version:   5.12.0
Git Commit:      unknown
Built:           Tue Jan  1 01:00:00 1980
OS/Arch:         linux/amd64

and I couldn't reproduce the issue after trying ~10 times. After switching back to 1.23.0 (using nix as well for consistency), I reproduced at the first try. For reference, that was with:

$ buildah version
Version:         1.23.0
Go Version:      go1.16.8
Image Spec:      1.0.1-dev
Runtime Spec:    1.0.2-dev
CNI Spec:        0.4.0
libcni Version:  v0.8.1
image Version:   5.16.0
Git Commit:      unknown
Built:           Tue Jan  1 01:00:00 1980
OS/Arch:         linux/amd64
BuildPlatform:   linux/amd64

@TomSweeneyRedHat
Copy link
Member

WDYT @vrothberg ?

@vrothberg
Copy link
Member

Failed to retrieve partial blob: blob type not supported for partial retrieval

@giuseppe PTAL

@giuseppe
Copy link
Member

giuseppe commented Dec 6, 2021

Failed to retrieve partial blob: blob type not supported for partial retrieval

podman/buildah fallback to pulling the image without using the "partial pull" feature when it is not supported (in fact the message above is just a debug log), so the hang is probably happening later on.

Could it be registry related and the registry block two quick requests for the same image?

Do you have an image that we can use to reproduce the issue?

@dawagner
Copy link
Author

dawagner commented Dec 8, 2021

Thanks for the feedback. I'll see if I can provide a reproducer.

@dawagner
Copy link
Author

dawagner commented Dec 14, 2021

I could reproduce with this image: https://hub.docker.com/r/deubeuliou/buildah-issue-3662 but only when it's on an AWS ECR registry (I used a private registry provided by my employer). Besides, I tried pulling the nginx image from AWS gallery (https://gallery.ecr.aws/nginx/nginx) and I couldn't reproduce.

Here's how the image was built:

cont=$(buildah from fedora:33)
buildah run $cont -- dd if=/dev/random of=/random_file bs=1M count=10
buildah commit --rm $cont buildah-issue-3662:1
cont=$(buildah from localhost/buildah-issue-3662:1)
buildah config --env BUILDAH_ISSUE_3662=1 $cont 
buildah commit --rm $cont localhost/buildah-issue-3662:2
cont=$(buildah from localhost/buildah-issue-3662:2)
buildah run $cont -- dd if=/dev/random of=/random_file_2 bs=1M count=10
buildah commit --rm $cont localhost/buildah-issue-3662:3

I then pushed the 3 tags to my AWS ECR registry, removed the local images, downloaded the 1st tag and then repeatedly attempted to buildah pull --log-level debug the 3rd tag. It hanged about 5~10% of the time.

@giuseppe
Copy link
Member

I still think the issue is caused by the remote registry throttling your requests. The debug log above doesn't have any effect (in fact, it doesn't even make an additional request when the annotation is not present).

Could you try debugging the network connection with wireshark?

@dawagner
Copy link
Author

I can try that. Are you suggesting that the bandwidth will be limited but non-zero?

@giuseppe
Copy link
Member

yes, or hangs for a while

@dawagner
Copy link
Author

Using ss, I can see https connections being established but when the hang occurs, all the connections are closed. I had a buildah pull command in a tmux session that was sitting there for 2 days (I believe the layer it was supposed to pull weighs about 1MB). I insist that I've not reproduced with buildah 1.21.0 even though I've tried about 50 times.

I can confirm, however, that the hang is not always associated with the "blob type not supported for partial retrieval" log; I'll remove it from the title.

@dawagner dawagner changed the title buildah pull sometimes hangs forever with "blob type not supported for partial retrieval" buildah pull sometimes hangs forever (v1.23.0) Dec 15, 2021
@dawagner
Copy link
Author

I suppose I can try and bisect the issue and/or I can try to reproduce with a public registry on the same AWS account where I'm currently having the issue. If that works, I could send you the URL privately. I'll probably do that after the holiday season.

@dawagner
Copy link
Author

Good news: I have bissected it down to 980d352. The bad news is, this commit is:

bump github.com/containers/image/v5 from 5.15.2 to 5.16.0

I'll try and bisect that as well.

@dawagner
Copy link
Author

Ok... you're not going to believe this...

I checked out the last good commit and then checked out the various components that were upgraded by the "bad" commit. image v5.15.0 is ok, storage v1.35.0 is ok as well, but it's vendor/github.com/vbauerster/mpb (bumped from v7.0.3 to v7.1.3) that triggers the hang...

I have no idea why it only happens under the specific circumstances I'm experiencing it, though.

Using the --quiet option unfortunately doesn't help.

@rhatdan
Copy link
Member

rhatdan commented Dec 18, 2021

@mtrmac PTAL

@mtrmac
Copy link
Contributor

mtrmac commented Dec 23, 2021

That’s almost certainly vbauerster/mpb#100 . (To confirm, it would help to capture a full Go backtrace.)

It seems that even the newest Buildah v1.23.1 still depends on that version; it was fixed in #3526 but there hasn’t been a Buildah release since.

@dawagner
Copy link
Author

dawagner commented Jan 5, 2022

@mtrmac : indeed, I don't reproduce with this commit. Thanks.

I'm closing this; until there's a new release and it hits my distro, I'll recompile it locally.

@dawagner dawagner closed this as completed Jan 5, 2022
@rptaylor
Copy link

rptaylor commented Aug 9, 2022

To fix this I had to update buildah, and also delete and prune the buildah images stored on disk with buildah rmi and buildah rm -a

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 31, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants