-
Notifications
You must be signed in to change notification settings - Fork 162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zarf fails when pulling from Nvidia's container registry #2408
Comments
@ercanserteli I am not able to reproduce the error you're seeing. Using this kind: ZarfPackageConfig
metadata:
name: test-package
version: 1.0.0
components:
- name: gpu-operator
required: true
charts:
- name: gpu-operator
namespace: gpu-operator
url: https://helm.ngc.nvidia.com/nvidia
version: v23.9.2
valuesFiles:
- ./values.yaml
images:
- registry.k8s.io/nfd/node-feature-discovery:v0.14.2
- nvcr.io/nvidia/k8s/dcgm-exporter:3.3.0-3.2.0-ubuntu22.04
- nvcr.io/nvidia/gpu-feature-discovery:v0.8.2-ubi8
- nvcr.io/nvidia/k8s-device-plugin:v0.14.5-ubi8
- nvcr.io/nvidia/k8s/container-toolkit:v1.14.6-ubuntu20.04
- nvcr.io/nvidia/gpu-operator:v23.9.2
- nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.2
- nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.6.5
note that I had to add https://catalog.ngc.nvidia.com/orgs/nvidia/teams/k8s/containers/dcgm-exporter/tags In the output you provided, it says |
After waiting a few hours and trying again, I am seeing the error now. I suspect this is an issue related to NVIDIA's registry somehow as we have not seen this problem occur with other registries that I'm aware of. |
You are right, the sample outputs I added were from running with the whole production In any case, I believe that Zarf should handle failed image layer downloads more gracefully such that they don't get cached in a corrupted state. If that were fixed, Zarf's retry mechanism could work successfully, and the sporadic |
Is there any possible workaround for this problem? For example doing docker pull on the images manually works, but I do not know if there is a way to make zarf use the local docker cache. I also tried setting up a pull-through cache on AWS ECR but it seems they don't support Nvidia's registry. Any ideas on a workaround would be great so that we can create packages in the meanwhile. |
Yes if Zarf does not find an image, it will pull from the local docker image store. I'm not sure if Zarf will still fall back to the local docker store if it see's an image in a remote then fails to pull it. You may have to rename / retag the images |
Thank you, this worked as a workaround! For anyone with the same problem, I first modified the hosts file to make nvcr.io unreachable and it used the local docker images, but it was extremely slow. Instead, setting up a local registry, pushing all the images and using |
@ercanserteli This issue should be fixed since v0.34.0. If you are still having an issues feel free to reopen |
Environment
Device and OS: Tested with Ubuntu 22.04 and Windows 11, AMD64
App version: v0.32.6
Steps to reproduce
zarf package create --confirm
This is the simplest zarf.yaml that I can get the error with, but it is not 100% consistent:
My usual yaml has the following component, which fails 100% of the time. Seems like the more images there are, higher the chance it will fail:
For completeness' sake, the contents of `../k8s/base/gpu-operator-values.yaml" are:
Expected result
It should pull the images normally and continue with the package creation.
Actual Result
It fails at the
Loading metadata for n images
stage. It retries 2 more times but that always results inexpected blob size x, but only wrote 23
. This also saves a corrupted cache, so once it fails this way, it will always fail on new runs until you clean the cache.Visual Proof (screenshots, videos, text, etc)
It writes one of two error messages randomly, but each is about an
INTERNAL_ERROR
.1:
2:
Severity/Priority
Very severe, because it completely blocks us from being able to create (and deploy) our package, which needs Nvidia GPU functionality to work.
Additional Context
This has been tested on multiple PCs/servers on different networks and throughout multiple days. I don't think it could be a temporary hickup or getting limited by nvcr.io. Running
docker pull
on the images can pull the images just fine. Even if it fails when downloading a layer, it should not corrupt the layer cache, it should instead be able to fix this on the retry stage.The text was updated successfully, but these errors were encountered: