Intermittent Hangs at crane.Push() on Registry Push #2104

ranimbal · 2023-10-26T20:54:15Z

Environment

Device and OS: Rocky 8 EC2
App version: 0.29.2
Kubernetes distro being used: RKE2 v1.26.9+rke2r1
Other: Bigbang v2.11.1

Steps to reproduce

zarf package deploy zarf-package-mvp-cluster-amd64-v5.0.0-alpha.7.tar.zst --confirm -l=debug
About 80% of the time or so, the above command gets stuck at crane.Push(). A retry usually works.

Expected result

That the zarf package deploy... command wouldn't get hung up, and continue along.

Actual Result

The zarf package deploy... command gets hung up

Visual Proof (screenshots, videos, text, etc)

��[30;100m�[30;100m  DEBUG  �[0m�[0m �[90m�[90m2023-10-23T18:37:19Z  -  Pushing ...1.dso.mil/ironbank/neuvector/neuvector/manager:5.1.3�[0m�[0m
�[30;100m�[30;100m  DEBUG  �[0m�[0m �[90m�[90m2023-10-23T18:37:19Z  -  crane.Push() /tmp/zarf-3272389118/images:registry1.dso.mil/ironbank/neuvector/neuvector/manager:5.1.3 -> 127.0.0.1:39357/ironbank/neuvector/neuvector/manager:5.1.3-zarf-487612511)�[0m�[0m
section_end:1698087620:step_script
�[0K�[31;1mERROR: Job failed: execution took longer than 35m0s seconds

Severity/Priority

There is a workaround, by keeping retrying until the process succeeds.

Additional Context

This looks exactly like #1568, which was closed.

We have a multi-node cluster on AWS EC2, our package size is about 2.9G. Here are a few things that we noticed after some extensive testing:

this issue is not seen on a single EC2 node RKE2 cluster, it seems to only occur on multi-node clusters.
our zarf docker registry is backed by S3. The issue is always seen in this case, but only if a multi-node cluster.
if we back the registry with the default PVC (instead of S3), the issue is not seen at all. Since data transfer to S3 is slower than to the EBS backed PVC, maybe this extra time causes the problem to appear?
disabling or enabling the zarf docker registry HPA doesn't seem to matter either ways.

The text was updated successfully, but these errors were encountered:

AbrohamLincoln · 2023-11-16T16:37:40Z

I did some testing on this and here's what i found:

I cannot reproduce this with a single node RKE2 cluster
I cannot reproduce this with an EKS cluster
I can reproduce this fairly consistently with RKE2 on a 2+ node cluster (rough math says ~80% of the time)
I changed the CNI from Canal to Calico. While I did still encounter this issue, my rough math says the failure rate dropped down to less than 20%.

While I have not found a smoking gun for this, the testing I've done seems to indicate it might be related to the default RKE2 CNI.

Racer159 · 2023-11-17T22:21:38Z

Yeah that is what we are leaning to after some internal testing as well - a potentially interesting data point - do you ever see this issue with zarf package mirror-resources?

https://docs.zarf.dev/docs/the-zarf-cli/cli-commands/zarf_package_mirror-resources#examples

(for the internal registry you can take the first example and swap the passwords and the package - if you don't have git configured just omit that set of flags)

Racer159 · 2023-11-17T22:23:13Z

(a potential addition to the theory is that other things in the cluster may be stressing it as well)

Racer159 · 2023-11-17T22:41:53Z

Also what is the node role layout for your clusters - I have heard reports that if all nodes are control plane nodes that the issue is also not seen.

ranimbal · 2023-11-17T22:48:13Z

Also what is the node role layout for your clusters - I have heard reports that if all nodes are control plane nodes that the issue is also not seen.

We've always had agent nodes when we saw this issue, whether with 1 or 3 control plane nodes. We've never seen this issue on single node clusters. Haven't tried a cluster with only 3 control plane nodes and no agent nodes.

docandrew · 2023-11-17T23:32:38Z

Just to add another data point from what we've seen - we can deploy OK with multi-node clusters but only if the nodes are all RKE2 servers. As soon as we make one an agent, the Zarf registry runs there and we see this behavior as well.

docandrew · 2023-11-17T23:34:21Z

Additional agent nodes are OK but we've tainted those so the Zarf registry doesn't run there.

AbrohamLincoln · 2023-12-01T17:30:15Z

I can confirm that adding a nodeSelector and taint/toleration to schedule the zarf registry pod(s) on the RKE2 control plane node(s) does ~~resolve~~ work around this issue:

kubectl patch deployment -n zarf zarf-docker-registry --patch-file=/dev/stdin <<-EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: zarf-docker-registry
  namespace: zarf
spec:
  template:
    spec:
      nodeSelector:
        node-role.kubernetes.io/master: "true"
      tolerations:
        - key: node-role.kubernetes.io/master
          operator: Exists
          effect: NoSchedule
EOF

…2190) ## Description This PR fixes error channel handling for Zarf tunnels so lost pod connections don't result in infinite spins. This should mostly resolve 2104 though not marking it "Fixes" as depending on how many pod connection errors occur a deployment could still run out of retries. ## Related Issue Relates to #2104 ## Type of change - [X] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [ ] Other (security config, docs update, etc) ## Checklist before merging - [ ] Test, docs, adr added or updated as needed - [X] [Contributor Guide Steps](https://github.com/defenseunicorns/zarf/blob/main/CONTRIBUTING.md#developer-workflow) followed

AbrohamLincoln · 2024-01-10T20:07:39Z

Just wanted to chime in and say that this problem is still reproducible with the changes in #2190.
It appears that there isn't an error so the retry does not occur.

mjnagel · 2024-02-06T18:13:22Z

Just noting still encountering this on RKE2 with EBS backed PVCs. Not really any additional details on how/why we encountered this but we were able to workaround by pushing the image that was hanging "manually"/via a small zarf package.

EDIT: To clarify this was a zarf package that we built with a single component containing the single image that commonly stalled on deploy. Then we create/deploy-ed it and once finished, we deployed our "real" zarf package and it sped past the image push. Not sure why this worked better, but it seemed to consistently help when we hit stalling images.

eddiezane · 2024-02-21T00:22:13Z

This is a super longstanding issue upstream that we've been band-aiding for a few years (in Kubernetes land). The root of the issue is that SPDY is long dead but used for all streaming functionality in Kubernetes. The current port forward logic depends on SPDY and a implementation that is overdue for a rewrite.

KEP 4006 should be an actual fix as we replace SPDY.

We are currently building mitigations into Zarf to try and address this.

What we really need is an environment where we can replicate the issue and test different fixes. If anyone has any ideas... Historically we've been unable to reproduce this.

Racer159 · 2024-02-27T03:28:18Z

This should be mitigated now in https://github.com/defenseunicorns/zarf/releases/tag/v0.32.4 - leaving this open until we get more community feedback though (and again this is a mitigation not a true fix, that will have to happen upstream).

Racer159 · 2024-02-27T03:28:50Z

(also thanks to @benmountjoy111 and @docandrew for the .pcap files!)

## Description This adds `--backoff` and `--retries` to package operations to allow those to be configured. ## Related Issue Relates to #2104 ## Type of change - [ ] Bug fix (non-breaking change which fixes an issue) - [X] New feature (non-breaking change which adds functionality) - [ ] Other (security config, docs update, etc) ## Checklist before merging - [x] Test, docs, adr added or updated as needed - [X] [Contributor Guide Steps](https://github.com/defenseunicorns/zarf/blob/main/CONTRIBUTING.md#developer-workflow) followed Signed-off-by: Eddie Zaneski <eddiezane@gmail.com> Co-authored-by: Eddie Zaneski <eddiezane@gmail.com>

YrrepNoj · 2024-03-22T20:34:32Z

This should be mitigated now in https://github.com/defenseunicorns/zarf/releases/tag/v0.32.4 - leaving this open until we get more community feedback though (and again this is a mitigation not a true fix, that will have to happen upstream).

Sadly, I do not think this solves the issue. I am still experiencing timeouts when publishing images. I am noticing that Zarf is now explicitly timing out instead of just hanging forever though.

eddiezane · 2024-03-28T13:29:53Z

kubernetes/kubernetes#117493 should fix this upstream. Hopefully we can get it merged and backported.

mjnagel · 2024-05-01T19:42:07Z

Following up here to see if there's any more clarity on the exact issue we're facing...based on the above comments it seems like the current suspicion is that the issue originates from the kubectl port-forward/tunneling? Is that accurate @eddiezane ?

In some testing on our environment we've consistently had failures with large image pushes. This is happening in the context of a UDS bundle, so not directly zarf but it's effectively just looping through each package to deploy. Our common error looks like the one above with timeouts currently.

We have however had success pushing the images two different ways:

A single component zarf package just containing the image (no manifests/charts), created/deployed on the cluster: This succeeds pushing the image pretty consistently (most recently did this with 2-12gb images and all pushed)
A kubectl port-forward to the zarf registry + docker push commands: This also seems to work consistently

I think where I'm confused in all this is that I'd assume either of these workarounds would hit the same limitations with port-forwarding/tunneling. Is there anything to glean from this experience that might help explain the issue better or why these methods seem to work far more consistently? As @YrrepNoj mentioned above, we're able to hit this pretty much 100% consistently with our bundle deploy containing the Leapfrog images and haven't found any success outside of these workarounds.

RyanTepera1 · 2024-05-02T22:24:39Z

A workaround that has seemed to work for me consistently to get past this particular issue is to use zarf package mirror-resources concurrently with a zarf connect git tunnel open and mirror the package’s internal resources to the specified image registry and git repository. I use the IP address of the node that the zarf-docker-registry is running on and NodePort service the zarf-docker-registry is using for the --registry-url. Authentication is also required with --registry-push-username/password and --git-push-username/password. Gotten from running a zarf tools get-creds. For example:

zarf package mirror-resources zarf-package-structsure-enterprise-amd64-v5.9.0.tar.zst --registry-url <IP address of node zarf-docker-registry is running on>:31999 --registry-push-username zarf-push --registry-push-password <zarf-push-password> --git-url http://127.0.0.1:<tunnel port from zarf connect git> --git-push-username zarf-git-user --git-push-password <git-user-password>

proof:
running the zarf package deploy zarf-package-structsure-enterprise-amd64-v5.0.0.tar.zst --confirm has error'd everytime when it gets stuck on a specific blob unable to push the image:

The zarf package mirror-resource command working to push the same image to the zarf-docker-registry with a zarf connect git tunnel open that previously always gets stuck during a zarf package deploy command:

After the zarf package’s internal resources are mirrored to the specified registry and git repository a zarf package deploy zarf-package-structsure-enterprise-amd64-v5.0.0.tar.zst --confirm is successful.

philiversen · 2024-05-09T15:38:40Z

I am seeing the same behavior as @RyanTepera1 using RKE2 v1.28.3+rke2r2 on EC2 instances with an NFS-based storage class. I have not seen this on EKS clusters using an EBS-based storage class. I also haven't tried the zarf connect git trick yet, but I'll be trying that soon!

One additional thing I've noticed is that using zarf package mirror-resources --zarf-url <ip>:31999 ... doesn't seem to completely hang, but it slows to a crawl taking hours to make a small amount of progress. However, if I kill the zarf-docker-registry-* pod, progress seems to resume at normal speed. I was able to get through a large package with multiple 2+ GB images in a single run by monitoring and occasionally killing zarf-docker-registry pods to get things moving again.

For example, pushing this sonarqube image took nearly 5 minutes to get from 39% to 41%, but after killing the zarf-docker-registry pod, it pushes the image in less than a minute.

philiversen · 2024-05-09T19:18:55Z

Moving the zarf-docker-registry pods to one of the RKE2 master nodes as suggested here did not improve performance for my deployment. I tried this with zarf package deploy... and zarf package mirror-resources.... In both cases, when image pushes slowed way down, killing zarf-docker-registry pods would get things moving again. This was much easier using the zarf package mirror-resources... approach.

ranimbal added the possible-bug 🐛 label Oct 26, 2023

Racer159 added bug 🐞 Something isn't working and removed possible-bug 🐛 labels Nov 16, 2023

Racer159 added this to the (2023.12.05) milestone Nov 16, 2023

docandrew mentioned this issue Dec 4, 2023

Tolerations/NodeSelector for Zarf registry #2176

Open

Racer159 modified the milestones: (2023.12.05), The Bucket Dec 12, 2023

Racer159 mentioned this issue Dec 12, 2023

fix: properly handle the tunnel error channel to retry image pushing #2190

Merged

5 tasks

eddiezane assigned eddiezane, Racer159 and Noxsios Feb 21, 2024

Racer159 mentioned this issue Mar 1, 2024

feat: add configurable backoff and retries for Zarf operations #2345

Merged

5 tasks

eddiezane mentioned this issue Mar 28, 2024

bugfix(port-forward): Correctly handle known errors kubernetes/kubernetes#117493

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermittent Hangs at crane.Push() on Registry Push #2104

Intermittent Hangs at crane.Push() on Registry Push #2104

ranimbal commented Oct 26, 2023 •

edited

AbrohamLincoln commented Nov 16, 2023 •

edited

Racer159 commented Nov 17, 2023 •

edited

Racer159 commented Nov 17, 2023

Racer159 commented Nov 17, 2023

ranimbal commented Nov 17, 2023

docandrew commented Nov 17, 2023

docandrew commented Nov 17, 2023

AbrohamLincoln commented Dec 1, 2023 •

edited

AbrohamLincoln commented Jan 10, 2024

mjnagel commented Feb 6, 2024 •

edited

eddiezane commented Feb 21, 2024

Racer159 commented Feb 27, 2024

Racer159 commented Feb 27, 2024

YrrepNoj commented Mar 22, 2024

eddiezane commented Mar 28, 2024

mjnagel commented May 1, 2024

RyanTepera1 commented May 2, 2024 •

edited

philiversen commented May 9, 2024

philiversen commented May 9, 2024

Intermittent Hangs at crane.Push() on Registry Push #2104

Intermittent Hangs at crane.Push() on Registry Push #2104

Comments

ranimbal commented Oct 26, 2023 • edited

Environment

Steps to reproduce

Expected result

Actual Result

Visual Proof (screenshots, videos, text, etc)

Severity/Priority

Additional Context

AbrohamLincoln commented Nov 16, 2023 • edited

Racer159 commented Nov 17, 2023 • edited

Racer159 commented Nov 17, 2023

Racer159 commented Nov 17, 2023

ranimbal commented Nov 17, 2023

docandrew commented Nov 17, 2023

docandrew commented Nov 17, 2023

AbrohamLincoln commented Dec 1, 2023 • edited

AbrohamLincoln commented Jan 10, 2024

mjnagel commented Feb 6, 2024 • edited

eddiezane commented Feb 21, 2024

Racer159 commented Feb 27, 2024

Racer159 commented Feb 27, 2024

YrrepNoj commented Mar 22, 2024

eddiezane commented Mar 28, 2024

mjnagel commented May 1, 2024

RyanTepera1 commented May 2, 2024 • edited

philiversen commented May 9, 2024

philiversen commented May 9, 2024

ranimbal commented Oct 26, 2023 •

edited

AbrohamLincoln commented Nov 16, 2023 •

edited

Racer159 commented Nov 17, 2023 •

edited

AbrohamLincoln commented Dec 1, 2023 •

edited

mjnagel commented Feb 6, 2024 •

edited

RyanTepera1 commented May 2, 2024 •

edited