Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent Hangs at crane.Push() on Registry Push #2104

Open
ranimbal opened this issue Oct 26, 2023 · 19 comments
Open

Intermittent Hangs at crane.Push() on Registry Push #2104

ranimbal opened this issue Oct 26, 2023 · 19 comments
Assignees
Labels
bug 🐞 Something isn't working
Milestone

Comments

@ranimbal
Copy link

ranimbal commented Oct 26, 2023

Environment

Device and OS: Rocky 8 EC2
App version: 0.29.2
Kubernetes distro being used: RKE2 v1.26.9+rke2r1
Other: Bigbang v2.11.1

Steps to reproduce

  1. zarf package deploy zarf-package-mvp-cluster-amd64-v5.0.0-alpha.7.tar.zst --confirm -l=debug
  2. About 80% of the time or so, the above command gets stuck at crane.Push(). A retry usually works.

Expected result

That the zarf package deploy... command wouldn't get hung up, and continue along.

Actual Result

The zarf package deploy... command gets hung up

Visual Proof (screenshots, videos, text, etc)

��[30;100m�[30;100m  DEBUG  �[0m�[0m �[90m�[90m2023-10-23T18:37:19Z  -  Pushing ...1.dso.mil/ironbank/neuvector/neuvector/manager:5.1.3�[0m�[0m
�[30;100m�[30;100m  DEBUG  �[0m�[0m �[90m�[90m2023-10-23T18:37:19Z  -  crane.Push() /tmp/zarf-3272389118/images:registry1.dso.mil/ironbank/neuvector/neuvector/manager:5.1.3 -> 127.0.0.1:39357/ironbank/neuvector/neuvector/manager:5.1.3-zarf-487612511)�[0m�[0m
section_end:1698087620:step_script
�[0K�[31;1mERROR: Job failed: execution took longer than 35m0s seconds

Severity/Priority

There is a workaround, by keeping retrying until the process succeeds.

Additional Context

This looks exactly like #1568, which was closed.

We have a multi-node cluster on AWS EC2, our package size is about 2.9G. Here are a few things that we noticed after some extensive testing:

  • this issue is not seen on a single EC2 node RKE2 cluster, it seems to only occur on multi-node clusters.
  • our zarf docker registry is backed by S3. The issue is always seen in this case, but only if a multi-node cluster.
  • if we back the registry with the default PVC (instead of S3), the issue is not seen at all. Since data transfer to S3 is slower than to the EBS backed PVC, maybe this extra time causes the problem to appear?
  • disabling or enabling the zarf docker registry HPA doesn't seem to matter either ways.
@Racer159 Racer159 added bug 🐞 Something isn't working and removed possible-bug 🐛 labels Nov 16, 2023
@Racer159 Racer159 added this to the (2023.12.05) milestone Nov 16, 2023
@AbrohamLincoln
Copy link
Contributor

AbrohamLincoln commented Nov 16, 2023

I did some testing on this and here's what i found:

  • I cannot reproduce this with a single node RKE2 cluster
  • I cannot reproduce this with an EKS cluster
  • I can reproduce this fairly consistently with RKE2 on a 2+ node cluster (rough math says ~80% of the time)
  • I changed the CNI from Canal to Calico. While I did still encounter this issue, my rough math says the failure rate dropped down to less than 20%.

While I have not found a smoking gun for this, the testing I've done seems to indicate it might be related to the default RKE2 CNI.

@Racer159
Copy link
Contributor

Racer159 commented Nov 17, 2023

Yeah that is what we are leaning to after some internal testing as well - a potentially interesting data point - do you ever see this issue with zarf package mirror-resources?

https://docs.zarf.dev/docs/the-zarf-cli/cli-commands/zarf_package_mirror-resources#examples

(for the internal registry you can take the first example and swap the passwords and the package - if you don't have git configured just omit that set of flags)

@Racer159
Copy link
Contributor

(a potential addition to the theory is that other things in the cluster may be stressing it as well)

@Racer159
Copy link
Contributor

Also what is the node role layout for your clusters - I have heard reports that if all nodes are control plane nodes that the issue is also not seen.

@ranimbal
Copy link
Author

Also what is the node role layout for your clusters - I have heard reports that if all nodes are control plane nodes that the issue is also not seen.

We've always had agent nodes when we saw this issue, whether with 1 or 3 control plane nodes. We've never seen this issue on single node clusters. Haven't tried a cluster with only 3 control plane nodes and no agent nodes.

@docandrew
Copy link
Contributor

Just to add another data point from what we've seen - we can deploy OK with multi-node clusters but only if the nodes are all RKE2 servers. As soon as we make one an agent, the Zarf registry runs there and we see this behavior as well.

@docandrew
Copy link
Contributor

Additional agent nodes are OK but we've tainted those so the Zarf registry doesn't run there.

@AbrohamLincoln
Copy link
Contributor

AbrohamLincoln commented Dec 1, 2023

I can confirm that adding a nodeSelector and taint/toleration to schedule the zarf registry pod(s) on the RKE2 control plane node(s) does resolve work around this issue:

kubectl patch deployment -n zarf zarf-docker-registry --patch-file=/dev/stdin <<-EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: zarf-docker-registry
  namespace: zarf
spec:
  template:
    spec:
      nodeSelector:
        node-role.kubernetes.io/master: "true"
      tolerations:
        - key: node-role.kubernetes.io/master
          operator: Exists
          effect: NoSchedule
EOF

@Racer159 Racer159 modified the milestones: (2023.12.05), The Bucket Dec 12, 2023
Racer159 added a commit that referenced this issue Dec 16, 2023
…2190)

## Description

This PR fixes error channel handling for Zarf tunnels so lost pod
connections don't result in infinite spins. This should mostly resolve
2104 though not marking it "Fixes" as depending on how many pod
connection errors occur a deployment could still run out of retries.

## Related Issue

Relates to #2104 

## Type of change

- [X] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Other (security config, docs update, etc)

## Checklist before merging

- [ ] Test, docs, adr added or updated as needed
- [X] [Contributor Guide
Steps](https://github.com/defenseunicorns/zarf/blob/main/CONTRIBUTING.md#developer-workflow)
followed
@AbrohamLincoln
Copy link
Contributor

Just wanted to chime in and say that this problem is still reproducible with the changes in #2190.
It appears that there isn't an error so the retry does not occur.

@mjnagel
Copy link
Contributor

mjnagel commented Feb 6, 2024

Just noting still encountering this on RKE2 with EBS backed PVCs. Not really any additional details on how/why we encountered this but we were able to workaround by pushing the image that was hanging "manually"/via a small zarf package.

EDIT: To clarify this was a zarf package that we built with a single component containing the single image that commonly stalled on deploy. Then we create/deploy-ed it and once finished, we deployed our "real" zarf package and it sped past the image push. Not sure why this worked better, but it seemed to consistently help when we hit stalling images.

@eddiezane
Copy link
Member

This is a super longstanding issue upstream that we've been band-aiding for a few years (in Kubernetes land). The root of the issue is that SPDY is long dead but used for all streaming functionality in Kubernetes. The current port forward logic depends on SPDY and a implementation that is overdue for a rewrite.

KEP 4006 should be an actual fix as we replace SPDY.

We are currently building mitigations into Zarf to try and address this.

What we really need is an environment where we can replicate the issue and test different fixes. If anyone has any ideas... Historically we've been unable to reproduce this.

@Racer159
Copy link
Contributor

This should be mitigated now in https://github.com/defenseunicorns/zarf/releases/tag/v0.32.4 - leaving this open until we get more community feedback though (and again this is a mitigation not a true fix, that will have to happen upstream).

@Racer159
Copy link
Contributor

(also thanks to @benmountjoy111 and @docandrew for the .pcap files!)

Noxsios pushed a commit that referenced this issue Mar 8, 2024
## Description

This adds `--backoff` and `--retries` to package operations to allow
those to be configured.

## Related Issue

Relates to #2104

## Type of change

- [ ] Bug fix (non-breaking change which fixes an issue)
- [X] New feature (non-breaking change which adds functionality)
- [ ] Other (security config, docs update, etc)

## Checklist before merging

- [x] Test, docs, adr added or updated as needed
- [X] [Contributor Guide
Steps](https://github.com/defenseunicorns/zarf/blob/main/CONTRIBUTING.md#developer-workflow)
followed

Signed-off-by: Eddie Zaneski <eddiezane@gmail.com>
Co-authored-by: Eddie Zaneski <eddiezane@gmail.com>
@YrrepNoj
Copy link
Member

This should be mitigated now in https://github.com/defenseunicorns/zarf/releases/tag/v0.32.4 - leaving this open until we get more community feedback though (and again this is a mitigation not a true fix, that will have to happen upstream).

Sadly, I do not think this solves the issue. I am still experiencing timeouts when publishing images. I am noticing that Zarf is now explicitly timing out instead of just hanging forever though.

Screenshot 2024-03-22 at 4 33 58 PM

@eddiezane
Copy link
Member

kubernetes/kubernetes#117493 should fix this upstream. Hopefully we can get it merged and backported.

@mjnagel
Copy link
Contributor

mjnagel commented May 1, 2024

Following up here to see if there's any more clarity on the exact issue we're facing...based on the above comments it seems like the current suspicion is that the issue originates from the kubectl port-forward/tunneling? Is that accurate @eddiezane ?

In some testing on our environment we've consistently had failures with large image pushes. This is happening in the context of a UDS bundle, so not directly zarf but it's effectively just looping through each package to deploy. Our common error looks like the one above with timeouts currently.

We have however had success pushing the images two different ways:

  • A single component zarf package just containing the image (no manifests/charts), created/deployed on the cluster: This succeeds pushing the image pretty consistently (most recently did this with 2-12gb images and all pushed)
  • A kubectl port-forward to the zarf registry + docker push commands: This also seems to work consistently

I think where I'm confused in all this is that I'd assume either of these workarounds would hit the same limitations with port-forwarding/tunneling. Is there anything to glean from this experience that might help explain the issue better or why these methods seem to work far more consistently? As @YrrepNoj mentioned above, we're able to hit this pretty much 100% consistently with our bundle deploy containing the Leapfrog images and haven't found any success outside of these workarounds.

@RyanTepera1
Copy link

RyanTepera1 commented May 2, 2024

A workaround that has seemed to work for me consistently to get past this particular issue is to use zarf package mirror-resources concurrently with a zarf connect git tunnel open and mirror the package’s internal resources to the specified image registry and git repository. I use the IP address of the node that the zarf-docker-registry is running on and NodePort service the zarf-docker-registry is using for the --registry-url. Authentication is also required with --registry-push-username/password and --git-push-username/password. Gotten from running a zarf tools get-creds. For example:

zarf package mirror-resources zarf-package-structsure-enterprise-amd64-v5.9.0.tar.zst --registry-url <IP address of node zarf-docker-registry is running on>:31999 --registry-push-username zarf-push --registry-push-password <zarf-push-password> --git-url http://127.0.0.1:<tunnel port from zarf connect git> --git-push-username zarf-git-user --git-push-password <git-user-password>

proof:
running the zarf package deploy zarf-package-structsure-enterprise-amd64-v5.0.0.tar.zst --confirm has error'd everytime when it gets stuck on a specific blob unable to push the image:
Screenshot 2024-05-02 at 4 16 35 PM

The zarf package mirror-resource command working to push the same image to the zarf-docker-registry with a zarf connect git tunnel open that previously always gets stuck during a zarf package deploy command:
Screenshot 2024-05-02 at 4 27 45 PM

After the zarf package’s internal resources are mirrored to the specified registry and git repository a zarf package deploy zarf-package-structsure-enterprise-amd64-v5.0.0.tar.zst --confirm is successful.

@philiversen
Copy link

I am seeing the same behavior as @RyanTepera1 using RKE2 v1.28.3+rke2r2 on EC2 instances with an NFS-based storage class. I have not seen this on EKS clusters using an EBS-based storage class. I also haven't tried the zarf connect git trick yet, but I'll be trying that soon!

One additional thing I've noticed is that using zarf package mirror-resources --zarf-url <ip>:31999 ... doesn't seem to completely hang, but it slows to a crawl taking hours to make a small amount of progress. However, if I kill the zarf-docker-registry-* pod, progress seems to resume at normal speed. I was able to get through a large package with multiple 2+ GB images in a single run by monitoring and occasionally killing zarf-docker-registry pods to get things moving again.

For example, pushing this sonarqube image took nearly 5 minutes to get from 39% to 41%, but after killing the zarf-docker-registry pod, it pushes the image in less than a minute.

image

@philiversen
Copy link

Moving the zarf-docker-registry pods to one of the RKE2 master nodes as suggested here did not improve performance for my deployment. I tried this with zarf package deploy... and zarf package mirror-resources.... In both cases, when image pushes slowed way down, killing zarf-docker-registry pods would get things moving again. This was much easier using the zarf package mirror-resources... approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐞 Something isn't working
Projects
Status: In progress
Status: No status
Development

No branches or pull requests

10 participants