-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Release stops retrying to connect #76
Comments
Further details about verifying that the connection secret to the GKE cluster seems OK. See the secret referenced by the
Extracting the
|
It should keep retrying with the default shortWait which is 30 sec. Every time reconcile (at connect step here) failed it is being re-added to queue. Do you see anything in provider helm controller logs? Just wondering if controller stopped working somehow? |
I reproduced this issue again just now and I captured the full pod logs in the gist below. A lot of messages for this, but not sure if that's an issue:
Note that the provider-helm pod has no restarts and appears healthy (full pod details in gist) > kcs get pod crossplane-provider-helm-4b3c12d3669a-7c46b69d8-mzdww
NAME READY STATUS RESTARTS AGE
crossplane-provider-helm-4b3c12d3669a-7c46b69d8-mzdww 1/1 Running 0 16m full gist: https://gist.github.com/jbw976/2d13e79147c2aae564c61a827389de57 Anything else I can look into? or get you access to? |
Could this be a network policy or something that is blocking provider-helm in the hosted crossplane instance in Upbound Cloud from being able to reach the GKE cluster? When I successfully used the kubeconfig from the connection secret, I was doing that from my own laptop. |
We don't have egress policy that might block such kind of a traffic in Upbound Cloud, so, I don't think so. Couple of things that I am wondering:
Otherwise, it would be helpful if I could get access to the environment next time for debugging further. |
Hmm, interesting point in the logs you shared:
which indicates that, installation of helm started at
|
Dear, I am experiencing exactly the same problem described by @jbw976 . |
Super cool that you got it working for you @idallaserra! Can you explain a little more detail about what "using cloud nat" means? Any docs you could point me to that would help me understand the details? It seemed to me like 2 different problems though, where my provider-helm couldn't even connect to the newly provisioned GKE cluster at all, even with my changes to enable basic auth in the GKE cluster in upbound/platform-ref-multi-k8s#6 that are now published in v0.0.4 at https://cloud.upbound.io/registry/upbound/platform-ref-multi-k8s. For you though, your provider-helm seemed to be able to connect to the GKE cluster, but then container pulls from inside the GKE cluster were not working. That felt like a different issue to me, and I'd love to hear more about the "cloud nat" you used to get that working. @turkenh I'll try to get more debugging details (and get you access to the repro instance) the next time I repro this. Thanks for your help so far! 🙇♂️ edit: ah, now i'm seeing that @turkenh mentioned the evidence for my provider-helm connecting to the cluster OK but just not finishing the job. Perhaps @idallaserra has it figured out with the "cloud nat" 💪 |
Dear for the cloud nat the problem is well explained here: |
Just reproduced the issue and here is my findings:
I think, this confirms @idallaserra`s observations. This stackoverflow describe the same issue and has a well-written answer. @jbw976, wondering if there is any special reason to enable private cluster nodes in gke resource spec here. |
And, after waiting loong enough, helm release reports following status back :)
|
No I don't think there is @turkenh. Most likely I was following another example and didn't realize the impact of that setting 😨 I'll get this patched in the reference platforms that use GKE (GCP and multi-k8s). Is there anything to fix here in terms of provider-helm reporting back status earlier, so that people in the future in similar scenarios don't get tricked by the red herring of "connection refused"? That was showing to me as the main/obvious thing that was wrong on the Can that status behavior be improved? 🙏 |
Yes, once we have implemented this, we will be able to report early / better instead of getting blocked until helm client returns. |
Perfect @turkenh! Let's close down this issue then in favor of #63. Thank you for your support and thank you to @idallaserra for his keep troubleshooting as well! 💪 |
What happened?
A
Release
object seems to stop trying to connect if it is unable to do so in the first ~3 minutes. Should it not continue to retry connecting continually after that?Note in the following status and events, we haven't retried to connect in 51 minutes:
The full output of
kubectl describe releases
can be found in the following gist: https://gist.github.com/jbw976/f3507bfeb78ecceb4cd9c2b58cce6e94Other observations to note:
ProviderConfig
and its referenced secret seem OK to me, I was able to use the kubeconfig field of the secret to manually connect to the cluster OK.Release
object is2020-12-29T22:04:14Z
and the last transition time for the "connect failed" status is2020-12-29T22:07:14Z
, so about 3 minutes after creationWhat environment did it happen in?
crossplane:v1.0.0
,provider-helm:v0.5.0
The text was updated successfully, but these errors were encountered: