Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release stops retrying to connect #76

Closed
jbw976 opened this issue Dec 29, 2020 · 14 comments
Closed

Release stops retrying to connect #76

jbw976 opened this issue Dec 29, 2020 · 14 comments
Labels
bug Something isn't working

Comments

@jbw976
Copy link
Member

jbw976 commented Dec 29, 2020

What happened?

A Release object seems to stop trying to connect if it is unable to do so in the first ~3 minutes. Should it not continue to retry connecting continually after that?

Note in the following status and events, we haven't retried to connect in 51 minutes:

Status:
  At Provider:
  Conditions:
    Last Transition Time:  2020-12-29T22:07:14Z
    Message:               connect failed: cannot create new Kubernetes client: cannot create Kubernetes client: Get "https://35.236.80.135/api?timeout=32s": dial tcp 35.236.80.135:443: connect: connection refused
    Reason:                ReconcileError
    Status:                False
    Type:                  Synced
Events:
  Type     Reason                   Age                From                                Message
  ----     ------                   ----               ----                                -------
  Warning  CannotConnectToProvider  54m (x2 over 54m)  managed/release.helm.crossplane.io  provider could not be retrieved: ProviderConfig.helm.crossplane.io "multik8s-cluster-gcp" not found
  Warning  CannotConnectToProvider  52m (x5 over 54m)  managed/release.helm.crossplane.io  cannot create new Kubernetes client: cannot create Kubernetes client: host must be a URL or a host:port pair: "https://"
  Warning  CannotConnectToProvider  51m                managed/release.helm.crossplane.io  cannot create new Kubernetes client: cannot create Kubernetes client: Get "https://35.236.80.135/api?timeout=32s": dial tcp 35.236.80.135:443: i/o timeout
  Warning  CannotConnectToProvider  51m (x2 over 51m)  managed/release.helm.crossplane.io  cannot create new Kubernetes client: cannot create Kubernetes client: Get "https://35.236.80.135/api?timeout=32s": dial tcp 35.236.80.135:443: connect: connection refused

The full output of kubectl describe releases can be found in the following gist: https://gist.github.com/jbw976/f3507bfeb78ecceb4cd9c2b58cce6e94

Other observations to note:

  • The ProviderConfig and its referenced secret seem OK to me, I was able to use the kubeconfig field of the secret to manually connect to the cluster OK.
  • The creation of the Release object is 2020-12-29T22:04:14Z and the last transition time for the "connect failed" status is 2020-12-29T22:07:14Z, so about 3 minutes after creation

What environment did it happen in?

crossplane:v1.0.0, provider-helm:v0.5.0

@jbw976 jbw976 added the bug Something isn't working label Dec 29, 2020
@jbw976
Copy link
Member Author

jbw976 commented Dec 29, 2020

Further details about verifying that the connection secret to the GKE cluster seems OK. See the secret referenced by the ProviderConfig:

> k get providerconfig.helm.crossplane.io multik8s-cluster-gcp -o yaml
...
spec:
  credentials:
    secretRef:
      key: kubeconfig
      name: 2b80fbbb-076b-4908-9034-71db73c3cc86-gkecluster
      namespace: upbound-system
    source: Secret

Extracting the data.kubeconfig field from that secret into a file then using the file to run kubectl get node worked OK:

> kubectl -n upbound-system get secret 2b80fbbb-076b-4908-9034-71db73c3cc86-gkecluster -o jsonpath='{.data.kubeconfig}' | base64 -d > kubeconfig-gke
> k --kubeconfig=kubeconfig-gke get node
NAME                                                  STATUS   ROLES    AGE   VERSION
gke-cluster-gcp-tkbx-cluster-gcp-s59h-366d1ae5-9nxr   Ready    <none>   64m   v1.16.15-gke.7300
gke-cluster-gcp-tkbx-cluster-gcp-s59h-894e22ae-p5c9   Ready    <none>   65m   v1.16.15-gke.7300
gke-cluster-gcp-tkbx-cluster-gcp-s59h-f5400ebc-zp8n   Ready    <none>   65m   v1.16.15-gke.7300

@turkenh
Copy link
Collaborator

turkenh commented Dec 30, 2020

It should keep retrying with the default shortWait which is 30 sec. Every time reconcile (at connect step here) failed it is being re-added to queue.

Do you see anything in provider helm controller logs? Just wondering if controller stopped working somehow?

@jbw976
Copy link
Member Author

jbw976 commented Dec 31, 2020

I reproduced this issue again just now and I captured the full pod logs in the gist below. A lot of messages for this, but not sure if that's an issue:

E1231 19:10:49.562709       1 memcache.go:206] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request

Note that the provider-helm pod has no restarts and appears healthy (full pod details in gist)

> kcs get pod crossplane-provider-helm-4b3c12d3669a-7c46b69d8-mzdww
NAME                                                    READY   STATUS    RESTARTS   AGE
crossplane-provider-helm-4b3c12d3669a-7c46b69d8-mzdww   1/1     Running   0          16m

full gist: https://gist.github.com/jbw976/2d13e79147c2aae564c61a827389de57

Anything else I can look into? or get you access to?

@jbw976
Copy link
Member Author

jbw976 commented Jan 1, 2021

Could this be a network policy or something that is blocking provider-helm in the hosted crossplane instance in Upbound Cloud from being able to reach the GKE cluster? When I successfully used the kubeconfig from the connection secret, I was doing that from my own laptop.

@turkenh
Copy link
Collaborator

turkenh commented Jan 1, 2021

Could this be a network policy or something that is blocking provider-helm in the hosted crossplane instance in Upbound Cloud from being able to reach the GKE cluster? When I successfully used the kubeconfig from the connection secret, I was doing that from my own laptop.

We don't have egress policy that might block such kind of a traffic in Upbound Cloud, so, I don't think so.

Couple of things that I am wondering:

  • Is it 100% reproducible?
  • Does the controller stop reconciling on CR, or, keeps retrying but getting some sort of error and somehow this is not reflected to CR status. Controller logs could help here (thanks for the logs above @jbw976, but I think we would need --debug flag in the controller to better understand what is going on)
  • What happens if we restart the controller?

Otherwise, it would be helpful if I could get access to the environment next time for debugging further.

@turkenh
Copy link
Collaborator

turkenh commented Jan 1, 2021

Hmm, interesting point in the logs you shared:

2020/12/31 19:10:47 info: skipping unknown hook: "crd-install"

which indicates that, installation of helm started at 2020/12/31 19:10:47, so, helm controller seems to get connected to remote cluster, but probably didn't get completed (yet?) because of some reason. (AFAIR, that chart contains a pre-install helm hook which might be related). So, I would definitely check what is going on on the cluster side, for example:

helm ls --all --all-namespaces
kubectl get jobs -n operators
kubectl get pods -n operators
kubectl get all -n operators
...

@idallaserra
Copy link

Dear, I am experiencing exactly the same problem described by @jbw976 .
But in my opinion the problem is not related to the helm gcp provider.
In my case the issue is due to the fact that I am testing the gcp part of platform-ref-multi-k8s and the GKE cluster created by this crossplane repo is Private!
So only some image in cache is working. In fact trying to install with plain helm, no provider helm, some example chart give me pull error. Fixing this using cloud nat and now everything works:
image

@jbw976
Copy link
Member Author

jbw976 commented Jan 5, 2021

Super cool that you got it working for you @idallaserra! Can you explain a little more detail about what "using cloud nat" means? Any docs you could point me to that would help me understand the details?

It seemed to me like 2 different problems though, where my provider-helm couldn't even connect to the newly provisioned GKE cluster at all, even with my changes to enable basic auth in the GKE cluster in upbound/platform-ref-multi-k8s#6 that are now published in v0.0.4 at https://cloud.upbound.io/registry/upbound/platform-ref-multi-k8s.

For you though, your provider-helm seemed to be able to connect to the GKE cluster, but then container pulls from inside the GKE cluster were not working. That felt like a different issue to me, and I'd love to hear more about the "cloud nat" you used to get that working.

@turkenh I'll try to get more debugging details (and get you access to the repro instance) the next time I repro this. Thanks for your help so far! 🙇‍♂️

edit: ah, now i'm seeing that @turkenh mentioned the evidence for my provider-helm connecting to the cluster OK but just not finishing the job. Perhaps @idallaserra has it figured out with the "cloud nat" 💪

@idallaserra
Copy link

idallaserra commented Jan 5, 2021

Dear for the cloud nat the problem is well explained here:
https://cloud.google.com/kubernetes-engine/docs/how-to/private-clusters#docker_hub in "Can't pull image from public Docker Hub".
Shortly the created cluster is Private, so the node has no internet connectivity, meaning no docker pull.
Here a document, old but still valid, explaing how to do.
The best in my opinion would be the Cloud Nat support in provider-gcp (still lacking it seems to me) to permit the creation directly in Crossplane.
Hope to be useful :-)
Ivan

@turkenh
Copy link
Collaborator

turkenh commented Jan 19, 2021

Just reproduced the issue and here is my findings:

  • Helm Release status is same as the one given in the beginning of this issue.
  • On the created GKECluster, I see:
    1. operators namespace created by provider-helm
    2. helm release is stuck as pending-install:
    helm ls -a
    NAME                      	NAMESPACE	REVISION	UPDATED                              	STATUS         	CHART                       	APP VERSION
    vault1-cluster-p7pfn-mkbc6	operators	1       	2021-01-19 14:08:47.1636178 +0000 UTC	pending-install	kube-prometheus-stack-10.1.0	0.42.1
    
    1. pre-install job in prometheus chart created and its pod stuck as ImagePullBackOff with the following events:
     Events:
       Type     Reason     Age                   From               Message
       ----     ------     ----                  ----               -------
       Normal   Scheduled  11m                   default-scheduler  Successfully assigned operators/vault1-cluster-p7pfn-mkbc6-admission-create-8m2qr to gke-vault1-cluster-p-vault1-cluster-p-d7ca7c52-3465
       Normal   Pulling    8m36s (x4 over 11m)   kubelet            Pulling image "jettech/kube-webhook-certgen:v1.2.1"
       Warning  Failed     8m21s (x4 over 10m)   kubelet            Failed to pull image "jettech/kube-webhook-certgen:v1.2.1": rpc error: code = Unknown desc = Error response from daemon: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
       Warning  Failed     8m21s (x4 over 10m)   kubelet            Error: ErrImagePull
       Normal   BackOff    5m57s (x14 over 10m)  kubelet            Back-off pulling image "jettech/kube-webhook-certgen:v1.2.1"
       Warning  Failed     63s (x34 over 10m)    kubelet            Error: ImagePullBackOff
    
    1. I can pull that docker image from my local
    2. Tried to run two pods with public docker images, one started running, the other failed with ImagePullBackoff

I think, this confirms @idallaserra`s observations. This stackoverflow describe the same issue and has a well-written answer.

@jbw976, wondering if there is any special reason to enable private cluster nodes in gke resource spec here.

@turkenh
Copy link
Collaborator

turkenh commented Jan 19, 2021

And, after waiting loong enough, helm release reports following status back :)

status:
  atProvider:
    releaseDescription: 'Release "vault1-cluster-p7pfn-mkbc6" failed: failed pre-install:
      failed to deploy vault1-cluster-p7pfn-mkbc6-admission-create'
    revision: 1
    state: failed
  conditions:
  - lastTransitionTime: "2021-01-19T14:08:32Z"
    reason: ReconcileSuccess
    status: "True"
    type: Synced
  - lastTransitionTime: "2021-01-19T14:08:17Z"
    reason: Unavailable
    status: "False"
    type: Ready
  failed: 1
  synced: true

@jbw976
Copy link
Member Author

jbw976 commented Jan 25, 2021

@jbw976, wondering if there is any special reason to enable private cluster nodes in gke resource spec here.

No I don't think there is @turkenh. Most likely I was following another example and didn't realize the impact of that setting 😨

I'll get this patched in the reference platforms that use GKE (GCP and multi-k8s).

Is there anything to fix here in terms of provider-helm reporting back status earlier, so that people in the future in similar scenarios don't get tricked by the red herring of "connection refused"? That was showing to me as the main/obvious thing that was wrong on the Release object when I first investigated and got lead down the wrong trail.

Can that status behavior be improved? 🙏

@turkenh
Copy link
Collaborator

turkenh commented Jan 25, 2021

Is there anything to fix here in terms of provider-helm reporting back status earlier, so that people in the future in similar scenarios don't get tricked by the red herring of "connection refused"? That was showing to me as the main/obvious thing that was wrong on the Release object when I first investigated and got lead down the wrong trail.

Can that status behavior be improved? 🙏

Yes, once we have implemented this, we will be able to report early / better instead of getting blocked until helm client returns.

@jbw976
Copy link
Member Author

jbw976 commented Jan 25, 2021

Perfect @turkenh! Let's close down this issue then in favor of #63. Thank you for your support and thank you to @idallaserra for his keep troubleshooting as well! 💪

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants