Skip to content
This repository has been archived by the owner on Nov 1, 2022. It is now read-only.

flux-helm-operator status.go/annotator doesn't recover from transient TLS error to Kubernetes API #2178

Closed
ellieayla opened this issue Jun 21, 2019 · 2 comments
Labels
blocked-needs-validation Issue is waiting to be validated before we can proceed bug

Comments

@ellieayla
Copy link
Contributor

ellieayla commented Jun 21, 2019

Describe the bug

The call to retrieve a list of namespaces here failed with a transient TLS error. The for loop bailed.

https://github.com/weaveworks/flux/blob/af9d3be07a240970cf1a935e4e88e0728052aad4/integrations/helm/status/status.go#L63

ts=2019-06-21T11:13:42.171831046Z caller=status.go:101 component=annotator loop=stopping err="Get https://example.hcp.westus2.azmk8s.io:443/api/v1/namespaces: net/http: TLS handshake timeout"

No further retry was made. The flux-helm-operator process didn't exit. http://127.0.0.1:3030/healthz continued to produce 200/OK

No further helm charts upgrades were started, despite updates to HelmChart resources.

To Reproduce
Steps to reproduce the behaviour:

  1. Install flux-helm-operator container on kubernetes
  2. Have a flaky network between flux-helm-operator pods and the api master
  3. Wait a bit, observe TLS error
  4. See flux-helm-operator not retry, sit stuck forever
  5. kubectl -n flux port-forward pod/flux-helm-operator-588cb46cb8-sk9fw 3030 + curl -v http://127.0.0.1:3030/healthz = 200/OK
  6. Delete the flux-helm-operator pod, let it restart.

Expected behavior
I expected flux-helm-operator to retry connectivity to Kubernetes API endpoint, or to exit, or to report unhealthy and a livenessProbe restart the pod.

Logs

$ kubectl -n flux logs deploy/flux-helm-operator --tail=90 -f
W0620 07:14:56.890850       7 client_config.go:549] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
ts=2019-06-20T07:15:02.019263142Z caller=helm.go:84 component=helm warning="unable to connect to Tiller" err="error getting tiller version: context deadline exceeded" host=tiller-deploy.kube-system:44134 options="{Host: Port: Namespace:kube-system TLSVerify:true TLSEnable:true TLSKey:/etc/fluxd/helm/tls.key TLSCert:/etc/fluxd/helm/tls.crt TLSCACert:/etc/fluxd/helm-ca/ca.crt TLSHostname:tiller-server}"
ts=2019-06-20T07:15:22.105590267Z caller=helm.go:88 component=helm info="connected to Tiller" version="sem_ver:\"v2.10.0\" git_commit:\"9ad53aac42165a5fadc6c87be0dea6b115f93090\" git_tree_state:\"clean\" " host=tiller-deploy.kube-system:44134 options="{Host: Port: Namespace:kube-system TLSVerify:true TLSEnable:true TLSKey:/etc/fluxd/helm/tls.key TLSCert:/etc/fluxd/helm/tls.crt TLSCACert:/etc/fluxd/helm-ca/ca.crt TLSHostname:tiller-server}"
ts=2019-06-20T07:15:22.105914964Z caller=chartsync.go:152 component=chartsync info="starting git chart sync loop"
ts=2019-06-20T07:15:22.106196363Z caller=operator.go:95 component=operator info="setting up event handlers"
ts=2019-06-20T07:15:22.106301262Z caller=operator.go:115 component=operator info="event handlers set up"
ts=2019-06-20T07:15:22.106411861Z caller=operator.go:128 component=operator info="starting operator"
ts=2019-06-20T07:15:22.106475261Z caller=operator.go:130 component=operator info="waiting for informer caches to sync"
ts=2019-06-20T07:15:22.107728953Z caller=server.go:41 component=daemonhttp info="starting HTTP server on :3030"
ts=2019-06-20T07:15:22.797249327Z caller=checkpoint.go:21 component=checkpoint msg="update available" latest=0.9.2 URL=https://github.com/weaveworks/flux/releases/tag/helm-0.9.2
ts=2019-06-20T14:24:44.63291148Z caller=checkpoint.go:21 component=checkpoint msg="update available" latest=0.9.2 URL=https://github.com/weaveworks/flux/releases/tag/helm-0.9.2
ts=2019-06-20T20:48:04.612503758Z caller=checkpoint.go:21 component=checkpoint msg="update available" latest=0.9.2 URL=https://github.com/weaveworks/flux/releases/tag/helm-0.9.2
ts=2019-06-21T01:35:21.114551968Z caller=checkpoint.go:21 component=checkpoint msg="update available" latest=0.9.2 URL=https://github.com/weaveworks/flux/releases/tag/helm-0.9.2
ts=2019-06-21T09:03:50.508971132Z caller=checkpoint.go:21 component=checkpoint msg="update available" latest=0.9.2 URL=https://github.com/weaveworks/flux/releases/tag/helm-0.9.2
ts=2019-06-21T11:13:42.171831046Z caller=status.go:101 component=annotator loop=stopping err="Get https://example.hcp.westus2.azmk8s.io:443/api/v1/namespaces: net/http: TLS handshake timeout"
ts=2019-06-21T14:18:18.163120018Z caller=checkpoint.go:21 component=checkpoint msg="update available" latest=0.9.2 URL=https://github.com/weaveworks/flux/releases/tag/helm-0.9.2

Additional context
Add any other context about the problem here, e.g

  • Flux version: docker.io/weaveworks/flux:1.12.3
  • Helm Operator version: docker.io/weaveworks/helm-operator:0.9.1
  • Kubernetes version: v1.12.8 in AKS
  • Git provider: Bitbucket
  • Container registry provider: Azure Container Registry
@ellieayla ellieayla added blocked-needs-validation Issue is waiting to be validated before we can proceed bug labels Jun 21, 2019
@hiddeco
Copy link
Member

hiddeco commented Jun 24, 2019

I suspect some refactor work in #2006 will actually resolve this as this no longer depends on the core API but instead reuses the shared informer.

@hiddeco
Copy link
Member

hiddeco commented Jul 29, 2019

Since the PR linked in my previous comment has landed in version 0.10.0 of the Helm operator, kubectl is no longer used in the Helm operator to request or list any resources (but it is still used to annotate resources from a release), and I expect this issue to be resolved.

If you do however run into the same (or new) symptoms, feel free to open up a new issue.

@hiddeco hiddeco closed this as completed Jul 29, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
blocked-needs-validation Issue is waiting to be validated before we can proceed bug
Projects
None yet
Development

No branches or pull requests

2 participants