Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Observing "no healthy upstream" for new deployments until ambassador pods restarted #3324

Closed
coala-svn opened this issue Apr 2, 2021 · 15 comments
Assignees
Labels
t:bug Something isn't working w:5 Targeted for fifth week of development cycle
Milestone

Comments

@coala-svn
Copy link

Description of the problem
I am facing a very strange problem. Our IT wants us to migrate Application testing pipeline to a new cluster. After deploying ambassador with helm (originally it was 1.12.0) I tested the deployments of our applications: all the deployments were successful, however on access to the application I constantly got an error "no healthy upstream" (the same deployment works in the old cluster).

At some point in time I learned about released 1.12.1 and upgraded the ambassador with "helm upgrade" to 1.12.1. After that all the old not working application deployments started to work without any additional changes. But every new deployment had the same issue: the error "no healthy upstream". Eventually ambassador was upgraded to 1.12.2 with the same effect: not working the old deployments started to work without any changes and every new deployment had an error "no healthy upstream".

Investigation of connectivity confirmed that the application is accessible with curl from ambassador pod via connection to the app service, as well as to the app in pod directly. However, external requests to the application always ended up with "no healthy upstream".

Now, if the ambassador pod is killed (replica count was reduced to 1 for simplifying logs analysis) and the deployment/replicaset replaces it with a new pod the issue is resolved - all not working deployments start working (it was tested 3 times).

Details on the current deployment:

$ helm -n ambassador list
NAME            NAMESPACE       REVISION        UPDATED                                 STATUS          CHART                   APP VERSION
ambassador      ambassador      14              2021-03-31 10:20:18.8370383 -0400 EDT   deployed        ambassador-6.6.2        1.12.2

Is it something that I might be missing during the deployment of ambassador?nd concise description of what the bug is.

Expected behavior
All the new application deployments start working without a need to restart ambassador pods

Versions:

  • Ambassador: 1.12.2 (1.12.0, 1.12.1)
  • Kubernetes environment: Azure Kubernetes Service (AKS) - privatelink custer (i,e. no access from the public internet and only internal LBs - Annotation "service.beta.kubernetes.io/azure-load-balancer-internal" is set to "true")
  • Version: v1.18.14

Additional context
None. I am not sure if it is a bug or not. I would appreciate any workaround for our environment.

@coala-svn
Copy link
Author

coala-svn commented Apr 5, 2021

After downgrading Ambassador version to 1.11.2 the issue is not reproducible. So, looks like it is an issue in 1.12.x

@rdmoore
Copy link

rdmoore commented Apr 5, 2021

We are also seeing a similar (same?) problem with ambassador 1.12.1 deployed into our nonprod environment.
A fresh ambassador pod works like a champ. Any change doesn't seem to be reflected into the envoy configuration.
Example events that don't seem to cause a reconfiguration:

  • a HPA scale-down (removal of a pod),
  • a new deployment
  • a rollout restart
    In all cases, rollout restart of ambassador resolves the issue.

In researching the issue, it looks like 1.12.1 switched to using EDS. Is it possible that the EDS service is not reflecting cluster changes?

@rhysm
Copy link

rhysm commented Apr 6, 2021

Also facing a similar issue. 1.12.0 was my first use of Ambassador and I thought I had misconfigured something. Rolling back to 1.11.2 resolved the issue.

@rdmoore
Copy link

rdmoore commented Apr 7, 2021

I started monitoring the snapshots/snapshot.yaml file while performing rollout restart of a deployment. Ambassador creates the correct information in this file at startup, but it does not get updated with new endpoint IPs when I roll a deployment. The ambassador documentation indicates that this is likely a configuration issue of some sort.

  • Is this file still is expected to be updated (with the latest EDS changes)?
  • what kinds of issues might cause a problem merging in new k8s information but not cause an issue creating the original file?

@rhs
Copy link
Contributor

rhs commented Apr 8, 2021

Can you post the Mapping resources for which you are experiencing this behavior?

@khussey khussey added the t:bug Something isn't working label Apr 8, 2021
@khussey khussey added this to the 2021 Cycle 3 milestone Apr 8, 2021
@coala-svn
Copy link
Author

Here is an example of mapping:

$ kubectl -n kangaroo277id100006 describe mapping ambassador-ms-service-0

Name:         ambassador-ms-service-0
Namespace:    kangaroo277id100006
Labels:       app.kubernetes.io/managed-by=Helm
Annotations:  getambassador.io/resource-changed: true
              meta.helm.sh/release-name: amb-rules
              meta.helm.sh/release-namespace: kangaroo277id100006
API Version:  getambassador.io/v2
Kind:         Mapping
Metadata:
  Creation Timestamp:  2021-04-08T19:53:49Z
  Generation:          1
  Managed Fields:
    API Version:  getambassador.io/v2
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:getambassador.io/resource-changed:
          f:meta.helm.sh/release-name:
          f:meta.helm.sh/release-namespace:
        f:labels:
          .:
          f:app.kubernetes.io/managed-by:
      f:spec:
        .:
        f:connect_timeout_ms:
        f:host:
        f:idle_timeout_ms:
        f:load_balancer:
          .:
          f:cookie:
            .:
            f:name:
            f:path:
            f:ttl:
          f:policy:
        f:prefix:
        f:resolver:
        f:rewrite:
        f:service:
        f:timeout_ms:
    Manager:         Go-http-client
    Operation:       Update
    Time:            2021-04-08T19:53:49Z
  Resource Version:  7903038
  Self Link:         /apis/getambassador.io/v2/namespaces/kangaroo277id100006/mappings/ambassador-ms-service-0
  UID:               6653b0e3-3d3c-4260-a945-284e807a66f7
Spec:
  connect_timeout_ms:  6000
  Host:                wcdp-windchill-kangaroo277id100006.rd-plm-devops.bdns.ptc.com
  idle_timeout_ms:     5000000
  load_balancer:
    Cookie:
      Name:    sticky-cookie-0
      Path:    /Windchill
      Ttl:     600s
    Policy:    ring_hash
  Prefix:      /Windchill
  Resolver:    endpoint
  Rewrite:     /Windchill
  Service:     ms-service-kangaroo-0.kangaroo277id100006.svc.cluster.local:8080
  timeout_ms:  0
Events:        <none>

@rhs
Copy link
Contributor

rhs commented Apr 9, 2021

I believe if you drop the .svc.cluster.local from the service name it should fix the problem. When you are using the kubernetes endpoint routing resolver the service field refers directly to a kubernetes resource, not to a dns name. The .svc.cluster.local suffix is added by the kubernetes dns server so it is a bit weird to use it when you aren't doing a dns lookup.

That said this is a bug because we used to allow that and a) we shouldn't disallow that without a deprecation period, and b) we should also be logging it as an error.

@rdmoore
Copy link

rdmoore commented Apr 10, 2021

Thanks!. That appears to be exactly my issue. Re-reading the documentation, I see that the DNS name is not recommended - only the claim that it might work. Why did I not notice this previously??? I have tested this change successfully with a few mapping files.

@coala-svn
Copy link
Author

@rhs - thanks for letting us know a workaround. However, the problem here is that we specifically were told by someone from Dataware team to use the suffix .svc.cluster.local for the service when they helped us with the update of our application deployment for Ambassador integration (I was not a part of that discussion and I learned about the recommendation only today when we internally discussed testing of the potential workaround).

@illinar
Copy link

illinar commented Apr 13, 2021

We observed very similar behavior, except that "no healthy upstream" error went away once mapping is re-loaded. We tried removing "http://" prefix from the service name as was suggested in Slack for a similar situation and it seemed to do the trick. But it is unclear what is the underlying cause and what is the correct way of specifying mappings to avoid this sort of scenarios.

@khussey khussey added the w:5 Targeted for fifth week of development cycle label Apr 14, 2021
@khussey
Copy link
Contributor

khussey commented Apr 20, 2021

This is fixed in Ambassador 1.13.0, which is now available.

@khussey khussey closed this as completed Apr 20, 2021
@coala-svn
Copy link
Author

coala-svn commented Apr 21, 2021

Confirmed.

Thank you guys for the prompt fix!

@wissam-launchtrip
Copy link

We are noticing this behavior on 1.13.5 still. Please advise.

@esmet
Copy link
Contributor

esmet commented May 24, 2021

@wissam-launchtrip can you go into a bit more detail? Are you seeing this exact issue or something similar? Anything that can help us verify the report and reproduce the issue for a possible fix 👍

@wissam-launchtrip
Copy link

No actually it's a different issue.
Upstream Services get disconnected for no clear reason! And we get "no healthy upstream" error.
This happens after a few hours from last deployment in the cluster.
If we make a deployment in the cluster, the error disappears.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t:bug Something isn't working w:5 Targeted for fifth week of development cycle
Projects
None yet
Development

No branches or pull requests

8 participants