Scaling any cloud native application is inherently domain specific, however the content here reflects common issues, tips, and tricks that come up frequently.
The performance of
- The number of
TLSContext
resources. - The number of
Host
resources. - The number of
Mapping
resources perHost
resource. - The number of
Mapping
resources that will span allHost
resources (either because they're usinghost_regex
, or because they're usinghostname: "*"
).
If your application involves a larger than average number of any of the above resources, you may find yourself in need of some of the content in this section.
Whether your application is growing organically or whether you are deliberately scale testing, it's
helpful to recognize how
As
OOMKilled
state. The only way to actually observe this is if you are lucky enough to be running the following
command (or have similar monitoring configured) when OOMKilled
:
kubectl get pods -n ambassador -w
In order to take the luck out of the equation,
2020/11/26 22:35:20 Memory Usage 0.56Gi (28%)
PID 1, 0.22Gi: busyambassador entrypoint
PID 14, 0.04Gi: /usr/bin/python /usr/bin/diagd /ambassador/snapshots /ambassador/bootstrap-ads.json /ambassador/envoy/envoy.json --notices /ambassador/notices.json --port 8004 --kick kill -HUP 1
PID 16, 0.12Gi: /ambassador/sidecars/amb-sidecar
PID 37, 0.07Gi: /usr/bin/python /usr/bin/diagd /ambassador/snapshots /ambassador/bootstrap-ads.json /ambassador/envoy/envoy.json --notices /ambassador/notices.json --port 8004 --kick kill -HUP 1
PID 48, 0.08Gi: envoy -c /ambassador/bootstrap-ads.json --base-id 0 --drain-time-s 600 -l error
In general you should try to keep
Host
and
Mapping
resources are defined in your cluster. If this number has grown over time, you may need to
increase the memory limit defined in your deployment.
/ambassador/v0/check_alive
endpoint on port 8877
for use with Kubernetes
liveness probes. See the Kubernetes documentation for more details on HTTP liveness probes.
Kubernetes will restart the kubectl describe pod -n ambassador
or
kubectl get events -n ambassador
or equivalent.
The purpose of liveness probes is to rescue an timeoutSeconds
and failureThreshold
fields of the Unhealthy
events, try tuning these fields upwards from their default values. See the Kubernetes documentation for more details on tuning probes.
Note that whatever changes you make to
/ambassador/v0/check_ready
endpoint on port 8877
for use with Kubernetes
readiness probes. See the Kubernetes documentation for more details on readiness probes.
Kubernetes uses readiness checks to prevent traffic from going to pods that are not ready to handle
requests. The only time check_ready
endpoint will only return 200 when all routing information has been loaded. After the initial
bootstrap period it behaves identically to the check_alive
endpoint.
Generally
AMBASSADOR_FAST_RECONFIGURE
is a feature flag that enables a higher performance implementation of
the code AMBASSADOR_FAST_RECONFIGURE
to true
to see if this helps.
AMBASSADOR_LEGACY_MODE
is not recommended when performance is critical.
The AMBASSADOR_DRAIN_TIME
variable controls how much of a grace period
When working with a large number of Host
resources, it's important to understand the impact of
unconstrained Mapping
s. An unconstrained Mapping
is one that is not restricted to a specific
Host
. Such a Mapping
will create a route for all of your Host
s. If this is what you want then
it is the appropriate thing to do, however if you do not intend to do this, then you can end up with
many more routes than you had intended and this can adversely impact performance.
localhost:8877/debug
. Note that the AMBASSADOR_FAST_RECONFIGURE
flag needs to
be set to "true"
for this endpoint to be present:
$ kubectl exec -n ambassador -it ${POD} curl localhost:8877/debug
{
"timers": {
# These two timers track how long it takes to respond to liveness and readiness probes.
"check_alive": "7, 45.411495ms/61.85999ms/81.358927ms",
"check_ready": "7, 49.951304ms/61.976205ms/86.279038ms",
# These two timers track how long we spend updating our in-memory snapshot when our Kubernetes
# watches tell us something has changed.
"consulUpdate": "0, 0s/0s/0s",
"katesUpdate": "3382, 28.662µs/102.784µs/95.220222ms",
# These timers tell us how long we spend notifying the sidecars if changed input. This
# includes how long the sidecars take to process that input.
"notifyWebhook:diagd": "2, 1.206967947s/1.3298432s/1.452718454s",
"notifyWebhooks": "2, 1.207007216s/1.329901037s/1.452794859s",
# This timer tells us how long we spend parsing annotations.
"parseAnnotations": "2, 21.944µs/22.541µs/23.138µs",
# This timer tells us how long we spend reconciling changes to consul inputs.
"reconcileConsul": "2, 50.104µs/55.499µs/60.894µs",
# This timer tells us how long we spend reconciling secrets related changes to $productName$
# inputs.
"reconcileSecrets": "2, 18.704µs/20.786µs/22.868µs"
},
"values": {
"envoyReconfigs": {
"times": [
"2020-11-06T13:13:24.218707995-05:00",
"2020-11-06T13:13:27.185754494-05:00",
"2020-11-06T13:13:28.612279777-05:00"
],
"staleCount": 2,
"staleMax": 0,
"synced": true
},
"memory": "39.73Gi of Unlimited (0%)"
}
}
localhost:8877/debug/pprof
to run Golang profiles to aid in live debugging. The endpoints are equivalent to those found in the http/pprof package. /debug/pprof/
returns an HTML page listing the available profiles.
The following are the different types of profiles you can run:
Profile | Function |
---|---|
/debug/pprof/allocs | Returns a sampling of all past memory allocations. |
/debug/pprof/block | Returns stack traces of goroutines that led to blocking on sychronization primitives. |
/debug/pprof/goroutine | Returns stack traces of all current goroutines. |
/debug/pprof/heap | Returns a sampling of memory allocations of live objects. |
/debug/pprof/mutex | Returns stack traces of goroutines holding contended mutexes. |
/debug/pprof/symbol | Returns the program counters listed in the request. |
/debug/pprof/threadcreate | Returns stack traces that led to creation of new OS threads. |