-
Notifications
You must be signed in to change notification settings - Fork 607
Description
Description
As old pods spin down during a rolling update, some requests return 503. Istio is hiding the more detailed error message - when bypassing Istio with service
of type: loadbalancer
, the error is:
Post http://a2b34b2aa4ec411eaab8a0a465d27bdf-e1fddee29350e253.elb.us-west-2.amazonaws.com/predict: read tcp 172.31.1.222:42654->54.71.144.207:80: read: connection reset by peer
Reproduction
Create an iris deployment with min and max of e.g. 2 replicas. Run dev/load.go
with at least 100 concurrent threads and no delays. Perform a rolling update, and watch for 503 errors (or the connection reset by peer
error if using the load balancer service) as the old pods are terminating.
Relevant info
Possibly related issue
Also, @vishalbollu reported that during large scale ups that require many new nodes (e.g. 1 -> 200 nodes), some 503 errors are also seen. This may or may not be the same root cause. A separate ticket should be created to track this if a fix for this issue doesn't resolve it.
Possibly related issue 2
It seems that deleting an API, deploying it again, waiting for the previous pod to terminate, and then creating a high volume of parallel requests also results in 503 errors. See the networking-debugging branch.