Skip to content

Intermittent failed requests during rolling updates #814

@deliahu

Description

@deliahu

Description

As old pods spin down during a rolling update, some requests return 503. Istio is hiding the more detailed error message - when bypassing Istio with service of type: loadbalancer, the error is:

Post http://a2b34b2aa4ec411eaab8a0a465d27bdf-e1fddee29350e253.elb.us-west-2.amazonaws.com/predict: read tcp 172.31.1.222:42654->54.71.144.207:80: read: connection reset by peer

Reproduction

Create an iris deployment with min and max of e.g. 2 replicas. Run dev/load.go with at least 100 concurrent threads and no delays. Perform a rolling update, and watch for 503 errors (or the connection reset by peer error if using the load balancer service) as the old pods are terminating.

Relevant info

Possibly related issue

Also, @vishalbollu reported that during large scale ups that require many new nodes (e.g. 1 -> 200 nodes), some 503 errors are also seen. This may or may not be the same root cause. A separate ticket should be created to track this if a fix for this issue doesn't resolve it.

Possibly related issue 2

It seems that deleting an API, deploying it again, waiting for the previous pod to terminate, and then creating a high volume of parallel requests also results in 503 errors. See the networking-debugging branch.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions