Skip to content

Issues with Nest Connection Pool retry mechanism #1080

@satishmallik

Description

@satishmallik

Currently nest is not able to differentiate between Service Unavailable and Timeout exception.

Problem description

  1. Think of 3 query nodes (q1,q2,q3) in cluster. One of query node q1 is booted out of cluster for some reasone like OOM.
  2. Now all query requests to q1 will fail with error status code 503 service not available.
    A typical response where this situation arises look like,
             ConnectionStatus = {StatusCode: 503, 
              Method: POST, 
               Url: http://localhost:9200/codeindex/sourceNoDedupeFileContract/_search?routing=0&pretty=true, 
  1. Only In this case query should be routed to another query node q2.

Currently nest retires all retryable exception like timeouts on all nodes of the connection pool.
So say a wildcard query is sent to query node q1. If it timesout the query should timeout.
But in this scenario nest still sends this query to another node q2. It again timesout and in turn it is sent to q3. So query times out after 3 mins instead of 1 min.

This also makes a case of DOS attack.

Proper fix should be to differentiate between timeout and ServiceNotAvailable exception. Query should be sent on another node from connection pool only if service is not available on first node.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions