Skip to content
This repository has been archived by the owner on Jan 27, 2023. It is now read-only.

Extend internal clients to allow for configurable timeouts #210

Closed
nurmi opened this issue Jun 5, 2019 · 2 comments
Closed

Extend internal clients to allow for configurable timeouts #210

nurmi opened this issue Jun 5, 2019 · 2 comments
Assignees

Comments

@nurmi
Copy link
Member

nurmi commented Jun 5, 2019

The internal anchore http client handler has support for configuring timeouts on http connections, which currently is only used in select, targeted locations in the logic (for example, in the policy engine -> catalog upcall, put in as part of issue #154 ).

Under certain network conditions where an internal host/port starts holding connections indefinitely, other internal clients can experience blocking which only clears if the services are restarted (and the network condition is cleared).

As an operator of anchore engine, it would be a useful addition to be able to configure internal clients to timeout, in order to avoid indefinite blocking, even if this timeout value would set very high (as some internal anchore connections can be long lived).

@armstrongli
Copy link

we have encountered the problem about analyzer that it stops image scan after running around 2 hours.

all workers stop working. it is wired at first place. So we took some time to go deeper, and notice that it is one problem in infrastructure level about network connection.

we checked all the connections in all analyzers, and I found that
• all the workers stuck on loading analyze result to policy engine
• there are same number of connections connecting to policy engine in established state

we checked all the policy engine and notice that
• the policy engines have finished the image load work
• there are no connections from any client

It means that the connections have been closed from policy engine side, but analyzers don't get the FIN signal on closing TCP connection.
So workers stuck on the waiting for connection finish.

Then I checked the source code of anchore in http.py and notice that the timeout of connection is None. It means that the connection never timeouts if there are any package drop(or other reasons) in infrastructure level, and the connection will stuck.

So I did the change on the http client to add default timeout on all anchore requests(anchy post, update, get) to have default timeout.

Here is the change:

diff --git a/anchore_engine/clients/services/http.py b/anchore_engine/clients/services/http.py
index d91e42e..c1d0c20 100644
--- a/anchore_engine/clients/services/http.py
+++ b/anchore_engine/clients/services/http.py
@@ -55,7 +55,7 @@ def fpost_req(url, **kwargs):
     rawdata = b''
     jsondata = {}
     try:
-        r = requests.post(url, stream=True, **kwargs)
+        r = requests.post(url, stream=True, **dict(kwargs, timeout=1800))
         httpcode = r.status_code
         rawdata = b''
         for rchunk in r.iter_content(8192*100):
@@ -106,7 +106,7 @@ def fput_req(url, **kwargs):
     rawdata = b''
     jsondata = {}
     try:
-        r = requests.put(url, stream=True, **kwargs)
+        r = requests.put(url, stream=True, **dict(kwargs, timeout=1800))
         httpcode = r.status_code
         rawdata = b''
         for rchunk in r.iter_content(8192*100):
@@ -158,7 +158,7 @@ def fget_req(url, **kwargs):
     rawdata = b''
     jsondata = {}
     try:
-        r = requests.get(url, stream=True, **kwargs)
+        r = requests.get(url, stream=True, **dict(kwargs, timeout=1800))
         httpcode = r.status_code
         rawdata = b''
         for rchunk in r.iter_content(8192*100):
@@ -211,7 +211,7 @@ def fdelete_req(url, **kwargs):
     rawdata = b''
     jsondata = {}
     try:
-        r = requests.delete(url, stream=True, **kwargs)
+        r = requests.delete(url, stream=True, **dict(kwargs, timeout=1800))
         httpcode = r.status_code
         rawdata = b''
         for rchunk in r.iter_content(8192*100):

@zhill zhill self-assigned this Jun 12, 2019
@zhill
Copy link
Member

zhill commented Jun 13, 2019

Fix in that adds two new global-level config options in config.yaml:

global_client_connect_timeout:
global_client_read_timeout:

The internal client code uses these as defaults if set. The defaults are 0.0 which is disabled, but a > 0 value enables them.

@zhill zhill closed this as completed in 9ab3a2a Jun 13, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants