Investigate and document timeouts across components #525

eloquence · 2020-04-02T19:08:49Z

The client performs many network operations with associated timeouts. If timeouts are too short, operations may fail; if they are too long, user feedback may be delayed. Timeouts are negotiated at different levels of the stack, e.g.:

client connection to proxy
proxy connection to API
Apache<->wsgi connection

When shorter timeouts at one level override longer ones at another, this can lead to unexpected results, as we saw during investigation of freedomofpress/securedrop-client#1007. There also different types of timeouts (e.g., ConnectTimeout vs. ReadTimeout) which may need to be set to different values.

We should more document the different timeouts, and the overall connection architecture, to ensure all developers can consistently reason about the expected behavior of the whole application. This can be done in the workstation wiki for now.

The text was updated successfully, but these errors were encountered:

sssoleileraaa · 2020-04-07T20:31:01Z

adding notes as i go...

/etc/apache2/apache2.conf sets the TimeOut directive to 60 seconds, which is the amount of time the apache httpd will wait for i/o, e.g. time it will wait for a tcp packet to arrive (for more info, read https://httpd.apache.org/docs/2.4/mod/core.html)
we use mod_wsgi and you can see how we configure it in /etc/apache2/sites-enabled/journalist.conf, see https://modwsgi.readthedocs.io/en/develop/configuration-directives/WSGIDaemonProcess.html
since we don't specify socket-timeout, we use the 60 second timeout specified by the apache TimeOut directive
the proxy uses python's Requests library to make requests to the server and defaults to 120 seconds before it stops waiting for a response, meaning it stops waiting if no bytes have been received on the underlying socket after 120 seconds, so the proxy would wait 60 more seconds but apache httpd times out (this probably should be updated to 60 seconds to match apache config, unless i'm missing something)
see how the proxy handles timeout and connection errors here: https://github.com/freedomofpress/securedrop-proxy/blob/d78144a95d872a57fb92c3e9034e9beeb830b097/securedrop_proxy/proxy.py#L242
if the proxy is able to make a connection, or if the server sends data, in less than 120 seconds then the proxy will continue to wait for however long it takes to get the full response
the sdk sends a qrexec command in a subprocess to the proxy vm and will kill the process if the specified timeout expires and raise a RequestTimeoutError
see how the sdk handles timeouts here: https://github.com/freedomofpress/securedrop-sdk/blob/80946593952574cfa1c0718e6a18436209fb1ac5/sdclientapi/__init__.py#L63-L73
the sdk's subprocess timeout is set here: https://github.com/freedomofpress/securedrop-sdk/blob/80946593952574cfa1c0718e6a18436209fb1ac5/sdclientapi/__init__.py#L114-L115, which defaults to 20 seconds, unless the request is download_submission which defaults to 60 seconds
the client sets the timeout used by the sdk before making an api call, with the following timeouts:
- get_sources: 40 seconds
- get_all_submissions: 40 seconds
- get_all_replies: 40 seconds
- reply_source: 5 seconds
- add_star: 5 seconds
- remove_star: 5 seconds
- authenticate: 60 seconds
- logout: 20 seconds
- download_reply: 20 seconds
- download_submission: uses a method to get a realistic timeout
- delete_source 5 seconds

sssoleileraaa · 2020-04-07T20:51:09Z

as of now, my understanding is that we have two main issues, the rest is just cleanup/refactoring:

we are timing out when there are above a hundred sources or so (I need to get back into testing today to determine how frequent these timeouts become - I'm pretty sure it still times out every single time at 1000 sources)
[needs investigation] we might be taking too long before we tell the user that there was an error with the login session when trying to authenticate

@rmol is speeding up the wsgi process by using a key cache (getting source keys takes a long time) and word is he will be adding usage of nginx, see freedomofpress/securedrop#5184, but we will still need to confirm whether or not this fixes the frequent timeout issue

rmol · 2020-04-07T21:28:51Z

if the proxy is able to make a connection, or if the server sends data, in less than 120 seconds then the proxy will continue to wait for however long it takes to get the full response

To be annoyingly pedantic, once the connection is established, if the server were ever to go 120 seconds without sending anything, the proxy would time out the request. For our purposes, it's almost always the same thing, as we're generally shipping entire responses and not trickling data. And right now we'll hit the 60-second Apache TimeOut first.

I'm pretty sure it still times out every single time at 1000 sources

I'm not seeing this with my staging environment. With just the fingerprint caching, responses take less than 30 seconds. With the key caching added in freedomofpress/securedrop#5184 they're under 20 seconds. Could be differences in our hardware or VM performance. (And yes, this is still pretty ridiculous for the amount of work happening and the size of the responses. A bunch of this is Tor, which can be helped by compressing the responses, but we're still taking several seconds to produce a relatively small amount of JSON, and I think it could still use more scrutiny.)

Just for the benefit of those following along at home, I don't think we're planning on nginx any time soon? Certainly not as part of freedomofpress/securedrop#5184. There is an issue for that (freedomofpress/securedrop#2414) but I haven't heard anything about it recently.

sssoleileraaa · 2020-04-07T22:02:34Z

To be annoyingly pedantic, once the connection is established, if the server were ever to go 120 seconds without sending anything, the proxy would time out the request. For our purposes, it's almost always the same thing, as we're generally shipping entire responses and not trickling data. And right now we'll hit the 60-second Apache TimeOut first.

Totally, that's my understanding too, which is why i mentioned how this should probably be updated to 60 seconds to match the apache config. However, my statement about the proxy waiting 120 seconds is true about the way it's configured as of now. If the apache config timeout was updated to say 180 seconds, the proxy would still timeout after 120 seconds, since that is what is specified.

I'm not seeing this with my staging environment.

I will test again. It could be a hardware issue, but it sounds like the 1000 source timeout issue has been fixed for you ever since you updated the proxy timeout to 120 seconds, which allowed us to wait until 60 seconds for a connection? Although, it seems like the 40 second sdk timeout at the subprocess layer would timeout before we would hit that 60 second timeout, correct? Something is not lining up here. I'm going to run some new tests and will report back.

sssoleileraaa · 2020-04-08T01:46:26Z

I'm not seeing this with my staging environment.

I will test again. It could be a hardware issue, but it sounds like the 1000 source timeout issue has been fixed for you ever since you updated the proxy timeout to 120 seconds, which allowed us to wait until 60 seconds for a connection? Although, it seems like the 40 second sdk timeout at the subprocess layer would timeout before we would hit that 60 second timeout, correct? Something is not lining up here. I'm going to run some new tests and will report back.

I just tested the latest client on Qubes against my staging server with 1000 sources and still see no sources populated in the source list after 10 minutes due to RequestTimeoutErrors, for more info about what I saw and which version of the client I ran, see:

freedomofpress/securedrop-client#1025 (comment)

I'm concerned we are still continuing to see timeouts as much as one third of the time when there are 200 sources, see freedomofpress/securedrop-client#1007 (comment), so I will test a server with 200 sources next.

redshiftzero · 2020-04-08T22:08:58Z

I don't think we're planning on nginx any time soon?

Nope, this is not in the near or medium term - to expand a bit, we don't have a good way of performing the migration that doesn't add a lot of burden for administrators. The bionic upgrade might be the next window to do it.

Although, it seems like the 40 second sdk timeout at the subprocess layer would timeout before we would hit that 60 second timeout, correct?

👍

Backing up for a second, before we did the final release before the first pilot provisioning, I tested master of securedrop-client at that time with 500 sources using freedomofpress/securedrop-proxy#70 as a check before we did the pilot release to ensure that I did not see timeouts. I should note I did not delete my existing database before doing this, which I should have done in hindsight. This was on a version of SecureDrop server that did have the fingerprint cache.

I just added another 500 sources (1000 total) on the server, deleted my client database, and ran with master (56e0981936838c8e24c55a6fdb3b66ced227b22b) of the client. Here's the timings logged from the client-side:

17:29:19,996 - Login success
17:30:40,138 - Sync failure
17:32:15,163 - Sync failure
17:33:50.161 - Sync failure
17:35:43,642 - First log line for get_remote_data - Fetched 1000 remote sources, 1998 remote submissions, 2046 remote replies. update_sources took 0.2768s, update_files took 0.0186s, update_messages took 1.5341s, update_replies took 4.9750s.

Sync does continue to show intermittent failures - they are intermittent.

Otherwise it sounds like everyone is using latest master in the client directory, latest securedrop-proxy with the 120s timeout, and a version of SecureDrop server that has the fingerprint cache. One possible source of divergence is that someone has a particularly slow tor circuit (which would be a useful observation since this can also happen in production).

I think next steps are: we compare directly the response time of our get_sources endpoints through tor as a sanity check. While the desired goal of all this is to determine "for normal conditions, what are sufficiently long timeout values for sync operations to set such that errors won't be needlessly raised to the user", for freedomofpress/securedrop-client#1025 we're just trying to set the timeout value for these metadata sync API calls correctly. Since there's some divergence, we could see what the response time of the direct API call is to ensure that something else isn't going on - i.e., testing with cron, and removing the client entirely from the equation. The distribution of that response time (for a representative sample of tor circuits) tells us what the timeout value should be set out on our side.

eloquence · 2020-04-22T22:39:56Z

For the 4/22-5/6 sprint, @creviera has agreed to clean up these notes a bit and copy them to the wiki, with review/input from @rmol. Then we can close out this issue.

sssoleileraaa · 2020-05-06T02:35:51Z

Wiki updates can be found here: https://github.com/freedomofpress/securedrop-workstation/wiki/Timeouts

eloquence · 2020-05-06T19:10:49Z

Thank you for the write-up Allie! :) @rmol will take a spin sometime during this sprint (5/6-5/20), then we can close out this issue.

rmol · 2020-05-12T18:53:42Z

Wiki summary looks good. Closing.

eloquence added the docs label Apr 2, 2020

eloquence added this to Sprint #48 - 4/2-4/15 in SecureDrop Team Board Apr 2, 2020

sssoleileraaa moved this from Sprint #48 - 4/2-4/15 to In Development in SecureDrop Team Board Apr 3, 2020

sssoleileraaa mentioned this issue Apr 8, 2020

Add key cache freedomofpress/securedrop#5184

Merged

2 tasks

eloquence moved this from In Development to Sprint #49 - 4/22-5/6 in SecureDrop Team Board Apr 22, 2020

rmol closed this as completed May 12, 2020

SecureDrop Team Board automation moved this from Sprint #50 - 5/6-5/20 to Done May 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate and document timeouts across components #525

Investigate and document timeouts across components #525

eloquence commented Apr 2, 2020

sssoleileraaa commented Apr 7, 2020 •

edited

sssoleileraaa commented Apr 7, 2020

rmol commented Apr 7, 2020

sssoleileraaa commented Apr 7, 2020

sssoleileraaa commented Apr 8, 2020

redshiftzero commented Apr 8, 2020

eloquence commented Apr 22, 2020

sssoleileraaa commented May 6, 2020

eloquence commented May 6, 2020

rmol commented May 12, 2020

Investigate and document timeouts across components #525

Investigate and document timeouts across components #525

Comments

eloquence commented Apr 2, 2020

sssoleileraaa commented Apr 7, 2020 • edited

sssoleileraaa commented Apr 7, 2020

rmol commented Apr 7, 2020

sssoleileraaa commented Apr 7, 2020

sssoleileraaa commented Apr 8, 2020

redshiftzero commented Apr 8, 2020

eloquence commented Apr 22, 2020

sssoleileraaa commented May 6, 2020

eloquence commented May 6, 2020

rmol commented May 12, 2020

sssoleileraaa commented Apr 7, 2020 •

edited