Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Application totally frozen with large number of sources #716

Closed
redshiftzero opened this issue Jan 21, 2020 · 9 comments · Fixed by #733
Closed

Application totally frozen with large number of sources #716

redshiftzero opened this issue Jan 21, 2020 · 9 comments · Fixed by #733

Comments

@redshiftzero
Copy link
Contributor

I just noticed this testing the get_sources endpoint speed server-side: after successfully fetching data from that endpoint, for large numbers of sources, the application is totally frozen.

STR

  1. Server side: NUM_SOURCES=200 make dev
  2. Start the client (you might need to update the timeout https://github.com/freedomofpress/securedrop-client/blob/master/securedrop_client/api_jobs/downloads.py#L56 if get_sources is timing out for you)

Expected behavior

  1. Application state updates with 200 sources

Actual behavior

  1. Application enters a non-responsive state
@redshiftzero redshiftzero added this to the 0.2.0beta milestone Jan 21, 2020
@eloquence eloquence added this to Nominated for next sprint in SecureDrop Team Board Jan 21, 2020
@eloquence eloquence changed the title application totally frozen with large number of sources Application totally frozen with large number of sources Jan 22, 2020
@eloquence eloquence moved this from Nominated for next sprint to Current Sprint (1/22-2/5) in SecureDrop Team Board Jan 22, 2020
@rmol
Copy link
Contributor

rmol commented Jan 23, 2020

@redshiftzero @kushaldas I saw this this evening. The UI was completely unresponsive, but in the background the client was busy downloading and decrypting submissions, until after a few hundred downloads a request timed out, it lost the connection to the proxy, and couldn't reconnect. In /var/log/kern.log on sd-proxy were a ton of messages like xenbus: xen store gave: unknown error E2BIG. That message led to Qubes issue #5343.

In sd-app and sd-proxy at this point I had around 30 qrexec processes/connections. I was testing with a persistent connection between sd-app and sd-proxy, so I imagine we're hitting this a lot sooner with the current API client, which is making a qrexec call per request.

I'm still looking into why I couldn't reconnect immediately after the timeout, when I should not have had more than one connection opened up to that point, and why I had these processes hanging around. My working theory is that they weren't getting cleaned up because when a request timed out, I was not calling communicate after terminate, and/or wasn't invoking the Popen constructor with start_new_session=True. I'll see if I can confirm this. If that's it, the current SDK has similar problems.

It looks like the only way to recover if we get into this situation is to reboot sd-proxy.

@redshiftzero
Copy link
Contributor Author

interesting... so it seems like there are multiple issues, since the UI should not be non-responsive if the client is merrily downloading/decrypting (we must be running some blocking actions in the main event loop). I'll also note for posterity that I saw this non-responsive behavior / hanging forever not using qrexec i.e. making direct HTTP requests to the server container.

we might need to workaround QubesOS/qubes-issues#5343 with a sleep after each network job

@kushaldas
Copy link
Contributor

Right now I was testing with 200 sources in the local docker container. Even after the network calls finished many minutes ago, the UI is still in frozen state, CPU at 100% and memory usage is 3.5GB and increasing.

I will go down to 20-50 sources and try to profile first. Other wise, this state we can not even profile it properly.

@kushaldas
Copy link
Contributor

The UI is becoming unresponsive even when it is downloading submissions in a different thread.

@kushaldas
Copy link
Contributor

kushaldas commented Jan 23, 2020

master_calls_sdclient

The number of times SourceWidget.init is getting called too much times. We can also cache in image loading (load_svg) call.

This is for only 90 sources. Also the system was running out of open file descriptors. The major thing to note is the memory usage, for that 90 sources the total memory usage reached 1.5GB+

@redshiftzero
Copy link
Contributor Author

redshiftzero commented Jan 23, 2020

theory: it looks like update_sources in the controller eventually calls SourceList.update. That method clears the SourceList and then calls SourceWidget.__init__ for each source. We call update_sources whenever we successfully download a message, reply, or do a sync. When we are downloading 2 messages each and 2 replies each associated with 200 sources and we redraw the entire source list, that means: 200 * 2 * 2 * 200 = 160,000 calls of SourceWidget.__init__ which is order of magnitude what is being observed (we might not reach that number per se if the app is freezing before hand)

@ntoll
Copy link
Contributor

ntoll commented Jan 23, 2020

@redshiftzero you win the internet for today.

I've been trying out, fighting with and harrumphing at Python profiling tools that don't core dump when used with PyQt. ;-)

This looks like a good theory to start with. I've just commented out the update_sources call in the controllers, on_message_download_success and the UI is responsive for me with 50 sources (when it wasn't before).

One solution would be to do inline-updates for the UI on the SourceList (much like we've done for the conversation view in the UI).

@kushaldas
Copy link
Contributor

kushaldas commented Jan 23, 2020

@ntoll yes, just commenting out that update_sources call makes the UI responsive.

without_update-sources_sdclient

@kushaldas
Copy link
Contributor

I can confirm even with 1000 sources, the client is working fine. Though I did not look at the client for some time after logging in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

5 participants