Separate MetadataSyncJob from main queue and increase frequency of background syncs #652

sssoleileraaa · 2019-12-10T00:01:09Z

Description

Separate metadata syncs from the main queue.

Follow-up for #491

Why do this?

Currently, whenever a queue job other than MetadataSyncJob sees an ApiInaccessibleError or RequestTimeoutError, we pause the queue. The queue can be unpaused by the follow actions:

Clicking on the "Retry" link from the client. This unpauses the queue and retries the last failed job. If it fails again, the queue will pause once again.
Initiating a sync either by our 5-minute background process or clicking on the refresh icon. A sync adds a MetadataSyncJob to the queue before unpausing it. Since MetadataSyncJob is the highest priority job it cuts to the front of the line and keeps retrying until it's successful.

The downside to this approach is that potentially many MetadataSyncJobs can be added to the queue, and they will always cut in line, making it so other jobs have to wait longer and longer before getting processed. For example, if a user clicks refresh 5 times and the background process kicks off a sync, there will be 6 MetadataSyncJobs at the front of the queue. And the longer the client fails to connect to the server, the more MetadataSyncJobs pile up. Even when there are no network issues and the queue is not paused, other tasks kick off syncs. For instance, the reply job triggers a sync so that there isn't a long period of displaying a conversation out of order (this can happen if a message came in during the time period between syncs). Another time a sync is triggered is when a user tries to open a file that no longer exists on the file system.

Also, if we want to sync with the server more often, to decrease the likelihood of journalists viewing out-of-date information when they send a reply, this means more MetadataSyncJobs cutting in line.

What's not in scope?

Before tackling this issue, we should do the following (when all these Issues are closed, our background sync will be the only place MetadataSyncJobs are created):

Remove the creation of MetadataSyncJobs from other areas of the code (these tasks are being tracked here: Sync icon should be disabled during active sync #388, No longer sync on successful delete or star operation #658, No longer sync on reply success #660, No longer trigger sync when files are missing #670, Trigger sync without calling sync_api after login #671).
Create a UserMetadataSyncJob for when the user manually makes a sync request by clicking the refresh icon (this task is being tracked here: [refactor] Separate MetadataSync and UserMetadataSync #655, also Sync icon should be disabled during active sync #388 should be done ASAP). This is to ensure that the client can be responsive to user actions.

What's in scope for this issue?

So the scope of this issue is to run MetadataSyncJob outside of the main queue (until we add async job support) and to sync more frequently to make up for removing syncs from other areas of the code. What still needs to be decided by is:

How often to sync (idea: every 15 seconds or until a MetadataSyncJob is successful, whichever one is longer)
Whether or not to create another queue just for MetadataSyncJob processing (using a queue fits within our current architecture of using queues for processing jobs, however, it would also make a lot of sense to not use a queue since we don't ever need to line up more than one MetadataSyncJob at a time)

Criteria

No longer use the main queue to process MetadataSyncJobs
Never add more than one MetadataSyncJob at a time
MetadataSyncJob should continue to be derived from ApiJob (the only types of jobs with api access)
Decrease the time between syncs so that we're never out of sync for too long
If the queues are paused when a MetadataSyncJob succeeds unpause the queues

Regression Criteria

Given that some operations have repeatedly failed due to network issues
When I do nothing
Then the client should periodically test connectivity
And the error message should disappear when connectivity is restored
And the queue should be resumed when connectivity is restored

sssoleileraaa · 2019-12-10T00:02:38Z

Also, I forgot to mention that this would allow us to prevent new MetadataSyncJobs getting added every 5 minutes by the timer if the queue is already paused because we would only create a new MetadataSyncJob when the health check succeeds.

eloquence · 2019-12-11T18:02:13Z

We discussed this briefly today. @redshiftzero proposed to have this queue handle the background-run MetadataSyncJobs for now, instead of adding a new ping action. User-initiated actions (and download jobs triggered by background syncs) would be handled by the main queue. Since we're also aiming to make MetadataSyncs faster, it's still an open question whether we need a separate network ping action.

This is under consideration for the end-of-year kanban period or the next sprint.

eloquence · 2019-12-11T18:19:24Z

As I understand this issue's original scope:

Create a network health check, e.g. a ping action of some kind
Add this to the main queue every 5 minutes
On success, it adds a MetaDataSync job to the main queue
On failure, it pauses the queue (but continues to run, and unpauses on success)

As I understand Jen's proposal:

Create a new queue
Add MetadataSyncJobs to that queue in frequent intervals (e.g., whenever the last one completes or fails)
On success, it adds related jobs (e.g., message/reply downloads) to the main queue
On failure, it pauses the main queue (but continues to run, and unpauses on success)
User-initiated metadata syncs would continue to be handled through the main queue

Please let me know if I captured that accurately. I would still nominate this to its own issue, because the scope, behavior, acceptance criteria, and possibly dependencies are substantively different.

sssoleileraaa · 2019-12-11T18:46:59Z

As I understand this issue's original scope:

Create a network health check, e.g. a ping action of some kind

Add this to the main queue every 5 minutes

On success, it adds a MetaDataSync job to the main queue

On failure, it pauses the queue (but continues to run, and unpauses on success)

As I understand Jen's proposal:

Create a new queue

Add MetadataSyncJobs to that queue in frequent intervals (e.g., whenever the last one completes or fails)

On success, it adds related jobs (e.g., message/reply downloads) to the main queue

On failure, it pauses the main queue (but continues to run, and unpauses on success)

User-initiated metadata syncs would continue to be handled through the main queue

Please let me know if I captured that accurately. I would still nominate this to its own issue, because the scope, behavior, acceptance criteria, and possibly dependencies are substantively different.

Yes, the first proposal assumes the health checker is in its own thread rather than queue (because there's nothing to queue up) and it doesn't have to be every 5 minutes, we could make it every 15 seconds or whatever we think makes sense for checking the network and adding MetadataSyncs.

Jen brought up a good reason to avoid a separate ping: it costs two round trips to do a sync and could take longer than 15 seconds to do in total. Adding a separate queue for MetadataSyncJobs means we will have a queue that'll only ever contain one job at a time, but the reason for this is because we don't have support of async jobs in the main queue yet and it keeps our code uniform and easy to read since we're used to seeing api tasks be queue jobs (it fits within our architecture design).

I think it would make the most sense to focus on tasks in this order:

removing code that unnecessarily enqueues MetadataSyncJobs (No longer sync on successful delete or star operation #658 and Sync icon should be disabled during active sync #388 (comment))
make sure we never add more than one MetadataSyncJob at a time (this we do by only enqueuing a MetadataSyncJob when one succeeds and creating a separate UserMetadataSyncJob for when the user manually makes a sync request by clicking the refresh icon [refactor] Separate MetadataSync and UserMetadataSync #655)
decrease the time between syncs to address the issue with showing outdated conversations (at this point we can remove calling sync_api when a reply successfully sends), which means we will need to move MetadataSyncJob to its own queue so we don't starve out other jobs.

sssoleileraaa · 2019-12-11T19:09:10Z

decrease the time between syncs to address the issue with showing outdated conversations (at this point we can remove calling sync_api when a reply successfully sends), which means we will need to move MetadataSyncJob to its own queue so we don't starve out other jobs.

Created a new issue for removing the sync during a reply send: #660

The work that doesn't have its own issue and could be captured by this ticket are:

make sure we never add more than one MetadataSyncJob at a time
decrease the time between syncs and move MetadataSyncJob to its own queue (this we may be able to split up further)

eloquence · 2019-12-17T20:28:03Z

The work that doesn't have its own issue and could be captured by this ticket are:

make sure we never add more than one MetadataSyncJob at a time

decrease the time between syncs and move MetadataSyncJob to its own queue (this we may be able to split up further)

That SGTM @creviera. Would you mind editing the issue description to that effect, to avoid any ambiguity? Also happy to take a stab at that, if you prefer.

sssoleileraaa · 2019-12-17T23:51:39Z

Updated this issue: #652 (comment)

redshiftzero · 2019-12-18T23:51:10Z

great writeup, thank you!

every 15 seconds or until a MetadataSyncJob is successful, whichever one is longer

👍

Whether or not to create another queue just for MetadataSyncJob processing

some ideas/thoughts. We could:

do nothing special and use existing logic. Kind of nice because it uses existing logic, but not the greatest because we can get an unnecessary pileup of jobs (unless we are really judicious about removing syncs from elsewhere). However, it's still an improvement because the job pileup doesn't impact other network actions.
do 0 but add the following logic in MetadataSyncJob: check the time of last sync, if it's been more than 15 seconds (or whatever we decide), run the sync. Else return.
create a class that runs single API calls (which we kind of already have in logic.py). We end up back with the "what should be able to make an API call" issue (What can make an API call [Discussion] #400), but we could do something like rename ApiJobQueue -> NetworkJobRunner and keep a reference to the class that runs single API calls on NetworkJobRunner. We should chat more about the implications of running async jobs (as they'd each run in their own threads, so we'd want to be careful about starting them) - if we had this, would we want these jobs to not retry? My thought right now is that we would want to retry these network actions given how flaky tor can be (e.g. if we do 15 seconds and a metadata sync fails 2 times in a row, that's a long time during a chat). Combined with the timestamp behavior in 1 we could handle that.
create the network health checker as a queue with maxsize=1, such that only one job can be added at a time. we add a add_job_if_not_empty method on ApiJob which will only add a new job if the queue in question is not empty (which we can check via handling queue.Full). This way we don't get jobs piling up, we get the retries, we still get a lot of code reuse, we use a nice interface in Python for sharing data between threads, and we know we have just a single thread running at any time attempting metadata syncs.
Something else

tbh I think an investigatory spike could be worthwhile here

ntoll · 2020-01-20T16:39:01Z

Hi, so I've been looking into this issue.

I'm painfully aware that I'm likely holding the wrong end of the stick, misunderstood something or don't have all the context needed, so please shoot me down.

As I understand it the problem is that the message queue gets clogged up with MetadataSyncJob instances which cut to the front of the line thus holding up other jobs.

Why not just limit only one MetadataSyncJob to ever be in flight at a time..? Any number of aspects of the application could ask for such a job, but if there's one already underway the queue ensures no new job is created. This feels like a far simpler solution than a completely new queue: it "only" involves a flag set when the job starts and unset when it completes or fails (notice the air quotes, I'm painfully aware that I'm probably missing something obvious here).

What am I missing here, and/or is this other approach enough to mitigate the clogged queue problem? Like I said, I'm painfully aware of my lack of context and this is probably a good learning exercise for me too.

redshiftzero · 2020-01-21T17:49:05Z

Yeah, there's not much point in adding more than one MetadataSyncJob to run at a time. One idea I had to go about this based on your current diff in #715 could be to pass an optional queue size to RunnableQueue.__init__ for use by the metadata queue. We'd use it when creating RunnableQueue.queue such that the queue object is instantiated with queue.PriorityQueue(maxsize=1) for the metadata queue. Then we'd handle the queue.Full exception: i.e. we'd just pass if there's already a MetadataSyncJob in the queue (btw this is just an idea, feel free to do something different if you see a better path).

eloquence · 2020-01-28T01:44:57Z

With #715 merged, we're partway to the goals spec'd out in this issue. Some work described here has been split off into separate issues:

Ensure that the sync is continuous (metadata syncs are not continuous #672)
Optimize the sync frequency (Optimize metadata sync frequency #738)

Additionally, we are already tracking needed follow-up to remove unnecessary sync calls, to fix sync-related UI glitchiness, and to revise the user experience of the "sync" area in the client: #687, #726, #671, #670, #352

Some proposed refactoring is tracked here: #647

Since this issue is not really organized as a tracking epic and we've covered follow-up work in separate issues, closing.

redshiftzero mentioned this issue Dec 10, 2019

Only call sync_api when necessary #462

Closed

sssoleileraaa changed the title ~~[refactor] Create NetworkHealthChecker class~~ Move MetadataSyncJob to its own queue and increase frequency of syncs Dec 11, 2019

This was referenced Dec 17, 2019

[refactor] Separate MetadataSync and UserMetadataSync #655

Closed

Deprioritize background jobs over user-initiated actions #657

Closed

eloquence added this to the 0.2.0beta milestone Dec 17, 2019

sssoleileraaa changed the title ~~Move MetadataSyncJob to its own queue and increase frequency of syncs~~ Separate MetadataSyncJob from main queue and increase frequency of background syncs Dec 17, 2019

eloquence added this to Near Term - SD Workstation in SecureDrop Team Board Dec 19, 2019

This was referenced Dec 19, 2019

Trigger sync without calling sync_api after login #671

Closed

"Last refresh:" activity status message is not treated as default status #414

Closed

eloquence moved this from Near Term - SD Workstation to Nominated for next sprint in SecureDrop Team Board Jan 7, 2020

eloquence moved this from Nominated for next sprint to Current Sprint (1/8-1/22) in SecureDrop Team Board Jan 8, 2020

This was referenced Jan 8, 2020

Replace refresh icon with non-interactive sync activity indicator #687

Closed

Implement network connectivity error handling #391

Closed

redshiftzero mentioned this issue Jan 18, 2020

app/queue: prioritize user-triggered state changes #708

Merged

6 tasks

sssoleileraaa moved this from Current Sprint (1/8-1/22) to In Development in SecureDrop Team Board Jan 21, 2020

sssoleileraaa moved this from In Development to Current Sprint (1/8-1/22) in SecureDrop Team Board Jan 21, 2020

ntoll mentioned this issue Jan 21, 2020

Add new metadata queue. #715

Merged

6 tasks

sssoleileraaa moved this from Current Sprint (1/8-1/22) to In Development in SecureDrop Team Board Jan 22, 2020

eloquence added the release blocker label Jan 27, 2020

eloquence closed this as completed Jan 28, 2020

SecureDrop Team Board automation moved this from In Development to Done Jan 28, 2020

sssoleileraaa mentioned this issue Jan 28, 2020

Make sync continuous #739

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Separate MetadataSyncJob from main queue and increase frequency of background syncs #652

Separate MetadataSyncJob from main queue and increase frequency of background syncs #652

sssoleileraaa commented Dec 10, 2019 •

edited

sssoleileraaa commented Dec 10, 2019

eloquence commented Dec 11, 2019 •

edited

eloquence commented Dec 11, 2019 •

edited

sssoleileraaa commented Dec 11, 2019

sssoleileraaa commented Dec 11, 2019

eloquence commented Dec 17, 2019

sssoleileraaa commented Dec 17, 2019

redshiftzero commented Dec 18, 2019

ntoll commented Jan 20, 2020

redshiftzero commented Jan 21, 2020

eloquence commented Jan 28, 2020 •

edited

Separate MetadataSyncJob from main queue and increase frequency of background syncs #652

Separate MetadataSyncJob from main queue and increase frequency of background syncs #652

Comments

sssoleileraaa commented Dec 10, 2019 • edited

Description

Why do this?

What's not in scope?

What's in scope for this issue?

Criteria

Regression Criteria

sssoleileraaa commented Dec 10, 2019

eloquence commented Dec 11, 2019 • edited

eloquence commented Dec 11, 2019 • edited

sssoleileraaa commented Dec 11, 2019

sssoleileraaa commented Dec 11, 2019

eloquence commented Dec 17, 2019

sssoleileraaa commented Dec 17, 2019

redshiftzero commented Dec 18, 2019

ntoll commented Jan 20, 2020

redshiftzero commented Jan 21, 2020

eloquence commented Jan 28, 2020 • edited

sssoleileraaa commented Dec 10, 2019 •

edited

eloquence commented Dec 11, 2019 •

edited

eloquence commented Dec 11, 2019 •

edited

eloquence commented Jan 28, 2020 •

edited