Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate MetadataSyncJob from main queue and increase frequency of background syncs #652

Closed
sssoleileraaa opened this issue Dec 10, 2019 · 11 comments

Comments

@sssoleileraaa
Copy link
Contributor

sssoleileraaa commented Dec 10, 2019

Description

Separate metadata syncs from the main queue.

Follow-up for #491

Why do this?

Currently, whenever a queue job other than MetadataSyncJob sees an ApiInaccessibleError or RequestTimeoutError, we pause the queue. The queue can be unpaused by the follow actions:

  1. Clicking on the "Retry" link from the client. This unpauses the queue and retries the last failed job. If it fails again, the queue will pause once again.

  2. Initiating a sync either by our 5-minute background process or clicking on the refresh icon. A sync adds a MetadataSyncJob to the queue before unpausing it. Since MetadataSyncJob is the highest priority job it cuts to the front of the line and keeps retrying until it's successful.

The downside to this approach is that potentially many MetadataSyncJobs can be added to the queue, and they will always cut in line, making it so other jobs have to wait longer and longer before getting processed. For example, if a user clicks refresh 5 times and the background process kicks off a sync, there will be 6 MetadataSyncJobs at the front of the queue. And the longer the client fails to connect to the server, the more MetadataSyncJobs pile up. Even when there are no network issues and the queue is not paused, other tasks kick off syncs. For instance, the reply job triggers a sync so that there isn't a long period of displaying a conversation out of order (this can happen if a message came in during the time period between syncs). Another time a sync is triggered is when a user tries to open a file that no longer exists on the file system.

Also, if we want to sync with the server more often, to decrease the likelihood of journalists viewing out-of-date information when they send a reply, this means more MetadataSyncJobs cutting in line.

What's not in scope?

Before tackling this issue, we should do the following (when all these Issues are closed, our background sync will be the only place MetadataSyncJobs are created):

What's in scope for this issue?

So the scope of this issue is to run MetadataSyncJob outside of the main queue (until we add async job support) and to sync more frequently to make up for removing syncs from other areas of the code. What still needs to be decided by is:

  • How often to sync (idea: every 15 seconds or until a MetadataSyncJob is successful, whichever one is longer)
  • Whether or not to create another queue just for MetadataSyncJob processing (using a queue fits within our current architecture of using queues for processing jobs, however, it would also make a lot of sense to not use a queue since we don't ever need to line up more than one MetadataSyncJob at a time)

Criteria

  • No longer use the main queue to process MetadataSyncJobs
  • Never add more than one MetadataSyncJob at a time
  • MetadataSyncJob should continue to be derived from ApiJob (the only types of jobs with api access)
  • Decrease the time between syncs so that we're never out of sync for too long
  • If the queues are paused when a MetadataSyncJob succeeds unpause the queues

Regression Criteria

Given that some operations have repeatedly failed due to network issues
When I do nothing
Then the client should periodically test connectivity
And the error message should disappear when connectivity is restored
And the queue should be resumed when connectivity is restored

@sssoleileraaa
Copy link
Contributor Author

Also, I forgot to mention that this would allow us to prevent new MetadataSyncJobs getting added every 5 minutes by the timer if the queue is already paused because we would only create a new MetadataSyncJob when the health check succeeds.

@eloquence
Copy link
Member

eloquence commented Dec 11, 2019

We discussed this briefly today. @redshiftzero proposed to have this queue handle the background-run MetadataSyncJobs for now, instead of adding a new ping action. User-initiated actions (and download jobs triggered by background syncs) would be handled by the main queue. Since we're also aiming to make MetadataSyncs faster, it's still an open question whether we need a separate network ping action.

This is under consideration for the end-of-year kanban period or the next sprint.

@eloquence
Copy link
Member

eloquence commented Dec 11, 2019

As I understand this issue's original scope:

  1. Create a network health check, e.g. a ping action of some kind
  2. Add this to the main queue every 5 minutes
  3. On success, it adds a MetaDataSync job to the main queue
  4. On failure, it pauses the queue (but continues to run, and unpauses on success)

As I understand Jen's proposal:

  1. Create a new queue
  2. Add MetadataSyncJobs to that queue in frequent intervals (e.g., whenever the last one completes or fails)
  3. On success, it adds related jobs (e.g., message/reply downloads) to the main queue
  4. On failure, it pauses the main queue (but continues to run, and unpauses on success)
  5. User-initiated metadata syncs would continue to be handled through the main queue

Please let me know if I captured that accurately. I would still nominate this to its own issue, because the scope, behavior, acceptance criteria, and possibly dependencies are substantively different.

@sssoleileraaa
Copy link
Contributor Author

As I understand this issue's original scope:

  1. Create a network health check, e.g. a ping action of some kind
  2. Add this to the main queue every 5 minutes
  3. On success, it adds a MetaDataSync job to the main queue
  4. On failure, it pauses the queue (but continues to run, and unpauses on success)

As I understand Jen's proposal:

  1. Create a new queue
  2. Add MetadataSyncJobs to that queue in frequent intervals (e.g., whenever the last one completes or fails)
  3. On success, it adds related jobs (e.g., message/reply downloads) to the main queue
  4. On failure, it pauses the main queue (but continues to run, and unpauses on success)
  5. User-initiated metadata syncs would continue to be handled through the main queue

Please let me know if I captured that accurately. I would still nominate this to its own issue, because the scope, behavior, acceptance criteria, and possibly dependencies are substantively different.

Yes, the first proposal assumes the health checker is in its own thread rather than queue (because there's nothing to queue up) and it doesn't have to be every 5 minutes, we could make it every 15 seconds or whatever we think makes sense for checking the network and adding MetadataSyncs.

Jen brought up a good reason to avoid a separate ping: it costs two round trips to do a sync and could take longer than 15 seconds to do in total. Adding a separate queue for MetadataSyncJobs means we will have a queue that'll only ever contain one job at a time, but the reason for this is because we don't have support of async jobs in the main queue yet and it keeps our code uniform and easy to read since we're used to seeing api tasks be queue jobs (it fits within our architecture design).

I think it would make the most sense to focus on tasks in this order:

  1. removing code that unnecessarily enqueues MetadataSyncJobs (No longer sync on successful delete or star operation #658 and Sync icon should be disabled during active sync #388 (comment))
  2. make sure we never add more than one MetadataSyncJob at a time (this we do by only enqueuing a MetadataSyncJob when one succeeds and creating a separate UserMetadataSyncJob for when the user manually makes a sync request by clicking the refresh icon [refactor] Separate MetadataSync and UserMetadataSync #655)
  3. decrease the time between syncs to address the issue with showing outdated conversations (at this point we can remove calling sync_api when a reply successfully sends), which means we will need to move MetadataSyncJob to its own queue so we don't starve out other jobs.

@sssoleileraaa
Copy link
Contributor Author

  1. decrease the time between syncs to address the issue with showing outdated conversations (at this point we can remove calling sync_api when a reply successfully sends), which means we will need to move MetadataSyncJob to its own queue so we don't starve out other jobs.

Created a new issue for removing the sync during a reply send: #660

The work that doesn't have its own issue and could be captured by this ticket are:

  • make sure we never add more than one MetadataSyncJob at a time
  • decrease the time between syncs and move MetadataSyncJob to its own queue (this we may be able to split up further)

@sssoleileraaa sssoleileraaa changed the title [refactor] Create NetworkHealthChecker class Move MetadataSyncJob to its own queue and increase frequency of syncs Dec 11, 2019
@eloquence eloquence added this to the 0.2.0beta milestone Dec 17, 2019
@eloquence
Copy link
Member

The work that doesn't have its own issue and could be captured by this ticket are:

  • make sure we never add more than one MetadataSyncJob at a time
  • decrease the time between syncs and move MetadataSyncJob to its own queue (this we may be able to split up further)

That SGTM @creviera. Would you mind editing the issue description to that effect, to avoid any ambiguity? Also happy to take a stab at that, if you prefer.

@sssoleileraaa
Copy link
Contributor Author

Updated this issue: #652 (comment)

@sssoleileraaa sssoleileraaa changed the title Move MetadataSyncJob to its own queue and increase frequency of syncs Separate MetadataSyncJob from main queue and increase frequency of background syncs Dec 17, 2019
@redshiftzero
Copy link
Contributor

great writeup, thank you!

every 15 seconds or until a MetadataSyncJob is successful, whichever one is longer

👍

Whether or not to create another queue just for MetadataSyncJob processing

some ideas/thoughts. We could:

  1. do nothing special and use existing logic. Kind of nice because it uses existing logic, but not the greatest because we can get an unnecessary pileup of jobs (unless we are really judicious about removing syncs from elsewhere). However, it's still an improvement because the job pileup doesn't impact other network actions.

  2. do 0 but add the following logic in MetadataSyncJob: check the time of last sync, if it's been more than 15 seconds (or whatever we decide), run the sync. Else return.

  3. create a class that runs single API calls (which we kind of already have in logic.py). We end up back with the "what should be able to make an API call" issue (What can make an API call [Discussion] #400), but we could do something like rename ApiJobQueue -> NetworkJobRunner and keep a reference to the class that runs single API calls on NetworkJobRunner. We should chat more about the implications of running async jobs (as they'd each run in their own threads, so we'd want to be careful about starting them) - if we had this, would we want these jobs to not retry? My thought right now is that we would want to retry these network actions given how flaky tor can be (e.g. if we do 15 seconds and a metadata sync fails 2 times in a row, that's a long time during a chat). Combined with the timestamp behavior in 1 we could handle that.

  4. create the network health checker as a queue with maxsize=1, such that only one job can be added at a time. we add a add_job_if_not_empty method on ApiJob which will only add a new job if the queue in question is not empty (which we can check via handling queue.Full). This way we don't get jobs piling up, we get the retries, we still get a lot of code reuse, we use a nice interface in Python for sharing data between threads, and we know we have just a single thread running at any time attempting metadata syncs.

  5. Something else

tbh I think an investigatory spike could be worthwhile here

@eloquence eloquence added this to Near Term - SD Workstation in SecureDrop Team Board Dec 19, 2019
@eloquence eloquence moved this from Near Term - SD Workstation to Nominated for next sprint in SecureDrop Team Board Jan 7, 2020
@eloquence eloquence moved this from Nominated for next sprint to Current Sprint (1/8-1/22) in SecureDrop Team Board Jan 8, 2020
@ntoll
Copy link
Contributor

ntoll commented Jan 20, 2020

Hi, so I've been looking into this issue.

I'm painfully aware that I'm likely holding the wrong end of the stick, misunderstood something or don't have all the context needed, so please shoot me down.

As I understand it the problem is that the message queue gets clogged up with MetadataSyncJob instances which cut to the front of the line thus holding up other jobs.

Why not just limit only one MetadataSyncJob to ever be in flight at a time..? Any number of aspects of the application could ask for such a job, but if there's one already underway the queue ensures no new job is created. This feels like a far simpler solution than a completely new queue: it "only" involves a flag set when the job starts and unset when it completes or fails (notice the air quotes, I'm painfully aware that I'm probably missing something obvious here).

What am I missing here, and/or is this other approach enough to mitigate the clogged queue problem? Like I said, I'm painfully aware of my lack of context and this is probably a good learning exercise for me too.

@sssoleileraaa sssoleileraaa moved this from Current Sprint (1/8-1/22) to In Development in SecureDrop Team Board Jan 21, 2020
@sssoleileraaa sssoleileraaa moved this from In Development to Current Sprint (1/8-1/22) in SecureDrop Team Board Jan 21, 2020
@ntoll ntoll mentioned this issue Jan 21, 2020
6 tasks
@redshiftzero
Copy link
Contributor

Yeah, there's not much point in adding more than one MetadataSyncJob to run at a time. One idea I had to go about this based on your current diff in #715 could be to pass an optional queue size to RunnableQueue.__init__ for use by the metadata queue. We'd use it when creating RunnableQueue.queue such that the queue object is instantiated with queue.PriorityQueue(maxsize=1) for the metadata queue. Then we'd handle the queue.Full exception: i.e. we'd just pass if there's already a MetadataSyncJob in the queue (btw this is just an idea, feel free to do something different if you see a better path).

@sssoleileraaa sssoleileraaa moved this from Current Sprint (1/8-1/22) to In Development in SecureDrop Team Board Jan 22, 2020
@eloquence
Copy link
Member

eloquence commented Jan 28, 2020

With #715 merged, we're partway to the goals spec'd out in this issue. Some work described here has been split off into separate issues:

Additionally, we are already tracking needed follow-up to remove unnecessary sync calls, to fix sync-related UI glitchiness, and to revise the user experience of the "sync" area in the client: #687, #726, #671, #670, #352

Some proposed refactoring is tracked here: #647

Since this issue is not really organized as a tracking epic and we've covered follow-up work in separate issues, closing.

SecureDrop Team Board automation moved this from In Development to Done Jan 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

No branches or pull requests

4 participants