Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement network connectivity error handling #391

Closed
2 of 4 tasks
eloquence opened this issue May 29, 2019 · 12 comments
Closed
2 of 4 tasks

Implement network connectivity error handling #391

eloquence opened this issue May 29, 2019 · 12 comments

Comments

@eloquence
Copy link
Member

eloquence commented May 29, 2019

(This is closely related to #291, but not dependent on it.)

If the client repeatedly fails to perform the next operation in the queue after a fixed number of retries and this is at least in part due to network conditions (e.g., network down, Whonix VM down, Tor network issues), we want to report this to the user as an error.

The purpose of this error message is to let the user know that operations are pending, and to give them an opportunity to correct the problem.

A "Retry" action should cause the error message to disappear, and queue operation to be resumed immediately.

Design

Behavior via Invision · Zeplin

Related issuess

User Story

As a user, I want to know when the client is experiencing connectivity isses, so that I can address them or try again later.

Acceptance Criteria

Given that I have performed some operations on the client
And I have temporarily lost (Tor) network connectivity
When the client has performed a set number of automatic retries that have failed (#384)
Then it should alert me to the error in the manner specified here.

Given that I have performed some operations on the client
And some of those operations fail for reasons unrelated to Tor network connectivity (e.g., attempting to star a source that has since been deleted)
When the client recognizes the failure
Then it should not treat the error in the manner specified here.

Given that some operations have repeatedly failed due to network issues
When I click "Retry"
Then the error message should disappear, and the client should attempt to resume processing the queue

Given that some operations have repeatedly failed due to network issues
When I do nothing
Then the client should periodically test connectivity
And the error message should disappear when connectivity is restored
And the queue should be resumed when connectivity is restored

Given that I have retried an operation that has previously failed
And the network is still down
When the operation fails again after a limited number of retries
Then it the client should again alert me to the error in the manner specified here.

Given that the client is in the error state specified here
When I perform an operation that requires connectivity (e.g., reply)
Then that operation should be processed normally (e.g., added to queue)
And the queue should continue to be paused until network connectivity is restored

Given that the client is in the error state specified here
When I add a reply
Then the reply should be displayed in "sending" state

Given that the client is in the error state specified here
When I star a source
Then the source should be displayed in "starred" state

Given that the client is in the error state specified here
When I delete a source
Then the source should be displayed in "deletion pending" state

@eloquence eloquence added this to the 0.2.0alpha milestone Jun 25, 2019
@eloquence eloquence changed the title Implement network connectivity error Implement network connectivity error handling Jun 26, 2019
@sssoleileraaa
Copy link
Contributor

sssoleileraaa commented Jul 22, 2019

In order to implement this, we should first implement pausing the queue, otherwise we won't be able to test the "Retry" action that should cause the queue operation to be resumed immediately. Also it would be helpful implement realistic timeouts before the "Retry" action so that we can test and get a good feel for how long a retry might take for reoccurring timeouts. But in reality, we might end up tweaking timeout values a few times until we have a smarter system that has some sort of backoff strategy for request timeouts so that we slow down our retries until we pause the queue and require user intervention.

@eloquence
Copy link
Member Author

Per the above comment, bumping this issue officially off the current sprint to focus remaining time on #430 and #443 as dependencies.

@sssoleileraaa
Copy link
Contributor

Breaking this down:

  1. Pause/Resume all queue: Pausing the queue #443
  2. Configurable number of retries: Enable configurable number of retries for each API job #384
  3. Realistic timeouts: Timeouts too long or too short #430
  4. Periodically test connectivity and resume queue automatically if network issue resolved: health checker - auto resume #491

@eloquence
Copy link
Member Author

eloquence commented Jul 24, 2019

For the 7/24-8/7 sprint, we'll aim to resolve #443 and #430 above, and add a first iteration of the UI layer (error message and functioning "Retry" button). If it makes sense to split the UI layer into its own issue/PR, that may be done mid-sprint.

#491 is out of scope for the current sprint, so the epic remains in the near-term backlog; once #491 is resolved, we should be able to fully satisfy the acceptance criteria.

@sssoleileraaa
Copy link
Contributor

Why is this still open again?

@redshiftzero
Copy link
Contributor

ah because we still need to implement #491

@eloquence
Copy link
Member Author

If I understand current state correctly, right now we are only offering "Retry" on metadata syncs, but on no other actions. This issue, as currently scoped, suggests a single global "Retry" action on all network errors, after automatic retries have failed. Is that still the approach we want to choose for beta? I suggest we discuss a bit more what's realistically achievable while alleviating likely pain points for our users.

Very high on the priority list is IMO a way for journalists to re-send failed replies without having to resort to copy and paste, whether that's accomplished through a global "retry" action (as proposed in this issue) or otherwise.

@redshiftzero
Copy link
Contributor

The retry is for any network action that is done via the queue. To improve reliability we should complete moving all network actions to the queue (that means they'll get automatic retries and then network actions will pause if a failure keeps occurring). Right now we have an incomplete transition: e.g. deletion is not done via the queue, if it fails once, it doesn't get retried at all.

In addition to that I think we need to do two things:

  • Delete individual failed replies upon request by the user: this is a smaller task, since failed replies are local only by construction
  • Retry individual failed replies upon request by the user: this will resubmit the reply - and update its position to be the most recent in the conversation, i.e. it will go to the bottom. I recognize that this latter point is one that there might be some disagreement on so we should discuss

@ninavizz
Copy link
Member

Discuss in Monday's Client meeting?

@eloquence
Copy link
Member Author

eloquence commented Dec 2, 2019

Recap of today's client sync discussion as I understand it, with proposed must/should/could priorities for the beta:

  1. Must: [securedrop-proxy] gracefully handle connection failures  securedrop-proxy#128 (proxy connection failure handling). Rationale: causes total connection failures to not be handled as expected.

  2. Must: Moving remaining actions (metadata syncs, deletion) to the queue. Rationale: allows those actions to be retried, and errors to be handled using a single global network error message, as described in this issue.

  3. Must: Adding a "Failed" indicator next to individual replies. Rationale: makes it clearer to the user (including subsequent users of the client on the same workstation) that certain replies have failed to send.

  4. Should: Adding a "Troubleshoot" pop-up to the error message, with simple help instructions. Rationale: resetting Tor connection can be a bit finicky, and something users may encounter under a number of different circumstances.

  5. Could: Allowing the user to delete individual failed replies. Rationale: This could give the user more control about which replies are re-sent as part of a batch.

  6. Could: Allowing the user to individually re-send replies. Rationale: This could be useful if the user notices a failed reply after a client re-start.

@ninavizz
Copy link
Member

ninavizz commented Dec 9, 2019

Cross-referencing #650

@eloquence eloquence modified the milestones: 0.2.0alpha, 0.2.0beta Dec 12, 2019
@eloquence
Copy link
Member Author

After discussion with @creviera, we agreed that this issue can be closed. Most of the original work has been done, and remaining work has clear follow-up issues (e.g. #534, #359).

That said, given the heavy work on the network stack of the client, it's important that we verify and re-verify that the client behaves as intended. I've updated the acceptance criteria and moved them to the client test plan wiki page:

https://github.com/freedomofpress/securedrop-client/wiki/Test-plan

We may narrow the number of tests for the final QA plan, but we should IMO verify each of the criteria listed there before then, and file new issues as needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants