Implement network connectivity error handling #391

eloquence · 2019-05-29T00:43:01Z

(This is closely related to #291, but not dependent on it.)

If the client repeatedly fails to perform the next operation in the queue after a fixed number of retries and this is at least in part due to network conditions (e.g., network down, Whonix VM down, Tor network issues), we want to report this to the user as an error.

The purpose of this error message is to let the user know that operations are pending, and to give them an opportunity to correct the problem.

A "Retry" action should cause the error message to disappear, and queue operation to be resumed immediately.

Design

Behavior via Invision · Zeplin

Related issuess

Reply pending state #350: Reply sending state in ConversationView
Clearly indicate that a reply failed to send #365: Clearly indicate that a reply failed to send
Need a pending source deletion state #534: Pending source deletion state
Add spinner for replies in process of being sent #359: Add spinner for replies in process of being sent

User Story

As a user, I want to know when the client is experiencing connectivity isses, so that I can address them or try again later.

Acceptance Criteria

Given that I have performed some operations on the client
And I have temporarily lost (Tor) network connectivity
When the client has performed a set number of automatic retries that have failed (#384)
Then it should alert me to the error in the manner specified here.

Given that I have performed some operations on the client
And some of those operations fail for reasons unrelated to Tor network connectivity (e.g., attempting to star a source that has since been deleted)
When the client recognizes the failure
Then it should not treat the error in the manner specified here.

Given that some operations have repeatedly failed due to network issues
When I click "Retry"
Then the error message should disappear, and the client should attempt to resume processing the queue

Given that some operations have repeatedly failed due to network issues
When I do nothing
Then the client should periodically test connectivity
And the error message should disappear when connectivity is restored
And the queue should be resumed when connectivity is restored

Given that I have retried an operation that has previously failed
And the network is still down
When the operation fails again after a limited number of retries
Then it the client should again alert me to the error in the manner specified here.

Given that the client is in the error state specified here
When I perform an operation that requires connectivity (e.g., reply)
Then that operation should be processed normally (e.g., added to queue)
And the queue should continue to be paused until network connectivity is restored

Given that the client is in the error state specified here
When I add a reply
Then the reply should be displayed in "sending" state

Given that the client is in the error state specified here
When I star a source
Then the source should be displayed in "starred" state

Given that the client is in the error state specified here
When I delete a source
Then the source should be displayed in "deletion pending" state

sssoleileraaa · 2019-07-22T18:53:06Z

In order to implement this, we should first implement pausing the queue, otherwise we won't be able to test the "Retry" action that should cause the queue operation to be resumed immediately. Also it would be helpful implement realistic timeouts before the "Retry" action so that we can test and get a good feel for how long a retry might take for reoccurring timeouts. But in reality, we might end up tweaking timeout values a few times until we have a smarter system that has some sort of backoff strategy for request timeouts so that we slow down our retries until we pause the queue and require user intervention.

eloquence · 2019-07-22T21:19:09Z

Per the above comment, bumping this issue officially off the current sprint to focus remaining time on #430 and #443 as dependencies.

sssoleileraaa · 2019-07-24T16:38:53Z

Breaking this down:

Pause/Resume all queue: Pausing the queue #443
Configurable number of retries: Enable configurable number of retries for each API job #384
Realistic timeouts: Timeouts too long or too short #430
Periodically test connectivity and resume queue automatically if network issue resolved: health checker - auto resume #491

eloquence · 2019-07-24T20:53:38Z

For the 7/24-8/7 sprint, we'll aim to resolve #443 and #430 above, and add a first iteration of the UI layer (error message and functioning "Retry" button). If it makes sense to split the UI layer into its own issue/PR, that may be done mid-sprint.

#491 is out of scope for the current sprint, so the epic remains in the near-term backlog; once #491 is resolved, we should be able to fully satisfy the acceptance criteria.

sssoleileraaa · 2019-10-31T21:29:35Z

Why is this still open again?

redshiftzero · 2019-10-31T21:30:12Z

ah because we still need to implement #491

eloquence · 2019-11-27T21:39:00Z

If I understand current state correctly, right now we are only offering "Retry" on metadata syncs, but on no other actions. This issue, as currently scoped, suggests a single global "Retry" action on all network errors, after automatic retries have failed. Is that still the approach we want to choose for beta? I suggest we discuss a bit more what's realistically achievable while alleviating likely pain points for our users.

Very high on the priority list is IMO a way for journalists to re-send failed replies without having to resort to copy and paste, whether that's accomplished through a global "retry" action (as proposed in this issue) or otherwise.

redshiftzero · 2019-11-27T21:46:57Z

The retry is for any network action that is done via the queue. To improve reliability we should complete moving all network actions to the queue (that means they'll get automatic retries and then network actions will pause if a failure keeps occurring). Right now we have an incomplete transition: e.g. deletion is not done via the queue, if it fails once, it doesn't get retried at all.

In addition to that I think we need to do two things:

Delete individual failed replies upon request by the user: this is a smaller task, since failed replies are local only by construction
Retry individual failed replies upon request by the user: this will resubmit the reply - and update its position to be the most recent in the conversation, i.e. it will go to the bottom. I recognize that this latter point is one that there might be some disagreement on so we should discuss

ninavizz · 2019-11-28T00:29:22Z

Discuss in Monday's Client meeting?

eloquence · 2019-12-02T19:00:27Z

Recap of today's client sync discussion as I understand it, with proposed must/should/could priorities for the beta:

Must: [securedrop-proxy] gracefully handle connection failures securedrop-proxy#128 (proxy connection failure handling). Rationale: causes total connection failures to not be handled as expected.
Must: Moving remaining actions (metadata syncs, deletion) to the queue. Rationale: allows those actions to be retried, and errors to be handled using a single global network error message, as described in this issue.
Must: Adding a "Failed" indicator next to individual replies. Rationale: makes it clearer to the user (including subsequent users of the client on the same workstation) that certain replies have failed to send.
Should: Adding a "Troubleshoot" pop-up to the error message, with simple help instructions. Rationale: resetting Tor connection can be a bit finicky, and something users may encounter under a number of different circumstances.
Could: Allowing the user to delete individual failed replies. Rationale: This could give the user more control about which replies are re-sent as part of a batch.
Could: Allowing the user to individually re-send replies. Rationale: This could be useful if the user notices a failed reply after a client re-start.

ninavizz · 2019-12-09T21:32:00Z

Cross-referencing #650

eloquence · 2020-01-29T23:25:43Z

After discussion with @creviera, we agreed that this issue can be closed. Most of the original work has been done, and remaining work has clear follow-up issues (e.g. #534, #359).

That said, given the heavy work on the network stack of the client, it's important that we verify and re-verify that the client behaves as intended. I've updated the acceptance criteria and moved them to the client test plan wiki page:

https://github.com/freedomofpress/securedrop-client/wiki/Test-plan

We may narrow the number of tests for the final QA plan, but we should IMO verify each of the criteria listed there before then, and file new issues as needed.

This was referenced May 29, 2019

Implement network activity indicator #291

Closed

Workstation Network Failure freedomofpress/securedrop-ux#32

Closed

eloquence mentioned this issue Jun 11, 2019

Clearly indicate that a reply failed to send #365

Closed

eloquence mentioned this issue Jun 12, 2019

Timeout / No Auth #379

Closed

eloquence added this to the 0.2.0alpha milestone Jun 25, 2019

eloquence changed the title ~~Implement network connectivity error~~ Implement network connectivity error handling Jun 26, 2019

eloquence added the epic label Jul 24, 2019

eloquence mentioned this issue Aug 2, 2019

Apply gradient that spans the footprint behind the Source List, Conversation Pane, and Reply pane #299

Closed

sssoleileraaa mentioned this issue Aug 12, 2019

Pause queue on auth errors, connection failures and timeouts #531

Merged

3 tasks

eloquence modified the milestones: 0.2.0alpha, 0.2.0beta Dec 12, 2019

eloquence closed this as completed Jan 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement network connectivity error handling #391

Implement network connectivity error handling #391

eloquence commented May 29, 2019 •

edited

Loading

sssoleileraaa commented Jul 22, 2019 •

edited

Loading

eloquence commented Jul 22, 2019

sssoleileraaa commented Jul 24, 2019

eloquence commented Jul 24, 2019 •

edited

Loading

sssoleileraaa commented Oct 31, 2019

redshiftzero commented Oct 31, 2019

eloquence commented Nov 27, 2019

redshiftzero commented Nov 27, 2019

ninavizz commented Nov 28, 2019

eloquence commented Dec 2, 2019 •

edited

Loading

ninavizz commented Dec 9, 2019

eloquence commented Jan 29, 2020

Implement network connectivity error handling #391

Implement network connectivity error handling #391

Comments

eloquence commented May 29, 2019 • edited Loading

Design

Related issuess

User Story

Acceptance Criteria

sssoleileraaa commented Jul 22, 2019 • edited Loading

eloquence commented Jul 22, 2019

sssoleileraaa commented Jul 24, 2019

eloquence commented Jul 24, 2019 • edited Loading

sssoleileraaa commented Oct 31, 2019

redshiftzero commented Oct 31, 2019

eloquence commented Nov 27, 2019

redshiftzero commented Nov 27, 2019

ninavizz commented Nov 28, 2019

eloquence commented Dec 2, 2019 • edited Loading

ninavizz commented Dec 9, 2019

eloquence commented Jan 29, 2020

eloquence commented May 29, 2019 •

edited

Loading

sssoleileraaa commented Jul 22, 2019 •

edited

Loading

eloquence commented Jul 24, 2019 •

edited

Loading

eloquence commented Dec 2, 2019 •

edited

Loading