Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Have a "fix everything" button for E2EE / Dealing with E2EE breakage #20685

Closed
ShadowJonathan opened this issue Jan 22, 2022 · 11 comments
Closed
Labels
A-E2EE O-Occasional Affects or can be seen by some users regularly or most users rarely S-Minor Impairs non-critical functionality or suitable workarounds exist T-Enhancement Team: Crypto X-Needs-Product More input needed from the Product team

Comments

@ShadowJonathan
Copy link
Contributor

ShadowJonathan commented Jan 22, 2022

Your use case

What would you like to do?

Have a "Fix Everything" button under E2EE settings, that will analyse, detect issues, and repair problems with E2EE as best as it can, and list the remaining unfixable problems to the user.

Why would you like to do it?

Currently today on matrix, E2EE breakage is pretty often, it is also a severe issue when it happens, and as E2EE is often based on happy-path behaviour, and its security sensitivity does not allow much flexibility with fixing issues with it, problems started in E2EE will often persist until "magical incantations" are made by the corresponding users to force behaviour to fix itself.

However, normal users will very quickly be turned off from Element/Matrix once they encounter these issues, and as they're hard to debug, and often derived from many states (on the servers involved, and on the users' devices), they dont magically fix themselves, even if the original issue is long resolved. These issues will very likely prop up in "sensitive" scenarios, as E2EE is default on many DMs, these one-to-one conversations will be interrupted and disrupted, heightening any negative emotions the user might have at the platform.

How would you like to achieve it?

So, as a universal bandaid, a button like this would probably flush a lot of E2EE state, try to reestablish sessions, signal to other devices and users to re-synchronise themselves, and generally try to identify as many issues as possible to fix. This all in a "soft" way, this should not clear a user's login, or other data.

This very likely would require spec changes, and/or a rethinking of some E2EE interactions, this would likely introduce security risks as tradeoffs, so this has to be threaded with caution.

Have you considered any alternatives?

This is only an alternative to the specific mechanism here proposed, but some further context;

E2EE in matrix is currently not self-healing in certain conditions, there is only a happy path then and there, and no real mechanism to reconcile when devices/sessions have strayed from that path. This is likely due to the security implications that such recovery mechanisms would have.

A "fix everything" button will help the immediate problem of users having no option to recover their E2EE-powered rooms sanely, but this doesn't fix the main problem;

Even if for the sake of security, (un)intentional disruptions can happen along many points of the pipeline that E2EE relies on, and recovering from this should both be a UX and security priority. E2EE cannot (again, even for the sake of security) model itself primarily along the lines of a "100% spec-compliant" server.

There are two kinds of E2EE breakage;

  1. Intentional disruption and hijacking, the malicious kind.
  2. Desynchronisation, the unintentional accidental kind.

The latter is the problem here, and a solution could be to have matrix E2EE differentiate the two.

Additional context

Related issues

element-hq/element-meta#310, #2996, #18881, #12250, #11049, element-hq/element-meta#1563, #18639, #16163, #17578, element-hq/element-meta#1930, #14820, #12851, #20670, #15388, #17500, #17622, #14921, #16086, #20247, #18541, #9219, #7312, #16614, #16613, #16458, #13744, #12250, and element-hq/element-meta#1859 are related to this.

element-hq/element-meta#1565, #16184, element-hq/element-meta#1894, #5675, and #15416 are tangentially related to this.

#20005, #18505, #18443, #13575, #11094, #6879, and #9434 are related to this insofar that this "fix everything" process should detect and address them.

#13582 (comment) is an example of a server issue causing an inconsistent state.

#3868 is also an example of potential introductions of inconsistent state.


TL;DR: Windows Troubleshooter for E2EE, but actually reliable and helpful.

Also, E2EE should possibly become more reliable and robust than matrix can be at its worst times.

@ShadowJonathan
Copy link
Contributor Author

Adding on from some discussion in #-dev;

The majority of E2EE self-healing is reactive to a situation at hand, it does not try to proactively analyse the latent state from participants involved to discover the error, and then fix it.

This issue essentially calls for the latter to be a manual process, but the alternatives section recommends it being an automatic one.

Furthermore, in situations where these desynchronisations/errors have caused irreparable damage, a fallback option (even if it is through human confirmation) should be given to one-time recover E2EE history for all participants involved.

@novocaine
Copy link
Contributor

If we know which things needed to be flushed .. we can just fix the bugs that have lead to corrupt state, rather than create an embarrassing band-aid solution. This is what the Element cryptography team is currently doing, using analytics to measure the rate of encryption errors and drive them down. It is a long and complex process that requires a lot of attention to detail, but we are making concrete progress.

If you have suggestions for exactly which data stores or user properties you would like to flush, that might be a more tractable request, but otherwise, I'm inclined to close this as the team will not action it in its current form.

@ShadowJonathan
Copy link
Contributor Author

ShadowJonathan commented Jan 22, 2022

An incomplete list could be;

  • user device lists, have the local server send out the same list to all servers to re-sync
  • discard local E2EE sessions
  • discard remote E2EE sessions (send to_device requests to remote users to discard them)
  • enumerate all messages that cannot be decrypted, send requests to local user's devices for keys
    • if no device has the key, re-request key from remote user(s)
  • test if user E2EE backup is accessible or non-corrupt

The above are giant reset buttons, but possibly the following could be proactive/selective;

  • request "show me your perspective" messages to all E2EE devices, compare and send resync request if it doesn't match
  • possibly detect when to_devices have gotten lost at any stage, and have fallbacks in case it happens, though this could be a two-generals' problem

Additionally, a bunch of assertions and assumptions that E2EE has about the current state of the world must be somehow exposed so that code can look at it, compare it (either with this button or another "weird state" trigger), and then take action to correct it, if possible.


@novocaine if this issue in the state of "provide a giant button" isn't feasible, could a variant which says "be more proactive with E2EE problems" work better? (with the core request to probe and fix E2EE inconsistency before it can lead to unrecoverable errors)

My main tenet with this issue is that past users do not have any possible way to fix their E2EE, maybe booting up in an element version that automatically detects, analyses, and fixes that could help their problem, but I know it's further off than just providing them a button. What I assume the cryptography teams are doing is providing better happy paths to E2EE, but it does not help with users who have already entered a FUBAR state.

@novocaine
Copy link
Contributor

An incomplete list could be;

  • user device lists, have the local server send out the same list to all servers to re-sync

  • discard local E2EE sessions

  • discard remote E2EE sessions (send to_device requests to remote users to discard them)

  • enumerate all messages that cannot be decrypted, send requests to local user's devices for keys

    • if no device has the key, re-request key from remote user(s)
  • test if user E2EE backup is accessible or non-corrupt

The above are giant reset buttons, but possibly the following could be proactive/selective;

  • request "show me your perspective" messages to all E2EE devices, compare and send resync request if it doesn't match
  • possibly detect when to_devices have gotten lost at any stage, and have fallbacks in case it happens, though this could be a two-generals' problem

Additionally, a bunch of assertions and assumptions that E2EE has about the current state of the world must be somehow exposed so that code can look at it, compare it (either with this button or another "weird state" trigger), and then take action to correct it, if possible.

@novocaine if this issue in the state of "provide a giant button" isn't feasible, could a variant which says "be more proactive with E2EE problems" work better? (with the core request to probe and fix E2EE inconsistency before it can lead to unrecoverable errors)

My main tenet with this issue is that past users do not have any possible way to fix their E2EE, maybe booting up in an element version that automatically detects, analyses, and fixes that could help their problem, but I know it's further off than just providing them a button. What I assume the cryptography teams are doing is providing better happy paths to E2EE, but it does not help with users who have already entered a FUBAR state.

I think some of these state clears are interesting options but only if the work to implement them is cheap; otherwise we're better off spending the time to try to fix the root causes. I've raised this internally with the team.

@BillCarsonFr
Copy link
Member

BillCarsonFr commented Jan 24, 2022

Quick Answer:
Hello, yes we are currently working on such features, the idea is to help users facing E2E issues to fix them.

Our ultimate goal is to fix the root cause, but meanwhile we can't let down users.
It's still a bit technical for now but we will improve so that it's more user friendly, please see how it looks like on android (coming soon an web) element-hq/element-android#5006

I have a live demo of it to showcase usage, will see our to share.

Thanks for the issue, I'll review it in more details and see how to use it to improve our ideas. thx
Opened issues:
element-hq/element-meta#76
element-hq/element-meta#74

@ShadowJonathan
Copy link
Contributor Author

@BillCarsonFr thanks for the response, I think this is a good step forward, but one thing I'd like to remind you of is that, while these issues and associated features will increase user visibility to how the innards of E2EE work, these are only useful for users which have intricate knowledge of the innards of E2EE. My point still stands about self-healing E2EE.

If this is what you're addressing, then all is good, I'm just following an impression given by those issues. (Which btw, looks amazing for E2EE debug)

@BillCarsonFr
Copy link
Member

@BillCarsonFr thanks for the response, I think this is a good step forward, but one thing I'd like to remind you of is that, while these issues and associated features will increase user visibility to how the innards of E2EE work, these are only useful for users which have intricate knowledge of the innards of E2EE. My point still stands about self-healing E2EE.

If this is what you're addressing, then all is good, I'm just following an impression given by those issues. (Which btw, looks amazing for E2EE debug)

Yes that's what we want to address ultimatly. These issues are not yet bringing us there, we are iterating on it seeing what it really usefull and then better package it as a self healing feature.

@ShadowJonathan
Copy link
Contributor Author

(Interestingly i missed #3553 when looking for issues, though it's related to this.)

@turt2live turt2live added X-Needs-Product More input needed from the Product team Team: Crypto S-Minor Impairs non-critical functionality or suitable workarounds exist O-Occasional Affects or can be seen by some users regularly or most users rarely labels Jun 14, 2022
@jittygitty
Copy link

@novocaine There's a ton of these issues, and the problem is that some of these E2EE errors might be due to servers (synapse vs dendrite vs etc) or even various clients. So it seems worthwhile for whatever matrix team controls to have the best diagnostics possible, ideally diagnostics that can self-heal or guide user into some simple steps that will heal/fix issues.

I commented how Element-Desktop app with Unable to decrypt error gave very little useful info in the CTRL-SHIFT-I console log see my post at:
#19748

It's very difficult to attract people to use Matrix system instead of other chat systems when they make an effort to try it and then "immediately" we have problems with chat not working at all.

Because we do have an "in-between" server, we should be able to have clients plus server self-triangulate such issues and ideally self-heal with some minor notifications or prompting to users.

@yamrzou
Copy link

yamrzou commented Sep 28, 2022

It's very difficult to attract people to use Matrix system instead of other chat systems when they make an effort to try it and then "immediately" we have problems with chat not working at all.

This.

@richvdh
Copy link
Member

richvdh commented Nov 18, 2022

I'm very much in agreement with @novocaine: a "fix everything button" is not a good solution here. Why even make the user press the button if we can magically fix everything?

We know how frustrating encryption problems are and we are working on fixing them, and improving visibility of why they happen. Putting on band-aids which will sometimes fix up problems after they happen is a distraction from actually making the system reliable in the first place.

@richvdh richvdh closed this as completed Nov 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-E2EE O-Occasional Affects or can be seen by some users regularly or most users rarely S-Minor Impairs non-critical functionality or suitable workarounds exist T-Enhancement Team: Crypto X-Needs-Product More input needed from the Product team
Projects
None yet
Development

No branches or pull requests

9 participants