Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

We abandon claiming OTKs from remote servers after 10s, and never retry, permenantly breaking E2EE to users on those servers. #24138

Closed
ara4n opened this issue Jan 2, 2023 · 4 comments
Labels
A-E2EE O-Frequent Affects or can be seen by most users regularly or impacts most users' first experience S-Critical Prevents work, causes data loss and/or has no workaround T-Defect Team: Crypto Z-Chronic

Comments

@ara4n
Copy link
Member

ara4n commented Jan 2, 2023

Steps to reproduce

  1. @pmaier1 joins the company and gets an account on the element.io HS
  2. he sends "hello world" into the main internal room
  3. his client tries to claim 1204 OTKs with the devices in the room.
  4. 720 of these keys are on matrix.org, but the server happens to take more than 10s to respond to the /keys/claim req over federation:
2023-01-02 13:37:20,154 - synapse.http.matrixfederationclient - 672 - INFO - POST-380412- {POST-O-2364518} [matrix.org] Request failed: POST matrix://matrix.org/_matrix/federation/v1/user/keys/claim: ResponseNeverReceived:[CancelledError()]
2023-01-02 13:37:20,192 - synapse.access.http.8008 - 460 - INFO - POST-380412- x.x.x.x - 8008 - {@pmaier:element.io} Processed request: 11.724sec/0.024sec (0.408sec, 0.084sec) (0.230sec/0.922sec/534) 59473B 200 "POST /_matrix/client/r0/keys/claim HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36" [0 dbevts]
  1. As a result, his client thinks all users on matrix.org have run out of OTKs, and fails to establish Olm with them:
2023-01-02T13:37:08.300Z I [!kxwQeJPhRigXSZrHqf:matrix.org] Making new Olm session for KkN6tLocYJnNksFwsoxhmNWWBQhMPpw0VtEaoELoACs (@matthew:matrix.org:SDLQVFUQYD)
...
2023-01-02T13:37:08.396Z I [!kxwQeJPhRigXSZrHqf:matrix.org] Claiming one-time keys for 1204 devices
2023-01-02T13:37:20.202Z I [!kxwQeJPhRigXSZrHqf:matrix.org] Claimed one-time keys for 1204 devices
...
2023-01-02T13:37:20.254Z W [!kxwQeJPhRigXSZrHqf:matrix.org] No one-time keys (alg=signed_curve25519) for device @matthew:matrix.org:SDLQVFUQYD"
...
2023-01-02T13:37:22.089Z I Notifying 720 devices we failed to create Olm sessions in !kxw...:matrix.org
  1. The client never retries; all the users on matrix.org see UTDs.

Outcome

What did you expect?

  • We should distinguish between running out of OTKs, versus a remote OTK claim timing out.
  • apparently we do in the failures field. but we're not using it to retry.
  • The client should retry claiming the OTKs with a longer timeout and retry setting up Olm and sending the megolm keys in the bg if the remote OTK claim times out.
  • at the least we should try setting up Olm when the user next sends a message, but in practice that isn't happening either.
  • We could also split up the key claim request into smaller batches, as trying to claim 720 keys could easily take more than 10s on a busy server.
  • The m.room_key.withheld to-device messages informing the target devices that the reason for the UTD was "no OTKS" got discarded by the receiving clients, otherwise debugging would have been way easier.

I suspect the same misbehaviour could be happening with /keys/query?

What happened instead?

A really nasty avoidable class of UISI, and yet another instance where we don't retry reqs atomically.

Operating system

Linux

Browser information

Chrome 107.0

URL for webapp

No response

Application version

1.11.17

Homeserver

element.io

Will you send logs?

Yes

@ara4n ara4n added the T-Defect label Jan 2, 2023
@andybalaam andybalaam added S-Critical Prevents work, causes data loss and/or has no workaround O-Frequent Affects or can be seen by most users regularly or impacts most users' first experience labels Jan 3, 2023
@andybalaam
Copy link
Contributor

I struggled with the right O- tag here - it seems common for people starting out new, but then not seen by most others.

@turt2live
Copy link
Member

I struggled with the right O- tag here - it seems common for people starting out new, but then not seen by most others.

it would also be seen by anyone trying to talk to slow homeservers, which are very plentiful.

@pmaier1
Copy link

pmaier1 commented Jan 20, 2023

Work has started within the rust SDK matrix-org/matrix-rust-sdk#1315. We're going to schedule the completion (if more work is necessary) after the introduction of Element R.

@BillCarsonFr
Copy link
Member

Fixed by rust crypto.

@pmaier1 pmaier1 closed this as completed Mar 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-E2EE O-Frequent Affects or can be seen by most users regularly or impacts most users' first experience S-Critical Prevents work, causes data loss and/or has no workaround T-Defect Team: Crypto Z-Chronic
Projects
None yet
Development

No branches or pull requests

7 participants