If you restart (e.g. upgrade) Element while it's waiting to process a m.room_key, it'll drop it and you'll get UISIs #23113

ara4n · 2022-08-19T13:13:52Z

Steps to reproduce

I sent Rick some messages in an E2EE room for the first time in over a week, so it created a new megolm session. He couldn't decrypt them. It looks like it correctly sent it to him on vector.modular.im via to-device (although tracing to-device messages client->server->server->client is still incredibly hard; filed here).

Rick then turned on Element Desktop in the morning, which received a huge batch of 1515 to-device messages queued up overnight: https://rageshakes.element.io/api/listing/2022-08-19/081925-YR54YZWK/logs-0001.log.gz. The vast majority were keyshare requests from himself (presumably his Element Android trying to fix UISIs?)

matthew@shadowfax rick uisis % less logs-0001.log.gz.txt | egrep -o 'received to_device (.*?) ' | sort | uniq -c | sort -rn
1291 received to_device m.room_key_request 
 204 received to_device m.room.encrypted 
  19 received to_device m.room_key 
   1 received to_device im.vector.auto_rs_request

Amongst these, there are a bunch of to-device messages from me:

% cat logs-0001.log.gz.txt | grep 'received to_device .* from: @matthew:matrix.org'    
2022-08-19T08:11:27.271Z I received to_device m.room.encrypted from: @matthew:matrix.org id: undefined
2022-08-19T08:11:34.840Z I received to_device m.room_key from: @matthew:matrix.org id: undefined
2022-08-19T08:11:35.777Z I received to_device m.room_key_request from: @matthew:matrix.org id: undefined
2022-08-19T08:11:35.778Z I received to_device im.vector.auto_rs_request from: @matthew:matrix.org id: undefined
2022-08-19T08:11:35.979Z I received to_device m.room_key_request from: @matthew:matrix.org id: undefined
2022-08-19T08:11:36.581Z I received to_device m.room.encrypted from: @matthew:matrix.org id: undefined
2022-08-19T08:11:36.581Z I received to_device m.room.encrypted from: @matthew:matrix.org id: undefined
2022-08-19T08:11:36.581Z I received to_device m.room.encrypted from: @matthew:matrix.org id: undefined
2022-08-19T08:11:36.581Z I received to_device m.room.encrypted from: @matthew:matrix.org id: undefined
2022-08-19T08:11:36.582Z I received to_device m.room.encrypted from: @matthew:matrix.org id: undefined
2022-08-19T08:11:36.583Z I received to_device m.room.encrypted from: @matthew:matrix.org id: undefined

of which only one m.room.encrypted is actually decrypted into an m.room_key before Rick (quite reasonably) clicked 'upgrade' on Nightly at 2022-08-19T08:11:41.387Z, thus knifing it in the middle of working through the to-device backlog.

The m.room_key which got decrypted looked like this, and was from a different session to the one with the UISIs:

2022-08-19T08:11:34.840Z I Decrypted Olm message from Nhu0+B3JJNHr3abToRm0rYjXne20De67kH1RaNZBEAM with session T//5eGDj9bH7ATYfSiG+Odj70yCNI1nWBXjqe9qZtgk
2022-08-19T08:11:34.840Z I received to_device m.room_key from: @matthew:matrix.org id: undefined

Meanwhile, on next launch, there were neither m.room.encrypted or m.room_keys from me - i.e. the to-device messages with the keys got lost forever:

% cat logs-0000.log.gz.txt | grep 'received to_device .* from: @matthew:matrix.org' 
<nothing>

Looking at the code, onToDeviceEvent() calls event.attemptDecryption() which then asynchronously calls decryptionLoop() to work through decrypting the message... but at no point does the todevice message actually get persisted anywhere.

To fix this, we should transactionally persist to-device messages before we consider that /sync request processed, and then reliably process them, much like Hydrogen does: it's classic 'what goes wrong' reliable transactions. Switching to matrix-rust-sdk-crypto would presumably not fix this failure mode, given it's a matter of reliably processing to-device messages in general, rather than anything crypto specific.

Therefore, we need to switch js-sdk to a more hydrogen or rust-sdk style approach of persisting stuff received in /sync locally in order to reliably receive it from the server, and then process based on that local store, so that if the client explodes we don't drop data on the floor.

Outcome

What did you expect?

No UISIs

What happened instead?

UISIs when the recipient restarts while their app is decrypting messages.

Operating system

No response

Application version

Nightly

How did you install the app?

Nightly

Homeserver

matrix.org

Will you send logs?

Yes

The text was updated successfully, but these errors were encountered:

ara4n · 2022-08-19T13:27:22Z

it's worth noting that in a Sliding Sync world, we don't need to worry about reliably persisting so much stuff (other than to-device messages!) given we can just re-request it from the server at launch.

kegsay · 2022-08-19T17:42:19Z

So in a sliding sync world you have an alternative solution here to guarantee at-least-once processing without having to persist to-device messages (which incurs perf hits as you have to hit the disk). The flow would look like:

Assume we have a single persisted since token for to-device messages already as this is the common case. Call it S.
Get N to-device messages with since=S returning next_batch: S'.
Asyncly begin processing these to-device messages in-memory. During this time, disable the to-device extension enabled: false. Continue doing sync loops to get normal events, etc. Meanwhile...
Slowly process the to-device messages in a web worker or something. If you die here, you'll restart with the persisted since value S and re-fetch these messages. This may result in more than once processing.
When finished processing, re-enable the extension and update to since:S'.

This isn't possible in sync v2 because:

ACKing to-device messages and getting new sync data is tightly coupled: you cannot do one without the other. SS has a since token just for to-device messages, and a pos for other normal sync data.
Even if you did have separate tokens, sync v2 would parse the dedicated to-device token if you kept it at S (so each /sync would return N to-device messages repeating forever until you increment to S'), or would return an initial sync's worth of to-device messages if you omitted the token. SS allows you to disable the extension entirely to say "I don't want you to process this right now" and then enable it later on when your client has gone through all the to-device messages.

ara4n · 2022-08-19T19:34:45Z

This may result in more than once processing.

this is the kicker, though - for e2ee, processing more than once is a disaster, given it's a ratchet and the ratchet will get out of sync. my hunch is that we should grab the to-device msgs as rapidly as possible and chuck them reliably to disk, and then consume from there, rather than relying on the (admittedly cool) feature that sliding sync lets you rollback.

kegsay · 2022-08-20T17:48:22Z

If that is such a problem then persisting to device messages isn't enough, you need a two phase or atomic commit process to avoid reprocessing. E.g. write all the messages before you ack the token, then for each start a transaction and process a message and delete it then commit it. The processing part cannot store anything outside the transaction. It becomes a bit harder to manage. I'm surprised the protocol is so fragile that duplicate messages make it break though..

ara4n · 2022-08-22T07:24:30Z

thinking more, yeah: i think Olm should handle duplicate messages okay, so in a v3 world we could indeed rerequest them from the server.

rxl881 · 2022-09-06T09:24:17Z

Annoyingly, I got bitten by this one a second time (this time a lot worse, loosing the majority of enc. messages while I was away on holiday (~1.5k megolm sessions)). Linking here for reference -- https://github.com/matrix-org/element-web-rageshakes/issues/15185

richvdh · 2022-11-14T14:35:30Z

It's worth noting that this will have been significantly exacerbated by some recent changes that add an extra check to the processing of incoming encrypted to-device messages (https://github.com/matrix-org/matrix-js-sdk/blame/v21.0.0/src/crypto/algorithms/olm.ts#L225-L243)

richvdh · 2022-11-14T14:49:20Z

Switching to matrix-rust-sdk-crypto would presumably not fix this failure mode, given it's a matter of reliably processing to-device messages in general, rather than anything crypto specific.

It'll fix the crypto-related failure modes though, which is what this issue seems to be about (and I'm told that the rust matrix-sdk-crypto makes sure to persist the to-device messages it cares about, so IMHO this issue will be fixed by #21972).

If to-device messages are being used for other things, I'd suggest opening separate issues for them specifically.

richvdh · 2023-04-20T15:13:33Z

Switching to matrix-rust-sdk-crypto would presumably not fix this failure mode, given it's a matter of reliably processing to-device messages in general, rather than anything crypto specific.

It'll fix the crypto-related failure modes though, which is what this issue seems to be about (and I'm told that the rust matrix-sdk-crypto makes sure to persist the to-device messages it cares about, so IMHO this issue will be fixed by #21972).

Looking at this in more detail: the difference with the rust crypto implementation is not that it persists to-device messages per se, but that it does all processing of inbound to-device messages before returning control to the application - which means that the application will not make another /sync request until the to-device messages are processed.

(The main reason it is able to do this, whereas it is more difficult in the js-sdk, is that the js-sdk makes an outbound network request to fetch the sender's device keys before decryption, whereas the rust sdk delays this process until the keys are used.)

So, the upshot remains: Element-R will fix this particular usage of to-device messages. Other applications which also use to-device messages may still be subject to the problem and will need separate fixes.

richvdh · 2024-03-01T13:51:38Z

@BillCarsonFr points out that we persist the sync token very early in /sync processing, so it doesn't really matter if the rest of the process is properly atomic: if the client is restarted between persisting that token and processing the keys, the keys will be dropped.

kegsay · 2024-03-01T17:52:15Z

tl;dr: JS SDK with Rust crypto does not have this problem because we only write AFTER processing /sync responses, once every 5 minutes.

Okay, this is confusing so bear with my explanation.

At first glance, it looks bad because this happens as soon as there's a response:

            // set the sync token NOW *before* processing the events. We do this so
            // if something barfs on an event we can skip it rather than constantly
            // polling with the same token.
            this.client.store.setSyncToken(data.next_batch);

However, the IndexedDBStore implementation of setSyncToken doesn't exist, and instead it uses the parent class implementation which is a MemoryStore. As a result, this DOES NOT PERSIST the sync token.

Later in the loop, we call await this.processSyncResponse(syncEventData, data); which contains this:

        if (data.to_device && Array.isArray(data.to_device.events) && data.to_device.events.length > 0) {
            let toDeviceMessages: IToDeviceEvent[] = data.to_device.events.filter(noUnsafeEventProps);

            if (this.syncOpts.cryptoCallbacks) {
                toDeviceMessages = await this.syncOpts.cryptoCallbacks.preprocessToDeviceMessages(toDeviceMessages);
            }

Specifically, we await preprocessToDeviceMessages which then does const processed = await this.receiveSyncChanges({ events }); This ultimately awaits on this.olmMachine.receiveSyncChanges(...) which hits https://github.com/matrix-org/matrix-rust-sdk-crypto-wasm/blob/f16e8063b0a30e87c1847f033af79ec47c1026ac/src/machine.rs#L317 which will process room keys and secrets, and return decrypted to-device events for things like room key requests or verification requests.

Only AFTER all this is await'd do we do this:

                // tell databases that everything is now in a consistent state and can be saved.
                await this.client.store.save();

This will NOT write to indexeddb until WRITE_DELAY_MS has elapsed, which defaults to 5 minutes. Only after that time will the token from setSyncToken be fetched and written (syncData.nextBatch):

    private async doSyncToDatabase(userTuples: UserTuple[]): Promise<void> {
        try {
            const syncData = this.syncAccumulator.getJSON(true);
            await Promise.all([
                this.persistUserPresenceEvents(userTuples),
                this.persistAccountData(syncData.accountData),
                this.persistSyncData(syncData.nextBatch, syncData.roomsData),
            ]);
        } finally {
            this.syncToDatabasePromise = undefined;
        }
    }

This is what happens when the client is restarted at various times in this process:

Immediately after getting a sync response, before writing in memory => uses previously stored nextBatch
After getting a sync response, after writing in memory => uses previously stored nextBatch
In the middle of receiveSyncChanges => uses previously stored nextBatch => may get duplicate to-device events
After receiveSyncChanges but before save => uses previously stored nextBatch => WILL get duplicate to-device events
In subsequent syncs, after save(), before WRITE_DELAY_MS => uses very old stored nextBatch => may get duplicate to-device events if the penultimate nextBatch contained to-device msgs, even if the nextBatch token was stored in-memory
After save() persists => uses that nextBatch token.

There's a complement-crypto test for this called TestUnprocessedToDeviceMessagesArentLostOnRestart and in the logs you can clearly see that JS SDK uses the last stored since token upon restart.

kegsay · 2024-03-01T18:10:20Z

Follow up Q: surely ~~libolm~~ legacy crypto is fine too then right? And if so, why is this issue even a thing? Well, libolm is not fine because:

preprocessToDeviceMessages doesn't do anything useful but mentions the EventMapper does the decryption.
EventMapper calls decryptEventIfNeeded https://github.com/matrix-org/matrix-js-sdk/blob/develop/src/event-mapper.ts#L85 but does not await it.
decryptEventIfNeeded returns a promise https://github.com/matrix-org/matrix-js-sdk/blob/develop/src/client.ts#L7803

This introduces a source of concurrency which could mean that:

lots of to-device events arrive, they all get shoved through the event mapper and decryptEventIfNeeded is called many times.
Next event loop tick: sync processing code continues and calls .save()
Event mapper starts churning through the to-device events, decrypting them.
WRITE_DELAY_MS is met so .save() does a real write to indexeddb, setting the nextBatch.
Client restarts, with events still churning in the event mapper.
Unprocessed to-device events are now lost, as upon restart it uses the latest nextBatch token.

richvdh · 2024-04-10T17:40:40Z

TL;DR: this remains an issue in legacy crypto

ara4n added the T-Defect label Aug 19, 2022

robintown added S-Major Severely degrades major functionality or product features, with no satisfactory workaround A-E2EE Z-UISI Unable to decrypt errors O-Uncommon Most users are unlikely to come across this or unexpected workflow labels Aug 19, 2022

ara4n mentioned this issue Aug 19, 2022

It is impossible to trace to-device messages from client->server->server->client element-hq/element-meta#558

Closed

BillCarsonFr mentioned this issue Oct 14, 2022

incoming encrypted to-device messages can be lost when the application is restarted element-hq/element-meta#762

Closed

richvdh mentioned this issue Nov 14, 2022

Unable To Decrypt meta issue element-hq/element-meta#245

Open

64 tasks

pmaier1 added the Team: Crypto label Mar 2, 2023

ara4n mentioned this issue Aug 16, 2023

App fails to recover rapidly from UISIs #25980

Open

richvdh removed the Team: Crypto label Nov 7, 2023

richvdh changed the title ~~If you restart (e.g. upgrade) Element while it's waiting to decrypt a m.room_key, it'll drop it and you'll get UISIs~~ If you restart (e.g. upgrade) Element while it's waiting to process a m.room_key, it'll drop it and you'll get UISIs Mar 1, 2024

kegsay closed this as completed Mar 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

If you restart (e.g. upgrade) Element while it's waiting to process a m.room_key, it'll drop it and you'll get UISIs #23113

If you restart (e.g. upgrade) Element while it's waiting to process a m.room_key, it'll drop it and you'll get UISIs #23113

ara4n commented Aug 19, 2022 •

edited

Loading

ara4n commented Aug 19, 2022

kegsay commented Aug 19, 2022 •

edited

Loading

ara4n commented Aug 19, 2022

kegsay commented Aug 20, 2022

ara4n commented Aug 22, 2022

rxl881 commented Sep 6, 2022 •

edited

Loading

richvdh commented Nov 14, 2022

richvdh commented Nov 14, 2022

richvdh commented Apr 20, 2023

richvdh commented Mar 1, 2024

kegsay commented Mar 1, 2024

kegsay commented Mar 1, 2024 •

edited by richvdh

Loading

richvdh commented Apr 10, 2024

If you restart (e.g. upgrade) Element while it's waiting to process a m.room_key, it'll drop it and you'll get UISIs #23113

If you restart (e.g. upgrade) Element while it's waiting to process a m.room_key, it'll drop it and you'll get UISIs #23113

Comments

ara4n commented Aug 19, 2022 • edited Loading

Steps to reproduce

Outcome

What did you expect?

What happened instead?

Operating system

Application version

How did you install the app?

Homeserver

Will you send logs?

ara4n commented Aug 19, 2022

kegsay commented Aug 19, 2022 • edited Loading

ara4n commented Aug 19, 2022

kegsay commented Aug 20, 2022

ara4n commented Aug 22, 2022

rxl881 commented Sep 6, 2022 • edited Loading

richvdh commented Nov 14, 2022

richvdh commented Nov 14, 2022

richvdh commented Apr 20, 2023

richvdh commented Mar 1, 2024

kegsay commented Mar 1, 2024

kegsay commented Mar 1, 2024 • edited by richvdh Loading

richvdh commented Apr 10, 2024

ara4n commented Aug 19, 2022 •

edited

Loading

kegsay commented Aug 19, 2022 •

edited

Loading

rxl881 commented Sep 6, 2022 •

edited

Loading

kegsay commented Mar 1, 2024 •

edited by richvdh

Loading