Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If you restart (e.g. upgrade) Element while it's waiting to process a m.room_key, it'll drop it and you'll get UISIs #23113

Closed
ara4n opened this issue Aug 19, 2022 · 13 comments
Labels
A-E2EE O-Uncommon Most users are unlikely to come across this or unexpected workflow S-Major Severely degrades major functionality or product features, with no satisfactory workaround T-Defect Z-UISI Unable to decrypt errors

Comments

@ara4n
Copy link
Member

ara4n commented Aug 19, 2022

Steps to reproduce

I sent Rick some messages in an E2EE room for the first time in over a week, so it created a new megolm session. He couldn't decrypt them. It looks like it correctly sent it to him on vector.modular.im via to-device (although tracing to-device messages client->server->server->client is still incredibly hard; filed here).

Rick then turned on Element Desktop in the morning, which received a huge batch of 1515 to-device messages queued up overnight: https://rageshakes.element.io/api/listing/2022-08-19/081925-YR54YZWK/logs-0001.log.gz. The vast majority were keyshare requests from himself (presumably his Element Android trying to fix UISIs?)

matthew@shadowfax rick uisis % less logs-0001.log.gz.txt | egrep -o 'received to_device (.*?) ' | sort | uniq -c | sort -rn
1291 received to_device m.room_key_request 
 204 received to_device m.room.encrypted 
  19 received to_device m.room_key 
   1 received to_device im.vector.auto_rs_request 

Amongst these, there are a bunch of to-device messages from me:

% cat logs-0001.log.gz.txt | grep 'received to_device .* from: @matthew:matrix.org'    
2022-08-19T08:11:27.271Z I received to_device m.room.encrypted from: @matthew:matrix.org id: undefined
2022-08-19T08:11:34.840Z I received to_device m.room_key from: @matthew:matrix.org id: undefined
2022-08-19T08:11:35.777Z I received to_device m.room_key_request from: @matthew:matrix.org id: undefined
2022-08-19T08:11:35.778Z I received to_device im.vector.auto_rs_request from: @matthew:matrix.org id: undefined
2022-08-19T08:11:35.979Z I received to_device m.room_key_request from: @matthew:matrix.org id: undefined
2022-08-19T08:11:36.581Z I received to_device m.room.encrypted from: @matthew:matrix.org id: undefined
2022-08-19T08:11:36.581Z I received to_device m.room.encrypted from: @matthew:matrix.org id: undefined
2022-08-19T08:11:36.581Z I received to_device m.room.encrypted from: @matthew:matrix.org id: undefined
2022-08-19T08:11:36.581Z I received to_device m.room.encrypted from: @matthew:matrix.org id: undefined
2022-08-19T08:11:36.582Z I received to_device m.room.encrypted from: @matthew:matrix.org id: undefined
2022-08-19T08:11:36.583Z I received to_device m.room.encrypted from: @matthew:matrix.org id: undefined

of which only one m.room.encrypted is actually decrypted into an m.room_key before Rick (quite reasonably) clicked 'upgrade' on Nightly at 2022-08-19T08:11:41.387Z, thus knifing it in the middle of working through the to-device backlog.

The m.room_key which got decrypted looked like this, and was from a different session to the one with the UISIs:

2022-08-19T08:11:34.840Z I Decrypted Olm message from Nhu0+B3JJNHr3abToRm0rYjXne20De67kH1RaNZBEAM with session T//5eGDj9bH7ATYfSiG+Odj70yCNI1nWBXjqe9qZtgk
2022-08-19T08:11:34.840Z I received to_device m.room_key from: @matthew:matrix.org id: undefined

Meanwhile, on next launch, there were neither m.room.encrypted or m.room_keys from me - i.e. the to-device messages with the keys got lost forever:

% cat logs-0000.log.gz.txt | grep 'received to_device .* from: @matthew:matrix.org' 
<nothing>

Looking at the code, onToDeviceEvent() calls event.attemptDecryption() which then asynchronously calls decryptionLoop() to work through decrypting the message... but at no point does the todevice message actually get persisted anywhere.

To fix this, we should transactionally persist to-device messages before we consider that /sync request processed, and then reliably process them, much like Hydrogen does: it's classic 'what goes wrong' reliable transactions. Switching to matrix-rust-sdk-crypto would presumably not fix this failure mode, given it's a matter of reliably processing to-device messages in general, rather than anything crypto specific.

Therefore, we need to switch js-sdk to a more hydrogen or rust-sdk style approach of persisting stuff received in /sync locally in order to reliably receive it from the server, and then process based on that local store, so that if the client explodes we don't drop data on the floor.

Outcome

What did you expect?

No UISIs

What happened instead?

UISIs when the recipient restarts while their app is decrypting messages.

Operating system

No response

Application version

Nightly

How did you install the app?

Nightly

Homeserver

matrix.org

Will you send logs?

Yes

@ara4n ara4n added the T-Defect label Aug 19, 2022
@robintown robintown added S-Major Severely degrades major functionality or product features, with no satisfactory workaround A-E2EE Z-UISI Unable to decrypt errors O-Uncommon Most users are unlikely to come across this or unexpected workflow labels Aug 19, 2022
@ara4n
Copy link
Member Author

ara4n commented Aug 19, 2022

it's worth noting that in a Sliding Sync world, we don't need to worry about reliably persisting so much stuff (other than to-device messages!) given we can just re-request it from the server at launch.

@kegsay
Copy link
Contributor

kegsay commented Aug 19, 2022

So in a sliding sync world you have an alternative solution here to guarantee at-least-once processing without having to persist to-device messages (which incurs perf hits as you have to hit the disk). The flow would look like:

  • Assume we have a single persisted since token for to-device messages already as this is the common case. Call it S.
  • Get N to-device messages with since=S returning next_batch: S'.
  • Asyncly begin processing these to-device messages in-memory. During this time, disable the to-device extension enabled: false. Continue doing sync loops to get normal events, etc. Meanwhile...
  • Slowly process the to-device messages in a web worker or something. If you die here, you'll restart with the persisted since value S and re-fetch these messages. This may result in more than once processing.
  • When finished processing, re-enable the extension and update to since:S'.

This isn't possible in sync v2 because:

  • ACKing to-device messages and getting new sync data is tightly coupled: you cannot do one without the other. SS has a since token just for to-device messages, and a pos for other normal sync data.
  • Even if you did have separate tokens, sync v2 would parse the dedicated to-device token if you kept it at S (so each /sync would return N to-device messages repeating forever until you increment to S'), or would return an initial sync's worth of to-device messages if you omitted the token. SS allows you to disable the extension entirely to say "I don't want you to process this right now" and then enable it later on when your client has gone through all the to-device messages.

@ara4n
Copy link
Member Author

ara4n commented Aug 19, 2022

This may result in more than once processing.

this is the kicker, though - for e2ee, processing more than once is a disaster, given it's a ratchet and the ratchet will get out of sync. my hunch is that we should grab the to-device msgs as rapidly as possible and chuck them reliably to disk, and then consume from there, rather than relying on the (admittedly cool) feature that sliding sync lets you rollback.

@kegsay
Copy link
Contributor

kegsay commented Aug 20, 2022

If that is such a problem then persisting to device messages isn't enough, you need a two phase or atomic commit process to avoid reprocessing. E.g. write all the messages before you ack the token, then for each start a transaction and process a message and delete it then commit it. The processing part cannot store anything outside the transaction. It becomes a bit harder to manage. I'm surprised the protocol is so fragile that duplicate messages make it break though..

@ara4n
Copy link
Member Author

ara4n commented Aug 22, 2022

thinking more, yeah: i think Olm should handle duplicate messages okay, so in a v3 world we could indeed rerequest them from the server.

@rxl881
Copy link
Contributor

rxl881 commented Sep 6, 2022

Annoyingly, I got bitten by this one a second time (this time a lot worse, loosing the majority of enc. messages while I was away on holiday (~1.5k megolm sessions)). Linking here for reference -- https://github.com/matrix-org/element-web-rageshakes/issues/15185

@richvdh
Copy link
Member

richvdh commented Nov 14, 2022

It's worth noting that this will have been significantly exacerbated by some recent changes that add an extra check to the processing of incoming encrypted to-device messages (https://github.com/matrix-org/matrix-js-sdk/blame/v21.0.0/src/crypto/algorithms/olm.ts#L225-L243)

@richvdh
Copy link
Member

richvdh commented Nov 14, 2022

Switching to matrix-rust-sdk-crypto would presumably not fix this failure mode, given it's a matter of reliably processing to-device messages in general, rather than anything crypto specific.

It'll fix the crypto-related failure modes though, which is what this issue seems to be about (and I'm told that the rust matrix-sdk-crypto makes sure to persist the to-device messages it cares about, so IMHO this issue will be fixed by #21972).

If to-device messages are being used for other things, I'd suggest opening separate issues for them specifically.

@richvdh
Copy link
Member

richvdh commented Apr 20, 2023

Switching to matrix-rust-sdk-crypto would presumably not fix this failure mode, given it's a matter of reliably processing to-device messages in general, rather than anything crypto specific.

It'll fix the crypto-related failure modes though, which is what this issue seems to be about (and I'm told that the rust matrix-sdk-crypto makes sure to persist the to-device messages it cares about, so IMHO this issue will be fixed by #21972).

Looking at this in more detail: the difference with the rust crypto implementation is not that it persists to-device messages per se, but that it does all processing of inbound to-device messages before returning control to the application - which means that the application will not make another /sync request until the to-device messages are processed.

(The main reason it is able to do this, whereas it is more difficult in the js-sdk, is that the js-sdk makes an outbound network request to fetch the sender's device keys before decryption, whereas the rust sdk delays this process until the keys are used.)

So, the upshot remains: Element-R will fix this particular usage of to-device messages. Other applications which also use to-device messages may still be subject to the problem and will need separate fixes.

@richvdh richvdh changed the title If you restart (e.g. upgrade) Element while it's waiting to decrypt a m.room_key, it'll drop it and you'll get UISIs If you restart (e.g. upgrade) Element while it's waiting to process a m.room_key, it'll drop it and you'll get UISIs Mar 1, 2024
@richvdh
Copy link
Member

richvdh commented Mar 1, 2024

@BillCarsonFr points out that we persist the sync token very early in /sync processing, so it doesn't really matter if the rest of the process is properly atomic: if the client is restarted between persisting that token and processing the keys, the keys will be dropped.

@kegsay
Copy link
Contributor

kegsay commented Mar 1, 2024

tl;dr: JS SDK with Rust crypto does not have this problem because we only write AFTER processing /sync responses, once every 5 minutes.


Okay, this is confusing so bear with my explanation.

At first glance, it looks bad because this happens as soon as there's a response:

            // set the sync token NOW *before* processing the events. We do this so
            // if something barfs on an event we can skip it rather than constantly
            // polling with the same token.
            this.client.store.setSyncToken(data.next_batch);

However, the IndexedDBStore implementation of setSyncToken doesn't exist, and instead it uses the parent class implementation which is a MemoryStore. As a result, this DOES NOT PERSIST the sync token.

Later in the loop, we call await this.processSyncResponse(syncEventData, data); which contains this:

        if (data.to_device && Array.isArray(data.to_device.events) && data.to_device.events.length > 0) {
            let toDeviceMessages: IToDeviceEvent[] = data.to_device.events.filter(noUnsafeEventProps);

            if (this.syncOpts.cryptoCallbacks) {
                toDeviceMessages = await this.syncOpts.cryptoCallbacks.preprocessToDeviceMessages(toDeviceMessages);
            }

Specifically, we await preprocessToDeviceMessages which then does const processed = await this.receiveSyncChanges({ events }); This ultimately awaits on this.olmMachine.receiveSyncChanges(...) which hits https://github.com/matrix-org/matrix-rust-sdk-crypto-wasm/blob/f16e8063b0a30e87c1847f033af79ec47c1026ac/src/machine.rs#L317 which will process room keys and secrets, and return decrypted to-device events for things like room key requests or verification requests.

Only AFTER all this is await'd do we do this:

                // tell databases that everything is now in a consistent state and can be saved.
                await this.client.store.save();

This will NOT write to indexeddb until WRITE_DELAY_MS has elapsed, which defaults to 5 minutes. Only after that time will the token from setSyncToken be fetched and written (syncData.nextBatch):

    private async doSyncToDatabase(userTuples: UserTuple[]): Promise<void> {
        try {
            const syncData = this.syncAccumulator.getJSON(true);
            await Promise.all([
                this.persistUserPresenceEvents(userTuples),
                this.persistAccountData(syncData.accountData),
                this.persistSyncData(syncData.nextBatch, syncData.roomsData),
            ]);
        } finally {
            this.syncToDatabasePromise = undefined;
        }
    }

This is what happens when the client is restarted at various times in this process:

  • Immediately after getting a sync response, before writing in memory => uses previously stored nextBatch
  • After getting a sync response, after writing in memory => uses previously stored nextBatch
  • In the middle of receiveSyncChanges => uses previously stored nextBatch => may get duplicate to-device events
  • After receiveSyncChanges but before save => uses previously stored nextBatch => WILL get duplicate to-device events
  • In subsequent syncs, after save(), before WRITE_DELAY_MS => uses very old stored nextBatch => may get duplicate to-device events if the penultimate nextBatch contained to-device msgs, even if the nextBatch token was stored in-memory
  • After save() persists => uses that nextBatch token.

There's a complement-crypto test for this called TestUnprocessedToDeviceMessagesArentLostOnRestart and in the logs you can clearly see that JS SDK uses the last stored since token upon restart.

@kegsay kegsay closed this as completed Mar 1, 2024
@kegsay
Copy link
Contributor

kegsay commented Mar 1, 2024

Follow up Q: surely libolm legacy crypto is fine too then right? And if so, why is this issue even a thing? Well, libolm is not fine because:

This introduces a source of concurrency which could mean that:

  • lots of to-device events arrive, they all get shoved through the event mapper and decryptEventIfNeeded is called many times.
  • Next event loop tick: sync processing code continues and calls .save()
  • Event mapper starts churning through the to-device events, decrypting them.
  • WRITE_DELAY_MS is met so .save() does a real write to indexeddb, setting the nextBatch.
  • Client restarts, with events still churning in the event mapper.
  • Unprocessed to-device events are now lost, as upon restart it uses the latest nextBatch token.

@richvdh
Copy link
Member

richvdh commented Apr 10, 2024

TL;DR: this remains an issue in legacy crypto

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-E2EE O-Uncommon Most users are unlikely to come across this or unexpected workflow S-Major Severely degrades major functionality or product features, with no satisfactory workaround T-Defect Z-UISI Unable to decrypt errors
Projects
None yet
Development

No branches or pull requests

6 participants