Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

/keys/query loop in unverified sessions #27165

Closed
fnwbr opened this issue Mar 12, 2024 · 22 comments · Fixed by matrix-org/matrix-rust-sdk#3442 or matrix-org/matrix-js-sdk#4216
Closed

/keys/query loop in unverified sessions #27165

fnwbr opened this issue Mar 12, 2024 · 22 comments · Fixed by matrix-org/matrix-rust-sdk#3442 or matrix-org/matrix-js-sdk#4216
Assignees
Labels
A-Element-R Issues affecting the port of Element's crypto layer to Rust O-Uncommon Most users are unlikely to come across this or unexpected workflow S-Major Severely degrades major functionality or product features, with no satisfactory workaround T-Defect

Comments

@fnwbr
Copy link

fnwbr commented Mar 12, 2024

I've outlined this in #element-web:matrix.org already; but to no luck. We've started seeing this as of 1 or 2 weeks ago:

Steps to reproduce

I tried this on a fresh Synapse instance with an empty database.

  1. In a fresh browser session, open Element and login
  2. Create a new room with encryption enabled
  3. Do not backup your keys or anything
  4. Close the window
  5. Now start a new session again (I just started a fresh Private Window)
  6. Login to the same account
  7. 💥

Outcome

What did you expect?

Element to not go crazy.

What happened instead?

Watch the console output repeating the following block about 60 times per second:
image

21:23:04.702 [Perf]: Make outgoing request 1 took 43ms [rageshake.ts:77:16](webpack://element-web/node_modules/matrix-react-sdk/src/rageshake/rageshake.ts)
21:23:04.708 FetchHttpApi: --> POST http://matrix.localhost/_matrix/client/v3/keys/query [rageshake.ts:77:16](webpack://element-web/node_modules/matrix-react-sdk/src/rageshake/rageshake.ts)
21:23:04.710 FetchHttpApi: --> POST http://matrix.localhost/_matrix/client/v3/keys/query [rageshake.ts:77:16](webpack://element-web/node_modules/matrix-react-sdk/src/rageshake/rageshake.ts)
21:23:04.725 FetchHttpApi: <-- POST http://matrix.localhost/_matrix/client/v3/keys/query [18ms 200] [rageshake.ts:77:16](webpack://element-web/node_modules/matrix-react-sdk/src/rageshake/rageshake.ts)
21:23:04.725 FetchHttpApi: <-- POST http://matrix.localhost/_matrix/client/v3/keys/query [16ms 200] [rageshake.ts:77:16](webpack://element-web/node_modules/matrix-react-sdk/src/rageshake/rageshake.ts)
21:23:04.726 DEBUG matrix_sdk_crypto::identities::manager: Handling a `/keys/query` response
    request_id="e8becdd764274754b6031ca441212b8c" users={"@florian:matrix.localhost"} failures={}
    at /home/runner/.cargo/git/checkouts/matrix-rust-sdk-1f4927f82a3d27bb/dcf0069/crates/matrix-sdk-crypto/src/identities/manager.rs:124 [rageshake.ts:77:16](webpack://element-web/node_modules/matrix-react-sdk/src/rageshake/rageshake.ts)
21:23:04.726 DEBUG matrix_sdk_crypto::identities::manager: Handling a `/keys/query` response
    request_id="98286678d9b54861b5a3c54dda2c703e" users={"@florian:matrix.localhost"} failures={}
    at /home/runner/.cargo/git/checkouts/matrix-rust-sdk-1f4927f82a3d27bb/dcf0069/crates/matrix-sdk-crypto/src/identities/manager.rs:124

Application version

Element version: 1.11.60 Crypto version: Rust SDK 0.7.0 (b1918e9), Vodozemac 0.5.1

Homeserver

Synapse v1.102.0

@fnwbr fnwbr added the T-Defect label Mar 12, 2024
@fnwbr
Copy link
Author

fnwbr commented Mar 12, 2024

I'm suspecting it's happening since the switch to the Rust Crypto SDK, but I will have to confirm that tomorrow.

@t3chguy
Copy link
Member

t3chguy commented Mar 12, 2024

This sounds like #26194 which @richvdh closed in matrix-org/matrix-js-sdk#3757

@richvdh
Copy link
Member

richvdh commented Mar 13, 2024

Please send debug logs from within the application.

@richvdh richvdh added the A-Element-R Issues affecting the port of Element's crypto layer to Rust label Mar 13, 2024
@fnwbr
Copy link
Author

fnwbr commented Mar 13, 2024

I have just re-produced the issue using the build on https://develop.element.io and uploaded logs from there.

@fnwbr
Copy link
Author

fnwbr commented Mar 13, 2024

Because I said I'd do so last night: I can also confirm that this behavior does not occur with v1.11.57 -- which was the last release before defaulting to the Rust crypto SDK, right?

EDIT: I sadly cannot provide logs from this test because I did so running the v1.11.57 Docker image locally without configuring Rageshake and whatnot. Let me know if I can provide you with more useful info on this.

fnwbr added a commit to medienhaus/medienhaus-docker-dev that referenced this issue Mar 13, 2024
fnwbr added a commit to medienhaus/medienhaus-docker that referenced this issue Mar 13, 2024
@florianduros florianduros added S-Major Severely degrades major functionality or product features, with no satisfactory workaround O-Uncommon Most users are unlikely to come across this or unexpected workflow labels Mar 25, 2024
@richvdh
Copy link
Member

richvdh commented Apr 3, 2024

The debug logs show lots of calls to /keys/query, but no corresponding Created a /keys/query request lines. That implies that the requests in question must be created via IdentityManager::build_key_query_for_users (which does not log the created requests). That method is only used by OlmMachine::query_keys_for_users. That in turn is called by RustCrypto.userHasCrossSigningKeys. That is called in a few places in the codebase.

@richvdh
Copy link
Member

richvdh commented Apr 3, 2024

I was unable to reproduce this following the reproduction steps.

I tried this on a fresh Synapse instance with an empty database.

In a fresh browser session, open Element and login

Surely if it is an empty database, you cannot login? Did you register the user somehow? Or are you using an unusual authentication system?

Assuming you're using an existing account: when I try to log in again, I see this screen:

image

@fnwbr can you confirm if you see that?

@fnwbr
Copy link
Author

fnwbr commented Apr 3, 2024

Thanks for giving this a look, I can check back in tomorrow and provide more details.

But to answer one question:

Surely if it is an empty database, you cannot login? Did you register the user somehow? Or are you using an unusual authentication system?

Yep, sorry for not mentioning. We're using the LDAP provider module as outlined in:
https://github.com/medienhaus/medienhaus-docker-dev/blob/main/config/matrix-synapse.yaml#L39-L52

@SPiRiT369
Copy link
Contributor

I just want to say that this issue happened to me for a long time. These endless loops triggered bans so I decided to stop my service (unfederated server) after not getting help in the chat. Today I've re-installed the server with a hope it's solved, but it's not.
It also happens to me with the Android app. And I think that renaming session(s) somehow stops the loop.
:/

@dbkr
Copy link
Member

dbkr commented Apr 22, 2024

I have my local copy doing this currently - here's a stack trace of what's calling userHasCrossSigningKeys which is what's causing the /keys/query

Screenshot 2024-04-22 at 12 09 27

UI looks like this (because I did not verify when Iogged in, so expected):

Screenshot 2024-04-22 at 12 34 30

@richvdh
Copy link
Member

richvdh commented Apr 22, 2024

I was able to reproduce this too.

I think you need an account which:

  • At least one of cross-signing or secret storage is not fully set up (particularly: private cross-signing keys are not stored in 4S). Could be a historical account, or a new one which for one of many reasons didn't get fully provisioned on registration.
  • Joined to at least one encrypted room.

Note that being verified (and actually having access to the secrets in secret storage) doesn't seem to be a factor here. Contrary to what you might expect, isSecretStorageReady() && isCrossSigningReady() is true even if we don't have the private cross-signing keys and can't get them because we don't have the 4S keys.

@SPiRiT369
Copy link
Contributor

@richvdh, to me it happens in a completely new server, so I wasn't in any room (encrypted or not) and I was the only user on that server.
Also it happens to me when I open Element Web and Android (one of them).

@richvdh
Copy link
Member

richvdh commented Apr 23, 2024

@richvdh, to me it happens in a completely new server, so I wasn't in any room (encrypted or not) and I was the only user on that server.

Ok please can you file a separate issue with as much detail as you can, and send a bug report from within the application

@richvdh
Copy link
Member

richvdh commented May 21, 2024

@BillCarsonFr looked at this and observed that we call onDevicesUpdated even when there is no change. A simple fix therefore might be to not do that.

However, it looks like we do that quite deliberately: https://github.com/matrix-org/matrix-rust-sdk/blob/5df53d7338a381fd596c580f36cbe69fdcfceed8/crates/matrix-sdk-crypto/src/store/mod.rs#L648. I think we should understand why that line is there before we rip it out.

@richvdh
Copy link
Member

richvdh commented May 21, 2024

Actually, no, that's not what that code is doing. Ignore my last comment; I still don't really understand why this bug is happening.

@richvdh
Copy link
Member

richvdh commented May 22, 2024

So the loop goes:

  • DeviceListener.doRecheck calls CryptoApi.userHasCrossSigningKeys
  • CryptoApi.userHasCrossSigningKeys calls OlmMachine::query_keys_for_users, which sends a /keys/query request
  • The /keys/query request returns a list of devices for the user
  • When processing the response to /keys/query, OlmMachine::receive_keys_query_response calls IdentityManager::receive_keys_query_response.
  • IdentityManager::receive_keys_query_response calls IdentityManager::handle_devices_from_key_query, which calls IdentityManager:update_user_devices
    • IdentityManager:update_user_devices calls IdentityManager::update_or_create_device for each device
    • IdentityManager::update_or_create_device does not check if the device matches that already stored; it just returns the device on the "changed" list.
  • Back in IdentityManager::receive_keys_query_response, it calls matrix-sdk-crypto::store::Store::save_changes
  • Store::save_changes calls CryptoStoreWrapper::save_changes
  • CryptoStoreWrapper::save_changes calls CryptoStore::save_changes to actually save the changes, and then sends out updates over CryptoStoreWrapper::identities_broadcaster
  • Store::identities_stream_raw exposes the identities stream
  • In the wasm bindings OlmMachine.registerDevicesUpdatedCallback subscribes to the raw identities stream and calls the callback with the list of affected users
  • The callback is RustCrypto:onDevicesUpdated, which emits a CryptoEvent.DevicesUpdated
  • DeviceListener has registered a listener for CryptoEvent.DevicesUpdated, which calls DeviceListener.doRecheck when our own userId is updated.

The obvious solution is to have IdentityManager::update_or_create_device check if the received device matches the details we already have. If so, it can return DeviceChange::None instead of DeviceChange::Updated.

@Hywan Hywan reopened this May 23, 2024
@Hywan
Copy link
Member

Hywan commented May 23, 2024

I'm not sure if this issue should be closed despite matrix-org/matrix-rust-sdk#3442 has been merged. I let @richvdh decides 🙂.

@richvdh
Copy link
Member

richvdh commented May 23, 2024

Thanks @Hywan. Yes, we'll keep this open until the fix actually lands in element-web (which means a new release of the wasm bindings, and a fix to matrix-org/matrix-rust-sdk#3447)

@SPiRiT369
Copy link
Contributor

SPiRiT369 commented May 23, 2024

@richvdh, thanks a lot for looking on this! I completely shut down 2 of my servers months ago because of this bug.
Question: This loop bug also happened to me when I opened Element Android (when the web client was not active).
Is this fix supposed to affect Android clients as well somehow?

@richvdh
Copy link
Member

richvdh commented May 23, 2024

I'm afraid I can't speak to Element Android. It may be affected by the same problem, or it may be something else. I'd suggest opening an issue on the element-android repository.

richvdh added a commit to matrix-org/matrix-js-sdk that referenced this issue May 24, 2024
Slightly more involved than normal because it requires us to pass a backup version into OlmMachine.importBackedUpRoomKeys.

On the other hand we can now re-enable the test that was disabled in #4214 due to matrix-org/matrix-rust-sdk#3447

Fixes: element-hq/element-web#27165
@MatMaul
Copy link

MatMaul commented May 24, 2024

FYI we have a server with a main at 100% and we can pinpoint that to the deploy of v1.11.67, which seems to be because this version migrates 30% of the users to the Rust crypto layer which triggers this bug.

We only have 10-ish people with this new version but it's enough to peg the main of this server, probably also because it was quite loaded before that.

Thanks for fixing it.

@amandahla
Copy link

amandahla commented Jun 4, 2024

Hey while this issue is being fixed (thanks!), would be possible to prevent that by using some of the ratelimiting configurations?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Element-R Issues affecting the port of Element's crypto layer to Rust O-Uncommon Most users are unlikely to come across this or unexpected workflow S-Major Severely degrades major functionality or product features, with no satisfactory workaround T-Defect
Projects
None yet
9 participants