Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Element never flushes accumulated non-LL room members, and so slowly uses more ambient RAM. #25264

Open
ara4n opened this issue May 2, 2023 · 1 comment
Labels
A-Performance O-Uncommon Most users are unlikely to come across this or unexpected workflow S-Major Severely degrades major functionality or product features, with no satisfactory workaround T-Defect

Comments

@ara4n
Copy link
Member

ara4n commented May 2, 2023

Steps to reproduce

  1. My Element Desktop kept whitescreening overnight, as per On un-sleeping laptop, element desktop sometimes goes entirely white (because EW has OOMed during the night) element-desktop#680
  2. It turns out my v8 heap was idling around 1.3GB (1.1GB after GC), perilously close to the 1.5GB limit - meaning that if sync accumulator persists overlap in sqlite when the laptop is asleep, we can OOM.
  3. 1.3GB seems high; i'd expect more like 500MB
  4. Turns out I had 418K RoomMember objects on the heap, which is massive.
  5. On a fresh launch, i have 370K, which is still massive.
  6. Turns out there are 158K room members pulled in from the accumulated sync response:
let count = 0; for (r of mxMatrixClientPeg.get().getRooms()) { count += Object.keys(r.currentState.members).length } console.log(count);
15:46:12.123 rageshake.ts:74 158074

Presumably we get ~2x of these on the heap - perhaps one for prevState and currentState per room, or similar.

So the problem seems to be that we never ever flush non-LL members from the sync accumulator. So if you idle in Matrix HQ in a login for a year, then the persisted room state gradually grows as it sees members appear in the timeline. (Lazily loaded members don't get persisted). So on my account it had gone from 7 members to 11,000 members - and similarly in other busy rooms, hence increasing the 'actual' LL subset of 17K members by ~10x up to 158K members.

While looking into this i saw some other memory leaks too:

  • lots of detached HTML DOM nodes (e.g. 2152x detached HTMLDivElements)
  • 2533x logger instances
  • 16118x SummarizedNotificationState_SummarizedNotificationState instances
  • window.mxSettingsStore is somehow retaining 140MB of native context (v8 JITed code?)
  • client.reEmitters ends up with 480K-odd keys in its map, and grows forever, and is so unwieldy that trying to inspect it causes an OOM.

For the main bug (slowly leaking members), about the only actual solution that comes to mind that we could occasionally flush the accumulated m.room.member events, and toggle the include_redundant_members field on the /sync filter to persuade the server to start resending non-LL members to us. (Assuming the server flushes the LL LRU cache on seeing include_redundant_members)

Meanwhile, sliding sync (or a Hydrogen or rust-sdk style architecture where we don't store members in RAM) is clearly the better solution - i.e. killing the sync accumulator entirely.

I propose not fixing this, in favour of landing SS instead, and meanwhile the workaround is for powerusers to logout and login again once a year or so.

Outcome

What did you expect?

App to not OOM overnight

What happened instead?

Many many OOMs.

Operating system

macOS 13.3.1

Browser information

Chrome

URL for webapp

element.io

Application version

Element Nightly version: 2023041901 Olm version: 3.2.12

Homeserver

matrix.org

Will you send logs?

No

@ara4n ara4n added the T-Defect label May 2, 2023
@germain-gg germain-gg added A-Performance S-Major Severely degrades major functionality or product features, with no satisfactory workaround O-Uncommon Most users are unlikely to come across this or unexpected workflow labels May 2, 2023
@t3chguy
Copy link
Member

t3chguy commented May 5, 2023

Worth keeping in mind that even though the current state only has n members, each "oldState" the same, each partial timeline (/context API) and each membership change will create a new Member object due to historical membership. This creates a minimum bound of 2n (n members) with the maximal bound being around 2n+e (e events) so you can end up with a quite a lot of RoomMember events before any leaking even happens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Performance O-Uncommon Most users are unlikely to come across this or unexpected workflow S-Major Severely degrades major functionality or product features, with no satisfactory workaround T-Defect
Projects
None yet
Development

No branches or pull requests

3 participants