Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Daphne is slowly leaking memory via channels-redis #7720

Closed
ryanpetrello opened this issue Jul 24, 2020 · 5 comments
Closed

Daphne is slowly leaking memory via channels-redis #7720

ryanpetrello opened this issue Jul 24, 2020 · 5 comments

Comments

@ryanpetrello
Copy link
Contributor

see: django/channels#1181 (comment)

@ryanpetrello

This comment has been minimized.

@ryanpetrello
Copy link
Contributor Author

ryanpetrello commented Aug 12, 2020

In addition to the general leak described in Daphne, I've found another way to get Daphne to grow memory in an unbounded way.

Internally, as channels_redis consumes messages from Redis, it stores per-local-channel copies in an in-memory receive buffer for other consumers on the same topic. In this way, if multiple consumers are subscribed to the same topic (i.e., the stdout for a job, or the global websocket broadcast topic), they can each receive a copy.

Unfortunately, Daphne will continually grow this per-channel buffer in an unbounded way even if nothing seems to be reading from the other end (i.e., if the connection is closed, or if it's just reading too slowly).

To illustrate why this can become a problem:

  1. Install a clustered AWX with 5 or more nodes.
  2. Intentionally disrupt eth0 Node B, so that broadcasting from other nodes isn't read quickly, causing the buffer on Node A to fill (and never empty, because the read side on Node B can't keep up):
~ tc qdisc add dev eth0 root netem delay 500ms loss 50% 
  1. Run some playbook that generates a high volume, constant stream of event data on Node A. Use instance groups to ensure that the playbook runs on Node A.
  2. Note that Daphne's RSS on Node A will slowly grow in an unbounded way because the receiver buffer for the broadcast channel is filling and not being emptied.
  3. Go eat a sandwich and come back 30 minutes later.

This is a fairy close approximation of the bug outlined at django/channels_redis#384 as it might affect AWX's busiest channel, the websocket backplane we use for broadcasting events to peers in a cluster. The messages in my testing are fairly small, so it takes awhile for memory to grow, but you could imagine that large messages (like lots of fact collection) would cause quicker memory growth.

related: django/channels_redis#384

@kdelee
Copy link
Member

kdelee commented Sep 16, 2020

@ryanpetrello is this resolved via #8094 ?

@ryanpetrello
Copy link
Contributor Author

Yes. Thanks, @kdelee.

@kdelee
Copy link
Member

kdelee commented Sep 16, 2020

Upstream patch is merged and released and patch as been verified in production by users experiencing the bug. We've bumped the versions we depend on to use fix, closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants