48% of WebSocket messages aren't being delivered #763

lsapan · 2017-10-12T00:00:19Z

First off, great work with Channels, this is very exciting stuff!

I've just updated one of my sites that required browsers to poll the site every 10 seconds for the latest data. There are around 800-1200 users on the site at any given time. Instead of polling, I'm now using a single Group to send updates every time there is new data available (every 10 seconds).

It works, but I've noticed that many messages aren't actually making it down to the browser. I watched the python process send messages for 500 seconds, and only 26/50 were received by my browser. The socket did not have to reconnect at all during that time, it simply didn't receive the other 24 messages. It's worth noting that when the messages did deliver, it was instantaneous.

Here's what my setup looks like:

Everything is running inside of Docker on a single Ubuntu 16.04 LTS server.
I'm testing with Chrome 61.
I'm running Django 1.11.6, channels 1.1.8, asgi_redis 1.4.3, and celery 4.1.0 (more on that in a moment).
Here's a list of all the containers in the stack:
- Nginx (proxies daphne and serves static files directly)
- Daphne
- Workers (3 of them, chosen admittedly arbitrarily)
- Celery
- Celerybeat
- Redis
- PostgreSQL

Nginx is configured to use 4 worker_processes with 1024 worker_connections each. Coupled with the fact the websocket isn't actually getting disconnected, I don't think that's the issue.

There are only two things that immediately come to mind:

I'm actually sending the Group messages from within celery. Every 10 seconds, celerybeat sends a task to celery to get the latest data from an API, and broadcast it to the Group. The only reason this is still on celery is just because I haven't changed it yet, but it should still theoretically work, right? The process runs without errors every time.
I'm using redis as the channel layer, but it's also being used as Celery's broker, and Django's caching backend.

Is there anything obvious here? Thanks in advance!

Update: I just performed the test on two machines at the same time. Some messages delivered to both, some delivered to just one, and some weren't delivered to either. I think it's safe to say that celery is always getting the Group message out, so I'm not sure where the disconnect is. One other thought that comes to mind: is the Group creating a separate redis entry for each user in the group per message? If so, I'm wondering if adding 1000 entries all at once is causing some to be expunged?

The text was updated successfully, but these errors were encountered:

andrewgodwin · 2017-10-12T18:09:36Z

Yes, that should all work - unfortunately, there's not exactly much I can do to help you without knowing exact steps to replicate (I just tried it here with two machines and it worked fine).

Have you tried connecting directly to Daphne and seeing if that works? I've heard problems with nginx dropping parts of the socket packets if it's configured wrong.

lsapan · 2017-10-12T18:16:07Z

Thanks for getting back so fast. Unfortunately I can’t test with daphne directly because it’s a production server. I haven’t replicated the issue locally which leads me to believe it’s the volume of the connections.

I’ll see if I can get a local test going with a high volume of WebSockets. In the meantime, is there any limit with one daphne process? I found a (very old) comment somewhere where you said there should be one daphne process per 100 connected clients. I’m just running a single daphne container right now.

andrewgodwin · 2017-10-12T18:17:48Z

I don't have hard data on the local limits, unfortuntely - I would maybe try increasing the number of Daphnes and seeing if that has any effect on the number you're seeing.

The channels 2 architecture is entirely changed in this regard and so it's likely that if there is some silent failure going on, that would instead be raising errors, but it's nowhere near ready for prod yet.

lsapan · 2017-10-12T18:21:15Z

Okay cool, good to know! I’ll try increasing it. For reference, does a daphne process use all of the available resources on the server? In other words, will adding more processes to the same server help, or do they need to be on additional servers?

andrewgodwin · 2017-10-12T18:22:29Z

They're only single-core because Python, so you should run at least one for each CPU core the machine has.

lsapan · 2017-10-12T18:23:58Z

Sounds good, thanks! I'll try to create a replicable case locally if that doesn't fix it.

lsapan · 2017-10-13T02:04:25Z

Thanks again for your help earlier! Unfortunately, scaling up the number of Daphne containers did not help. I did some testing locally and was able to replicate it even when connecting to Daphne directly.

I've created a sample project that demonstrates the issue so you can replicate it and check it out for yourself, it's over here: https://github.com/lsapan/channels-mass-broadcast

The README explains everything, but TL;DR:
Celery sends a random message every 10 seconds. The browser opens up 200 connections to Daphne, and console.logs every time it gets a message. This allows you to see how many are actually received.

It's literally two commands to get it running if you have Docker installed. Let me know if you need anything!

Update: Also just to prove that it's not an issue on the browser side of things (because it is an awful lot!), I set up this scenario:
Chrome: 250 connections
Chrome (Private Browsing): 250 connections
Safari: 1 connection

Here are the results:

Needless to say, Safari is the one in the middle, and it only received half the messages.

andrewgodwin · 2017-10-14T21:51:57Z

OK. I won't have time to sit down and repro this for a while, I'm afraid, but hopefully I'll get around to it eventually.

lsapan · 2017-10-14T22:30:23Z

Totally understandable. Any chance I can pay you to take a look sooner?

andrewgodwin · 2017-10-16T17:56:59Z

Not sure given my current schedule, but shoot me an email at andrew@aeracode.org to discuss.

agronick · 2017-10-17T16:45:09Z

Just a guess but maybe it is something to do with your message queue. You could edit the Daphne and Channels code to print a counter right before they send and receive a message and see if they both are equal. If they aren't then Reis or whatever isn't passing them through.

lsapan · 2017-10-17T17:24:52Z

@agronick yeah that's what I'm thinking too. I actually haven't tried simply switching it to RabbitMQ too, it's worth a shot.

lsapan · 2017-10-17T17:41:14Z

@agronick @andrewgodwin well I'm beating myself up now! I meant to test with RabbitMQ and completely forgot, but it fixes it! 100% of messages are consistently being delivered now in my channels-mass-broadcast test repo!

The documentation should probably mention that asgi_redis will not work for higher volumes of channels in a single group. Super relieved to see this working with rabbit though!

andrewgodwin · 2017-10-17T18:05:12Z

Huh, weird, the redis one is meant to be the one we recommend and maintain! Glad the rabbit one works (and I'm sure Artem will be as well), this is why we provide options :)

andrewgodwin · 2017-10-17T18:05:42Z

I'll look into the Redis performance and fix it up another time - we don't use the Group stuff at work, only normal send/receive, which is probably why we didn't run into this.

lsapan · 2017-10-17T18:52:56Z

@andrewgodwin well I see why you recommend redis, I rolled rabbit into production and my site immediately fell apart with backpressure errors. Tried increasing the capacity and while that worked for a minute or two, it was very slow and eventually still failed. I'm guessing this is a latency issue with asgi_rabbitmq?

In any case, I'll use something else for now until either asgi_redis is fixed or channels 2.0 is out. Thanks again!

andrewgodwin added blocked/needs-investigation bug labels Oct 12, 2017

lsapan closed this as completed Oct 17, 2017

lsapan mentioned this issue Oct 18, 2017

Improve sending efficiency for large Groups django/channels_redis#62

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

48% of WebSocket messages aren't being delivered #763

48% of WebSocket messages aren't being delivered #763

lsapan commented Oct 12, 2017 •

edited

Loading

andrewgodwin commented Oct 12, 2017

lsapan commented Oct 12, 2017

andrewgodwin commented Oct 12, 2017

lsapan commented Oct 12, 2017

andrewgodwin commented Oct 12, 2017

lsapan commented Oct 12, 2017

lsapan commented Oct 13, 2017 •

edited

Loading

andrewgodwin commented Oct 14, 2017

lsapan commented Oct 14, 2017

andrewgodwin commented Oct 16, 2017

agronick commented Oct 17, 2017

lsapan commented Oct 17, 2017

lsapan commented Oct 17, 2017

andrewgodwin commented Oct 17, 2017

andrewgodwin commented Oct 17, 2017

lsapan commented Oct 17, 2017

48% of WebSocket messages aren't being delivered #763

48% of WebSocket messages aren't being delivered #763

Comments

lsapan commented Oct 12, 2017 • edited Loading

andrewgodwin commented Oct 12, 2017

lsapan commented Oct 12, 2017

andrewgodwin commented Oct 12, 2017

lsapan commented Oct 12, 2017

andrewgodwin commented Oct 12, 2017

lsapan commented Oct 12, 2017

lsapan commented Oct 13, 2017 • edited Loading

andrewgodwin commented Oct 14, 2017

lsapan commented Oct 14, 2017

andrewgodwin commented Oct 16, 2017

agronick commented Oct 17, 2017

lsapan commented Oct 17, 2017

lsapan commented Oct 17, 2017

andrewgodwin commented Oct 17, 2017

andrewgodwin commented Oct 17, 2017

lsapan commented Oct 17, 2017

lsapan commented Oct 12, 2017 •

edited

Loading

lsapan commented Oct 13, 2017 •

edited

Loading