Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

48% of WebSocket messages aren't being delivered #763

Closed
lsapan opened this issue Oct 12, 2017 · 16 comments
Closed

48% of WebSocket messages aren't being delivered #763

lsapan opened this issue Oct 12, 2017 · 16 comments

Comments

@lsapan
Copy link

lsapan commented Oct 12, 2017

First off, great work with Channels, this is very exciting stuff!

I've just updated one of my sites that required browsers to poll the site every 10 seconds for the latest data. There are around 800-1200 users on the site at any given time. Instead of polling, I'm now using a single Group to send updates every time there is new data available (every 10 seconds).

It works, but I've noticed that many messages aren't actually making it down to the browser. I watched the python process send messages for 500 seconds, and only 26/50 were received by my browser. The socket did not have to reconnect at all during that time, it simply didn't receive the other 24 messages. It's worth noting that when the messages did deliver, it was instantaneous.

Here's what my setup looks like:

  • Everything is running inside of Docker on a single Ubuntu 16.04 LTS server.
  • I'm testing with Chrome 61.
  • I'm running Django 1.11.6, channels 1.1.8, asgi_redis 1.4.3, and celery 4.1.0 (more on that in a moment).
  • Here's a list of all the containers in the stack:
    • Nginx (proxies daphne and serves static files directly)
    • Daphne
    • Workers (3 of them, chosen admittedly arbitrarily)
    • Celery
    • Celerybeat
    • Redis
    • PostgreSQL

Nginx is configured to use 4 worker_processes with 1024 worker_connections each. Coupled with the fact the websocket isn't actually getting disconnected, I don't think that's the issue.

There are only two things that immediately come to mind:

  1. I'm actually sending the Group messages from within celery. Every 10 seconds, celerybeat sends a task to celery to get the latest data from an API, and broadcast it to the Group. The only reason this is still on celery is just because I haven't changed it yet, but it should still theoretically work, right? The process runs without errors every time.
  2. I'm using redis as the channel layer, but it's also being used as Celery's broker, and Django's caching backend.

Is there anything obvious here? Thanks in advance!

Update: I just performed the test on two machines at the same time. Some messages delivered to both, some delivered to just one, and some weren't delivered to either. I think it's safe to say that celery is always getting the Group message out, so I'm not sure where the disconnect is. One other thought that comes to mind: is the Group creating a separate redis entry for each user in the group per message? If so, I'm wondering if adding 1000 entries all at once is causing some to be expunged?

@andrewgodwin
Copy link
Member

Yes, that should all work - unfortunately, there's not exactly much I can do to help you without knowing exact steps to replicate (I just tried it here with two machines and it worked fine).

Have you tried connecting directly to Daphne and seeing if that works? I've heard problems with nginx dropping parts of the socket packets if it's configured wrong.

@lsapan
Copy link
Author

lsapan commented Oct 12, 2017

Thanks for getting back so fast. Unfortunately I can’t test with daphne directly because it’s a production server. I haven’t replicated the issue locally which leads me to believe it’s the volume of the connections.

I’ll see if I can get a local test going with a high volume of WebSockets. In the meantime, is there any limit with one daphne process? I found a (very old) comment somewhere where you said there should be one daphne process per 100 connected clients. I’m just running a single daphne container right now.

@andrewgodwin
Copy link
Member

I don't have hard data on the local limits, unfortuntely - I would maybe try increasing the number of Daphnes and seeing if that has any effect on the number you're seeing.

The channels 2 architecture is entirely changed in this regard and so it's likely that if there is some silent failure going on, that would instead be raising errors, but it's nowhere near ready for prod yet.

@lsapan
Copy link
Author

lsapan commented Oct 12, 2017

Okay cool, good to know! I’ll try increasing it. For reference, does a daphne process use all of the available resources on the server? In other words, will adding more processes to the same server help, or do they need to be on additional servers?

@andrewgodwin
Copy link
Member

They're only single-core because Python, so you should run at least one for each CPU core the machine has.

@lsapan
Copy link
Author

lsapan commented Oct 12, 2017

Sounds good, thanks! I'll try to create a replicable case locally if that doesn't fix it.

@lsapan
Copy link
Author

lsapan commented Oct 13, 2017

Thanks again for your help earlier! Unfortunately, scaling up the number of Daphne containers did not help. I did some testing locally and was able to replicate it even when connecting to Daphne directly.

I've created a sample project that demonstrates the issue so you can replicate it and check it out for yourself, it's over here: https://github.com/lsapan/channels-mass-broadcast

The README explains everything, but TL;DR:
Celery sends a random message every 10 seconds. The browser opens up 200 connections to Daphne, and console.logs every time it gets a message. This allows you to see how many are actually received.

It's literally two commands to get it running if you have Docker installed. Let me know if you need anything!

Update: Also just to prove that it's not an issue on the browser side of things (because it is an awful lot!), I set up this scenario:
Chrome: 250 connections
Chrome (Private Browsing): 250 connections
Safari: 1 connection

Here are the results:
screen shot 2017-10-12 at 10 21 32 pm

Needless to say, Safari is the one in the middle, and it only received half the messages.

@andrewgodwin
Copy link
Member

OK. I won't have time to sit down and repro this for a while, I'm afraid, but hopefully I'll get around to it eventually.

@lsapan
Copy link
Author

lsapan commented Oct 14, 2017

Totally understandable. Any chance I can pay you to take a look sooner?

@andrewgodwin
Copy link
Member

Not sure given my current schedule, but shoot me an email at andrew@aeracode.org to discuss.

@agronick
Copy link

Just a guess but maybe it is something to do with your message queue. You could edit the Daphne and Channels code to print a counter right before they send and receive a message and see if they both are equal. If they aren't then Reis or whatever isn't passing them through.

@lsapan
Copy link
Author

lsapan commented Oct 17, 2017

@agronick yeah that's what I'm thinking too. I actually haven't tried simply switching it to RabbitMQ too, it's worth a shot.

@lsapan
Copy link
Author

lsapan commented Oct 17, 2017

@agronick @andrewgodwin well I'm beating myself up now! I meant to test with RabbitMQ and completely forgot, but it fixes it! 100% of messages are consistently being delivered now in my channels-mass-broadcast test repo!

The documentation should probably mention that asgi_redis will not work for higher volumes of channels in a single group. Super relieved to see this working with rabbit though!

@lsapan lsapan closed this as completed Oct 17, 2017
@andrewgodwin
Copy link
Member

Huh, weird, the redis one is meant to be the one we recommend and maintain! Glad the rabbit one works (and I'm sure Artem will be as well), this is why we provide options :)

@andrewgodwin
Copy link
Member

I'll look into the Redis performance and fix it up another time - we don't use the Group stuff at work, only normal send/receive, which is probably why we didn't run into this.

@lsapan
Copy link
Author

lsapan commented Oct 17, 2017

@andrewgodwin well I see why you recommend redis, I rolled rabbit into production and my site immediately fell apart with backpressure errors. Tried increasing the capacity and while that worked for a minute or two, it was very slow and eventually still failed. I'm guessing this is a latency issue with asgi_rabbitmq?

In any case, I'll use something else for now until either asgi_redis is fixed or channels 2.0 is out. Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants