-
-
Notifications
You must be signed in to change notification settings - Fork 800
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
48% of WebSocket messages aren't being delivered #763
Comments
Yes, that should all work - unfortunately, there's not exactly much I can do to help you without knowing exact steps to replicate (I just tried it here with two machines and it worked fine). Have you tried connecting directly to Daphne and seeing if that works? I've heard problems with nginx dropping parts of the socket packets if it's configured wrong. |
Thanks for getting back so fast. Unfortunately I can’t test with daphne directly because it’s a production server. I haven’t replicated the issue locally which leads me to believe it’s the volume of the connections. I’ll see if I can get a local test going with a high volume of WebSockets. In the meantime, is there any limit with one daphne process? I found a (very old) comment somewhere where you said there should be one daphne process per 100 connected clients. I’m just running a single daphne container right now. |
I don't have hard data on the local limits, unfortuntely - I would maybe try increasing the number of Daphnes and seeing if that has any effect on the number you're seeing. The channels 2 architecture is entirely changed in this regard and so it's likely that if there is some silent failure going on, that would instead be raising errors, but it's nowhere near ready for prod yet. |
Okay cool, good to know! I’ll try increasing it. For reference, does a daphne process use all of the available resources on the server? In other words, will adding more processes to the same server help, or do they need to be on additional servers? |
They're only single-core because Python, so you should run at least one for each CPU core the machine has. |
Sounds good, thanks! I'll try to create a replicable case locally if that doesn't fix it. |
Thanks again for your help earlier! Unfortunately, scaling up the number of Daphne containers did not help. I did some testing locally and was able to replicate it even when connecting to Daphne directly. I've created a sample project that demonstrates the issue so you can replicate it and check it out for yourself, it's over here: https://github.com/lsapan/channels-mass-broadcast The README explains everything, but TL;DR: It's literally two commands to get it running if you have Docker installed. Let me know if you need anything! Update: Also just to prove that it's not an issue on the browser side of things (because it is an awful lot!), I set up this scenario: Needless to say, Safari is the one in the middle, and it only received half the messages. |
OK. I won't have time to sit down and repro this for a while, I'm afraid, but hopefully I'll get around to it eventually. |
Totally understandable. Any chance I can pay you to take a look sooner? |
Not sure given my current schedule, but shoot me an email at andrew@aeracode.org to discuss. |
Just a guess but maybe it is something to do with your message queue. You could edit the Daphne and Channels code to print a counter right before they send and receive a message and see if they both are equal. If they aren't then Reis or whatever isn't passing them through. |
@agronick yeah that's what I'm thinking too. I actually haven't tried simply switching it to RabbitMQ too, it's worth a shot. |
@agronick @andrewgodwin well I'm beating myself up now! I meant to test with RabbitMQ and completely forgot, but it fixes it! 100% of messages are consistently being delivered now in my channels-mass-broadcast test repo! The documentation should probably mention that asgi_redis will not work for higher volumes of channels in a single group. Super relieved to see this working with rabbit though! |
Huh, weird, the redis one is meant to be the one we recommend and maintain! Glad the rabbit one works (and I'm sure Artem will be as well), this is why we provide options :) |
I'll look into the Redis performance and fix it up another time - we don't use the Group stuff at work, only normal send/receive, which is probably why we didn't run into this. |
@andrewgodwin well I see why you recommend redis, I rolled rabbit into production and my site immediately fell apart with backpressure errors. Tried increasing the capacity and while that worked for a minute or two, it was very slow and eventually still failed. I'm guessing this is a latency issue with asgi_rabbitmq? In any case, I'll use something else for now until either asgi_redis is fixed or channels 2.0 is out. Thanks again! |
First off, great work with Channels, this is very exciting stuff!
I've just updated one of my sites that required browsers to poll the site every 10 seconds for the latest data. There are around 800-1200 users on the site at any given time. Instead of polling, I'm now using a single
Group
to send updates every time there is new data available (every 10 seconds).It works, but I've noticed that many messages aren't actually making it down to the browser. I watched the python process send messages for 500 seconds, and only 26/50 were received by my browser. The socket did not have to reconnect at all during that time, it simply didn't receive the other 24 messages. It's worth noting that when the messages did deliver, it was instantaneous.
Here's what my setup looks like:
Nginx is configured to use 4 worker_processes with 1024 worker_connections each. Coupled with the fact the websocket isn't actually getting disconnected, I don't think that's the issue.
There are only two things that immediately come to mind:
Group
messages from within celery. Every 10 seconds, celerybeat sends a task to celery to get the latest data from an API, and broadcast it to theGroup
. The only reason this is still on celery is just because I haven't changed it yet, but it should still theoretically work, right? The process runs without errors every time.Is there anything obvious here? Thanks in advance!
Update: I just performed the test on two machines at the same time. Some messages delivered to both, some delivered to just one, and some weren't delivered to either. I think it's safe to say that celery is always getting the Group message out, so I'm not sure where the disconnect is. One other thought that comes to mind: is the
Group
creating a separate redis entry for each user in the group per message? If so, I'm wondering if adding 1000 entries all at once is causing some to be expunged?The text was updated successfully, but these errors were encountered: