Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new connected channels may fail to receive message from group_send() #1244

Open
DrChai opened this issue Feb 8, 2019 · 35 comments
Open

new connected channels may fail to receive message from group_send() #1244

DrChai opened this issue Feb 8, 2019 · 35 comments

Comments

@DrChai
Copy link

DrChai commented Feb 8, 2019

It looks occasionally some channels won't receive messages from using group_send(), although they all successfully connected to our server. Disconnected and re-connect may resolve the issue.
Here is the sample:

two clients connected via websocket and joined a same group partner_orders_1531

127.0.0.1:6379[1]> zrange asgi::group:partner_orders_1531 0 -1 withscores
1) "specific.mrOZIufu!GAaMMeISqdry"
2) "1549585254.4294083"
3) "specific.qUSiCEQD!TGlphabkczyd"
4) "1549585307.3211055"

then calling async_to_sync(channel_layer.group_send)

channel_layer = get_channel_layer()
async_to_sync(channel_layer.group_send)(partner_order_group_name, order_data)

but only one channel with channel name qUSiCEQD!TGlphabkczyd was able to receive the message and triggered

class PartnerOrderConsumer(WebsocketConsumer):
        ...
        def order_receive(self, event):
            self.send(text_data=event["text"])

related redis log :
image

Pip & Environment:
Python 3.6.6
channels==2.1.5
channels-redis==2.3.1
daphne==2.2.3
Django==1.11.10
The server is hosted on Ubuntu 16.04.5 LTS with 2vCPUs/4GB MEMORY by Digitalocean. We have two daphne instances and using nginx for load balancing.

@xuehaoze
Copy link

xuehaoze commented Feb 8, 2019

same issue here, please help, thanks!

@bastbnl
Copy link
Contributor

bastbnl commented Feb 9, 2019

I noticed the same behaviour a couple of weeks ago. My setup was python manage.py runserver behind nginx for SSL offloading. It did not appear to happen when the websocket connection was established to the runserver process without nginx in front.

It did not seem to appear when the software is running via daphne, but I'll have to check it in more detail to be sure.

@qeternity
Copy link

We experience this issue as well. At first we thought it was an issue with the channels-redis backend so we moved over to a rabbitmq backend, but this has not solved the problem. It seems that daphne workers timeout to their backend and stop receiving subscribed group_sends

@carltongibson
Copy link
Member

It seems that daphne workers timeout to their backend and stop receiving subscribed group_sends

So, can we pin that down to a specific issue with Daphne? (If so a ticket there with a reproduce would be wonderful.) (If not some kind of reproduce would be helpful...)

@carltongibson
Copy link
Member

...behind nginx for SSL offloading.

So is it an nginx config issue?

@carltongibson
Copy link
Member

This isn't really actionable as is: there's nothing pointing to channels. Likely the issue is elsewhere in the stack.

I'm going to close for now. If someone can put together a reproduce showing channels (or Daphne) is at fault, I'm happy to re-open.

@qeternity
Copy link

@carltongibson I'm not sure why you'd close it. Surely the fact that there's nothing actionable means that it warrants further investigation. It's not an NGINX issue because simply issuing a reload (kill -HUP) to daphne immediately fixes the problem. Also, anything that isn't group_send based (i.e. application level ping-pong) still work just fine.

@carltongibson
Copy link
Member

It's not an addressable issue as it stands. Is it DigitalOcean? Is it Nginx? Is it Daphne? Is it Channels? If you read the comments here, there is nothing in common... so pending a reproduce there's nothing to do. We don't have capacity to chase down phantom issues with no details. needsinfo is a legitimate closing reason.

If someone can put together a reproduce showing channels (or Daphne) is at fault, I'm happy to re-open.

@qeternity
Copy link

qeternity commented Mar 26, 2019

If you read the comments here, there is nothing in common

Channels/Daphne/group_send are the common denominator.

@carltongibson
Copy link
Member

It did not seem to appear when the software is running via daphne

As I said, I'm very happy to investigate if anyone can provide a reproduce.

@qeternity
Copy link

qeternity commented Mar 30, 2019

So it would seem I owe @carltongibson an apology. After further debugging, this appears to be an odd behavior with nginx buffering.

@DrChai @xuehaoze try adding "proxy_buffering off;" to your nginx conf that handles websockets proxying?

EDIT: see below, this does not fix the problem.

@carltongibson
Copy link
Member

@qeternity Thanks for the follow-up and the pointer. Glad you made progress!

@xuehaoze
Copy link

@qeternity @carltongibson Thanks for your reply, we will give it a try!

@qeternity
Copy link

After running this for a few days in production, this does NOT resolve the issue. group_send is still broken, but everything else works. Restarting daphne is the only solution. I have tried both redis and rabbitmq backends so this suggests it's internal daphne/channels.

@mkoponen
Copy link

mkoponen commented Apr 3, 2019

@qeternity I have to upgrade an old project to this right now, and this issue is going to be a problem for me, as I'm using this same setup. Now, here's something from layers.py that concerns me, in async def group_send(self, group, message):

for channel in self.groups.get(group, set()): 
 try: 
  await self.send(channel, message) 
 except ChannelFull: 
  pass

Neither is there any logging in self.send if the channel is full. It just raises the exception, which is then ignored by this code. If this condition happens, you'll never know it.

So, my first thought is that the group messages are never completely cleaned up from the channel; possibly only in some specific use case, or possibly always. They'd get handled by the worker, but not removed from the channel. This would explain why it takes time for the scenario to happen, because it takes time for the channel to fill up.

So, if you already have a setup where you can make this bug happen, could you add a debug line instead of "pass" to the final line, or perhaps just throw the exception again? If we know the problem is this, then the search for the bug should be much easier. Also, if this is the reason, I'd expect raising the frequency of group_send's in your test code to make it happen faster, unless the problem is caused by a group_send happening simultaneously with some other method call, and only then leaving the channel with one slot less space in it.

This is all pure guesswork. This is the first time I have ever even looked at this project's code.

@carltongibson
Copy link
Member

Hi @mkoponen. That's interesting. (Gives an avenue to work on at least...) I have some time to work on Channels at DjangoCon Europe next week, so I'll re-open to investigate. (If anyone wants to jump in before then, please do! 🙂)

@carltongibson carltongibson reopened this Apr 4, 2019
@qeternity
Copy link

@mkoponen @carltongibson I'll debug this week, but we've used two separate backends (redis and rabbitmq) and neither have an implementation like that. That's for the InMemory layer, which isn't what we're using.

@mkoponen
Copy link

mkoponen commented Apr 8, 2019

Ah, so it is. However, then my next question is, does this condition ever return true in the problem case?

if redis.call('LLEN', KEYS[i]) < tonumber(ARGV[i + #KEYS]) then

I don't understand all the concepts well enough yet to see what this line means, but I do notice that again the method would just fail silently if it didn't.

The first goal should be to figure out that when the problem starts, is the problem in pushing the information into Redis, or receiving the information from it.

EDIT: Another suggestion: Could there be a scenario in the two awaits at the top that seek to remove expired channels, such that they would simply await forever?

EDIT2: I see. This is from "simple" send-method

# Check the length of the list before send
# This can allow the list to leak slightly over capacity, but that's fine.
if await connection.llen(channel_key) >= self.get_capacity(channel):
    raise ChannelFull()

Capacity is also given as argument to the Lua code, so it is clearly trying to do this same check about exceeded capacity. Only it does it in Lua, whereas send() does it in Python. But I'm back to my original hypothesis: The channel gets full, and send starts failing silently.

EDIT3: Further hypothesis. Group messages get cleaned up only through expiry, and never through successful reception and handling of the message. Those users who send them very infrequently never notice the problem, because the slow expiry is enough to leave room for the new message whenever they need it. Those who send them frequently, hit capacity.

@xiaoqiao99
Copy link

xiaoqiao99 commented Apr 22, 2019

@qeternity
Excuse me,
Have it been solved now?

@qeternity
Copy link

This problem is still not resolved. We are using the rabbitmq backend. Our server locked up again and I went into the rabbitmq admin panel and saw that the queues all had 100 messages waiting in the queue. This is definitely the problem. It would appear that the asgi websockets workers disconnect from the queue and stop receiving messages, which then pile up to the channel limit.

@carltongibson
Copy link
Member

We are using the rabbitmq backend

So it's not redis... — still looking for a decent reproduce.

I need to get DRF v3.10 out and then I'll have some time to dig into this, and other channels issues.
Following up @mkoponen's lead seems most promising currently. If anyone has time to look, that's be great.

@qeternity
Copy link

Indeed, as I wrote above, we switched backends attempting to mitigate this but it had no impact.

@maurice-g
Copy link

Was anyone able to make any progress on this? Or find a way to reproduce it reliably? Having the same issue.

@Kavarakis
Copy link

I am having the same issue with this. I've tried to create different test app in same env specs and everything is working, but my current app is failing to send correct messages to redis via channel layers.

Any update?

@carltongibson
Copy link
Member

This still needs digging into, I'm afraid.

I'm currently looking into issues around AsyncHttpConsumer (in the time I have 😝). Maybe this will turn out to be related to that. If not this is on my list for afterwards.

If someone could pin it down to an example project that reliably reproduced the issue, that would be more that half way to a solution.

@Kavarakis
Copy link

Hi guys, I've managed to fix my issues regarding this problem. The issue was rather stupid where I needed to open my WebSocket connection via frontend app. If I don't open WS connection and Django doesn't receive connection notification and successful handshake channels and Redis cannot send any messages via channel layers (tested and confirmed in my case). I don't see the logic behind this way but I've managed to remove my problem in this case.

I hope this will help someone.

Cheers!

@maurice-g
Copy link

Seems like I've been looking for the culprit in the wrong place as well, at least for some cases.

For the Googlers coming here: The issue was, that there wasn't actually a websockets connection anymore. We did listen to close and error events on the websocket connection in the frontend, and reacted by automatically trying to reconnect. Thus, I assumed, there would always be an active WS connection to django channels. Turns out that in some cases (such as just disconnecting your wifi) no close or error event will be thrown (tested in Chrome 77) and the browser will happily report the connection as active forever. The backend will close the connection from its side after ~50s (no idea where that number is coming from.. Channels / Daphne / Load balancer?) and therefore also send no further notifications to that channel.
Solution for us was implementing manual heartbeat messages from frontend to the backend every few seconds. Contrary to my understanding of websockets, this is not handled by the Browser (whatever use the Ping/Pong has then..).

@simkimsia
Copy link

@maurice-g Sorry I am a bit new to this whole websockets. How do you implement manual heartbeat from frontend to backend?

I am using the following:

celery[redis]==4.3.0
channels==2.2.0  
channels-redis==2.4.0

@tarikki
Copy link

tarikki commented Nov 9, 2019

I faced this issue and found two culprits so far:

  1. Client silently disconnects
    If you're using django-channels to communicate to a websocket running in a browser, then this is the most likely cause. I noticed that if my users lock their device then come back to my web app after a certain amount of time has passed, then the connection will not be there anymore but it will not have thrown any errors either. The solution is to do what @maurice-g suggested and have a manual heartbeat between the browser and the backend

  2. group_send in channels_redis silently failing if channel over capacity
    Following the lead from @mkoponen I noticed that when you use group_send() with channels_redis as the backend, then there is the potential for group_send() to silently fail. When using group_send, before sending the current message the lua script will check the number of messages in the channel. If the channel is not yet at full capacity, it will send the message. However if the channel has hit full capacity then this channel will be skipped silently. I wrote a PR to fix this so the number channels that are at full capacity will be logged, which will help with debuggin.

@SerhiiK17
Copy link

@sabuhish
Copy link

sabuhish commented Feb 8, 2020

any solution? I am still facing the same issue stuck almost three days. What I think async_to_sync creates new connection each time it is called.. is there are any alternative?

@kmiku7
Copy link

kmiku7 commented Apr 10, 2020

I encountered a similar problem ever, the client-side receives nothing from websocket after a period of time. After debug, my problem was caused by python3's bug: https://github.com/python/cpython/blob/v3.6.1/Lib/asyncio/locks.py#L163 asyncio does not deal with exceptions properly when acquiring a lock which makes other locks in deadlock state. After upgrading my local python it fixed.
Hope it is helpful for you.

@hamedkalantari
Copy link

We encounter this problem as were developing a real-time game. As long as i could remember sometimes one may disconnect from websocket without even calling disconnect event. I think the problem is not about django channels cause we were using flask, but it's more about routers and switches, because sometimes intermediate routers and switches may drop the connection due to being idle and websocket will not call disconnect event. So we deploy a PING PONG messaging system that solves the problem for us. Although it's worth to say that we used a random data generator because PING or PONG messages were too light.

Although the problem @mkoponen mentioned is a serious problem.

@acu192
Copy link

acu192 commented Apr 17, 2021

I'm seeing this issue in my app. I can confirm in my case:

  1. It is not a silent disconnect of the front-end.
  2. It happens only on group_send ... but just sometimes. The same code-path succeeds most of the time but fails occasionally (fail case = messages sent with group_send are not actually sent).
  3. It never fails when sending directly through channel layer (i.e. self.channel_layer.send(...)) .

There's a common situation in my app where a client will send two messages to the server. One message is a broadcast (thus the server does self.channel_layer.group_send) and the other is intended for a specific client (thus the server does self.channel_layer.group_send). Most of the time this works. But occasionally the broadcast message is silently dropped. This is the scenario that leads to my 1-3 conclusions in the list above. The broadcast message is sent first, which is how I know it's not a silent disconnect on the front-end (because the front-ends that gets the second message always gets it, but it doesn't always get the broadcast message that was sent first).

I'm using redis. Although, others here have stated they see this problem no matter what channel layer they use. I haven't tried other channel layers myself. I'm just digging into this right now...

Channels 3.0.1
Python 3.7

@itsmehdi97
Copy link

I'm using uvicorn to serve my app and still having this issue.
In my case the problem is totally gone when I use only 1 uvicorn worker, but the more workers are used the more messages are dropped.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests