Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Managing "presence" with channel expirations / disconnects that don't fire "websocket.disconnect" #293

Closed
yourcelf opened this issue Aug 10, 2016 · 5 comments

Comments

@yourcelf
Copy link

yourcelf commented Aug 10, 2016

We're using django-channels with an application that needs to maintain a list of users that are currently "present" in a chat room. To do this, we create a Connection model that ties a websocket channel to a Room:

class Room(models.Model):
    channel_name = models.CharField(max_length=255, unique=True)

class Connection(models.Model):
    room = models.ForeignKey('Room')
    user = models.ForeignKey(User, null=True)
    channel_name = models.CharField(max_length=255, unique=True)

Essentially, the Room is acting as a denormalization of connection status; one which keeps a channel's association with the logged in User that we get when websocket.connect fires. When we get websocket.disconnect, we can just remove the Connection model associated with the channel, and the user disappears from the Room.

This works great as long as disconnect fires -- but it has two obvious failure modes:

  1. A user might disconnect without properly closing the websocket connection, so websocket.disconnect doesn't get fired, leaving an orphaned entry.
  2. The server/service might get restarted, closing all websocket connections, but without running disconnect handlers. From the database's perspective, everyone's still connected.

To deal with these, we've set up a periodic task using celerybeat to prune room entries that no longer have an associated channel, using the new group_channels method:

def prune_channels(room):
    user_channel_names = channel_layers['default'].group_channels(room.channel_name)
    Connection.objects.filter(room=room).exclude(
        channel_name__in=user_channel_names
    ).delete()
    ... broadcast new presence info to room.channel_name ...

We're trying to run this pruning function every 30-60 seconds to keep room presence relatively fresh. However, when we run this with the redis backend and the Django devserver, we're seeing that group_channels is returning a long list of no-longer-connected channels, long after we'd have expected them to expire.

Questions I have:

Is this general approach the best way with the current API to achieve this sort of thing, or is there a more elegant/recommended way to maintain a list of currently connected channels and their associated users? (Reading through issues, it seems that "presence" is a big deal for lots of folks; some sort of canonical docs might be awesome. Would be happy to contribute a PR with a writeup of our approach if it'd help as a start.)

Might our problem with long-dead channels showing up in a call to group_channels(...) have something to do with the value of group_expiry, which defaults to 24 hours, in contrast to expiry which defaults to 60 seconds? Any other suggestions of where to look to debug this?

To speed up responsiveness of approaches like this, do you thing django-channels could implement some sort of signal or hook that fires when a channel expires due to client dropouts? We have routing hooks to listen for websocket.disconnect, but I can't find any way to listen for connection.expire, and we're left with polling group_channels(...) to learn about it. This isn't a huge deal; it just adds the polling delay to any existing expiry timeout before status can be updated.

@andrewgodwin
Copy link
Member

It's reasons like this I didn't want to add group_channels in the first place, it's not as clever as you think :)

group_channels is basically just doing the exact same thing as you - adding things on connect, and removing them on disconnect. The reason you're seeing expired channels in it is because you're not getting disconnect fired on those channels for the reasons you described.

There's two different approaches to solving this problem:

  • Using websocket auto-ping to periodically assert clients are still connected to sockets, and cutting them loose if not. This is a feature that's already in daphne master if you want to try it, and which is coming in the next release; the ping interval and time-till-considered-dead are both configurable. This solves the (more common) clients-disconnecting-uncleanly issue.
  • Storing the timestamp of the last seen action from a client in a database, and then pruning ones you haven't heard of in a while. This is the only approach that will also get round disconnect not being sent because a Daphne server was killed, but it's more specialised, so Channels can't implement it directly.

If you truly want fully accurate presence, you should go with the timestamp method, but be aware it scales a bit differently (you'll need to consider how to mark channels as expired at scale - I suggest use of transactions and LIMIT 100 or so on queries to get channels to expire). I'd like to have this method available as a separate library/writeup for those who need it, but there's a lot of other stuff I also want to do first - if you're interested in taking on some of this, I will lend all the support I can.

In general, don't use group_channels, it's not accurate and is only intended for bulk operations like merging groups together. I'll try and add a few more warnings around it in the docs/code mentioning the caveats.

@yourcelf
Copy link
Author

Alright, thanks. I've published https://github.com/unhangout/django-channels-presence, which is a reusable Django app that implements the database-backed Room/Presence model strategy with timestamps.

@oTree-org
Copy link

@andrewgodwin, I read your above post and looked in daphne and found ping_interval and ping_timeout, but how can I actually take advantage of these to prune stale connections? Is this to be used in combination with group_channels?

Like some other people, I also need to display a list of which users are present in which rooms, and need to solve the problem of pruning connections when disconnect is not fired (for example, with Chrome, if I open 30 tabs and then close them all at once, usually about 10 tabs fail to fire the disconnect). I have a "presence" DB table whose records get created on connect and deleted on disconnect (source), so I end up with stale records in this table that are never deleted.

I upgraded daphne, but still have the same issue with disconnect not being fired. I was hoping maybe daphne would cause my disconnect consumer to be called after ping_timeout, but that seems not to be the case.

Thus far, my workaround has been to use an AJAX heartbeat every 20 seconds, but that seems to have its own problems & complexities, so I am looking for a alternative.

@andrewgodwin
Copy link
Member

The ping options are just to make Daphne clean up stale connections faster than before; there's still a situation where you can not get disconnects (for example, if you SIGKILL Daphne and thus don't give it time to close cleanly). If you want truly accurate presence you'll have to implement your own heartbeat layer on top of websockets with the relevant time tracking and timeout logic; this is not something that channels has the maintenance team to include right now.

@pycharmer1221
Copy link

hello,
i am facing the same problem,
we have one solution that we can implement our own heartbeat layer on top of WebSocket.
but it adds unnecessary complications to our system.
or are there any other solutions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants