xmlstream socket CLOSE_WAIT problem #151
Comments
On my local machine I can't reproduce this issue, it only happens when clients connect from remote machines, I have updated SleekXMPP to 1.0.1-dev version and now I get some other error, something about SSL
I wonder, maybe in such situation it makes sense to give up and shutdown socket? Currently clients are stuck in attempts to send /stream. Anyway, XMPP server could be restarted at this point and is not waiting for /stream, actually the whole session doesn't exist anymore and should be initialized again, I would not count on /stream or stream-error messages from a server as well, it can be something wrong with the connection and you will never get any data from it. |
You're right about not needing to send the |
It looks like there are some changes in develop branch related to this issue, but it still doesn't work quite well:
And it keeps going, I have to exit process and start process again, looks like there is some state information which is not cleaned on disconnect. |
Are you specifying a specific SASL mech? |
No, I don't touch anything. It connects fine, but when I restart ejabberd it is not able to connect again because of this. |
Interesting. Can you run this again with the debug log level and post that? |
Well, you asked for it. As you can see there are 2 DNS records, I shutdown one ejabberd, but it can't connect to another one. ejabberds run in a cluster.... I should mention it earlier. DEBUG:sleekxmpp.xmlstream.xmlstream:RECV: stream:error/stream:error DEBUG:sleekxmpp.xmlstream.xmlstream:SEND (IMMED): <stream:stream to='X' xmlns:stream='http://etherx.jabber.org/streams' xmlns='jabber:client' version='1.0'> DEBUG:sleekxmpp.xmlstream.xmlstream:SEND (IMMED): <stream:stream to='X' xmlns:stream='http://etherx.jabber.org/streams' xmlns='jabber:client' version='1.0'> |
Haha, yeah I did ask for it, but that was very useful. You were right that state was not being cleared properly: the set of attempted SASL mechs was not being reset if authentication didn't succeed at all. Develop branch has been updated to fix that. However, that still leaves an issue of why you were not able to auth at all with your second machine, but I'm guessing that may be an issue with ejabberd clustering. |
Great, it will take some time for me to update clients, I'll let you know the result :-) |
Looks like auth issue is fixed, at least for scenario with one ejabberd, as for ejabberd cluster, disconnected node cannot authenticate on another ejabberd, but that is another story. As for this issue, looks like there is some threading problem. send_raw() continues to work even after "stop" was set.
Here you can see line "!!!!!! stop set", but after that send_raw() continues to work... crazy, I wonder maybe there is more than one thread running send_raw()? I am still trying to figure out what is wrong. At least it is not hard to reproduce, all you need is a steady stream of stanzas to keep something in the queue at the time disconnect occurs, I use tcpkill to simulate connectivity problems. |
Maybe it is related to my new approach of handling connection problems, my main loop now looks like this:
I use main thread as a supervisor for sleek thread, if I see "session end" event I set node_state to offline, wait some time to give it a chance to disconnect on itself, if it doesn't happen I invoke disconnect() manually. Maybe disconnect doesn't work as expected and process() starts another thread, while old one is still stuck somewhere. |
And here is a couple of new lines from the log, something about lock:
|
I guess transition() should return False if it fails to acquire lock, maybe this should be handled somehow in disconnect(), like shutdown thread unconditionally, raise exception, so that user's code can handle this (though there is not much to handle, but at least I would know that I have to restart my process) |
Right, there is the I've just updated develop to be more aggressive in checking for the Let me know if the problems persist, and thanks once again for the reports. |
Results are mixed. It was working fine for 2 days, but then has started to fail, so far there is not much data what causes this, the testing is still in progress. So far I have seen this kind of errors:
And it keeps going. ejabberd log doesn't reveal any details. |
It looks like I really need to add the original stanza to the timeout logs instead of just the id. |
Looks like it works fine now, some failures still happen from time to time, but I can't spend more time on that right now, if a client can't reconnect I shutdown its process and start it again. The only problem I have is that not all threads exit... maybe it is a good idea to add daemon=True to Sleek threads? If something goes wrong it is still better to exit somehow even not cleanly, than let threads run and hold process, especially when some of them are still trying to communicate with a server. |
We used to set the threads to daemon mode, but that makes the interpreter barf exceptions on shutdown because daemon threads are not supposed to access global objects (so no queue module references, etc). Of course, that generates a lot of extra bug reports :) But yes, if you're ok with noisy exceptions on shutdown, we know from experience that daemon=True works great. In XMLStream.process there is a start_thread function where you can add the line to set daemon status for the main Sleek threads. Let me know if you do eventually get a fix on where the failures are coming from. |
OK. FYI - start_thread() is not enough, scheduler also doesn't exit - https://github.com/fritzy/SleekXMPP/blob/master/sleekxmpp/xmlstream/scheduler.py#L130 Any chance that eventually can become an optional parameter? |
Right, of course, that was set too before in the scheduler code. I've added a _use_daemons flag to XMLStream, with the _ since I'm not sure yet if that should be official API once the hanging thread issues are resolved. |
I believe this issue has struck us, If network goes down xmpp ping (xep 0199) fails with timeout and SleekXMPP attempts to reconnect - it never succeeds (After fixing network) Is there any workaround for this (or chance to fix)? |
I have some peculiar problem with client connections, after some time (about 12 hours or so) they disconnect and not able to connect again, they stuck in reconnect attempts, here is some lines from sleekxmpp log file:
netstat reveals that socket still exists, but it is in CLOSE_WAIT state:
After "Socket Error #-5" I don't see any new lines in the log, though I still call connect every 15 seconds (I don't use reattempt=True)
The text was updated successfully, but these errors were encountered: