xmlstream socket CLOSE_WAIT problem #151

Open
coffeeowl opened this Issue Mar 20, 2012 · 21 comments

Projects

None yet

3 participants

@coffeeowl

I have some peculiar problem with client connections, after some time (about 12 hours or so) they disconnect and not able to connect again, they stuck in reconnect attempts, here is some lines from sleekxmpp log file:

WARNING:sleekxmpp.xmlstream.handler.waiter:Timed out waiting for IqWait_19E
ERROR:sleekxmpp.xmlstream.xmlstream:Error reading from XML stream.
ERROR:sleekxmpp.basexmpp:no element found: line 1, column 45319
Traceback (most recent call last):
File "/usr/local/lib/python2.6/dist-packages/sleekxmpp/xmlstream/xmlstream.py", line 1150, in _process
if not self.__read_xml():
File "/usr/local/lib/python2.6/dist-packages/sleekxmpp/xmlstream/xmlstream.py", line 1186, in __read_xml
for event, xml in ET.iterparse(self.filesocket, (b'end', b'start')):
File "", line 68, in iter
SyntaxError: no element found: line 1, column 45319
ERROR:sleekxmpp.xmlstream.xmlstream:Could not connect to :5222. Socket Error #-5: No address associated with hostname

netstat reveals that socket still exists, but it is in CLOSE_WAIT state:

tcp 1 0 :48050 :5222 CLOSE_WAIT 31226/python

After "Socket Error #-5" I don't see any new lines in the log, though I still call connect every 15 seconds (I don't use reattempt=True)

@coffeeowl

On my local machine I can't reproduce this issue, it only happens when clients connect from remote machines, I have updated SleekXMPP to 1.0.1-dev version and now I get some other error, something about SSL

SSLError: [Errno 1] _ssl.c:1217: error:1409F07F:SSL routines:SSL3_WRITE_PENDING:bad write retry
WARNING:sleekxmpp.xmlstream.xmlstream:Failed to send /stream:stream
WARNING:sleekxmpp.xmlstream.xmlstream:SSL write error - reattempting
WARNING:sleekxmpp.xmlstream.xmlstream:SSL write error - reattempting
ERROR:sleekxmpp.basexmpp:[Errno 1] _ssl.c:1217: error:1409F07F:SSL routines:SSL3_WRITE_PENDING:bad write retry
Traceback (most recent call last):
File "/usr/local/lib/python2.6/dist-packages/sleekxmpp/xmlstream/xmlstream.py", line 1536, in _send_thread
sent += self.socket.send(enc_data[sent:])
File "/usr/lib/python2.6/ssl.py", line 174, in send
v = self._sslobj.write(data)
SSLError: [Errno 1] _ssl.c:1217: error:1409F07F:SSL routines:SSL3_WRITE_PENDING:bad write retry
WARNING:sleekxmpp.xmlstream.xmlstream:Failed to send from="zzz@yyy">action="execute" />
ERROR:sleekxmpp.basexmpp:[Errno 1] _ssl.c:1217: error:1409F07F:SSL routines:SSL3_WRITE_PENDING:bad write retry
Traceback (most recent call last):
File "/usr/local/lib/python2.6/dist-packages/sleekxmpp/xmlstream/xmlstream.py", line 1185, in send_raw
sent += self.socket.send(data[sent:])
File "/usr/lib/python2.6/ssl.py", line 174, in send
v = self._sslobj.write(data)
SSLError: [Errno 1] _ssl.c:1217: error:1409F07F:SSL routines:SSL3_WRITE_PENDING:bad write retry
WARNING:sleekxmpp.xmlstream.xmlstream:Failed to send /stream:stream

I wonder, maybe in such situation it makes sense to give up and shutdown socket? Currently clients are stuck in attempts to send /stream. Anyway, XMPP server could be restarted at this point and is not waiting for /stream, actually the whole session doesn't exist anymore and should be initialized again, I would not count on /stream or stream-error messages from a server as well, it can be something wrong with the connection and you will never get any data from it.

@legastero
Collaborator

You're right about not needing to send the </stream> at that point, and I'm going through and fixing things to make that work properly now. This has actually been an issue for some other things I'm working on, like properly handling failed cert validation.

@coffeeowl

It looks like there are some changes in develop branch related to this issue, but it still doesn't work quite well:

INFO:sleekxmpp.xmlstream.xmlstream:Negotiating TLS
INFO:sleekxmpp.xmlstream.xmlstream:Using SSL version: 3
ERROR:sleekxmpp.features.feature_mechanisms.mechanisms:No appropriate login method.
INFO:sleekxmpp.xmlstream.xmlstream:Waiting for /stream:stream from server

And it keeps going, I have to exit process and start process again, looks like there is some state information which is not cleaned on disconnect.

@legastero
Collaborator

Are you specifying a specific SASL mech?

@coffeeowl

No, I don't touch anything. It connects fine, but when I restart ejabberd it is not able to connect again because of this.

@legastero
Collaborator

Interesting. Can you run this again with the debug log level and post that?

@coffeeowl

Well, you asked for it. As you can see there are 2 DNS records, I shutdown one ejabberd, but it can't connect to another one. ejabberds run in a cluster.... I should mention it earlier.

DEBUG:sleekxmpp.xmlstream.xmlstream:RECV: stream:error/stream:error
DEBUG:sleekxmpp.xmlstream.xmlstream:End of stream recieved
DEBUG:sleekxmpp.xmlstream.xmlstream:SEND (IMMED): /stream:stream
INFO:sleekxmpp.xmlstream.xmlstream:Waiting for /stream:stream from server
DEBUG:sleekxmpp.thirdparty.statemachine: ==== TRANSITION connected -> disconnected
DEBUG:sleekxmpp.xmlstream.scheduler:Quitting Scheduler thread
DEBUG:sleekxmpp.thirdparty.statemachine:State was not ready
DEBUG:sleekxmpp.xmlstream.xmlstream:No remaining DNS records to try.
DEBUG:sleekxmpp.xmlstream.resolver:Querying SRV records for X
DEBUG:sleekxmpp.xmlstream.resolver:DNS: Querying X. for AAAA records.
DEBUG:sleekxmpp.xmlstream.resolver:DNS: No AAAA records for X..
DEBUG:sleekxmpp.xmlstream.resolver:DNS: Querying X. for A records.
DEBUG:sleekxmpp.xmlstream.xmlstream:Connecting to X:5222
DEBUG:sleekxmpp.thirdparty.statemachine: ==== TRANSITION disconnected -> connected
DEBUG:sleekxmpp.xmlstream.xmlstream:Starting HANDLER THREAD
DEBUG:sleekxmpp.xmlstream.xmlstream:Loading event runner
DEBUG:sleekxmpp.xmlstream.xmlstream:SEND (IMMED): <stream:stream to='X' xmlns:stream='http://etherx.jabber.org/streams' xmlns='jabber:client' version='1.0'>
DEBUG:sleekxmpp.xmlstream.xmlstream:RECV: stream:featuresSCRAM-SHA-1DIGEST-MD5PLAIN/stream:features
DEBUG:sleekxmpp.xmlstream.xmlstream:SEND (IMMED):
DEBUG:sleekxmpp.xmlstream.xmlstream:RECV:
DEBUG:sleekxmpp.features.feature_starttls.starttls:Starting TLS
INFO:sleekxmpp.xmlstream.xmlstream:Negotiating TLS
INFO:sleekxmpp.xmlstream.xmlstream:Using SSL version: 3
DEBUG:sleekxmpp.xmlstream.xmlstream:CERT: -----BEGIN CERTIFICATE-----
-----END CERTIFICATE-----

DEBUG:sleekxmpp.xmlstream.xmlstream:SEND (IMMED): <stream:stream to='X' xmlns:stream='http://etherx.jabber.org/streams' xmlns='jabber:client' version='1.0'>
DEBUG:sleekxmpp.xmlstream.xmlstream:RECV: stream:featuresSCRAM-SHA-1DIGEST-MD5PLAIN/stream:features
DEBUG:sleekxmpp.xmlstream.xmlstream:SEND (IMMED): biwsbj1ubGFkZnNseDYxc3RnLHI9MzQ0NTk4NjU0MjEx
DEBUG:sleekxmpp.xmlstream.xmlstream:RECV:
INFO:sleekxmpp.features.feature_mechanisms.mechanisms:Authentication failed: not-authorized
DEBUG:sleekxmpp.xmlstream.xmlstream:SEND (IMMED):
DEBUG:sleekxmpp.xmlstream.xmlstream:RECV: bm9uY2U9IjM3Mzc5MTE2NzUiLHFvcD0iYXV0aCIsY2hhcnNldD11dGYtOCxhbGdvcml0aG09bWQ1LXNlc3M=
DEBUG:sleekxmpp.xmlstream.xmlstream:SEND (IMMED): dXNlcm5hbWU9Im5sYWRmc2x4NjFzdGciLG5vbmNlPSIzNzM3OTExNjc1IixyZWFsbT0ic3RhZ2UuYXBpY2FuZXQuY29tIixxb3A9YXV0aCxjbm9uY2U9IjU0MDc1MzEwMzU0MiIsbmM9MDAwMDAwMDEsZGlnZXN0LXVyaT0ieG1wcC9zdGFnZS5hcGljYW5ldC5jb20iLHJlc3BvbnNlPTFiMDg3ZjIzODVkYzk5ZGU3OTMxYmY0NzVhZmYyOWYy
DEBUG:sleekxmpp.xmlstream.xmlstream:RECV:
INFO:sleekxmpp.features.feature_mechanisms.mechanisms:Authentication failed: not-authorized
DEBUG:sleekxmpp.xmlstream.xmlstream:SEND (IMMED): AG5sYWRmc2x4NjFzdGcAWmxhdGFuMjAxMA==
DEBUG:sleekxmpp.xmlstream.xmlstream:RECV:
INFO:sleekxmpp.features.feature_mechanisms.mechanisms:Authentication failed: not-authorized
ERROR:sleekxmpp.features.feature_mechanisms.mechanisms:No appropriate login method.
DEBUG:sleekxmpp.xmlstream.xmlstream:SEND (IMMED): /stream:stream
INFO:sleekxmpp.xmlstream.xmlstream:Waiting for /stream:stream from server
DEBUG:sleekxmpp.thirdparty.statemachine: ==== TRANSITION connected -> disconnected
ERROR:sleekxmpp.xmlstream.xmlstream:Error reading from XML stream.
DEBUG:sleekxmpp.xmlstream.scheduler:Quitting Scheduler thread
DEBUG:sleekxmpp.thirdparty.statemachine:State was not ready
DEBUG:sleekxmpp.thirdparty.statemachine:State was not ready
DEBUG:sleekxmpp.xmlstream.resolver:DNS: Querying X. for AAAA records.
DEBUG:sleekxmpp.xmlstream.resolver:DNS: No AAAA records for X..
DEBUG:sleekxmpp.xmlstream.resolver:DNS: Querying X. for A records.
DEBUG:sleekxmpp.xmlstream.xmlstream:Connecting to X:5222
ERROR:sleekxmpp.xmlstream.xmlstream:Could not connect to X:5222. Socket Error #111: Connection refused
DEBUG:sleekxmpp.xmlstream.xmlstream:No remaining DNS records to try.
DEBUG:sleekxmpp.xmlstream.resolver:Querying SRV records for X
DEBUG:sleekxmpp.xmlstream.resolver:DNS: Querying X. for AAAA records.
DEBUG:sleekxmpp.xmlstream.resolver:DNS: No AAAA records for X..
DEBUG:sleekxmpp.xmlstream.resolver:DNS: Querying X. for A records.
DEBUG:sleekxmpp.xmlstream.xmlstream:Connecting to X:5222
ERROR:sleekxmpp.xmlstream.xmlstream:Could not connect to X:5222. Socket Error #111: Connection refused
DEBUG:sleekxmpp.xmlstream.resolver:DNS: Querying Y. for AAAA records.
DEBUG:sleekxmpp.xmlstream.resolver:DNS: No AAAA records for Y..
DEBUG:sleekxmpp.xmlstream.resolver:DNS: Querying Y. for A records.
DEBUG:sleekxmpp.xmlstream.xmlstream:Connecting to Y:5222
DEBUG:sleekxmpp.thirdparty.statemachine: ==== TRANSITION disconnected -> connected
DEBUG:sleekxmpp.xmlstream.xmlstream:Starting HANDLER THREAD
DEBUG:sleekxmpp.xmlstream.xmlstream:Loading event runner
DEBUG:sleekxmpp.xmlstream.xmlstream:SEND (IMMED): <stream:stream to='X' xmlns:stream='http://etherx.jabber.org/streams' xmlns='jabber:client' version='1.0'>
DEBUG:sleekxmpp.xmlstream.xmlstream:RECV: stream:featuresSCRAM-SHA-1DIGEST-MD5PLAIN/stream:features
DEBUG:sleekxmpp.xmlstream.xmlstream:SEND (IMMED):
DEBUG:sleekxmpp.xmlstream.xmlstream:RECV:
DEBUG:sleekxmpp.features.feature_starttls.starttls:Starting TLS
INFO:sleekxmpp.xmlstream.xmlstream:Negotiating TLS
INFO:sleekxmpp.xmlstream.xmlstream:Using SSL version: 3
DEBUG:sleekxmpp.xmlstream.xmlstream:CERT: -----BEGIN CERTIFICATE-----
-----END CERTIFICATE-----

DEBUG:sleekxmpp.xmlstream.xmlstream:SEND (IMMED): <stream:stream to='X' xmlns:stream='http://etherx.jabber.org/streams' xmlns='jabber:client' version='1.0'>
DEBUG:sleekxmpp.xmlstream.xmlstream:RECV: stream:featuresSCRAM-SHA-1DIGEST-MD5PLAIN/stream:features
ERROR:sleekxmpp.features.feature_mechanisms.mechanisms:No appropriate login method.
DEBUG:sleekxmpp.xmlstream.xmlstream:SEND (IMMED): /stream:stream
INFO:sleekxmpp.xmlstream.xmlstream:Waiting for /stream:stream from server
DEBUG:sleekxmpp.xmlstream.xmlstream:End of stream recieved
DEBUG:sleekxmpp.thirdparty.statemachine: ==== TRANSITION connected -> disconnected
DEBUG:sleekxmpp.xmlstream.scheduler:Quitting Scheduler thread
DEBUG:sleekxmpp.thirdparty.statemachine:State was not ready
DEBUG:sleekxmpp.thirdparty.statemachine:State was not ready
DEBUG:sleekxmpp.xmlstream.xmlstream:No remaining DNS records to try.

@legastero
Collaborator

Haha, yeah I did ask for it, but that was very useful.

You were right that state was not being cleared properly: the set of attempted SASL mechs was not being reset if authentication didn't succeed at all. Develop branch has been updated to fix that.

However, that still leaves an issue of why you were not able to auth at all with your second machine, but I'm guessing that may be an issue with ejabberd clustering.

@coffeeowl

Great, it will take some time for me to update clients, I'll let you know the result :-)

@coffeeowl

Looks like auth issue is fixed, at least for scenario with one ejabberd, as for ejabberd cluster, disconnected node cannot authenticate on another ejabberd, but that is another story. As for this issue, looks like there is some threading problem. send_raw() continues to work even after "stop" was set.

WARNING:sleekxmpp.xmlstream.xmlstream:SSL write error - reattempting
ERROR:sleekxmpp.basexmpp:[Errno 1] _ssl.c:1217: error:1409F07F:SSL routines:SSL3_WRITE_PENDING:bad write retry
Traceback (most recent call last):
File "/usr/local/lib/python2.6/dist-packages/sleekxmpp/xmlstream/xmlstream.py", line 1497, in _send_thread
with self.send_lock:
File "/usr/lib/python2.6/ssl.py", line 174, in send
v = self._sslobj.write(data)
SSLError: [Errno 1] _ssl.c:1217: error:1409F07F:SSL routines:SSL3_WRITE_PENDING:bad write retry
WARNING:sleekxmpp.xmlstream.xmlstream:Failed to send from="esmaanlx61stg@stage.apicanet.com/0">
WARNING:sleekxmpp.xmlstream.xmlstream:SSL write error - reattempting
INFO:sleekxmpp.xmlstream.xmlstream:Waiting for /stream:stream from server
ERROR:sleekxmpp.xmlstream.xmlstream:!!!!!!! stop set
ERROR:sleekxmpp.basexmpp:[Errno 1] _ssl.c:1217: error:1409F07F:SSL routines:SSL3_WRITE_PENDING:bad write retry
Traceback (most recent call last):
File "/usr/local/lib/python2.6/dist-packages/sleekxmpp/xmlstream/xmlstream.py", line 1497, in _send_thread
with self.send_lock:
File "/usr/lib/python2.6/ssl.py", line 174, in send
v = self._sslobj.write(data)
SSLError: [Errno 1] _ssl.c:1217: error:1409F07F:SSL routines:SSL3_WRITE_PENDING:bad write retry
WARNING:sleekxmpp.xmlstream.xmlstream:Failed to send from="esmaanlx61stg@stage.apicanet.com/0">
WARNING:sleekxmpp.xmlstream.xmlstream:SSL write error - reattempting

Here you can see line "!!!!!! stop set", but after that send_raw() continues to work... crazy, I wonder maybe there is more than one thread running send_raw()? I am still trying to figure out what is wrong. At least it is not hard to reproduce, all you need is a steady stream of stanzas to keep something in the queue at the time disconnect occurs, I use tcpkill to simulate connectivity problems.

@coffeeowl

Maybe it is related to my new approach of handling connection problems, my main loop now looks like this:

while node.run:
    try:
        syslog(LOG_NOTICE, "connecting...")
        if node.connect(reattempt=False):
            syslog(LOG_NOTICE, "connected to %s" % str(node.address))
            # process() will work in separate thread
            node.process()
            offline_time = 0
            while node.run:
                if node.node_state == "offline":
                    if offline_time == 0:
                        syslog(LOG_NOTICE, "node is offline")
                    offline_time += 1
                    if offline_time >= 15:
                        syslog(LOG_NOTICE, "connection is dead, reconnecting...")
                        # disconnect and exit, outer loop will try
                        # to connect again
                        node.disconnect()
                        break
                else:
                    offline_time = 0

                time.sleep(1)
        else:
            raise Exception("connect() failed")
    except Exception, e:
        syslog(LOG_ERR, "cannot connect, %s, will retry in 15 seconds..." % e)
        time.sleep(15)

I use main thread as a supervisor for sleek thread, if I see "session end" event I set node_state to offline, wait some time to give it a chance to disconnect on itself, if it doesn't happen I invoke disconnect() manually. Maybe disconnect doesn't work as expected and process() starts another thread, while old one is still stuck somewhere.

@coffeeowl

And here is a couple of new lines from the log, something about lock:

SSLError: [Errno 1] _ssl.c:1217: error:1409F07F:SSL routines:SSL3_WRITE_PENDING:bad write retry
WARNING:sleekxmpp.xmlstream.xmlstream:Failed to send <iq to="X@Y" type="set" id="1045bdf2-9260-4ae9-b292-4a08d5a400e4-D8" from="Z@Y"><command xmlns="http://jabber.org/protocol/commands" node="savejob" action="execute" /></iq>
DEBUG:sleekxmpp.thirdparty.statemachine:Could not acquire lock
DEBUG:sleekxmpp.thirdparty.statemachine:Could not acquire lock
WARNING:sleekxmpp.xmlstream.xmlstream:SSL write error - reattempting
DEBUG:sleekxmpp.xmlstream.xmlstream:SSL error - max retries reached
ERROR:sleekxmpp.basexmpp:[Errno 1] _ssl.c:1217: error:1409F07F:SSL routines:SSL3_WRITE_PENDING:bad write retry
Traceback (most recent call last):
File "/usr/local/lib/python2.6/dist-packages/sleekxmpp/xmlstream/xmlstream.py", line 1497, in _send_thread
sent += self.socket.send(enc_data[sent:])
File "/usr/lib/python2.6/ssl.py", line 174, in send
v = self._sslobj.write(data)
SSLError: [Errno 1] _ssl.c:1217: error:1409F07F:SSL routines:SSL3_WRITE_PENDING:bad write retry
WARNING:sleekxmpp.xmlstream.xmlstream:Failed to send <iq to="X@Y" type="set" id="1045bdf2-9260-4ae9-b292-4a08d5a400e4-D8" from="Z@Y"><command xmlns="http://jabber.org/protocol/commands" node="savejob" action="execute" /></iq>

@coffeeowl

I guess transition() should return False if it fails to acquire lock, maybe this should be handled somehow in disconnect(), like shutdown thread unconditionally, raise exception, so that user's code can handle this (though there is not much to handle, but at least I would know that I have to restart my process)

@legastero
Collaborator

Right, there is the _send_thread() thread that handles sending most stanzas (send_raw() delegates to this unless explicitly given the now=True parameter).

I've just updated develop to be more aggressive in checking for the self.stop threading event, so that should help some. I also noticed that the last failed stanza was not being cleared on disconnect, which could also have been part of the problem so that is fixed too.

Let me know if the problems persist, and thanks once again for the reports.

@coffeeowl

Results are mixed. It was working fine for 2 days, but then has started to fail, so far there is not much data what causes this, the testing is still in progress.

So far I have seen this kind of errors:

INFO:sleekxmpp.features.feature_bind.bind:Node set to: X
WARNING:sleekxmpp.xmlstream.handler.waiter:Timed out waiting for IqWait_d2c99649-a69b-49a6-831b-96db2b1f9089-79430
INFO:sleekxmpp.xmlstream.xmlstream:Waiting for /stream:stream from server
INFO:sleekxmpp.xmlstream.xmlstream:Waiting for /stream:stream from server
INFO:sleekxmpp.xmlstream.xmlstream:Negotiating TLS
INFO:sleekxmpp.xmlstream.xmlstream:Using SSL version: 3
INFO:sleekxmpp.features.feature_bind.bind:Node set to: X
WARNING:sleekxmpp.xmlstream.handler.waiter:Timed out waiting for IqWait_d2c99649-a69b-49a6-831b-96db2b1f9089-7943C
INFO:sleekxmpp.xmlstream.xmlstream:Waiting for /stream:stream from server
INFO:sleekxmpp.xmlstream.xmlstream:Waiting for /stream:stream from server
INFO:sleekxmpp.xmlstream.xmlstream:Negotiating TLS
INFO:sleekxmpp.xmlstream.xmlstream:Using SSL version: 3
INFO:sleekxmpp.features.feature_bind.bind:Node set to: X

And it keeps going. ejabberd log doesn't reveal any details.

@legastero
Collaborator

It looks like I really need to add the original stanza to the timeout logs instead of just the id.

@coffeeowl

Looks like it works fine now, some failures still happen from time to time, but I can't spend more time on that right now, if a client can't reconnect I shutdown its process and start it again. The only problem I have is that not all threads exit... maybe it is a good idea to add daemon=True to Sleek threads? If something goes wrong it is still better to exit somehow even not cleanly, than let threads run and hold process, especially when some of them are still trying to communicate with a server.

@legastero
Collaborator

We used to set the threads to daemon mode, but that makes the interpreter barf exceptions on shutdown because daemon threads are not supposed to access global objects (so no queue module references, etc). Of course, that generates a lot of extra bug reports :)

But yes, if you're ok with noisy exceptions on shutdown, we know from experience that daemon=True works great. In XMLStream.process there is a start_thread function where you can add the line to set daemon status for the main Sleek threads.

Let me know if you do eventually get a fix on where the failures are coming from.

@coffeeowl

OK. FYI - start_thread() is not enough, scheduler also doesn't exit - https://github.com/fritzy/SleekXMPP/blob/master/sleekxmpp/xmlstream/scheduler.py#L130

Any chance that eventually can become an optional parameter?

@legastero
Collaborator

Right, of course, that was set too before in the scheduler code.

I've added a _use_daemons flag to XMLStream, with the _ since I'm not sure yet if that should be official API once the hanging thread issues are resolved.

@kowal256

I believe this issue has struck us,

If network goes down xmpp ping (xep 0199) fails with timeout and SleekXMPP attempts to reconnect - it never succeeds (After fixing network)

Is there any workaround for this (or chance to fix)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment