Skip to content

fix for malformed xml from Google and other sinners #198

Open
wants to merge 8 commits into from

2 participants

@jfarrimo

I've run into this bug:

https://bugs.freedesktop.org/show_bug.cgi?id=33857

Basically, the iTeleportService client is sending malformed xml through gchat, and gchat is merrily sending it along. Since this is a Google service, there is almost no hope of getting it fixed. Therefore, I'm forced to hack deep into the bowels of SleekXMPP to take care of it. It makes me feel dirty to do this. There is no shower that will make me feel clean again.

I don't have the luxury of de-friending this user. I don't see a good way to catch the malformed xml other than to sit in the socket and watch the stream as it comes by. Ugh.

Please tell me there is a better way to do this. Please.

BTW, the worst part is that there is no error message SleekXMPP emits that gives even the foggiest clue that this is the problem. It took me a good long while to drill down into the code to figure out what was going on. This is only partly SleekXMPP's fault. It's also the fault of the xml parser.

@jfarrimo

Hrm. Ok, looks like the pull request covers three commits. It's only meant to cover the last one that mentions iTeleportService in the subject.

@legastero
Collaborator

Sadly, that looks to be the best solution for now (for Python 2, this won't work for 3 because filesocket isn't used there). The only other way to do it with the current architecture would be to make another wrapper for the socket object to do the filtering, but that's pretty much what this is.

@jfarrimo

I've done more digging, and the fault isn't so much that of SleekXMPP as it is that the expat parser doesn't provide any way to handle errors. That is probably ok most of the time, and it's the way the XML specification says things should work, but sometimes we have to be able to handle errors anyway. When Google passes us bad XML in iq messages,
we have to roll with it since they're an 800 pound gorilla. It would be just the same if Facebook was the culprit.

It looks like the lxml parser has a mode where it will try to correct errors, but I'm not entirely comfortable with that either. In a perfect world, the expat parser would hand you the offending xml, let you deal with it as best you can, then hand it back to retry.

The one thing that really irks me is that the expat parser won't even give you back the offending piece of XML. I had to put logging in the socket function just to get it to tell me what it was blowing up on. All that they give you is a useless line number and offset. Ugh!

@legastero
Collaborator

I'm working on moving away from iterparse() and working directly with the expat parser (see https://gist.github.com/3716705). If you see a way to get the error correction hooks you need in there, we can add it in.

@jfarrimo

FYI, this has brought to light a bigger bug that I haven't found a workaround for yet. I'm counting up exceptions caught during session_start() and trying to exit the program after a certain number in a row (I'm taking this to mean something's badly broken). This, however, doesn't work. Here's what's going on:

  1. in the session_start() event, call self.send_presence() and self.get_roster(), as usual, catching exceptions they throw
  2. immediately get the xml stream error from the bad presence data received from iTeleportService
  3. call self.disconnect() to end it, since things are un-recoverably hosed at this point
  4. infinite loop encountered, program never stops, threads never die, etc...

I'm pretty sure this is because XMLStream._process() automatically calls self.reconnect() on getting this error (as it should in case of network errors, etc...). This somehow over-rides my self.disconnect() call. Basically, I can't make it stop.

I'm doing all of this so that, in the future, when more problems with malformed XML in presence stanzas crop up, I'll notice it and shut down. Otherwise the program will continually try to reconnect and it'll be difficult to detect that things have gone wrong.

@jfarrimo

Here's the offending code:

def session_start(self, blank_event):
    self.logger.debug("starting session")
    try:
        self.send_presence()
        self.get_roster()
    except IqError as err:
        self.start_errors += 1
        self.logger.error(
            'There was an error getting the roster (%r): error count = %s',
            err.iq['error']['condition'], self.start_errors)
        if self.start_errors >= self.error_limit:
            self.logger.error("Too many startup errors encountered, giving up")
            self.disconnect()
    except IqTimeout:
        self.start_errors += 1
        self.logger.error(
            'Server is taking too long to respond: error count = %s',
            self.start_errors)
        if self.start_errors >= self.error_limit:
            self.logger.error("Too many startup errors encountered, giving up")
            self.disconnect()
    else:
        self.start_errors = 0
        self.online = True
        self.logger.info("session started")
@legastero
Collaborator

Interesting, I'll look into this. I know there are several issues related to state changes like this that need addressing.

For now, this looks like a good candidate for the new abort() method, which just shuts everything down without going through the full stream closing sequence.

@jfarrimo

Oh, sweetness! I hadn't noticed that. I'll plug it in now and give it a shot.

Thanks!

Jay Farrimond Merge remote-tracking branch 'upstream/develop' into develop
Conflicts:
	sleekxmpp/xmlstream/xmlstream.py
2ea67e2
@jfarrimo

So, it turns out that abort() doesn't work out of the box in this situation. I had to comment out the self.stop.clear() call at the beginning of XMLStream._connect(). Apparently, there is a race condition between the event thread and the main thread where the event thread calls abort between the time that the main thread calls reconnect() and reconnect() transitions from the disconnected to the connected state (looks like a 2 second wait is in there). Do you know of any dangers from removing this self.stop.clear() call? The other option would be to add a similar self.aborting member and look for it before calling the self.stop.clear() call.

@legastero
Collaborator

Good catch. The self.stop.clear() call really ought to be in connect() instead of _connect(). That should eliminate the race condition.

@jfarrimo

Ok, cool.

Jay Farrimond added some commits Dec 31, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.