New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Out of order ack-ing? #33
Comments
Poking around in the transport:
I'm not sure /what's/ going on. It looks like everything correct is getting sent. The data going out the socket is correctly ordered. |
Hmm, that is odd. What version of RabbitMQ is this? |
RabbitMQ 3.6.5, Erlang 19.1 |
Traceback to the error with transmit and received messages:
|
How about on the RabbitMQ side? I found some similar errors on google, but they are pretty old (2-3 years ago at least). |
Only thing in the rabbitmq logs is:
One thing of note is that I'm on a link that is.... not great. RTT is ~170 ms, with periodic spikes (I'm on the west coast of the US, server is in france (I think). I'm just confused because TCP should guarantee in-order delivery, so we should be able to assume the underlying transport is reliable, if not timely, and it seems like that's not true here. Let me see if I can convince rabbitmq to output more diagnostics. |
Are you using multiple threads per channel? Reading online it suggests a threading issue, but based on the logs I can't see how that could possible be. |
Hmmmm, I have 3 threads per channel. One for rx, one for tx, and one for management/monitoring. The acks should be strictly ordered, though. Actually, let me idiot check. Ok, yeah, the sequential writes for the two acks leading up to the exception are from the same thread. Unfortunately, there isn't (or at least wasn't) a way to do ping-pong transmit/receive, so I think you always need two threads (one for tx, one for rx)? |
The only scenario that I can visualize in my head would be that two threads are competing for the write lock. Since unlike rx, there is no outgoing outgoing queue. So unless the lock is already owned, it should always write directly. Where are you outputting the write logs? Make sure that you log after the lock is acquired. If you are not using SSL, you could even try to remove the thread lock in write_to_socket. |
Already there - In
I am using SSL. |
Yea - this is making absolutely no sense. Could you capture some packets with wireshark or tcpdump and verify the order? |
Working on something like that. Let me try to produce a sane minimal example before I go any further. It could be something really silly in my environment. |
Sounds good - let me know what you find! |
will do Aaand of course, my minimal example doesn't crash. |
I wonder if this is related to a single channel consuming two queues? I can reproduce it I think with two channel.basic.consume defined on the same channel. |
That sounds likely, but I was testing it myself a bit, without seeing the crash. OTOH:
My test harness is here, if it's worth anything: https://github.com/fake-name/ReadableWebProxy/blob/everything-is-fire/amqpstorm_test.py I think I'm going to rip up and rewrite most of my amqp stuff tomorrow. It's been getting more and more creaky and brittle as I add new things and (don't) refactor, and it's /probably/ at least in part responsible for this weirdness. |
Actually, that was a false alarm from my side. I was deferring the acks, but forgot to pop them from the list. So it was trying to ack the same messages over and over again. |
Feel free to take a look at some of the rpc examples I have available - assuming that anyone of them fit your use case. If not, feel free to provide me with specs and I'll add it as an example. |
OTOH, now stuck in:
With rabbitmq reporting 0 unacked messages. |
Actually use the final patch
|
Derp. Will do. |
If its still an issue we can add a simple timeout to that function. There is no way that under normal situations that you should be stuck there longer than a second or two. Gonna write some unit-tests around this. |
Yeah, I just patched a timeout into it. Working on merging from master now. |
A lot of changes, but mostly in the unit-tests so hopefully wont be too tricky to merge. I did make a change to the io.close call that should speed things up a bit though. |
I'm basically just clobbering everything but my |
For that matter, if I put a unicode string into a message, and send it, it gets auto-encoded? If the transport is binary, it might make more sense to require inputs be bytestrings. That way, you avoid encoding ambiguities. I'm pushing largely binary strings everywhere, and having a heuristic somewhere trying to potentially decode binary messages feels weird. |
Yes, if you don't need that, it might be good to set it to False. It might add some extra cpu cycles to have it enabled.
|
This was actually implemented to make code for Python 2 work better with Python 3 code. As an example this would work with Python 2.7, but would require a decode in Python 3, as the message will always be received as bytestring by default in Python 3. Python 2
Python 3
Python 3 with auto-decode
|
Yeah, it gets fiddly when you're trying to bridge the two versions, definitely. When I updated my fork of amqpstorm, I defaulted to not decoding in all circumstances (by changing the function definitions).. OTOH, for py3k only, I'd say enforce that all correlation IDs must be bytestrings. That way, you avoid any possible ambiguity. Right now, you basically have the situation that py3k was designed to prevent, namely usually functional bytestring/string interaction. |
Makes sense. Something we can address in the next major patch. |
Yeah, I'm just not sure if that's something you'd want to do. I'm a big fan of py3k's enforcement of separation, but if you're trying to support both 2 and 3 it gets really hard to do, without having a different API for each version. Personally, I've just decided to go 3k only, but then the only thing that's affected for me is my personal sillyness. |
I am actually looking at making a Python 3.5+ only version. Making use of some of the new threading features would be useful, but impossible to keep backwards compatible. |
Heh. Coroutines ftw. |
Ok, MOAR STUFF. I think the issue /may/ be an interaction with I wrote a minimal test-harness, which simply doesn't ack() every 3rd message. I then periodically call Is it possible there's something odd going on with the recovered messages? Could the delivery tags from the ostensibly original delivery be being used, rather then the redelivered tags? Calling
The good news here is I can reliably reproduce this issue, and my old setup that was having issues also sometimes used recover(). Additionally, the exception appears to happen a period of time after the recover call is called, which could be why the message logs /looked/ fine. Anyways, https://github.com/fake-name/ReadableWebProxy/blob/master/amqpstorm_test.py reproduces this exact issue repeatably. My initial assumption here is I'm using recover() wrong, but I can't see how it should work if just calling it and expecting redelivery isn't correct. |
Yea, honestly never really used it. A possibility is that I need to clear the inbound queue whenever recover is called. Since all messages will be re-delivered, the old ones may need to be cleared first. |
Testing the following at the moment and it is looking good.
Good find! |
One thing that does stick out for recover, is that it's meant to be synchronous.
|
Well, calling I'm not sure if the eventual crash I saw was due to the threading context of my test or not, though. My actual application project serializes all rabbitmq stuff (the MQ interface is owned by a separate thread, with queues for all operations), so I'm testing it there now. |
Yea, it will still die eventually. The documentation on recover is unfortunately quite vague. My understand of the problem is as follows.
With my fix the major issues gets resolved, but as it's threaded it's possible a message was being processed before the queue was successfully cleared and still gets processed twice. |
I haven't found a single amqp library that actually clears the inbound queue, so I do wonder how exactly this command is supposed to be handled. |
Wait, is Anyways, assuming I can guarantee serialization (which I can in my case), it seems like it mat least mostly solves the issue. Looking around for other people using basic.recover, it seems like there are a few questions about using it from various languages, with little follow up. I think the usual case involves just restarting the connection, rather then trying to recover anything while while keeping the connection alive. |
Yea, I think I'll ask in the RabbitMQ forums - not found anything useful. I never implemented support for baisc.recover-async as it was deprecated, but it would be easy for you to try it out. I don't think it's simply a proxy call.
|
I'm kind of confused about what the difference between the two is. They sound like they basically do the same thing. It's worth noting that there is a flag on redelivered messages. Possibly a good work around to the multiple-acking-a-message would be to process redelivered messages first, and maintain a log of the delivery tags. Then, when processing normal messages, if you reach a message that's been seen as a redelivered message, just discard the original version. I can't see a way to do it without maintaining some local state, though. |
Yea, feels like it would add a lot of complexity, especially when trying to keep this thread-safe. I'll look into it though. I asked the following in the RabbitMQ forum. Feel free to fill in any missing information. |
Hahahahaha, so the depreciated version seems more fully functional. Sigh. |
So both recover and recover-async are apparently deprecated, based on the response from that thread. I think you basically got two options here.
|
Yeah, I guess I'm stuck working around the issue. How annoying. They could at least update their documentation! |
Whoooo, more wierdness!
Context:
I have a connection with a thread processing the receiving messages. I have instrumented
Message.ack()
with a print statement that prints the delivery tag that it's acking.It appears I'm calling [delivery tag 6].ack(), [delivery tag 7].ack(), and somehow the ack for delivery tag 7 is getting received by the rabbitmq server /first/, resulting in a
PRECONDITION_FAILED
error because acking 7 implicitly acks previous tags, and therefore 6 is not a valid delivery tag anymore.I'm working on pulling out a testable example, but it's certainly odd.
Incidentally, the new docs pages are fancypants!
The text was updated successfully, but these errors were encountered: