Add the ability to forcefully close a connection. #29

fake-name · 2016-08-30T05:59:45Z

Basically, I'm dealing with a context where I have a high volume traffic AMQP connection, across a high-latency, unreliable link (intercontinental).

If there is a way for a connection to possibly wedge, I'll encounter it.

Anyways, The issue I ran into here is that it's possible for the "close" operation on a connection to wedge indefinitely, preventing a connection from actually closing (I /think/ if the shutdown RPC request gets garbled/eaten somewhere).

The question of /how/ this happens aside, I therefore need to be able to kill a open connection in a dirty manner. This adds the ability to kill() a connection which will force the worker thread to exit immediately, without bothering to do any proper cleanup.

Because this is a operation that's generally done with a additional watcher process, outside of the normal program flow, I used multiprocessing primitives to manage the _die flag (at one point, the watcher was a separate /process/, not thread).

Anyways, I'm not sure if this is inline with the ideas of the library, but I haven't been able to wedge the amqp connection with this patch set, so that's something.

Only tested on Py3.5x64, Ubuntu 14.04.

Things I haven't done: Additional unit tests.

# Conflicts: # amqpstorm/channel.py # amqpstorm/io.py # amqpstorm/rpc.py

codecov-io · 2016-08-30T06:11:50Z

Current coverage is 99.74% (diff: 59.37%)

Merging #29 into master will decrease coverage by 0.16%

@@             master        #29   diff @@
==========================================
  Files            64         64          
  Lines          5668       5557   -111   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
- Hits           5663       5543   -120   
- Misses            5         14     +9   
  Partials          0          0

Powered by Codecov. Last update 2c37a45...dc9b4e5

eandersson · 2016-08-30T12:12:22Z

Hey,

Thanks for the pull request. I actually saw your changes a few weeks ago in your project. I was quite amused by some of your commit messages. =]

I have actually been considering implementing a flag on connection.close, something along the lines connection.close(forcefully=True). I'll try to review your changes in the upcoming weeks, after my vacation.

fake-name · 2016-08-30T17:38:51Z

I saw you had watched my project, which is actually why I figured I'd make the pull request.

And yes, I tend to be an angry, sweary coder, particularly on personal projects I'm working only for my own interests.

Anyways, I don't pretend that the above pull request is necessarily /good/ code, it's just functional enough to work in a environment with an extremely unreliable network.

FWIW, This is the only AMQP library that I've been able to convince to be genuinely durable.

PyAMQP is written for celery, to the exclusion of basically everything else (and I apparently got Ask Solem really pissed at me for pointing this fact out).
RabbitPy would periodically wedge on teardown using 100% CPU, or throw recursion errors (RecursionError when consuming from queue? gmr/rabbitpy#101, How do I properly tear down a connection? gmr/rabbitpy#102, Infinite 100% CPU loop on connection teardown gmr/rabbitpy#103).

AMQPStorm would wedge on close sometimes, but the library is actually structured well enough that I could hack the killer calls in this pull request into it.

Anyways, practically speaking, I think having a "force" method on the close() method seems less immediately clear. IIRC, connection.close() blocks, so the assumption is that you have a thread blocked on close() when you call kill(), and kill() (should be) explicitly documented as non blocking. That way, you can be confident your management/watchdog process that's doing the killing will not block.

No hurry, have a fun vacation!

eandersson · 2016-09-09T19:00:37Z

I am glad you found it durable. My whole design is all about keeping things simple, straight forward and to make sure that we fail-fast and reliably.

Reviewing this I actually noticed that I wasn't always closing the connection cleanly. It is seen as general good behavior to wait for a confirmation from RabbitMQ that the connection was closed successfully, so that the connection and channels gets cleaned up immediately on the server side. Previously it would only wait for the response when closing a channel. This fix will actually make it worse for you though.

Anyway, I added a flag called "wait_for_rpc", I am not going to pretend that it will fix your problem, since it does lack some components, and I am still looking into adopting your kill concept, but if you have time, feel free to test this as an alternative.

You can enable this new behavior like this.

connection.close(wait_for_rpc=False)

fake-name · 2016-09-10T02:24:53Z

I tend to be more of a "hacker" then a actual proper "programmer", so I wouldn't pretend my modifications are elegant or cleanly implemented in any way.

In any event, I apparently am working across a unreliable link (RabbitMQ server is in Europe, host is on the west-coast of the US), and basically it'll apparently drop or garble (not sure) packets periodically. The problem I encountered is that if there is any requirement for a disconnect handshake/RPC, it'll eventually fail at some point, resulting in the connection being wedged.

I think the "proper" solution here is to have the disconnection also respect timeouts, where if a close request doesn't receive the response in some function of the socket timeout (3x socket timeout or similar), the connection is considered "closed", even without confirmation..

eandersson · 2016-09-10T08:55:28Z

That might actually be the issue. As with the current implementation the RPC call for closing channels will not respect the TCP timeout. Instead it uses the individual Channel RPC timeout, set to 60s by default.

I think that actually sending an RPC for channels when closing the connection is a bad idea. Instead I think we should make it up to the client to properly close the channel.

e.g.

channel.close()
connection.close()

I removed the wait_for_rpc boolean, and connection.close should now respect the socket timeout (well socket timeout + 0.25 is the current value). I also applied this to thread.join(timeout).

With this, the connection should always be properly closed within the socket timeout (+ ~0.25s).

fake-name · 2016-09-10T09:25:21Z

I'm pretty sure the problem I was encountering was channels wedging on close if the close request has been lost without the corresponding closure of the socket. The worker thread would then get stuck indefinitely.

It's also possible it wasn't stuck indefinitely, but just longer then I was waiting while debugging, though I did reduce timeouts to (IIRC) 30 seconds without changes.

eandersson · 2016-09-10T11:03:42Z

Yep, that should hopefully not be an issue any longer. You may however want to skip channel.close() and go straight to connection.close() when trying to kill the connection fast.

fake-name · 2016-10-22T05:12:09Z

It should probably be noted I found a pile of issues in this PR, don't merge it whatever you do.

Other comments: _build_message_body() in channel.py can (I think) wedge indefinitely if the connection dies or is closed while a message is being constructed, and the entire message body has not been received yet. This is exacerbated by large message sizes. My system fragments messages into 100K segments, and allowing larger messages is basically the easiest way to get deadlocks ever.

I've run 100+ MB messages through AMQP, but it's durability is heavily, HEAVILY dependent on the connection load percentage. If you have one HUGE message every 5 minutes, it'll be fine. If you have continuous huge messages, the connection (I think) starves for control-channel bandwidth, and everything is fire.

fake-name · 2016-10-22T05:17:22Z

It might be a good idea to go through and audit every while loop in the codebase. I'm still consistently getting things wedged in various places, for reasons I haven't tracked down yet (I can't replicate the issue on-command).

eandersson · 2016-10-22T05:19:59Z

Interesting. I'll take a look at that build message.

Even though the code never got merged I learnt a lot from your pull request.

I have a lot more ideas based on this that I want to implement, but I am in the US for a few months and my laptop broke so I don't have a workstation to work on from home. At the moment.

eandersson · 2016-10-22T05:24:33Z

Might be good to add some sort of timeout or something similar on the while loops.

fake-name · 2016-10-22T05:26:35Z

Yeah, it looks like my major source of wedging was indeed _build_message_body().

No worries about turn around time or "WHY U NO FIX" or whatever. I'm mostly trying to get this noted down just in case you'd be interested, or for when I look at it in a few months and wonder why the heck I did what I did.

I'm "fixing" (read hacking around horribly) in my vendored version of the library, so this is more "FYI" then anything.

eandersson · 2016-10-22T05:53:19Z

Yep, and I appreciate the feedback. I am here to write a kickass library and to try to level up my coding skills. Always helpful to have someone else help review code and report flaws.

eandersson · 2016-10-23T10:11:25Z

Pushed a patch using my phone lol that might fix that issue

fake-name · 2016-10-23T20:41:09Z

Wow, you're way more hardcore then I. You put me on a box without sublime text and 2 monitors and I'm all "Ugh.... I can't work like this."

eandersson · 2016-10-23T20:42:47Z

Hah I normally have three monitors and complain when I have to use my laptop. ;) I guess that's what a month without either does to you

fake-name added 6 commits August 29, 2016 22:33

Merge in connection killer bits.

410bbd1

Remove superflous print stuff.

34a6a54

Merge branch 'master' of github.com:fake-name/amqpstorm

486a9ab

# Conflicts: # amqpstorm/channel.py # amqpstorm/io.py # amqpstorm/rpc.py

Fix merge whoopsie.

b5c0ecf

Dunno how these got deleted

746aa99

Cleanup some of the commented print stuff.

dc9b4e5

eandersson force-pushed the master branch 5 times, most recently from 4b959a1 to f45fc43 Compare September 10, 2016 21:09

fake-name closed this Oct 22, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add the ability to forcefully close a connection. #29

Add the ability to forcefully close a connection. #29

fake-name commented Aug 30, 2016 •

edited

codecov-io commented Aug 30, 2016 •

edited

eandersson commented Aug 30, 2016 •

edited

fake-name commented Aug 30, 2016 •

edited

eandersson commented Sep 9, 2016 •

edited

fake-name commented Sep 10, 2016

eandersson commented Sep 10, 2016 •

edited

fake-name commented Sep 10, 2016

eandersson commented Sep 10, 2016 •

edited

fake-name commented Oct 22, 2016

fake-name commented Oct 22, 2016

eandersson commented Oct 22, 2016

eandersson commented Oct 22, 2016

fake-name commented Oct 22, 2016 •

edited

eandersson commented Oct 22, 2016

eandersson commented Oct 23, 2016

fake-name commented Oct 23, 2016

eandersson commented Oct 23, 2016

Add the ability to forcefully close a connection. #29

Add the ability to forcefully close a connection. #29

Conversation

fake-name commented Aug 30, 2016 • edited

codecov-io commented Aug 30, 2016 • edited

Current coverage is 99.74% (diff: 59.37%)

eandersson commented Aug 30, 2016 • edited

fake-name commented Aug 30, 2016 • edited

eandersson commented Sep 9, 2016 • edited

fake-name commented Sep 10, 2016

eandersson commented Sep 10, 2016 • edited

fake-name commented Sep 10, 2016

eandersson commented Sep 10, 2016 • edited

fake-name commented Oct 22, 2016

fake-name commented Oct 22, 2016

eandersson commented Oct 22, 2016

eandersson commented Oct 22, 2016

fake-name commented Oct 22, 2016 • edited

eandersson commented Oct 22, 2016

eandersson commented Oct 23, 2016

fake-name commented Oct 23, 2016

eandersson commented Oct 23, 2016

fake-name commented Aug 30, 2016 •

edited

codecov-io commented Aug 30, 2016 •

edited

eandersson commented Aug 30, 2016 •

edited

fake-name commented Aug 30, 2016 •

edited

eandersson commented Sep 9, 2016 •

edited

eandersson commented Sep 10, 2016 •

edited

eandersson commented Sep 10, 2016 •

edited

fake-name commented Oct 22, 2016 •

edited