Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the ability to forcefully close a connection. #29

Closed
wants to merge 6 commits into from

Conversation

fake-name
Copy link

@fake-name fake-name commented Aug 30, 2016

Basically, I'm dealing with a context where I have a high volume traffic AMQP connection, across a high-latency, unreliable link (intercontinental).

If there is a way for a connection to possibly wedge, I'll encounter it.

Anyways, The issue I ran into here is that it's possible for the "close" operation on a connection to wedge indefinitely, preventing a connection from actually closing (I /think/ if the shutdown RPC request gets garbled/eaten somewhere).

The question of /how/ this happens aside, I therefore need to be able to kill a open connection in a dirty manner. This adds the ability to kill() a connection which will force the worker thread to exit immediately, without bothering to do any proper cleanup.

Because this is a operation that's generally done with a additional watcher process, outside of the normal program flow, I used multiprocessing primitives to manage the _die flag (at one point, the watcher was a separate /process/, not thread).

Anyways, I'm not sure if this is inline with the ideas of the library, but I haven't been able to wedge the amqp connection with this patch set, so that's something.


Only tested on Py3.5x64, Ubuntu 14.04.

Things I haven't done: Additional unit tests.

@codecov-io
Copy link

codecov-io commented Aug 30, 2016

Current coverage is 99.74% (diff: 59.37%)

Merging #29 into master will decrease coverage by 0.16%

@@             master        #29   diff @@
==========================================
  Files            64         64          
  Lines          5668       5557   -111   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
- Hits           5663       5543   -120   
- Misses            5         14     +9   
  Partials          0          0          

Powered by Codecov. Last update 2c37a45...dc9b4e5

@eandersson
Copy link
Owner

eandersson commented Aug 30, 2016

Hey,

Thanks for the pull request. I actually saw your changes a few weeks ago in your project. I was quite amused by some of your commit messages. =]

I have actually been considering implementing a flag on connection.close, something along the lines connection.close(forcefully=True). I'll try to review your changes in the upcoming weeks, after my vacation.

@fake-name
Copy link
Author

fake-name commented Aug 30, 2016

I saw you had watched my project, which is actually why I figured I'd make the pull request.

And yes, I tend to be an angry, sweary coder, particularly on personal projects I'm working only for my own interests.

Anyways, I don't pretend that the above pull request is necessarily /good/ code, it's just functional enough to work in a environment with an extremely unreliable network.

FWIW, This is the only AMQP library that I've been able to convince to be genuinely durable.

AMQPStorm would wedge on close sometimes, but the library is actually structured well enough that I could hack the killer calls in this pull request into it.


Anyways, practically speaking, I think having a "force" method on the close() method seems less immediately clear. IIRC, connection.close() blocks, so the assumption is that you have a thread blocked on close() when you call kill(), and kill() (should be) explicitly documented as non blocking. That way, you can be confident your management/watchdog process that's doing the killing will not block.

No hurry, have a fun vacation!

@eandersson
Copy link
Owner

eandersson commented Sep 9, 2016

I am glad you found it durable. My whole design is all about keeping things simple, straight forward and to make sure that we fail-fast and reliably.

Reviewing this I actually noticed that I wasn't always closing the connection cleanly. It is seen as general good behavior to wait for a confirmation from RabbitMQ that the connection was closed successfully, so that the connection and channels gets cleaned up immediately on the server side. Previously it would only wait for the response when closing a channel. This fix will actually make it worse for you though.

Anyway, I added a flag called "wait_for_rpc", I am not going to pretend that it will fix your problem, since it does lack some components, and I am still looking into adopting your kill concept, but if you have time, feel free to test this as an alternative.

You can enable this new behavior like this.

connection.close(wait_for_rpc=False)

@fake-name
Copy link
Author

I tend to be more of a "hacker" then a actual proper "programmer", so I wouldn't pretend my modifications are elegant or cleanly implemented in any way.

In any event, I apparently am working across a unreliable link (RabbitMQ server is in Europe, host is on the west-coast of the US), and basically it'll apparently drop or garble (not sure) packets periodically. The problem I encountered is that if there is any requirement for a disconnect handshake/RPC, it'll eventually fail at some point, resulting in the connection being wedged.

I think the "proper" solution here is to have the disconnection also respect timeouts, where if a close request doesn't receive the response in some function of the socket timeout (3x socket timeout or similar), the connection is considered "closed", even without confirmation..

@eandersson
Copy link
Owner

eandersson commented Sep 10, 2016

That might actually be the issue. As with the current implementation the RPC call for closing channels will not respect the TCP timeout. Instead it uses the individual Channel RPC timeout, set to 60s by default.

I think that actually sending an RPC for channels when closing the connection is a bad idea. Instead I think we should make it up to the client to properly close the channel.

e.g.

channel.close()
connection.close()

I removed the wait_for_rpc boolean, and connection.close should now respect the socket timeout (well socket timeout + 0.25 is the current value). I also applied this to thread.join(timeout).

With this, the connection should always be properly closed within the socket timeout (+ ~0.25s).

@fake-name
Copy link
Author

I'm pretty sure the problem I was encountering was channels wedging on close if the close request has been lost without the corresponding closure of the socket. The worker thread would then get stuck indefinitely.

It's also possible it wasn't stuck indefinitely, but just longer then I was waiting while debugging, though I did reduce timeouts to (IIRC) 30 seconds without changes.

@eandersson
Copy link
Owner

eandersson commented Sep 10, 2016

Yep, that should hopefully not be an issue any longer. You may however want to skip channel.close() and go straight to connection.close() when trying to kill the connection fast.

@fake-name
Copy link
Author

It should probably be noted I found a pile of issues in this PR, don't merge it whatever you do.

Other comments: _build_message_body() in channel.py can (I think) wedge indefinitely if the connection dies or is closed while a message is being constructed, and the entire message body has not been received yet. This is exacerbated by large message sizes. My system fragments messages into 100K segments, and allowing larger messages is basically the easiest way to get deadlocks ever.

I've run 100+ MB messages through AMQP, but it's durability is heavily, HEAVILY dependent on the connection load percentage. If you have one HUGE message every 5 minutes, it'll be fine. If you have continuous huge messages, the connection (I think) starves for control-channel bandwidth, and everything is fire.

@fake-name fake-name closed this Oct 22, 2016
@fake-name
Copy link
Author

It might be a good idea to go through and audit every while loop in the codebase. I'm still consistently getting things wedged in various places, for reasons I haven't tracked down yet (I can't replicate the issue on-command).

@eandersson
Copy link
Owner

Interesting. I'll take a look at that build message.

Even though the code never got merged I learnt a lot from your pull request.

I have a lot more ideas based on this that I want to implement, but I am in the US for a few months and my laptop broke so I don't have a workstation to work on from home. At the moment.

@eandersson
Copy link
Owner

Might be good to add some sort of timeout or something similar on the while loops.

@fake-name
Copy link
Author

fake-name commented Oct 22, 2016

Yeah, it looks like my major source of wedging was indeed _build_message_body().

No worries about turn around time or "WHY U NO FIX" or whatever. I'm mostly trying to get this noted down just in case you'd be interested, or for when I look at it in a few months and wonder why the heck I did what I did.

I'm "fixing" (read hacking around horribly) in my vendored version of the library, so this is more "FYI" then anything.

@eandersson
Copy link
Owner

Yep, and I appreciate the feedback. I am here to write a kickass library and to try to level up my coding skills. Always helpful to have someone else help review code and report flaws.

@eandersson
Copy link
Owner

Pushed a patch using my phone lol that might fix that issue

@fake-name
Copy link
Author

Wow, you're way more hardcore then I. You put me on a box without sublime text and 2 monitors and I'm all "Ugh.... I can't work like this."

@eandersson
Copy link
Owner

Hah I normally have three monitors and complain when I have to use my laptop. ;) I guess that's what a month without either does to you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants