Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix to stuck at shutdown #257

Closed
wants to merge 1 commit into from

Conversation

@sonots
Copy link
Member

commented Jan 31, 2014

I met troubles that fluentd does not terminate sometimes.
Using sigdump, I found that fluentd was stuck at TCPSocket.new on shutdown. Please see https://gist.github.com/sonots/e05d8fbcd7ab39e4a5f8 for details.

So, I implemented SocketUtil.create_tcp_socket and SocketUtil.open_tcp_socket having connect_timeout option as https://bugs.ruby-lang.org/issues/5101 says.

PS. Please wait to merge. I will test for a week on my production environment.

@sonots

View changes

test/plugin/socket_util.rb Outdated
Fluent::Test.setup
end

BAD_HOST = '192.0.2.1' # TEST-NET http://www.faqs.org/rfcs/rfc3330.html

This comment has been minimized.

Copy link
@sonots

sonots Jan 31, 2014

Author Member

Could anyone check this line?
I found that TCPSocket.new can easily be stuck with non-existing host. So, I used TEST-NET address to re-produce the phenomenon in the test. Do you think this is okay?

This comment has been minimized.

Copy link
@sonots

sonots Jan 31, 2014

Author Member

I no longer use TEST-NET address.

#
# cf. https://bugs.ruby-lang.org/issues/5101
def create_tcp_socket(host, port, opts={})
connect_timeout = opts[:connect_timeout] || 5.0

This comment has been minimized.

Copy link
@sonots

sonots Jan 31, 2014

Author Member

Could anyone check this line? Default timeout is 5.0 sec. Do you think this is fine?

@repeatedly

This comment has been minimized.

Copy link
Member

commented Jan 31, 2014

TCPSocket.open(listen_address, @port) {|sock| } is for stopping event loop.
If a connection causes a timeout, does it still affect event loop?

@sonots

This comment has been minimized.

Copy link
Member Author

commented Feb 4, 2014

I again met situations that fluentd does not die. The sigdump is here https://gist.github.com/sonots/8797574.

This is telling as the process was stuck at Socket.pack_sockaddr_in(port, host) line of SocketUtil#create_tcp_socket.

def create_tcp_socket(host, port, opts={})
  connect_timeout = opts[:connect_timeout] || 5.0
  addr = Socket.pack_sockaddr_in(port, host)
  s = Socket.new(:AF_INET, :SOCK_STREAM, 0)
  begin
    s.connect_nonblock(addr)
  rescue Errno::EINPROGRESS
    IO.select(nil, [s], nil, connect_timeout) or raise Timeout::Error
  end
  s
end
@repeatedly

This comment has been minimized.

Copy link
Member

commented Feb 18, 2014

Ruby bugs related with this issue:

https://bugs.ruby-lang.org/issues/9525

@sonots

This comment has been minimized.

Copy link
Member Author

commented Mar 6, 2014

Let me summarize here:

  1. SocketUtil.create_tcp_socket resolved some of problems that fluentd does not terminate. Before applying it, I got fluentd stuck very often like once out of 3 times of fluentd cluster restart. After applying it, the frequency was reduced like once out of 10 times.
  2. Still Fluentd got stuck at Socket.pack_sockaddr_in, but this problem was resolved by applying patches from https://bugs.ruby-lang.org/issues/9525.
  3. (New!) Still Fluentd got stuck at @lsock.close. See https://gist.github.com/sonots/9392668

It looks the 3rd new issue is caused by

TCPSocket.open(listen_address, @port) {|sock| } is for stopping event loop.
If a connection causes a timeout, does it still affect event loop?

I will implement the timeout option for Coolio::Loop#run_once as @frsyuki said at https://twitter.com/frsyuki/status/428645916333965312

@sonots sonots changed the title [WIP] connect_timeout option for TCPSocket.new [WIP] Fix that fluentd stuck at shutdown Apr 14, 2014

@sonots sonots changed the title [WIP] Fix that fluentd stuck at shutdown [WIP] Fix to stuck at shutdown Apr 14, 2014

@sonots

This comment has been minimized.

Copy link
Member Author

commented Apr 17, 2014

Implemented Coolio::Loop#run(timeout) at tarcieri/cool.io#29 and #297.
I restarted my fluentd cluster 50 times, and got stuck 0 time 💯

@sonots sonots closed this Apr 17, 2014

@sonots sonots deleted the sonots:tcp_socket_open_timeout branch Apr 17, 2014

@sonots sonots changed the title [WIP] Fix to stuck at shutdown Fix to stuck at shutdown Sep 5, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.