Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed TLS handshake #91

Closed
lmorfitt opened this issue Sep 19, 2013 · 34 comments
Closed

Failed TLS handshake #91

lmorfitt opened this issue Sep 19, 2013 · 34 comments

Comments

@lmorfitt
Copy link

Hello,

I suspect this is something dumb but I cant for the life of me figure out why lumberjack has started to throw these errors.

2013/09/19 12:36:37.093618 Failed to tls handshake with 192.168.51.61:5043 read tcp 192.168.51.61:5043: i/o timeout
2013/09/19 12:36:38.093839 Connecting to 192.168.51.61:5043
2013/09/19 12:36:53.094597 Failed to tls handshake with 192.168.51.61:5043 read tcp 192.168.51.61:5043: i/o timeout
2013/09/19 12:36:54.094838 Connecting to 192.168.51.61:5043

We have been using it for a while with no issues, then elastic search ran out of disk space. Cleaned up the disk and now we are seeing these errors.

Restarted logstash, elastic search and the indexer and lumberjack with no update.

thanks
Luke

@lmorfitt
Copy link
Author

Never mind, think this was a data issue, we had some date mismatch issues.

thanks
Luke

@Moqume
Copy link

Moqume commented Nov 8, 2013

FWIW, I'm getting this for random log shippers as well. If I restart Logstash (with a Lumberjack input), they'll be able to re-connect again.

@jesusaurus
Copy link

I'm also seeing lumberjack throw these error intermittently. When this happens, sometimes (but not always) I see these errors in the logstash log once every couple seconds:

{:timestamp=>"2013-11-12T19:10:54.440000+0000", :message=>"Input thread exception", :plugin=><LogStash::Inputs::Lumberjack ssl_certificate=>"/etc/ssl/certs/logstash.crt", ssl_key=>"/etc/ssl/private/logstash.key", type=>"msgaas", charset=>"UTF-8", host=>"0.0.0.0">, :exception=>#<SocketError: problem when accepting>, :backtrace=>["org/jruby/ext/socket/RubyTCPServer.java:174:in `accept'", "jar:file:/mnt/logstash/logstash.jar!/META-INF/jruby.home/lib/ruby/1.9/openssl/ssl-internal.rb:160:in `accept'", "file:/mnt/logstash/logstash.jar!/lumberjack/server.rb:49:in `run'", "file:/mnt/logstash/logstash.jar!/logstash/inputs/lumberjack.rb:42:in `run'", "file:/mnt/logstash/logstash.jar!/logstash/agent.rb:761:in `run_input'", "file:/mnt/logstash/logstash.jar!/logstash/agent.rb:407:in `start_input'"], :level=>:warn}
{:timestamp=>"2013-11-12T19:10:54.460000+0000", :message=>"Restarting input due to exception", :plugin=><LogStash::Inputs::Lumberjack ssl_certificate=>"/etc/ssl/certs/logstash.crt", ssl_key=>"/etc/ssl/private/logstash.key", type=>"msgaas", charset=>"UTF-8", host=>"0.0.0.0">, :level=>:error}

But these errors don't always occur -- and when they do, the lumberjack errors continue long after these errors stop. After restarting logstash, the lumberjack agents will reconnect... for a while.

I'm running lumberjack 0.3.1 and logstash 1.1.12

@phildougherty
Copy link

I'm also running into this issue. lumberjack 0.3.1 and logstash 1.2.2.

@lmorfitt
Copy link
Author

lmorfitt commented Dec 9, 2013

Hi,
@phildougherty, @jesusaurus

This is normally a time /date related issue. ntpdate normally fixes it for me. It seems that the TLS handshake is very sensitive to time changes.

2013/12/09 17:47:54.260621 Read error looking for ack: read tcp 192.168.51.61:5043: i/o timeout
2013/12/09 17:47:54.260768 Loading client ssl certificate: /etc/ssl/lumberjack.crt and /etc/ssl/lumberjack.key
2013/12/09 17:47:54.435974 Setting trusted CA from file: /etc/ssl/lumberjack.pub
2013/12/09 17:47:54.436143 Connecting to 192.168.51.61:5043
2013/12/09 17:47:54.490049 Connected to 192.168.51.61:5043
2013/12/09 17:47:54.510958 Registrar received 1000 events
2013/12/09 17:47:54.511119 Saving registrar state.
2013/12/09 17:48:09.510983 Read error looking for ack: read tcp 192.168.51.61:5043: i/o timeout
2013/12/09 17:48:09.511099 Loading client ssl certificate: /etc/ssl/lumberjack.crt and /etc/ssl/lumberjack.key
2013/12/09 17:48:09.686265 Setting trusted CA from file: /etc/ssl/lumberjack.pub
2013/12/09 17:48:09.686432 Connecting to 192.168.51.61:5043
2013/12/09 17:48:09.740112 Connected to 192.168.51.61:5043

service ntp stop
ntpdate NTP SERVER IP
service ntp start

2013/12/09 17:48:37.053861 Read error looking for ack: read tcp 192.168.51.61:5043: i/o timeout
2013/12/09 17:48:37.053980 Loading client ssl certificate: /etc/ssl/lumberjack.crt and /etc/ssl/lumberjack.key
2013/12/09 17:48:37.233354 Setting trusted CA from file: /etc/ssl/lumberjack.pub
2013/12/09 17:48:37.233532 Connecting to 192.168.51.61:5043
2013/12/09 17:48:37.287108 Connected to 192.168.51.61:5043
2013/12/09 17:48:37.441958 Registrar received 1000 events
2013/12/09 17:48:37.442119 Saving registrar state.
2013/12/09 17:48:49.850381 Registrar received 1000 events
2013/12/09 17:48:49.852707 Saving registrar state.
2013/12/09 17:48:57.735521 Registrar received 1000 events
2013/12/09 17:48:57.735709 Saving registrar state.
2013/12/09 17:49:05.115405 Registrar received 1000 events

@lmorfitt lmorfitt reopened this Dec 9, 2013
@phildougherty
Copy link

Has anyone managed to find a workaround for this issue? It seems to strike randomly, and fiddling about with ntp is not resolving the problem for me when it happens.

2013/12/17 05:23:43.492445 Read error looking for ack: read tcp x.x.x.x:xxxx: i/o timeout
2013/12/17 05:23:43.492531 Loading client ssl certificate: /opt/logstash-forwarder/ssl/logstash-forwarder.crt and /opt/logstash-forwarder/ssl/logstash-forwarder.key
2013/12/17 05:23:43.704927 Setting trusted CA from file: /opt/logstash-forwarder/ssl/ca.crt
2013/12/17 05:23:43.705219 Connecting to x.x.x.x:xxxx (x.x.x.x)
2013/12/17 05:23:43.764138 Connected to x.x.x.x
2013/12/17 05:23:58.764314 Read error looking for ack: read tcp x.x.x.x:xxxx: i/o timeout
2013/12/17 05:23:58.764395 Loading client ssl certificate: /opt/logstash-forwarder/ssl/logstash-forwarder.crt and /opt/logstash-forwarder/ssl/logstash-forwarder.key
2013/12/17 05:23:58.965437 Setting trusted CA from file: /opt/logstash-forwarder/ssl/ca.crt
2013/12/17 05:23:58.965735 Connecting to x.x.x.x:xxxx (x.x.x.x)
2013/12/17 05:23:59.024270 Connected to x.x.x.x

@Moqume
Copy link

Moqume commented Dec 17, 2013

Unfortunately, for me the "fix" is to restart Logstash with a cron job every so often. The logs get queued with Lumberjack anyway, so no logs get lost. I haven't had the time to do any further investigating on my end.

@choffee
Copy link

choffee commented Apr 7, 2014

I am getting loads of these from lots of servers. Lots of logs not getting through.
ntp is up and running and it seems to be hit and miss about if restarting logstash helps or not.

I don't think is time related as the error is coming from the lumberjack code not the ssl connections as far as I can tell.

Best "fix" I have so far is changing the end of the server.rb file in the library that logstash is using so that it sends an ack for every line. ( I think that is what is doing ) but even then still getting loads of timeouts.

def data(sequence, map, &block)
  puts sequence
  puts map
  block.call(map)
  #if (sequence - @last_ack) >= @window_size
    @fd.syswrite(["1A", sequence].pack("A*N"))
    @last_ack = sequence
  #end
end

@driskell
Copy link
Contributor

driskell commented Apr 7, 2014

Hi @choffee

Is your pipeline in logstash very busy? Might be the cause.

I implemented partial ACK and separated communication from pipeline in #180 to fix this. You could try that? Requires the server.rb copying into logstash and also the forwarder patching.

I have that and other major statefile/prospector issues fixed in my repo too at:
https://github.com/driskell/logstash-forwarder

Jason

@choffee
Copy link

choffee commented Apr 8, 2014

Hi @driskell

Thanks. I am trying that out now and it seems to be working great.

I don't think I am sending a massive amount of logs through ( /me notes, should graph that ) but when the main logstash process has been down a while I think there is a bit of flood which seemed to break things.

Is there any chance of this making it into master soon? It would be great to have this working. While we wait for zeromq and have an option for those who still want encryption in the future.

john

@driskell
Copy link
Contributor

driskell commented Apr 8, 2014

Good to hear @choffee

Unfortunately I'm not project maintainer so wouldn't have a say. I hope it gets in. They might not like the new data frame version I added though - but that's the only way to fix what was a broken implementation.

@mrteera
Copy link

mrteera commented Jan 14, 2015

I've this issue too, I still can't figure out how to fix it.

Jan 14 06:14:41 localhost :41Z - logstash-forwarder[14906]: 2015/01/14 06:14:41.392072 Failed to tls handshake with - read tcp -:5000: i/o timeout
Jan 14 06:14:42 localhost :42Z - logstash-forwarder[14906]: 2015/01/14 06:14:42.392444 Connecting to -:5000 (-)

@jordansissel
Copy link
Contributor

@driskell partial acks don't solve stalls.

@jordansissel
Copy link
Contributor

I'm not sure what this ticket is about.

The original filing was about TLS handshake timeouts. It was later reopened by @YummyLogs reporting a different problem (read ack timeout).

@mrteera Your problem reports read tcp -:5000 and - is not a valid hostname, so I don't know what's going on with your system.

Can we close this and open new tickets focused on each kind of error?

@driskell
Copy link
Contributor

Partial acks solve read ack timeouts and solve event duplication (forwarder resends ALL logs on timeout). If logstash does not push all 1024 spool events (if it's a full spoil) through its pipeline within network timeout of 15 seconds... It will always timeout and resend all 1024. I've seen a logstash continually duplicate logs for hours just because a 100mb log file was added and sent all at once. Brokers alleviate this but if broker gets full too...

I admit I've confused the read ack timeout and TLS handshake timeout in other places though. TLS handshake timeout is unrelated to ack.

I did just recall a stall issue same as @myatu in log courier. I'll raise a ticket for you to look into. It causes failed handshake after a time but restarting logstash fixes it.

@karthik87
Copy link

@jordansissel

I have a strange issue where in my logstash forwarder (as a docker) connects to the logstash server( as a docker too) only when I reboot my instance running the forwarder. Both are running on rackspace cloud.

When I run the docker first time, i just get the following,
Failure connecting to x.x.x.x dial tcp x.x.x.x:5043: i/o timeout

Then when I issue reboot and again run the same docker its connected and can ship logs smoothly.
But I can restart my production servers this easily. So I am stuck here.

Is it anything related to firewall or iptables?

@juise
Copy link

juise commented Feb 13, 2015

Please, please, please fix it. The problem about 2 years. It's disgusting!

@driskell
Copy link
Contributor

@juise Such comments are unhelpful and unwelcome

@juise
Copy link

juise commented Feb 13, 2015

Just for note, the follow problem:

... Read error looking for ack: read tcp x.x.x.x:xxxx: i/o timeout
... Loading client ssl certificate: /opt/logstash-forwarder/ssl/logstash-forwarder.crt and /opt/logstash-forwarder/ssl/logstash-forwarder.key
... Setting trusted CA from file: /opt/logstash-forwarder/ssl/ca.crt
... Connecting to x.x.x.x:xxxx (x.x.x.x)
... Connected to x.x.x.x

may appear when you are using a Ruby module, by example without try catch blocks and exception occure. In log doesn't appear any message about it. It's very - very bad.

@jordansissel
Copy link
Contributor

... Read error looking for ack: read tcp x.x.x.x:xxxx: i/o timeout

This occurs when downstream server doesn't acknowledge within the defined timeout period.

... Loading client ssl certificate: /opt/logstash-forwarder/ssl/logstash-forwarder.crt and /opt/logstash-forwarder/ssl/logstash-forwarder.key

This is not an error.

... Setting trusted CA from file: /opt/logstash-forwarder/ssl/ca.crt

This is not an error

... Connecting to x.x.x.x:xxxx (x.x.x.x)

This is not an error

... Connected to x.x.x.x

This is not an error.

may appear when you are using a Ruby module

I'm not sure what this means.

@juise
Copy link

juise commented Feb 13, 2015

If not an error, why the only one solution to make it work after "not at error" is restart. I can reproduce it without any problems.

It can be problem not on the logstash-forwarder side, but in logstash die.

@jordansissel
Copy link
Contributor

is restart

A restart of what? logstash or logstash-forwarder?

@juise
Copy link

juise commented Feb 13, 2015

The both, restart logstash and logstash-forwarder. The order doesn't matter.

@jordansissel
Copy link
Contributor

The problem usually occurs when logstash is overwhelmed by the load you are giving it with logstash-forwarder, causing the forwarder to give up (i/o timeout) waiting for a response and having it try to connect to another logstash if that can help.

In almost all circumstances, the problem is a due to flow problems in logstash or something downstream of logstash itself (for example, if an output's dependency is down or offline or unavailable).

This ticket is about "Failed TLS handshake" doesn't seem related to the 'Read error looking for ack: read tcp x.x.x.x:xxxx: i/o timeout' message you report.

@sledorze
Copy link

sledorze commented Apr 1, 2015

Got this issue without load... (1 req / 10 mins)

@cameronevans
Copy link

@sledorze could it possibly be issues with logstash itself? I have had the issue without load myself and it is usually a result of logstash's state.

@sledorze
Copy link

sledorze commented Apr 1, 2015

maybe yes, and what was the fix? (it could be useful for logstash-forwarder
users to know about it)

On Wed, Apr 1, 2015 at 2:19 PM, cameronevans notifications@github.com
wrote:

@sledorze https://github.com/sledorze could it possibly be issues with
logstash itself? I have had the issue without load myself and it is usually
a result of logstash's state.


Reply to this email directly or view it on GitHub
#91 (comment)
.

Stéphane Le Dorze

http://lambdabrella.blogspot.com/
http://www.linkedin.com/in/stephaneledorze
http://twitter.com/stephaneledorze
https://twitter.com/#!/stephaneledorze/status/74020060125077504

Tel: +33 (0) 6 08 76 70 15

@cameronevans
Copy link

It was a number of reasons, usually plugin errors (in my case tcp or elasticsearch) that would cause logstash to halt but have the service just spin it up again. Also, if the forwarder was inhaling a file with a great number of lines it would sometimes fill up the queue and throw this error even though there were not many events at the time. The --debug option is useful for these cases though it is a lot of information to pore through.

@e8-preet
Copy link

Thanks @choffee modifying the server.rb fixed my issue :)

@alexanderilyin
Copy link

Faced with this two times in last month. Only restart of logstash helps.

    2015/06/03 11:02:04.669083 Failed to tls handshake with X.X.X.X read tcp X.X.X.X:5000: connection reset by peer
    2015/06/03 11:02:05.669712 Connecting to [X.X.X.X]:5000 (example.com) 
    2015/06/03 11:02:21.670019 Failed to tls handshake with X.X.X.X read tcp X.X.X.X:5000: i/o timeout
    2015/06/03 11:02:22.670621 Connecting to [X.X.X.X]:5000 (example.com) 

@juise
Copy link

juise commented Jun 3, 2015

Try to modify server.rb as @choffee suggested. This works in all versions, include the latest 1.5. Looks like the elastic folks this problem doesn't interesting

@alexanderilyin
Copy link

@juise still facing sometimes with same problem and have to restart logstash to fix it.

@jordansissel
Copy link
Contributor

We are working on solutions that include better messaging on downstream
failures, and other things.

On Wednesday, June 3, 2015, Alexander Petrovsky notifications@github.com
wrote:

Try to modify server.rb as @choffee https://github.com/choffee
suggested. This works in all versions, include the latest 1.5. Looks like
the elastic folks this problem doesn't interesting


Reply to this email directly or view it on GitHub
#91 (comment)
.

@jordansissel
Copy link
Contributor

There's so much discussion on unrelated problems in this ticket, I'm going to close this because I cannot follow it easily. If you are still affected by somethingin this ticket, please open a new ticket and provide whatever details you can. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests