New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lumberjack server resets all tcp connections when client fails SSL negotiation #160

Closed
nathanlburns opened this Issue Feb 6, 2014 · 21 comments

Comments

Projects
None yet
@nathanlburns

nathanlburns commented Feb 6, 2014

This bug was originally reported in logstash, but it belongs here.

https://logstash.jira.com/browse/LOGSTASH-1598

I've scanned the issues list here (not in depth) and believe there actually may be some other problems reported here that this may explain.

Our setup:

  • Many servers (Windows based) running logstash-forwarder configured to ship to round robin dns array of servers running logstash with the lumberjack input.

Our issue:

  • Yesterday morning, due to a boneheaded mistake by me, the certificate I had configured lumberjack with expired, causing all logs to stop being forwarded.
  • About an hour later, I had a new certificate generated and deployed on both shipper and receiver, logs started pouring in at this point.
  • Expecting an initial flood of messages, I closely monitored the system, but after several hours the flood did not abate.
  • Jumping into elasticsearch, realized that we were getting a ton of duplicate events, sometimes up to 100 copies.
  • I added uuid to the reciever configuration and verified I was getting duplicate uuids on those messages. Problem was lumberjack related...
  • logstash-forwarder logs:
2014/02/06 17:52:06.538400 Connected to <redacted>
2014/02/06 17:52:06.961400 Read error looking for ack: EOF
2014/02/06 17:52:06.961400 Setting trusted CA from file: <redacted>l.crt
2014/02/06 17:52:06.962400 Connecting to <redacted>
2014/02/06 17:52:07.031400 Connected to <redacted>
2014/02/06 17:52:07.033400 Socket error, will reconnect: WSASend tcp 10.0.95.65:49219: An existing connection was forcibly closed by the remote host.
  • I ran across LOGSTASH-1598 researching these errors.
  • Confirmed we still have some windows boxes trying to use old cert.
  • Identified hosts that were using bad cert using tcpdump (tcp analysis in wireshark demonstrated the rejection pattern) and blocked them with iptables.
  • Duplicate messages go away.

Now I know the root problem here is a configuration issue, but existing sessions should not be closed when a client fails to connect.

I've been looking into the code a bit, currently reading the openssl implementation in ruby right now... At this point not sure how to fix it.

@jordansissel

This comment has been minimized.

Show comment
Hide comment
@jordansissel

jordansissel Feb 6, 2014

Contributor

I did some digging and couldn't reproduce this. Still sounds like there's a bug somewhere, though

Contributor

jordansissel commented Feb 6, 2014

I did some digging and couldn't reproduce this. Still sounds like there's a bug somewhere, though

@rachin-fab

This comment has been minimized.

Show comment
Hide comment
@rachin-fab

rachin-fab Feb 21, 2014

The other thing that we are seeing is the following error.
This occurs at logstash server almost same time as the "Read error looking for ack: EOF" at forwarder. This error causes logstash server to crash. A fix would be great, but, at minimum can we handle this error more gracefully ?

Exception in thread "LogStash::Runner" org.jruby.exceptions.RaiseException: (ConcurrencyError) Detected invalid array contents due to unsynchronized modifications with concurrent users
at org.jruby.RubyArray.<<(org/jruby/RubyArray.java:1147)
at LogStash::Codecs::Multiline.buffer(file:/opt/logstash/logstash.jar!/logstash/codecs/multiline.rb:159)
at LogStash::Codecs::Multiline.do_previous(file:/opt/logstash/logstash.jar!/logstash/codecs/multiline.rb:180)
at org.jruby.RubyMethod.call(org/jruby/RubyMethod.java:132)
at LogStash::Codecs::Multiline.decode(file:/opt/logstash/logstash.jar!/logstash/codecs/multiline.rb:154)
at RUBY.run(file:/opt/logstash/logstash.jar!/logstash/inputs/lumberjack.rb:46)
at org.jruby.RubyProc.call(org/jruby/RubyProc.java:271)
at Lumberjack::Connection.data(file:/opt/logstash/logstash.jar!/lumberjack/server.rb:241)
at RUBY.run(file:/opt/logstash/logstash.jar!/lumberjack/server.rb:223)
at Lumberjack::Parser.data_field_value(file:/opt/logstash/logstash.jar!/lumberjack/server.rb:182)
at Lumberjack::Parser.feed(file:/opt/logstash/logstash.jar!/lumberjack/server.rb:93)
at Lumberjack::Parser.compressed_payload(file:/opt/logstash/logstash.jar!/lumberjack/server.rb:198)
at Lumberjack::Parser.feed(file:/opt/logstash/logstash.jar!/lumberjack/server.rb:93)
at RUBY.run(file:/opt/logstash/logstash.jar!/lumberjack/server.rb:220)
at RUBY.run(file:/opt/logstash/logstash.jar!/lumberjack/server.rb:59)

rachin-fab commented Feb 21, 2014

The other thing that we are seeing is the following error.
This occurs at logstash server almost same time as the "Read error looking for ack: EOF" at forwarder. This error causes logstash server to crash. A fix would be great, but, at minimum can we handle this error more gracefully ?

Exception in thread "LogStash::Runner" org.jruby.exceptions.RaiseException: (ConcurrencyError) Detected invalid array contents due to unsynchronized modifications with concurrent users
at org.jruby.RubyArray.<<(org/jruby/RubyArray.java:1147)
at LogStash::Codecs::Multiline.buffer(file:/opt/logstash/logstash.jar!/logstash/codecs/multiline.rb:159)
at LogStash::Codecs::Multiline.do_previous(file:/opt/logstash/logstash.jar!/logstash/codecs/multiline.rb:180)
at org.jruby.RubyMethod.call(org/jruby/RubyMethod.java:132)
at LogStash::Codecs::Multiline.decode(file:/opt/logstash/logstash.jar!/logstash/codecs/multiline.rb:154)
at RUBY.run(file:/opt/logstash/logstash.jar!/logstash/inputs/lumberjack.rb:46)
at org.jruby.RubyProc.call(org/jruby/RubyProc.java:271)
at Lumberjack::Connection.data(file:/opt/logstash/logstash.jar!/lumberjack/server.rb:241)
at RUBY.run(file:/opt/logstash/logstash.jar!/lumberjack/server.rb:223)
at Lumberjack::Parser.data_field_value(file:/opt/logstash/logstash.jar!/lumberjack/server.rb:182)
at Lumberjack::Parser.feed(file:/opt/logstash/logstash.jar!/lumberjack/server.rb:93)
at Lumberjack::Parser.compressed_payload(file:/opt/logstash/logstash.jar!/lumberjack/server.rb:198)
at Lumberjack::Parser.feed(file:/opt/logstash/logstash.jar!/lumberjack/server.rb:93)
at RUBY.run(file:/opt/logstash/logstash.jar!/lumberjack/server.rb:220)
at RUBY.run(file:/opt/logstash/logstash.jar!/lumberjack/server.rb:59)

@nicholasf

This comment has been minimized.

Show comment
Hide comment
@nicholasf

nicholasf Feb 26, 2014

This needs to be fixed. I've just run into it.

nicholasf commented Feb 26, 2014

This needs to be fixed. I've just run into it.

@jordansissel

This comment has been minimized.

Show comment
Hide comment
@jordansissel

jordansissel Feb 26, 2014

Contributor

I'll be fixed soon, sorry, as always, for the bugs.

Contributor

jordansissel commented Feb 26, 2014

I'll be fixed soon, sorry, as always, for the bugs.

@nicholasf

This comment has been minimized.

Show comment
Hide comment
@nicholasf

nicholasf Feb 26, 2014

Thanks for the quick response. Looking forward to the fix.

nicholasf commented Feb 26, 2014

Thanks for the quick response. Looking forward to the fix.

@DanielRedOak

This comment has been minimized.

Show comment
Hide comment
@DanielRedOak

DanielRedOak Mar 5, 2014

I added some findings from my setup to this https://logstash.jira.com/browse/LOGSTASH-1598 but didnt know if I should post them here. Might be useful in reproducing the issue.

Basically have a client connected to logstash via logstash-forwarder. Then open a telnet connection to logstash via the logstash-forwarder port, type in some garbage, hit enter. Watch all legitimate connections that are sending events drop "Read error looking for ack: EOF"

DanielRedOak commented Mar 5, 2014

I added some findings from my setup to this https://logstash.jira.com/browse/LOGSTASH-1598 but didnt know if I should post them here. Might be useful in reproducing the issue.

Basically have a client connected to logstash via logstash-forwarder. Then open a telnet connection to logstash via the logstash-forwarder port, type in some garbage, hit enter. Watch all legitimate connections that are sending events drop "Read error looking for ack: EOF"

@nicolas-g

This comment has been minimized.

Show comment
Hide comment
@nicolas-g

nicolas-g Mar 12, 2014

I've just run into this as well after my SSL certs expired , I have replaced the certs on both the server and the agents with no luck. How can I fix this ? Do I need to uninstall and re-install everything from scratch ?

nicolas-g commented Mar 12, 2014

I've just run into this as well after my SSL certs expired , I have replaced the certs on both the server and the agents with no luck. How can I fix this ? Do I need to uninstall and re-install everything from scratch ?

@driskell

This comment has been minimized.

Show comment
Hide comment
@driskell

driskell Mar 16, 2014

Contributor

Hi guys,

I think the cause is the last successful connection is closed if a connection failed to "initialise" - bit of a logic error. This I can reproduce by doing as @DanielRedOak stated, opening a telnet and sending plain text.

However, it only closes the last successful connection - not all of them. Would this sound right? Or is it in fact the case that a single failure causes ALL to fail?

Regardless, I've fixed my particular observation it in my fork and will get a PR set up once @jordansissel and co get some eyes onto the project. Because I've done so many changes I've had to start basing branches on my own changes (its easier for me) so I can't PR at the moment until I know what will be merged and what will not, and where I need to rebase from.

Jason

Contributor

driskell commented Mar 16, 2014

Hi guys,

I think the cause is the last successful connection is closed if a connection failed to "initialise" - bit of a logic error. This I can reproduce by doing as @DanielRedOak stated, opening a telnet and sending plain text.

However, it only closes the last successful connection - not all of them. Would this sound right? Or is it in fact the case that a single failure causes ALL to fail?

Regardless, I've fixed my particular observation it in my fork and will get a PR set up once @jordansissel and co get some eyes onto the project. Because I've done so many changes I've had to start basing branches on my own changes (its easier for me) so I can't PR at the moment until I know what will be merged and what will not, and where I need to rebase from.

Jason

driskell added a commit to driskell/logstash-forwarder that referenced this issue Mar 16, 2014

Fix one possible cause of #160; also add logging for failed connectio…
…ns along with the peer address (requires a patch to LogStash input to pass through the @logger)
@driskell

This comment has been minimized.

Show comment
Hide comment
@driskell

driskell Mar 19, 2014

Contributor

The changes are now in my repository:
https://github.com/driskell/logstash-forwarder

Contributor

driskell commented Mar 19, 2014

The changes are now in my repository:
https://github.com/driskell/logstash-forwarder

@jmreicha

This comment has been minimized.

Show comment
Hide comment
@jmreicha

jmreicha May 21, 2014

Has there been any update or fix for this yet? I just ran into the Read error looking for ack: EOF error yesterday.

jmreicha commented May 21, 2014

Has there been any update or fix for this yet? I just ran into the Read error looking for ack: EOF error yesterday.

@mipearson

This comment has been minimized.

Show comment
Hide comment
@mipearson

mipearson Aug 18, 2014

Contributor

Confirmed here. Multiple servers with 'stale' bad certificates prevented our server with a good certificate from connecting. Only resolution was to shut down LSF on all servers, restart logstash, then ramp up servers with good certificates one by one.

That's two hours of openssl "what the hell, this certificate is fine" fuckery I won't get back :(

Contributor

mipearson commented Aug 18, 2014

Confirmed here. Multiple servers with 'stale' bad certificates prevented our server with a good certificate from connecting. Only resolution was to shut down LSF on all servers, restart logstash, then ramp up servers with good certificates one by one.

That's two hours of openssl "what the hell, this certificate is fine" fuckery I won't get back :(

@jordansissel

This comment has been minimized.

Show comment
Hide comment
@jordansissel

jordansissel Aug 18, 2014

Contributor

That's two hours of openssl "what the hell, this certificate is fine" fuckery I won't get back :(

For what it's worth, I'm sorry you wasted so much time fighting some bullshit computer problems on what is probably your weekend.

Computers are barely-functional pieces of shit. Software, specifically, is a miracle. I'm surprised my computer boots and gets online. For myself, despite trying to write good software, I often make things that work poorly or generally fuck up someone else's day, it makes me sad. It's times like this I think about changing industries, because I'm not here to make things that ruin other people's days. :(

Contributor

jordansissel commented Aug 18, 2014

That's two hours of openssl "what the hell, this certificate is fine" fuckery I won't get back :(

For what it's worth, I'm sorry you wasted so much time fighting some bullshit computer problems on what is probably your weekend.

Computers are barely-functional pieces of shit. Software, specifically, is a miracle. I'm surprised my computer boots and gets online. For myself, despite trying to write good software, I often make things that work poorly or generally fuck up someone else's day, it makes me sad. It's times like this I think about changing industries, because I'm not here to make things that ruin other people's days. :(

@mipearson

This comment has been minimized.

Show comment
Hide comment
@mipearson

mipearson Aug 18, 2014

Contributor

Australian, remember? It's how I spent the time after my Monday morning coffee :)

All good - was mentioning the lost time in the hopes of highlighting the severity of the issue. High impact combined with very difficult cause analysis.

Regards,

Michael Pearson
(sent from my phone)

On 18 Aug 2014, at 1:25 pm, Jordan Sissel notifications@github.com wrote:

That's two hours of openssl "what the hell, this certificate is fine" fuckery I won't get back :(

For what it's worth, I'm sorry you wasted so much time fighting some bullshit computer problems on what is probably your weekend.

Computers are barely-functional pieces of shit. Software, specifically, is a miracle. I'm surprised my computer boots and gets online. For myself, despite trying to write good software, I often make things that work poorly or generally fuck up someone else's day, it makes me sad. It's times like this I think about changing industries, because I'm not here to make things that ruin other people's days. :(


Reply to this email directly or view it on GitHub.

Contributor

mipearson commented Aug 18, 2014

Australian, remember? It's how I spent the time after my Monday morning coffee :)

All good - was mentioning the lost time in the hopes of highlighting the severity of the issue. High impact combined with very difficult cause analysis.

Regards,

Michael Pearson
(sent from my phone)

On 18 Aug 2014, at 1:25 pm, Jordan Sissel notifications@github.com wrote:

That's two hours of openssl "what the hell, this certificate is fine" fuckery I won't get back :(

For what it's worth, I'm sorry you wasted so much time fighting some bullshit computer problems on what is probably your weekend.

Computers are barely-functional pieces of shit. Software, specifically, is a miracle. I'm surprised my computer boots and gets online. For myself, despite trying to write good software, I often make things that work poorly or generally fuck up someone else's day, it makes me sad. It's times like this I think about changing industries, because I'm not here to make things that ruin other people's days. :(


Reply to this email directly or view it on GitHub.

@driskell

This comment has been minimized.

Show comment
Hide comment
@driskell

driskell Aug 18, 2014

Contributor

#180 fixes this as well as the too many open files, and adds logging of failing clients (requires logstash plugin changing too though to pass through logger)

I think this specific problem is:
https://github.com/elasticsearch/logstash-forwarder/blob/master/lib/lumberjack/server.rb#L54

If SSL accept fails, client is not changed. It still holds the reference to the last successful accept, and that last connection gets killed.

The duplication is likely the partial ack problem though. If logstash can't keep up working lumberjack network timeout connections begin to reset and resend logs. #180 adds partial ack.

I've stopped maintaining my LSF repo tho (with #180 in it) and maintain Log Courier instead (my LSF rewrite). Just means I'm not actively using #180 anymore myself so needs some testing - but I'm around to help fix bits or explain bits.

Jason

Contributor

driskell commented Aug 18, 2014

#180 fixes this as well as the too many open files, and adds logging of failing clients (requires logstash plugin changing too though to pass through logger)

I think this specific problem is:
https://github.com/elasticsearch/logstash-forwarder/blob/master/lib/lumberjack/server.rb#L54

If SSL accept fails, client is not changed. It still holds the reference to the last successful accept, and that last connection gets killed.

The duplication is likely the partial ack problem though. If logstash can't keep up working lumberjack network timeout connections begin to reset and resend logs. #180 adds partial ack.

I've stopped maintaining my LSF repo tho (with #180 in it) and maintain Log Courier instead (my LSF rewrite). Just means I'm not actively using #180 anymore myself so needs some testing - but I'm around to help fix bits or explain bits.

Jason

@jordansissel

This comment has been minimized.

Show comment
Hide comment
@jordansissel

jordansissel Oct 8, 2014

Contributor

Targeting for logstash 1.5 fix

Contributor

jordansissel commented Oct 8, 2014

Targeting for logstash 1.5 fix

@suyograo suyograo assigned suyograo and jordansissel and unassigned suyograo Oct 21, 2014

@suyograo suyograo assigned jsvd and unassigned jordansissel Nov 11, 2014

@maxhyjal

This comment has been minimized.

Show comment
Hide comment
@maxhyjal

maxhyjal Jan 7, 2015

Is this issue going to be fixed in logstash 1.5 release? Or is log-courier a better solution at this point at least?

maxhyjal commented Jan 7, 2015

Is this issue going to be fixed in logstash 1.5 release? Or is log-courier a better solution at this point at least?

@jordansissel jordansissel assigned jordansissel and unassigned jsvd Jan 7, 2015

@jordansissel

This comment has been minimized.

Show comment
Hide comment
@jordansissel

jordansissel Jan 7, 2015

Contributor

I can't reproduce this on logstash master, but I can reproduce it on logstash 1.5.0 beta1. I'll try to figure out what the problem is.

Contributor

jordansissel commented Jan 7, 2015

I can't reproduce this on logstash master, but I can reproduce it on logstash 1.5.0 beta1. I'll try to figure out what the problem is.

@jordansissel

This comment has been minimized.

Show comment
Hide comment
@jordansissel

jordansissel Jan 7, 2015

Contributor

strangely, I cannot reproduce it from git on v1.5.0.beta1 tag either, but I can reproduce it from the v1.5.0 beta1 release tarball.

Contributor

jordansissel commented Jan 7, 2015

strangely, I cannot reproduce it from git on v1.5.0.beta1 tag either, but I can reproduce it from the v1.5.0 beta1 release tarball.

@jordansissel

This comment has been minimized.

Show comment
Hide comment
@jordansissel

jordansissel Jan 7, 2015

Contributor

On v1.5.0.beta1 I cannot reproduce the "resets all tcp connections" bug. The behavior I observe is that it closes the previously-accepted socket due to the line of code @driskell pointed out (client is the previous connection at exception-time due to the assignment not occurring).

I can fix this "close the previous connection on handshake failure"

I can't yet reproduce the "closes ALL connections" problem this ticket mentions in the subject.

Contributor

jordansissel commented Jan 7, 2015

On v1.5.0.beta1 I cannot reproduce the "resets all tcp connections" bug. The behavior I observe is that it closes the previously-accepted socket due to the line of code @driskell pointed out (client is the previous connection at exception-time due to the assignment not occurring).

I can fix this "close the previous connection on handshake failure"

I can't yet reproduce the "closes ALL connections" problem this ticket mentions in the subject.

@jordansissel

This comment has been minimized.

Show comment
Hide comment
@jordansissel

jordansissel Jan 7, 2015

Contributor

Ahh, lsf master has this already fixed, and it works, but the code is not right. I'll patch that.

Contributor

jordansissel commented Jan 7, 2015

Ahh, lsf master has this already fixed, and it works, but the code is not right. I'll patch that.

jordansissel added a commit that referenced this issue Jan 7, 2015

@jordansissel

This comment has been minimized.

Show comment
Hide comment
@jordansissel

jordansissel Jan 7, 2015

Contributor

I believe this is fixed and will appear in logstash 1.5.0

Contributor

jordansissel commented Jan 7, 2015

I believe this is fixed and will appear in logstash 1.5.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment