Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better document LSF backpressure behavior (reports of stalls, connection timeouts) #323

Closed
pacolotero opened this issue Dec 4, 2014 · 26 comments

Comments

@pacolotero
Copy link

Recently my logstash forwarder has stopped working.

Has begun to show these errors

2014/12/04 11:27:59.183151 Failure connecting to 176.31.255.50: dial tcp 176.31.255.50:5042: i/o timeout
2014/12/04 11:28:00.200147 Connecting to [176.31.255.50]:5042 (log.domain.com)

but sometimes get a connection and sends events

2014/12/04 11:30:54.481805 Failure connecting to 176.31.255.50: dial tcp 176.31.255.50:5042: i/o timeout
2014/12/04 11:30:55.550163 Connecting to [176.31.255.50]:5042 (log.domain.com) 
2014/12/04 11:30:55.630855 Connected to 176.31.255.50
2014/12/04 11:30:55.658211 Registrar: precessing 34 events
2014/12/04 11:30:55.666604 Registrar: precessing 2 events
2014/12/04 11:30:56.402660 Registrar: precessing 13 events
2014/12/04 12:05:21.389334 Read error looking for ack: read tcp 176.31.255.50:5042: i/o timeout

I currently use 3 logstash-forwarders and none gives me that error, only this.

@bensquire
Copy link

I am having a similar problem:

2015/01/04 18:27:57.928754 Failure connecting to 178.62.19.121: dial tcp 178.62.19.121:5000: connection refused
2015/01/04 18:27:58.929155 Connecting to 178.62.19.121:5000 (178.62.19.121) 

It was working, then just stopped 😕

@loosi
Copy link

loosi commented Jan 29, 2015

My forwarders were working for a long time, now they just stopped. Anyone found a working solution for that i/o timeout?

@eorochena
Copy link

Same issue over here, I had to switch to rabbitmq because of this

@driskell
Copy link
Contributor

@bensquire That is usually Logstash has crashed or stopped listening, so the server is rejecting the connection. Can you restart it? Check Logstash logs for more information on what might have happened if this is the cause.

@loosi I/O timeout is usually caused by slow logstash server. You need faster servers, or lower spool size. There was my #180 to fix this so it works for any amount of logs - I eventually forked to Log Courier though to rewrite the protocol to fix this once and for all.

@eorochena See above, depending on what your issue is.

@loosi
Copy link

loosi commented Feb 2, 2015

@driskell
Thx for your Answer, but Im not yet convinced thats its really a problem of the logstash-server. I've just upgraded my logstash server to a pretty tough machine that has nothing to do right now. Logstash has a bunch of workers and increased memory, elasticsearch is pretty much tuned up and still i/o timeouts.
Strangely it works when restarting the server that runs logstash-fowarder.

As I see your PR is not yet merged, so your changes are not included in latest master (I build my rpm from master, so I am really up-to-date for that). Logstash in debug mode just cuts the logs, there is no error nor any load, so i cannot see that the logstash agent is doing something wrong.

@bensquire
Copy link

@driskell Thanks for this. As it happens, this turned out to be a misconfiguration of 1 of my patterns, apologies. Actually finding this out however was a bit of a pain in the harris...

@driskell
Copy link
Contributor

driskell commented Feb 2, 2015

@loosi You may be right. I can see it's "Failure connecting".
I've seen "reading ACK" and "Failure TLS handshake" but this one actually looks different. My solutions above were for these.

Can you confirm what exactly the error is?

@loosi
Copy link

loosi commented Feb 2, 2015

Connecting to [**]:6679 (__)
2015/02/02 20:41:49.709818 Connected to *

2015/02/02 20:42:14.729562 Read error looking for ack: read tcp **:6679: i/o timeout
I dont have any TLS Problems at all, just this errors after connecting. Changing the spool-size does not make any difference at all.

I'm missing a real debug mode. How do you debug these days correctly?

@jordansissel
Copy link
Contributor

Read error looking for ack: read tcp **:6679: i/o timeout

This means there was a timeout waiting for an ackknowledgement from Logstash. This occurs when Logstash is stuck (limited resources, blocked on something in an output, etc) and lsf gives up waiting. The problem is almost always external to logstash-forwarder.

@driskell
Copy link
Contributor

driskell commented Feb 2, 2015

@loosi @jordansissel It occurs more often when Logstash is simply busy processing a backlog. If there are 50 clients who all send 1024 events, it's going to take far more than 15 seconds to run through the 1024 events. Worse - while Logstash is chugging through those 1024, the timeout is hit, and forwarder resends them... that duplicates the events. This is when I created #180 to fix it (though I've learned a lot about better PR since then :-) )

Increasing the network timeout in forwarder quite high does tend to get rid of the "Read error looking for ack" issue eventually - but not ideal imho. Another solution is to have a broker such as Redis, have a Logstash instance that throws the events directly into Redis as quickly as possible to avoid timeout. Redis will need to store the full backlog though. The indexing instance of Logstash can then take its time.

There's always https://github.com/driskell/log-courier though - it contains the fix in #180

@jordansissel
Copy link
Contributor

@driskell changing buffer strategies between lsf and logstash doesn't solve the "logstash, or something downstream, is overloaded" problem. Partial acks don't solve an overloaded logstash, though it would reduce the feedback cycle of retransmission possibly making it worse - but it shouldn't solve the original overload/stall cause.

@driskell
Copy link
Contributor

driskell commented Feb 2, 2015

@jordansissel I thought overloading is the entire point. Why have a beefy CPU if when Logstash tries to use it everything dies. With Partial ACK, we can have 50 forwarders pointed directly at Logstash, and it will burn away at 100% if there's enough of a backlog, when it's ploughed through like a hungry hippo it'll idle down and trickle away. Then you have a spike due to a newsletter and it throws to 100% again for an hour and then you're happy. Might mean data is not realtime for an hour or so but at least you don't have to spec Logstash for your spikes which might happen only once.

The buffer helps because the redis input is far more robust than the lumberjack input. With the redis input you can run Logstash at full capacity with no issues because due to it being a pull mechanism, it is flow controlled. Having zero filters and a redis output is also extremely fast and can cope with most backlogs without timing out forwarders. So I would say it is a viable solution for some - or I guess a workaround at worst.

Partial ack resolves the entire situation. It allows flow control. Logstash can indicate to forwarders that it should not disconnect and should not resend, it should wait.

Stalling. Now that's a different problem to overloading. Overloading is a normal everyday occurrence. Youtube overloads my internet connection every day, and I expect it to, but it doesn't cause an issue because it has flow control. But even then, Partial ack benefits us. Rather than having clients timeout and distracting us from the real issue, we see errors all over the place. With partial ACK, clients are silent, we see no logs flowing, and we know immediately we have a stall somewhere. Makes debugging much easier.

@driskell
Copy link
Contributor

driskell commented Feb 2, 2015

When I say overload above, I mean, "Logstash is receiving logs faster than it can process them", and temporary.

Of course, permanent overload is another thing. Partial ack still helps here though, as you can see your current capacity and 100% CPU. Otherwise, you've got false figures due to the timeouts and resends.

@jordansissel
Copy link
Contributor

Timeouts were implemented specifically to switch servers in the case of Logstash being overloaded. not switching servers is counter to that goal. I admit lsf doesn't have this implemented perfectly because frankly with 1 server in the list, it probably should skip ack-timout-retry becuase it'll just retry on the same server.

@driskell
Copy link
Contributor

driskell commented Feb 2, 2015

Ah. That's very interesting.
Though from my experience, rather than one or two timing out, it's usually them all at once nearly, at least at the levels of spike I was having. Might be worth bearing in mind, I'm not sure if such a system would even work reliably with 2 servers as a result as all would flick and just kill the second server too.
Though I guess the randomising would kick in if there were 3 servers and it would then balance.

@jordansissel
Copy link
Contributor

@driskell yeah, the idea was randomization and ack-timouts would lead roughly to load balancing. Future work will involve actual load sharing (multiple connections, like @avleen's fork) and smarter handling of slow/stalled downstreams

@loosi
Copy link

loosi commented Feb 2, 2015

I just verified that when disabling any all other forwarders (6 forwarders, about 1k per sec total) it works for a few seconds, then there are timeouts again. Why are there 6 forwarders running flawlessly with a lot more logs than this single machine? Thats a weird thing, some time ago we indexed 10k per sec without a problem, now with a bigger machine we dont reach anything close to that number.

@driskell
Copy link
Contributor

driskell commented Feb 2, 2015

@loosi Just a guess, but maybe the flawless ones are relatively realtime and OK, but the other, although having less logs overall, is behind a fair bit so is trying to catch up.

@jordansissel That sounds like the best plan. @avleen's looks interesting for sure - though at glance I don't know how it maintains the registrar safely as publishers could finish out of order.

@loosi
Copy link

loosi commented Feb 3, 2015

I actually fixed it for now, probably the best way to go for me. A single input/output-only logstash piping between lumberjack input and redis output. Multiple logstash processes grepping the stuff out of redis. Works like a charm for about 10k logs atm.

@rabidscorpio
Copy link

I keep getting the "Read error looking for ack" error but I'm only running this on a single server reporting to another server that's only receiving events from that single server and then putting them in a separate elasticsearch cluster that's only getting updates from the logstash server.

@jordansissel
Copy link
Contributor

I think all evidence so far in this ticket is similar to other tickets with these reports. The flow through logstash can apply backpressure such that lsf will get connection timeouts, ack timeouts, etc. lsf is programmed to retry in these situations.

The symptom is lsf getting these connection timeouts. The cause is almost always something downstream (logstash, elasticsearch, network problems, etc) that causes log flow from lsf's file inputs to stall. I am sure we can improve how lsf handles these, or whatever, but fundamentally I believe these problems to be simply an expectation mismatch between users and reality - this is my fault for not documenting it properly. Sorry for the confusion, folks. I hope this helps explain a little bit?

@rabidscorpio
Copy link

@jordansissel I can't disagree with you more. In my case, the stack that I was testing had only a few servers sending data to a logstash server with a 5 node elasticsearch cluster behind it. There was no back pressure whatsoever and I think you're completely discounting evidence that there might be other problems going on here.

@jordansissel
Copy link
Contributor

@rabidscorpio Apologies. I did not intend to be dismissive. We get lots of tickets like this one where it's hard for me to follow because so many folks are trying to explain their problem. I want to help, but I don't currently have a good idea of what is causing these problems for users if the problem is not caused by a clog/stall downstream. I appreciate and acknowledge the troubles folks are reporting here, and I'm open to discussing your issue, but please help me help you: file a new ticket specific to your situation so we can debug that in isolation.

@rabidscorpio
Copy link

@jordansissel Thanks for responding, Jordan. I can totally understand why it would be frustrating trying to diagnose these types of problems and I wish I could give you more details. Maybe that's part of the problem because there's just not enough detail about what's going on when this problem happens. I had the most detailed logging turned on for logstash (there was no intermediary Redis or Kafka) and there was nothing in the logs at all related to the disconnects. I'm no longer working on that particular project so I can't really file a new ticket, though, and everything I'm relating is from memory from April.

@jordansissel
Copy link
Contributor

@rabidscorpio You won't be alone in debugging. It's usually simple for us to help guide folks having problems to get the details we need, but it's made more difficult when we get 5+ people on a ticket saying they have the same problem, and I lose my focus on which specific situation I am debugging or the cause or whatever. Nobody wins :\

Sorry again that you ran into trouble. We're working on curating our tickets more actively so that we don't end up with a mix of situations and issues piling into a single ticket. Thank you for telling me about your situation :)

@ruflin ruflin added the libbeat label Sep 16, 2015
@jordansissel
Copy link
Contributor

Thanks for helping make logstash-forwarder better!

Logstash-forwarder is going away and is replaced by filebeat and its friend, libbeat. If this is still an issue, would you mind opening a ticket there?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants