Better document LSF backpressure behavior (reports of stalls, connection timeouts) #323

pacolotero · 2014-12-04T10:36:23Z

Recently my logstash forwarder has stopped working.

Has begun to show these errors

2014/12/04 11:27:59.183151 Failure connecting to 176.31.255.50: dial tcp 176.31.255.50:5042: i/o timeout
2014/12/04 11:28:00.200147 Connecting to [176.31.255.50]:5042 (log.domain.com)

but sometimes get a connection and sends events

2014/12/04 11:30:54.481805 Failure connecting to 176.31.255.50: dial tcp 176.31.255.50:5042: i/o timeout
2014/12/04 11:30:55.550163 Connecting to [176.31.255.50]:5042 (log.domain.com) 
2014/12/04 11:30:55.630855 Connected to 176.31.255.50
2014/12/04 11:30:55.658211 Registrar: precessing 34 events
2014/12/04 11:30:55.666604 Registrar: precessing 2 events
2014/12/04 11:30:56.402660 Registrar: precessing 13 events
2014/12/04 12:05:21.389334 Read error looking for ack: read tcp 176.31.255.50:5042: i/o timeout

I currently use 3 logstash-forwarders and none gives me that error, only this.

The text was updated successfully, but these errors were encountered:

bensquire · 2015-01-04T18:35:57Z

I am having a similar problem:

2015/01/04 18:27:57.928754 Failure connecting to 178.62.19.121: dial tcp 178.62.19.121:5000: connection refused
2015/01/04 18:27:58.929155 Connecting to 178.62.19.121:5000 (178.62.19.121)

It was working, then just stopped 😕

loosi · 2015-01-29T16:54:28Z

My forwarders were working for a long time, now they just stopped. Anyone found a working solution for that i/o timeout?

eorochena · 2015-01-29T17:14:11Z

Same issue over here, I had to switch to rabbitmq because of this

driskell · 2015-01-29T17:20:29Z

@bensquire That is usually Logstash has crashed or stopped listening, so the server is rejecting the connection. Can you restart it? Check Logstash logs for more information on what might have happened if this is the cause.

@loosi I/O timeout is usually caused by slow logstash server. You need faster servers, or lower spool size. There was my #180 to fix this so it works for any amount of logs - I eventually forked to Log Courier though to rewrite the protocol to fix this once and for all.

@eorochena See above, depending on what your issue is.

loosi · 2015-02-02T13:11:17Z

@driskell
Thx for your Answer, but Im not yet convinced thats its really a problem of the logstash-server. I've just upgraded my logstash server to a pretty tough machine that has nothing to do right now. Logstash has a bunch of workers and increased memory, elasticsearch is pretty much tuned up and still i/o timeouts.
Strangely it works when restarting the server that runs logstash-fowarder.

As I see your PR is not yet merged, so your changes are not included in latest master (I build my rpm from master, so I am really up-to-date for that). Logstash in debug mode just cuts the logs, there is no error nor any load, so i cannot see that the logstash agent is doing something wrong.

bensquire · 2015-02-02T13:22:11Z

@driskell Thanks for this. As it happens, this turned out to be a misconfiguration of 1 of my patterns, apologies. Actually finding this out however was a bit of a pain in the harris...

driskell · 2015-02-02T13:22:52Z

@loosi You may be right. I can see it's "Failure connecting".
I've seen "reading ACK" and "Failure TLS handshake" but this one actually looks different. My solutions above were for these.

Can you confirm what exactly the error is?

loosi · 2015-02-02T20:05:55Z

Connecting to [**]:6679 (__)
2015/02/02 20:41:49.709818 Connected to *
2015/02/02 20:42:14.729562 Read error looking for ack: read tcp **:6679: i/o timeout
I dont have any TLS Problems at all, just this errors after connecting. Changing the spool-size does not make any difference at all.

I'm missing a real debug mode. How do you debug these days correctly?

jordansissel · 2015-02-02T20:07:40Z

Read error looking for ack: read tcp **:6679: i/o timeout

This means there was a timeout waiting for an ackknowledgement from Logstash. This occurs when Logstash is stuck (limited resources, blocked on something in an output, etc) and lsf gives up waiting. The problem is almost always external to logstash-forwarder.

driskell · 2015-02-02T20:17:10Z

@loosi @jordansissel It occurs more often when Logstash is simply busy processing a backlog. If there are 50 clients who all send 1024 events, it's going to take far more than 15 seconds to run through the 1024 events. Worse - while Logstash is chugging through those 1024, the timeout is hit, and forwarder resends them... that duplicates the events. This is when I created #180 to fix it (though I've learned a lot about better PR since then :-) )

Increasing the network timeout in forwarder quite high does tend to get rid of the "Read error looking for ack" issue eventually - but not ideal imho. Another solution is to have a broker such as Redis, have a Logstash instance that throws the events directly into Redis as quickly as possible to avoid timeout. Redis will need to store the full backlog though. The indexing instance of Logstash can then take its time.

There's always https://github.com/driskell/log-courier though - it contains the fix in #180

jordansissel · 2015-02-02T20:19:58Z

@driskell changing buffer strategies between lsf and logstash doesn't solve the "logstash, or something downstream, is overloaded" problem. Partial acks don't solve an overloaded logstash, though it would reduce the feedback cycle of retransmission possibly making it worse - but it shouldn't solve the original overload/stall cause.

driskell · 2015-02-02T21:30:18Z

@jordansissel I thought overloading is the entire point. Why have a beefy CPU if when Logstash tries to use it everything dies. With Partial ACK, we can have 50 forwarders pointed directly at Logstash, and it will burn away at 100% if there's enough of a backlog, when it's ploughed through like a hungry hippo it'll idle down and trickle away. Then you have a spike due to a newsletter and it throws to 100% again for an hour and then you're happy. Might mean data is not realtime for an hour or so but at least you don't have to spec Logstash for your spikes which might happen only once.

The buffer helps because the redis input is far more robust than the lumberjack input. With the redis input you can run Logstash at full capacity with no issues because due to it being a pull mechanism, it is flow controlled. Having zero filters and a redis output is also extremely fast and can cope with most backlogs without timing out forwarders. So I would say it is a viable solution for some - or I guess a workaround at worst.

Partial ack resolves the entire situation. It allows flow control. Logstash can indicate to forwarders that it should not disconnect and should not resend, it should wait.

Stalling. Now that's a different problem to overloading. Overloading is a normal everyday occurrence. Youtube overloads my internet connection every day, and I expect it to, but it doesn't cause an issue because it has flow control. But even then, Partial ack benefits us. Rather than having clients timeout and distracting us from the real issue, we see errors all over the place. With partial ACK, clients are silent, we see no logs flowing, and we know immediately we have a stall somewhere. Makes debugging much easier.

driskell · 2015-02-02T21:40:19Z

When I say overload above, I mean, "Logstash is receiving logs faster than it can process them", and temporary.

Of course, permanent overload is another thing. Partial ack still helps here though, as you can see your current capacity and 100% CPU. Otherwise, you've got false figures due to the timeouts and resends.

jordansissel · 2015-02-02T21:53:46Z

Timeouts were implemented specifically to switch servers in the case of Logstash being overloaded. not switching servers is counter to that goal. I admit lsf doesn't have this implemented perfectly because frankly with 1 server in the list, it probably should skip ack-timout-retry becuase it'll just retry on the same server.

driskell · 2015-02-02T22:01:45Z

Ah. That's very interesting.
Though from my experience, rather than one or two timing out, it's usually them all at once nearly, at least at the levels of spike I was having. Might be worth bearing in mind, I'm not sure if such a system would even work reliably with 2 servers as a result as all would flick and just kill the second server too.
Though I guess the randomising would kick in if there were 3 servers and it would then balance.

jordansissel · 2015-02-02T22:05:14Z

@driskell yeah, the idea was randomization and ack-timouts would lead roughly to load balancing. Future work will involve actual load sharing (multiple connections, like @avleen's fork) and smarter handling of slow/stalled downstreams

loosi · 2015-02-02T22:11:09Z

I just verified that when disabling any all other forwarders (6 forwarders, about 1k per sec total) it works for a few seconds, then there are timeouts again. Why are there 6 forwarders running flawlessly with a lot more logs than this single machine? Thats a weird thing, some time ago we indexed 10k per sec without a problem, now with a bigger machine we dont reach anything close to that number.

driskell · 2015-02-02T23:10:15Z

@loosi Just a guess, but maybe the flawless ones are relatively realtime and OK, but the other, although having less logs overall, is behind a fair bit so is trying to catch up.

@jordansissel That sounds like the best plan. @avleen's looks interesting for sure - though at glance I don't know how it maintains the registrar safely as publishers could finish out of order.

loosi · 2015-02-03T00:21:17Z

I actually fixed it for now, probably the best way to go for me. A single input/output-only logstash piping between lumberjack input and redis output. Multiple logstash processes grepping the stuff out of redis. Works like a charm for about 10k logs atm.

rabidscorpio · 2015-02-12T08:12:23Z

I keep getting the "Read error looking for ack" error but I'm only running this on a single server reporting to another server that's only receiving events from that single server and then putting them in a separate elasticsearch cluster that's only getting updates from the logstash server.

jordansissel · 2015-08-10T18:42:49Z

I think all evidence so far in this ticket is similar to other tickets with these reports. The flow through logstash can apply backpressure such that lsf will get connection timeouts, ack timeouts, etc. lsf is programmed to retry in these situations.

The symptom is lsf getting these connection timeouts. The cause is almost always something downstream (logstash, elasticsearch, network problems, etc) that causes log flow from lsf's file inputs to stall. I am sure we can improve how lsf handles these, or whatever, but fundamentally I believe these problems to be simply an expectation mismatch between users and reality - this is my fault for not documenting it properly. Sorry for the confusion, folks. I hope this helps explain a little bit?

rabidscorpio · 2015-08-12T20:40:42Z

@jordansissel I can't disagree with you more. In my case, the stack that I was testing had only a few servers sending data to a logstash server with a 5 node elasticsearch cluster behind it. There was no back pressure whatsoever and I think you're completely discounting evidence that there might be other problems going on here.

jordansissel · 2015-08-12T20:44:27Z

@rabidscorpio Apologies. I did not intend to be dismissive. We get lots of tickets like this one where it's hard for me to follow because so many folks are trying to explain their problem. I want to help, but I don't currently have a good idea of what is causing these problems for users if the problem is not caused by a clog/stall downstream. I appreciate and acknowledge the troubles folks are reporting here, and I'm open to discussing your issue, but please help me help you: file a new ticket specific to your situation so we can debug that in isolation.

rabidscorpio · 2015-08-12T20:55:50Z

@jordansissel Thanks for responding, Jordan. I can totally understand why it would be frustrating trying to diagnose these types of problems and I wish I could give you more details. Maybe that's part of the problem because there's just not enough detail about what's going on when this problem happens. I had the most detailed logging turned on for logstash (there was no intermediary Redis or Kafka) and there was nothing in the logs at all related to the disconnects. I'm no longer working on that particular project so I can't really file a new ticket, though, and everything I'm relating is from memory from April.

jordansissel · 2015-08-12T20:59:52Z

@rabidscorpio You won't be alone in debugging. It's usually simple for us to help guide folks having problems to get the details we need, but it's made more difficult when we get 5+ people on a ticket saying they have the same problem, and I lose my focus on which specific situation I am debugging or the cause or whatever. Nobody wins :\

Sorry again that you ran into trouble. We're working on curating our tickets more actively so that we don't end up with a mix of situations and issues piling into a single ticket. Thank you for telling me about your situation :)

jordansissel · 2015-11-17T00:31:37Z

Thanks for helping make logstash-forwarder better!

Logstash-forwarder is going away and is replaced by filebeat and its friend, libbeat. If this is still an issue, would you mind opening a ticket there?

jordansissel added question docs labels Aug 10, 2015

This was referenced Aug 10, 2015

Logstash Stops responding when incoming logs exceed 30 per second #295

Closed

logstash-forwarder works on initial "from-beginning" load, then just stops sending new log entries #291

Closed

tbragin changed the title ~~Failure connecting~~ Better document LSF backpressure behavior (reports of stalls, connection timeouts) Aug 10, 2015

suyograo mentioned this issue Aug 10, 2015

Failure connecting to [ip]: dial tcp [ip]:[port]: connection timed out #443

Closed

ruflin added the libbeat label Sep 16, 2015

jordansissel closed this as completed Nov 17, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better document LSF backpressure behavior (reports of stalls, connection timeouts) #323

Better document LSF backpressure behavior (reports of stalls, connection timeouts) #323

pacolotero commented Dec 4, 2014

bensquire commented Jan 4, 2015

loosi commented Jan 29, 2015

eorochena commented Jan 29, 2015

driskell commented Jan 29, 2015

loosi commented Feb 2, 2015

bensquire commented Feb 2, 2015

driskell commented Feb 2, 2015

loosi commented Feb 2, 2015

jordansissel commented Feb 2, 2015

driskell commented Feb 2, 2015

jordansissel commented Feb 2, 2015

driskell commented Feb 2, 2015

driskell commented Feb 2, 2015

jordansissel commented Feb 2, 2015

driskell commented Feb 2, 2015

jordansissel commented Feb 2, 2015

loosi commented Feb 2, 2015

driskell commented Feb 2, 2015

loosi commented Feb 3, 2015

rabidscorpio commented Feb 12, 2015

jordansissel commented Aug 10, 2015

rabidscorpio commented Aug 12, 2015

jordansissel commented Aug 12, 2015

rabidscorpio commented Aug 12, 2015

jordansissel commented Aug 12, 2015

jordansissel commented Nov 17, 2015

Better document LSF backpressure behavior (reports of stalls, connection timeouts) #323

Better document LSF backpressure behavior (reports of stalls, connection timeouts) #323

Comments

pacolotero commented Dec 4, 2014

bensquire commented Jan 4, 2015

loosi commented Jan 29, 2015

eorochena commented Jan 29, 2015

driskell commented Jan 29, 2015

loosi commented Feb 2, 2015

bensquire commented Feb 2, 2015

driskell commented Feb 2, 2015

loosi commented Feb 2, 2015

jordansissel commented Feb 2, 2015

driskell commented Feb 2, 2015

jordansissel commented Feb 2, 2015

driskell commented Feb 2, 2015

driskell commented Feb 2, 2015

jordansissel commented Feb 2, 2015

driskell commented Feb 2, 2015

jordansissel commented Feb 2, 2015

loosi commented Feb 2, 2015

driskell commented Feb 2, 2015

loosi commented Feb 3, 2015

rabidscorpio commented Feb 12, 2015

jordansissel commented Aug 10, 2015

rabidscorpio commented Aug 12, 2015

jordansissel commented Aug 12, 2015

rabidscorpio commented Aug 12, 2015

jordansissel commented Aug 12, 2015

jordansissel commented Nov 17, 2015