DOS caused by packet too large client #219

packplusplus · 2015-08-07T18:47:57Z

I saw lots of these guys in my log. :message=>"Protocol error, connection aborted", :error=>"packet too large (67121475 > 10485760)" Its a client misconfiguration, but it ended up DOSing the ingestors, and thats pretty scary.

I'm wondering if theres anything that can be done to keep this from happening again in the future. To be clear, I don't know think this is logcourier specific issue, or if it's any input (or filter) that can have this happen. Off the top of my head black listing a client that keeps causing errors, but I have no clue if that's feasible.

(this is def a "lets talk" github issue, not a bug)

The text was updated successfully, but these errors were encountered:

driskell · 2015-08-07T19:00:43Z

Client should reconnect only after a second or so so shouldn't really DoS, though I guess it will be sending 67MB over and over.

v2 courier plugin I'm working on considers the packet size before receiving it (it reads header only then rejects) and also has network level backoff to stop memory problems. So hopefully that will mitigate any high net throughout you might get now as it'll only receive into the tiny TCP buffers before discarding. (It's worth noting partial ack starts as soon as header received too in v2 so even less timeouts.)

If you can give more details on the DoS - what resource it denies - I can check any v2 design accounts for it.

Thanks for the feedback too - discussions like this help to improve and inform :-)

packplusplus · 2015-08-07T20:05:46Z

In my setup Each internal group gets a client certificate for auth. I make an input file for each one in conf.d. So there's 10+ inputs in that directory, each with a different port and ca associated with it. One client was shipping log messages that were too big (stack traces they weren't reading right). While that one client (which may have been 5 or 6 machines) was hammering the two logstash ingestors with large log messages, no other inputs were able to push thru logstash to elasticsearch. I blocked that port, and everything began to flow normally.

I have an es support contract, and asked about inputs blowing up other inputs and linked them to this issue. Maybe they can shed some light as to how this sorta thing can happen.

packplusplus · 2015-08-11T20:46:46Z

@driskell I'm going to keep pumping this thread, because I think there is a locking/dos/something problem on the log courier ingest process that the large packets were triggering. I think the crazy disconnects/reconnects trigger faster, so now I'm looking for help debugging this that doesn't include turning on --debug server side on something taking 400-50 events per minute but takes several hours/days to show the problem.

I had an ingestor lock up. It's still spitting back acks, but is not pushing to elasticsearch. It's still throwing errors to logstash.log about clients disconnecting. I turned debug on on the client side, but don't think I can on the server without hupping the thing.

Client
** logstash 1.5.3
** Courier output Plugin 1.8
** openjdk 1.8
Server
** logstash-1.5.3
** courier input 1.8
** openjdk 1.8 (seemed to happen with 1.7 too)

an example run from the client side. Input is stdin, i hit test a couple times then enter. Looks like lots of ACKN and timeouts: https://gist.github.com/packplusplus/041387c126d4a2a0161d

I'll try to leave the broken server up as long as I can, but I don't know how useful it is going to be.

driskell · 2015-08-11T21:54:48Z

The logs you gave are an output receiving backoff requests. The server side sends ACK with the zero sequence saying "nothing done yet.". This is symptom of blocked pipeline.

Can you send the QUIT signal to Logstash. This triggers the Java JVM to throw a thread trace to stdout that will be huge. Can you gist that? We should be able to verify all courier threads are active, and all filter and output threads are active, and maybe even see where the block is.

driskell · 2015-08-11T21:55:20Z

(Send QUIT to the server side logstash, that is.)

driskell · 2015-08-11T21:55:51Z

(FYI send it like this: kill -QUIT 12345 where 12345 is the process ID)

packplusplus · 2015-08-11T22:43:29Z

doop doop...I'm an idiot...took me like 20 minutes to realize it still goes to logstash.stdout.

Off to learn about java stack traces...TIMED_WAITING bad, runnable good?

https://gist.githubusercontent.com/packplusplus/220b70220b3cff70f1c4/raw/bfb5a86b492c8dd8a6bc2ed585ea064575c45cb2/gistfile1.txt

driskell · 2015-08-12T06:00:52Z

Timed waiting is fine it just means a thread waiting for something.

You'll notice courier threads trying to push to the filter queue and waiting because it's full. Then you'll notice >output trying to pop from the output queue but waiting because empty.

There's no filterworkers! Nothing to take events from filter queue and put onto output queue. It's a classic filterworker crash 😩 Can you double check output logs on logstash server side for exceptions? My fear though is it's an earlier issue in Logstash that they never fixed where it doesn't log the exception because it didn't know it crashed.

Last option is to try a shutdown of logstash and see if it then logs it. Send a kill -TERM 12345 where the 12345 is logstash process ID. If it doesn't log any exception and hangs you can kill -KILL 12345 to get rid of it as you would normally.

packplusplus · 2015-08-12T13:57:29Z

Man you're good. I looked thru the logs, and connection errors from hosts were cluttering things.

logstash.stdout - just the stack traces
logstash.err gist: https://gist.github.com/packplusplus/4d653e6298e8c07533fc
greped for filter: https://gist.github.com/packplusplus/19b2d2df52ec355df160

So now I'm down to "TypeError: can't convert Fixnum into String"; why doesn't the filter crash trigger a restart?

Also, I put each of my log filter types in different files, each with their own filter {} statement around them. Do you think this could have something to do with that?

driskell · 2015-08-12T15:09:13Z

What's your config for filters? Specifically, the mutate?

packplusplus · 2015-08-12T16:41:05Z

I didn't think any of these were exciting: https://gist.github.com/packplusplus/9b842a1b66c4f0f42fad

don't forget, multiple filter threads.

packplusplus · 2015-08-13T20:46:16Z

ES support indicated one of my mutate filters (doing the convert) was likely causing it by trying to convert a field that might not exist. I've removed it and hopefully that was the reason.

driskell · 2015-08-13T20:51:10Z

I had a quick look at mutate but couldn't see what it was doing. That might be an possibility but I didn't have time to test anything. Might be worth you reproducing it if you can and reporting in mutate plugin issues.

Should be fairly easy to make filter workers recover too after crash I'm shocked no one has yet!

packplusplus · 2015-11-06T19:26:18Z

I have not seen this issue re-occur since I got the mutates straightened out. I'm okay with closing this issue, but I'm pretty convinced it was the filterworker crashing, not the client being an ass and sending huge messaes.

driskell added the investigating label Aug 21, 2015

zoni mentioned this issue Oct 21, 2015

log-courier input plugin hanging in thread state WAITING (on object monitor) #243

Closed

driskell mentioned this issue Nov 11, 2015

pending->ACK->pending loop #253

Closed

driskell closed this as completed Feb 20, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOS caused by packet too large client #219

DOS caused by packet too large client #219

packplusplus commented Aug 7, 2015

driskell commented Aug 7, 2015

packplusplus commented Aug 7, 2015

packplusplus commented Aug 11, 2015

driskell commented Aug 11, 2015

driskell commented Aug 11, 2015

driskell commented Aug 11, 2015

packplusplus commented Aug 11, 2015

driskell commented Aug 12, 2015

packplusplus commented Aug 12, 2015

driskell commented Aug 12, 2015

packplusplus commented Aug 12, 2015

packplusplus commented Aug 13, 2015

driskell commented Aug 13, 2015

packplusplus commented Nov 6, 2015

DOS caused by packet too large client #219

DOS caused by packet too large client #219

Comments

packplusplus commented Aug 7, 2015

driskell commented Aug 7, 2015

packplusplus commented Aug 7, 2015

packplusplus commented Aug 11, 2015

driskell commented Aug 11, 2015

driskell commented Aug 11, 2015

driskell commented Aug 11, 2015

packplusplus commented Aug 11, 2015

driskell commented Aug 12, 2015

packplusplus commented Aug 12, 2015

driskell commented Aug 12, 2015

packplusplus commented Aug 12, 2015

packplusplus commented Aug 13, 2015

driskell commented Aug 13, 2015

packplusplus commented Nov 6, 2015