Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOS caused by packet too large client #219

Closed
packplusplus opened this issue Aug 7, 2015 · 14 comments
Closed

DOS caused by packet too large client #219

packplusplus opened this issue Aug 7, 2015 · 14 comments

Comments

@packplusplus
Copy link

I saw lots of these guys in my log. :message=>"Protocol error, connection aborted", :error=>"packet too large (67121475 > 10485760)" Its a client misconfiguration, but it ended up DOSing the ingestors, and thats pretty scary.

I'm wondering if theres anything that can be done to keep this from happening again in the future. To be clear, I don't know think this is logcourier specific issue, or if it's any input (or filter) that can have this happen. Off the top of my head black listing a client that keeps causing errors, but I have no clue if that's feasible.

(this is def a "lets talk" github issue, not a bug)

@driskell
Copy link
Owner

driskell commented Aug 7, 2015

Client should reconnect only after a second or so so shouldn't really DoS, though I guess it will be sending 67MB over and over.

v2 courier plugin I'm working on considers the packet size before receiving it (it reads header only then rejects) and also has network level backoff to stop memory problems. So hopefully that will mitigate any high net throughout you might get now as it'll only receive into the tiny TCP buffers before discarding. (It's worth noting partial ack starts as soon as header received too in v2 so even less timeouts.)

If you can give more details on the DoS - what resource it denies - I can check any v2 design accounts for it.

Thanks for the feedback too - discussions like this help to improve and inform :-)

@packplusplus
Copy link
Author

In my setup Each internal group gets a client certificate for auth. I make an input file for each one in conf.d. So there's 10+ inputs in that directory, each with a different port and ca associated with it. One client was shipping log messages that were too big (stack traces they weren't reading right). While that one client (which may have been 5 or 6 machines) was hammering the two logstash ingestors with large log messages, no other inputs were able to push thru logstash to elasticsearch. I blocked that port, and everything began to flow normally.

I have an es support contract, and asked about inputs blowing up other inputs and linked them to this issue. Maybe they can shed some light as to how this sorta thing can happen.

@packplusplus
Copy link
Author

@driskell I'm going to keep pumping this thread, because I think there is a locking/dos/something problem on the log courier ingest process that the large packets were triggering. I think the crazy disconnects/reconnects trigger faster, so now I'm looking for help debugging this that doesn't include turning on --debug server side on something taking 400-50 events per minute but takes several hours/days to show the problem.

I had an ingestor lock up. It's still spitting back acks, but is not pushing to elasticsearch. It's still throwing errors to logstash.log about clients disconnecting. I turned debug on on the client side, but don't think I can on the server without hupping the thing.

  • Client
    ** logstash 1.5.3
    ** Courier output Plugin 1.8
    ** openjdk 1.8

  • Server
    ** logstash-1.5.3
    ** courier input 1.8
    ** openjdk 1.8 (seemed to happen with 1.7 too)

    an example run from the client side. Input is stdin, i hit test a couple times then enter. Looks like lots of ACKN and timeouts: https://gist.github.com/packplusplus/041387c126d4a2a0161d

I'll try to leave the broken server up as long as I can, but I don't know how useful it is going to be.

@driskell
Copy link
Owner

The logs you gave are an output receiving backoff requests. The server side sends ACK with the zero sequence saying "nothing done yet.". This is symptom of blocked pipeline.

Can you send the QUIT signal to Logstash. This triggers the Java JVM to throw a thread trace to stdout that will be huge. Can you gist that? We should be able to verify all courier threads are active, and all filter and output threads are active, and maybe even see where the block is.

@driskell
Copy link
Owner

(Send QUIT to the server side logstash, that is.)

@driskell
Copy link
Owner

(FYI send it like this: kill -QUIT 12345 where 12345 is the process ID)

@packplusplus
Copy link
Author

doop doop...I'm an idiot...took me like 20 minutes to realize it still goes to logstash.stdout.

Off to learn about java stack traces...TIMED_WAITING bad, runnable good?

https://gist.githubusercontent.com/packplusplus/220b70220b3cff70f1c4/raw/bfb5a86b492c8dd8a6bc2ed585ea064575c45cb2/gistfile1.txt

@driskell
Copy link
Owner

Timed waiting is fine it just means a thread waiting for something.

You'll notice courier threads trying to push to the filter queue and waiting because it's full. Then you'll notice >output trying to pop from the output queue but waiting because empty.

There's no filterworkers! Nothing to take events from filter queue and put onto output queue. It's a classic filterworker crash 😩 Can you double check output logs on logstash server side for exceptions? My fear though is it's an earlier issue in Logstash that they never fixed where it doesn't log the exception because it didn't know it crashed.

Last option is to try a shutdown of logstash and see if it then logs it. Send a kill -TERM 12345 where the 12345 is logstash process ID. If it doesn't log any exception and hangs you can kill -KILL 12345 to get rid of it as you would normally.

@packplusplus
Copy link
Author

Man you're good. I looked thru the logs, and connection errors from hosts were cluttering things.

So now I'm down to "TypeError: can't convert Fixnum into String"; why doesn't the filter crash trigger a restart?

Also, I put each of my log filter types in different files, each with their own filter {} statement around them. Do you think this could have something to do with that?

@driskell
Copy link
Owner

What's your config for filters? Specifically, the mutate?

@packplusplus
Copy link
Author

I didn't think any of these were exciting: https://gist.github.com/packplusplus/9b842a1b66c4f0f42fad

don't forget, multiple filter threads.

@packplusplus
Copy link
Author

ES support indicated one of my mutate filters (doing the convert) was likely causing it by trying to convert a field that might not exist. I've removed it and hopefully that was the reason.

@driskell
Copy link
Owner

I had a quick look at mutate but couldn't see what it was doing. That might be an possibility but I didn't have time to test anything. Might be worth you reproducing it if you can and reporting in mutate plugin issues.

Should be fairly easy to make filter workers recover too after crash I'm shocked no one has yet!

@packplusplus
Copy link
Author

I have not seen this issue re-occur since I got the mutates straightened out. I'm okay with closing this issue, but I'm pretty convinced it was the filterworker crashing, not the client being an ass and sending huge messaes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants