-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOS caused by packet too large client #219
Comments
Client should reconnect only after a second or so so shouldn't really DoS, though I guess it will be sending 67MB over and over. v2 courier plugin I'm working on considers the packet size before receiving it (it reads header only then rejects) and also has network level backoff to stop memory problems. So hopefully that will mitigate any high net throughout you might get now as it'll only receive into the tiny TCP buffers before discarding. (It's worth noting partial ack starts as soon as header received too in v2 so even less timeouts.) If you can give more details on the DoS - what resource it denies - I can check any v2 design accounts for it. Thanks for the feedback too - discussions like this help to improve and inform :-) |
In my setup Each internal group gets a client certificate for auth. I make an input file for each one in conf.d. So there's 10+ inputs in that directory, each with a different port and ca associated with it. One client was shipping log messages that were too big (stack traces they weren't reading right). While that one client (which may have been 5 or 6 machines) was hammering the two logstash ingestors with large log messages, no other inputs were able to push thru logstash to elasticsearch. I blocked that port, and everything began to flow normally. I have an es support contract, and asked about inputs blowing up other inputs and linked them to this issue. Maybe they can shed some light as to how this sorta thing can happen. |
@driskell I'm going to keep pumping this thread, because I think there is a locking/dos/something problem on the log courier ingest process that the large packets were triggering. I think the crazy disconnects/reconnects trigger faster, so now I'm looking for help debugging this that doesn't include turning on --debug server side on something taking 400-50 events per minute but takes several hours/days to show the problem. I had an ingestor lock up. It's still spitting back acks, but is not pushing to elasticsearch. It's still throwing errors to logstash.log about clients disconnecting. I turned debug on on the client side, but don't think I can on the server without hupping the thing.
I'll try to leave the broken server up as long as I can, but I don't know how useful it is going to be. |
The logs you gave are an output receiving backoff requests. The server side sends ACK with the zero sequence saying "nothing done yet.". This is symptom of blocked pipeline. Can you send the QUIT signal to Logstash. This triggers the Java JVM to throw a thread trace to stdout that will be huge. Can you gist that? We should be able to verify all courier threads are active, and all filter and output threads are active, and maybe even see where the block is. |
(Send QUIT to the server side logstash, that is.) |
(FYI send it like this: |
doop doop...I'm an idiot...took me like 20 minutes to realize it still goes to logstash.stdout. Off to learn about java stack traces...TIMED_WAITING bad, runnable good? |
Timed waiting is fine it just means a thread waiting for something. You'll notice courier threads trying to push to the filter queue and waiting because it's full. Then you'll notice >output trying to pop from the output queue but waiting because empty. There's no filterworkers! Nothing to take events from filter queue and put onto output queue. It's a classic filterworker crash 😩 Can you double check output logs on logstash server side for exceptions? My fear though is it's an earlier issue in Logstash that they never fixed where it doesn't log the exception because it didn't know it crashed. Last option is to try a shutdown of logstash and see if it then logs it. Send a |
Man you're good. I looked thru the logs, and connection errors from hosts were cluttering things.
So now I'm down to "TypeError: can't convert Fixnum into String"; why doesn't the filter crash trigger a restart? Also, I put each of my log filter types in different files, each with their own filter {} statement around them. Do you think this could have something to do with that? |
What's your config for filters? Specifically, the |
I didn't think any of these were exciting: https://gist.github.com/packplusplus/9b842a1b66c4f0f42fad don't forget, multiple filter threads. |
ES support indicated one of my mutate filters (doing the convert) was likely causing it by trying to convert a field that might not exist. I've removed it and hopefully that was the reason. |
I had a quick look at mutate but couldn't see what it was doing. That might be an possibility but I didn't have time to test anything. Might be worth you reproducing it if you can and reporting in mutate plugin issues. Should be fairly easy to make filter workers recover too after crash I'm shocked no one has yet! |
I have not seen this issue re-occur since I got the mutates straightened out. I'm okay with closing this issue, but I'm pretty convinced it was the filterworker crashing, not the client being an ass and sending huge messaes. |
I saw lots of these guys in my log.
:message=>"Protocol error, connection aborted", :error=>"packet too large (67121475 > 10485760)"
Its a client misconfiguration, but it ended up DOSing the ingestors, and thats pretty scary.I'm wondering if theres anything that can be done to keep this from happening again in the future. To be clear, I don't know think this is logcourier specific issue, or if it's any input (or filter) that can have this happen. Off the top of my head black listing a client that keeps causing errors, but I have no clue if that's feasible.
(this is def a "lets talk" github issue, not a bug)
The text was updated successfully, but these errors were encountered: