New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tcp input connection threads tracking leak #1509

Closed
dougmcclure opened this Issue Jul 7, 2014 · 46 comments

Comments

Projects
None yet
8 participants
@dougmcclure

Suspected TCP Input leak and/or overload condition.

This error message is returned gradually, increasing over time to a constant stream of this message.

"A plugin had an unrecoverable error. Will restart this plugin.\n Plugin: <LogStash::Inputs::Tcp type=>"raw-syslog", host=>"0.0.0.0", mode=>"server">\n Error: closed stream", :level=>:error}

rsyslog --> haproxy --> four logstash instances (each with two TCP inputs) --> TCP output to a index/search product

jstack, netstat -na, lsof -i and lsof output as requested available here: https://gist.github.com/dougmcclure/c5badc31dae31a50d647

@jordansissel

This comment has been minimized.

Show comment
Hide comment
@jordansissel

jordansissel Jul 7, 2014

Contributor

Can you attach your config as well?

Contributor

jordansissel commented Jul 7, 2014

Can you attach your config as well?

@jordansissel

This comment has been minimized.

Show comment
Hide comment
@jordansissel

jordansissel Jul 8, 2014

Contributor

Specifically, I don't recognize the "raw-syslog" input plugin. Is this something you wrote?

Contributor

jordansissel commented Jul 8, 2014

Specifically, I don't recognize the "raw-syslog" input plugin. Is this something you wrote?

@dougmcclure

This comment has been minimized.

Show comment
Hide comment
@dougmcclure

dougmcclure Jul 8, 2014

That's just from the type => statement

input {

tcp {
port => 10520
type => "raw-syslog"
} #end tcp input

tcp {
port => 10521
type => "raw-syslog"
} #end tcp input

} #end inputs

That's just from the type => statement

input {

tcp {
port => 10520
type => "raw-syslog"
} #end tcp input

tcp {
port => 10521
type => "raw-syslog"
} #end tcp input

} #end inputs

@dougmcclure

This comment has been minimized.

Show comment
Hide comment
@dougmcclure

dougmcclure Jul 10, 2014

Need anything else collected? I've been trying to tweak /etc/sysctl.conf and tune things but still see errors under load.

Need anything else collected? I've been trying to tweak /etc/sysctl.conf and tune things but still see errors under load.

@colinsurprenant colinsurprenant added this to the 1.4.3 milestone Jul 10, 2014

@colinsurprenant colinsurprenant self-assigned this Jul 10, 2014

@colinsurprenant

This comment has been minimized.

Show comment
Hide comment
@colinsurprenant

colinsurprenant Jul 10, 2014

Contributor

I have identified a connection leak in the tcp input plugin and it is probably what is causing this problem here.

it was also discussed in JIRA https://logstash.jira.com/browse/LOGSTASH-2168

Contributor

colinsurprenant commented Jul 10, 2014

I have identified a connection leak in the tcp input plugin and it is probably what is causing this problem here.

it was also discussed in JIRA https://logstash.jira.com/browse/LOGSTASH-2168

@dougmcclure

This comment has been minimized.

Show comment
Hide comment
@dougmcclure

dougmcclure Jul 11, 2014

Awesome! How might I get a build with this incorporated so I can lay it down in my environment and test?

Thanks!

Doug

Awesome! How might I get a build with this incorporated so I can lay it down in my environment and test?

Thanks!

Doug

@colinsurprenant

This comment has been minimized.

Show comment
Hide comment
@colinsurprenant

colinsurprenant Jul 11, 2014

Contributor

once it is merged we'll send you a link to a nightly build so you can test it.

Contributor

colinsurprenant commented Jul 11, 2014

once it is merged we'll send you a link to a nightly build so you can test it.

@dougmcclure

This comment has been minimized.

Show comment
Hide comment
@dougmcclure

dougmcclure Jul 11, 2014

Thanks Colin! Is there anything I can monitor system resource wise for this issue? I'm just tailing the logstash logfile at this point, but could I watch memory or other increase/decrease elsewhere?

Tks!

Doug

Thanks Colin! Is there anything I can monitor system resource wise for this issue? I'm just tailing the logstash logfile at this point, but could I watch memory or other increase/decrease elsewhere?

Tks!

Doug

@colinsurprenant

This comment has been minimized.

Show comment
Hide comment
@colinsurprenant

colinsurprenant Jul 11, 2014

Contributor

so the leak is happening upon new tcp connections, these are never cleaned up when the connection is closed. so in a situation where every event sent creates a new tcp connection, memory usage of the logstash JVM will increase until OOM. basically the only thing you can monitor for this is the process mem usage.

Contributor

colinsurprenant commented Jul 11, 2014

so the leak is happening upon new tcp connections, these are never cleaned up when the connection is closed. so in a situation where every event sent creates a new tcp connection, memory usage of the logstash JVM will increase until OOM. basically the only thing you can monitor for this is the process mem usage.

@dougmcclure

This comment has been minimized.

Show comment
Hide comment
@dougmcclure

dougmcclure Jul 11, 2014

i see it now - using htop the virt mem goes from 5G to 13G on each logstash instance as the load increases and I then see the 'closed stream' errors start to show up. kill/restart then required (repeatedly) to try to process backlog to try an minimize message loss.

Related question - any idea how I could set up a 'health check' to probe the tcp input for 'normal' state? (something I could send or expect if I probed the input) Looking to set something up on haproxy to recognize this condition and the route to a backup set of logstash servers.

i see it now - using htop the virt mem goes from 5G to 13G on each logstash instance as the load increases and I then see the 'closed stream' errors start to show up. kill/restart then required (repeatedly) to try to process backlog to try an minimize message loss.

Related question - any idea how I could set up a 'health check' to probe the tcp input for 'normal' state? (something I could send or expect if I probed the input) Looking to set something up on haproxy to recognize this condition and the route to a backup set of logstash servers.

@colinsurprenant

This comment has been minimized.

Show comment
Hide comment
@colinsurprenant

colinsurprenant Jul 11, 2014

Contributor

for a health check, we do not have anything specifically for that (yet) but an idea could be to send a "ping" message on that tcp input and in your logstash config you could look for that ping message and send it to a different output, like email, and have something in place to verify that those pings are sent. This is not ideal, having a signal on error detection would be better.

if you are using logstash 1.4.x you could actually patch it with the fix now and let me know how it works? you just have to copy the file https://raw.githubusercontent.com/colinsurprenant/logstash/fix/tcp_input_leak/lib/logstash/inputs/tcp.rb into your logstash directory in lib/logstash/inputs/tcp.rb

Contributor

colinsurprenant commented Jul 11, 2014

for a health check, we do not have anything specifically for that (yet) but an idea could be to send a "ping" message on that tcp input and in your logstash config you could look for that ping message and send it to a different output, like email, and have something in place to verify that those pings are sent. This is not ideal, having a signal on error detection would be better.

if you are using logstash 1.4.x you could actually patch it with the fix now and let me know how it works? you just have to copy the file https://raw.githubusercontent.com/colinsurprenant/logstash/fix/tcp_input_leak/lib/logstash/inputs/tcp.rb into your logstash directory in lib/logstash/inputs/tcp.rb

@dougmcclure

This comment has been minimized.

Show comment
Hide comment
@dougmcclure

dougmcclure Jul 11, 2014

Got this error upon starting up with the new tcp.rb file:

The error reported is:
undefined local variable or method `fix_streaming_codecs' for #LogStash::Inputs::Tcp:0x8d40863

Got this error upon starting up with the new tcp.rb file:

The error reported is:
undefined local variable or method `fix_streaming_codecs' for #LogStash::Inputs::Tcp:0x8d40863

@colinsurprenant

This comment has been minimized.

Show comment
Hide comment
@colinsurprenant

colinsurprenant Jul 11, 2014

Contributor

ah, yes, that was introduced in 1.4.2. you can upgrade to 1.4.2, add the tcp.rb file on it or you can just remove that line with fix_streaming_codecs in tcp.rb https://github.com/colinsurprenant/logstash/blob/fix/tcp_input_leak/lib/logstash/inputs/tcp.rb#L70.

Contributor

colinsurprenant commented Jul 11, 2014

ah, yes, that was introduced in 1.4.2. you can upgrade to 1.4.2, add the tcp.rb file on it or you can just remove that line with fix_streaming_codecs in tcp.rb https://github.com/colinsurprenant/logstash/blob/fix/tcp_input_leak/lib/logstash/inputs/tcp.rb#L70.

@dougmcclure

This comment has been minimized.

Show comment
Hide comment
@dougmcclure

dougmcclure Jul 11, 2014

ok, i commented it out and running now on 1.4.1. Waiting for the next burst/spike to come in to see how it goes. Thanks!

ok, i commented it out and running now on 1.4.1. Waiting for the next burst/spike to come in to see how it goes. Thanks!

@dougmcclure

This comment has been minimized.

Show comment
Hide comment
@dougmcclure

dougmcclure Jul 11, 2014

On that keep alive comment - if I added that simple input and sent in a 'ping' - can I send a "ack" response back to the port somehow? In server mode for TCP input, there must be some handshaking happening there?

On that keep alive comment - if I added that simple input and sent in a 'ping' - can I send a "ack" response back to the port somehow? In server mode for TCP input, there must be some handshaking happening there?

@colinsurprenant

This comment has been minimized.

Show comment
Hide comment
@colinsurprenant

colinsurprenant Jul 11, 2014

Contributor

there's no handshaking at the application level. without writing your own custom plugin, in logstash the way to output something is through the output plugins, so upon receiving a special "ping" event, using filters and conditionals, you could use one of the output plugins to get a notification, using tcp, or a message queue, or email for example.

but really, before putting this in place, I'd first check if this problem is resolved first.

Contributor

colinsurprenant commented Jul 11, 2014

there's no handshaking at the application level. without writing your own custom plugin, in logstash the way to output something is through the output plugins, so upon receiving a special "ping" event, using filters and conditionals, you could use one of the output plugins to get a notification, using tcp, or a message queue, or email for example.

but really, before putting this in place, I'd first check if this problem is resolved first.

colinsurprenant added a commit that referenced this issue Jul 11, 2014

fix connection threads tracking leak
better mutex name, loop with while true

event per connection & thread cleanups specs

remove leftover

closes #1509

colinsurprenant added a commit that referenced this issue Jul 11, 2014

fix connection threads tracking leak
better mutex name, loop with while true

event per connection & thread cleanups specs

remove leftover

closes #1509

Conflicts:
	spec/inputs/tcp.rb
@colinsurprenant

This comment has been minimized.

Show comment
Hide comment
@colinsurprenant

colinsurprenant Jul 11, 2014

Contributor

I merged #1522 which closed this issue. Feel free to reopen if you are still having problems and please let us know if it works better for you :)

Contributor

colinsurprenant commented Jul 11, 2014

I merged #1522 which closed this issue. Feel free to reopen if you are still having problems and please let us know if it works better for you :)

@colinsurprenant colinsurprenant changed the title from TCP Input Leak / Overload to tcp input connection threads tracking leak Jul 11, 2014

@dougmcclure

This comment has been minimized.

Show comment
Hide comment
@dougmcclure

dougmcclure Jul 14, 2014

I'm not sure where to reopen this issue or if you prefer a new one.

I don't see the original errors occur as quick now for sure but I do see that under load (proccesing a multi-MB burst of data) that eventually each logstash instance grows in resource usage and will either freeze/stall all together or die. I captured a new jstack from a 'frozen/stalled" logstash instance and added it to this gist for review.

https://gist.github.com/dougmcclure/55bcc5f82a6afa4ec2ce

Let me know if you want a new issue or not or other data.

Tks!

Doug

I'm not sure where to reopen this issue or if you prefer a new one.

I don't see the original errors occur as quick now for sure but I do see that under load (proccesing a multi-MB burst of data) that eventually each logstash instance grows in resource usage and will either freeze/stall all together or die. I captured a new jstack from a 'frozen/stalled" logstash instance and added it to this gist for review.

https://gist.github.com/dougmcclure/55bcc5f82a6afa4ec2ce

Let me know if you want a new issue or not or other data.

Tks!

Doug

@jordansissel

This comment has been minimized.

Show comment
Hide comment
@jordansissel

jordansissel Jul 14, 2014

Contributor

many tcp threads. 8264 Ruby-X-Thread-XXXXX: /opt/logstash-X.X.X/lib/logstash/inputs/tcp.rb:XXX

Are you positive you applied this patch?

Contributor

jordansissel commented Jul 14, 2014

many tcp threads. 8264 Ruby-X-Thread-XXXXX: /opt/logstash-X.X.X/lib/logstash/inputs/tcp.rb:XXX

Are you positive you applied this patch?

@dougmcclure

This comment has been minimized.

Show comment
Hide comment
@dougmcclure

dougmcclure Jul 14, 2014

Yes, on 7/11 I manually copied over Colin's tcp.rb file to my /lib/logstash/inputs directory.

I've even turned off one of the two tcp inputs I originally had on each logstash instance so haproxy is now only spraying across four logstash inputs on the box. (one input per instance)

Yes, on 7/11 I manually copied over Colin's tcp.rb file to my /lib/logstash/inputs directory.

I've even turned off one of the two tcp inputs I originally had on each logstash instance so haproxy is now only spraying across four logstash inputs on the box. (one input per instance)

@jordansissel

This comment has been minimized.

Show comment
Hide comment
@jordansissel

jordansissel Jul 14, 2014

Contributor

Can you try this without haproxy?

Contributor

jordansissel commented Jul 14, 2014

Can you try this without haproxy?

@colinsurprenant

This comment has been minimized.

Show comment
Hide comment
@colinsurprenant

colinsurprenant Jul 14, 2014

Contributor

it looks like the tcp connections are never closed somehow, and it might in fact be related to haproxy. I also think testing without haproxy would provide a good insight.

Contributor

colinsurprenant commented Jul 14, 2014

it looks like the tcp connections are never closed somehow, and it might in fact be related to haproxy. I also think testing without haproxy would provide a good insight.

@dougmcclure

This comment has been minimized.

Show comment
Hide comment
@dougmcclure

dougmcclure Jul 14, 2014

Ok, did a test with just having rsyslog stream right to a single logstash instance with one tcp input. I ran a syslog generator and sent in ~260K messages at a 500 message/sec rate into my main rsyslog server which then streamed them to logstash tcp input.

logstash's resource utilization as seen in htop (VIRT) increased from around 4GB to upwards of 13GB or so during the test run. I have a script to watch TIME_WAIT and CLOSE_WAIT connections and the CLOSE_WAIT rose to somewhere around 12,000 peak during the test. (in gist)

I grabbed a new jstack during the test when resources were up in the 10GB range and a few thousand CLOSE_WAIT connections.

https://gist.github.com/dougmcclure/1d00d1147a077f7e5c99

Ok, did a test with just having rsyslog stream right to a single logstash instance with one tcp input. I ran a syslog generator and sent in ~260K messages at a 500 message/sec rate into my main rsyslog server which then streamed them to logstash tcp input.

logstash's resource utilization as seen in htop (VIRT) increased from around 4GB to upwards of 13GB or so during the test run. I have a script to watch TIME_WAIT and CLOSE_WAIT connections and the CLOSE_WAIT rose to somewhere around 12,000 peak during the test. (in gist)

I grabbed a new jstack during the test when resources were up in the 10GB range and a few thousand CLOSE_WAIT connections.

https://gist.github.com/dougmcclure/1d00d1147a077f7e5c99

@jordansissel

This comment has been minimized.

Show comment
Hide comment
@jordansissel

jordansissel Jul 14, 2014

Contributor

Some thoughts -

First, leaking connections is a bug. If we're doing it, we should fix it.

About your test. Why is rsyslog openning 12000 tcp conncetions? What is causing so much tcp churn in your environment?

Contributor

jordansissel commented Jul 14, 2014

Some thoughts -

First, leaking connections is a bug. If we're doing it, we should fix it.

About your test. Why is rsyslog openning 12000 tcp conncetions? What is causing so much tcp churn in your environment?

@jordansissel

This comment has been minimized.

Show comment
Hide comment
@jordansissel

jordansissel Jul 14, 2014

Contributor
% grep -c '^"Ruby-0-Thread' jstack\ during\ test
2946
% grep -c 'SizedQueue.push' jstack\ during\ test
2952
% sgrep 'SizedQueue.push' jstack\ during\ test | grep '^"' | fex '"1' | tr 0-9 X |sort | uniq -c | sort -n
      9 |worker
    218 Ruby-X-Thread-XXXXX: /opt/logstash-X.X.X/lib/logstash/inputs/tcp.rb:XXX
   2725 Ruby-X-Thread-XXXXXX: /opt/logstash-X.X.X/lib/logstash/inputs/tcp.rb:XXX

Something is blocking your output. This isn't a tcp bug. Look at the threads above! Both tcp input and filter workers (the |worker threads) are blocked.

Contributor

jordansissel commented Jul 14, 2014

% grep -c '^"Ruby-0-Thread' jstack\ during\ test
2946
% grep -c 'SizedQueue.push' jstack\ during\ test
2952
% sgrep 'SizedQueue.push' jstack\ during\ test | grep '^"' | fex '"1' | tr 0-9 X |sort | uniq -c | sort -n
      9 |worker
    218 Ruby-X-Thread-XXXXX: /opt/logstash-X.X.X/lib/logstash/inputs/tcp.rb:XXX
   2725 Ruby-X-Thread-XXXXXX: /opt/logstash-X.X.X/lib/logstash/inputs/tcp.rb:XXX

Something is blocking your output. This isn't a tcp bug. Look at the threads above! Both tcp input and filter workers (the |worker threads) are blocked.

@jordansissel

This comment has been minimized.

Show comment
Hide comment
@jordansissel

jordansissel Jul 14, 2014

Contributor

If I had to guess what is going on with your rsyslog test:

  • rsyslog writes to Logstash
  • Logstash output is blocking/slow
  • rsyslog times out a write and reconnects
  • repeat rsyslog timout+reconnect+write+block
  • Logstash still waiting to finish reading from these new socket connections, which are blocked writing to filters

Repeat until you're out of sockets or memory.

The observed problem (OOM or out of sockets) doesn't appear, at least with the data presented, to be the actual issue. It seems to be a symptom that your output(s) are not keeping up with your ingestion rate.

Contributor

jordansissel commented Jul 14, 2014

If I had to guess what is going on with your rsyslog test:

  • rsyslog writes to Logstash
  • Logstash output is blocking/slow
  • rsyslog times out a write and reconnects
  • repeat rsyslog timout+reconnect+write+block
  • Logstash still waiting to finish reading from these new socket connections, which are blocked writing to filters

Repeat until you're out of sockets or memory.

The observed problem (OOM or out of sockets) doesn't appear, at least with the data presented, to be the actual issue. It seems to be a symptom that your output(s) are not keeping up with your ingestion rate.

@colinsurprenant

This comment has been minimized.

Show comment
Hide comment
@colinsurprenant

colinsurprenant Jul 14, 2014

Contributor

but having 12000 sockets in CLOSE_WAIT state is may be indicative of a socket leak or close not being called on the socket...

CLOSE_WAIT Indicates that the server has received the first FIN signal from the client and the connection is in the process of being closed. This essentially means that it is a state where the socket is waiting for the application to execute close.

Contributor

colinsurprenant commented Jul 14, 2014

but having 12000 sockets in CLOSE_WAIT state is may be indicative of a socket leak or close not being called on the socket...

CLOSE_WAIT Indicates that the server has received the first FIN signal from the client and the connection is in the process of being closed. This essentially means that it is a state where the socket is waiting for the application to execute close.

@jordansissel

This comment has been minimized.

Show comment
Hide comment
@jordansissel

jordansissel Jul 14, 2014

Contributor

@colinsurprenant nah, look at the thread model.

  • accept -> start new thread
  • new thread: read from socket, write to Queue (blocks, in this case)

Once blocked, we have no opportunity to timeout or close the socket. The socket stays in CLOSE_WAIT because rsyslog has closed it, but logstash isn't done with it (it's blocked writign to queue, so it can't read the rest of the socket data in order to close it)

Contributor

jordansissel commented Jul 14, 2014

@colinsurprenant nah, look at the thread model.

  • accept -> start new thread
  • new thread: read from socket, write to Queue (blocks, in this case)

Once blocked, we have no opportunity to timeout or close the socket. The socket stays in CLOSE_WAIT because rsyslog has closed it, but logstash isn't done with it (it's blocked writign to queue, so it can't read the rest of the socket data in order to close it)

@dougmcclure

This comment has been minimized.

Show comment
Hide comment
@dougmcclure

dougmcclure Jul 14, 2014

It very well could be my custom TCP output. I'll test w/o our plugin and just push it out to file or TCP to another rsyslog spool to see how it performs.

It very well could be my custom TCP output. I'll test w/o our plugin and just push it out to file or TCP to another rsyslog spool to see how it performs.

@colinsurprenant

This comment has been minimized.

Show comment
Hide comment
@colinsurprenant

colinsurprenant Jul 14, 2014

Contributor

much true :P that would explain it. @dougmcclure could you test writing to file?

Contributor

colinsurprenant commented Jul 14, 2014

much true :P that would explain it. @dougmcclure could you test writing to file?

@jordansissel

This comment has been minimized.

Show comment
Hide comment
@jordansissel

jordansissel Jul 14, 2014

Contributor

If you test with output { null { } } I imagine you will see it not do this behavior.

Contributor

jordansissel commented Jul 14, 2014

If you test with output { null { } } I imagine you will see it not do this behavior.

@dougmcclure

This comment has been minimized.

Show comment
Hide comment
@dougmcclure

dougmcclure Jul 14, 2014

I am able to complete the same test (syslog load generator --> rsyslog --> logstash tcp input --> file output OR syslog output) with no issues TCP connection wise - not even a noticeable increase in system resource consumption.

I will dig into our output plugin at this point. At least we got one bug squashed earlier!

Thanks for the guidance in troubleshooting!

Doug

I am able to complete the same test (syslog load generator --> rsyslog --> logstash tcp input --> file output OR syslog output) with no issues TCP connection wise - not even a noticeable increase in system resource consumption.

I will dig into our output plugin at this point. At least we got one bug squashed earlier!

Thanks for the guidance in troubleshooting!

Doug

@jordansissel

This comment has been minimized.

Show comment
Hide comment
@jordansissel

jordansissel Jul 14, 2014

Contributor

WHew! @dougmcclure - huge thanks again for helping test this stuff! <3

Contributor

jordansissel commented Jul 14, 2014

WHew! @dougmcclure - huge thanks again for helping test this stuff! <3

@colinsurprenant

This comment has been minimized.

Show comment
Hide comment
@colinsurprenant

colinsurprenant Jul 14, 2014

Contributor

woot! nasty one!

wonder how we could mitigate this problem. maybe put an upper limit on the number of open tcp connection? that way the problem would manifest in the tcp producer failing to connect and not logstash crashing.

Contributor

colinsurprenant commented Jul 14, 2014

woot! nasty one!

wonder how we could mitigate this problem. maybe put an upper limit on the number of open tcp connection? that way the problem would manifest in the tcp producer failing to connect and not logstash crashing.

@jordansissel

This comment has been minimized.

Show comment
Hide comment
@jordansissel

jordansissel Jul 14, 2014

Contributor

@colin +1 right now we don't, inside logstash, know if additional load should be blocked. lumberjack input has this same exact problem (keeps accepting + starts reading new events, then blocks writing to queue, repeat)

Contributor

jordansissel commented Jul 14, 2014

@colin +1 right now we don't, inside logstash, know if additional load should be blocked. lumberjack input has this same exact problem (keeps accepting + starts reading new events, then blocks writing to queue, repeat)

@dougmcclure

This comment has been minimized.

Show comment
Hide comment
@dougmcclure

dougmcclure Jul 15, 2014

my desire was to use rsyslog+haproxy to intelligently handle this in front of logstash. If I had some way to have the TCP Input to signal "throttle/I'm full/STOP" via a health check response upstream I could then activate standby instances within haproxy to start taking on the excess/unexpected load.

haproxy's health check for tcp looks very flexible (send/expect/binary/commands) if I had some way to interact back and forth with logstash somehow. Maybe we need a 'health service' within logstash itself to report on how it sees inputs, filters, outputs, resource utilization, queues, workers, etc. that other intelligent systems could interrogate to make decisions from?

my desire was to use rsyslog+haproxy to intelligently handle this in front of logstash. If I had some way to have the TCP Input to signal "throttle/I'm full/STOP" via a health check response upstream I could then activate standby instances within haproxy to start taking on the excess/unexpected load.

haproxy's health check for tcp looks very flexible (send/expect/binary/commands) if I had some way to interact back and forth with logstash somehow. Maybe we need a 'health service' within logstash itself to report on how it sees inputs, filters, outputs, resource utilization, queues, workers, etc. that other intelligent systems could interrogate to make decisions from?

@colinsurprenant

This comment has been minimized.

Show comment
Hide comment
@colinsurprenant

colinsurprenant Jul 15, 2014

Contributor

@jordansissel yes and this defeats the purpose of the sized queues, the connection pool becomes an unbounded input queue. I think this should be as simple as having a (configurable?) max pending connections and just stop accepting connections past this. another probably better approach would be to have a fixed/configurable connection thread pool which would have the added benefit of reducing object churn on high frequency connections situations.

@dougmcclure its in our plans to expose monitoring/health/internal metrics in logstash. in the short term, would you be able to deal correctly with a connection refused scenario? if we bounded the number of connections and started refusing connections past that limit, would haproxy be able to deal with it correctly?

Contributor

colinsurprenant commented Jul 15, 2014

@jordansissel yes and this defeats the purpose of the sized queues, the connection pool becomes an unbounded input queue. I think this should be as simple as having a (configurable?) max pending connections and just stop accepting connections past this. another probably better approach would be to have a fixed/configurable connection thread pool which would have the added benefit of reducing object churn on high frequency connections situations.

@dougmcclure its in our plans to expose monitoring/health/internal metrics in logstash. in the short term, would you be able to deal correctly with a connection refused scenario? if we bounded the number of connections and started refusing connections past that limit, would haproxy be able to deal with it correctly?

@jordansissel

This comment has been minimized.

Show comment
Hide comment
@jordansissel

jordansissel Jul 15, 2014

Contributor

Once we start measuring time spent waiting to write to the queue we could
use that data to apply back pressure on the tcp accept thread? Too much
time spent writing to queue? Block new connections?

On Monday, July 14, 2014, Colin Surprenant notifications@github.com wrote:

@jordansissel https://github.com/jordansissel yes and this defeats the
purpose of the sized queues, the connection pool becomes an unbounded input
queue. I think this should be as simple as having a (configurable?) max
pending connections and just stop accepting connections past this. another
probably better approach would be to have a fixed/configurable connection
thread pool which would have the added benefit of reducing object churn on
high frequency connections situations.

@dougmcclure https://github.com/dougmcclure its in our plans to expose
monitoring/health/internal metrics in logstash. in the short term, would
you be able to deal correctly with a connection refused scenario? if we
bounded the number of connections and started refusing connections past
that limit, would haproxy be able to deal with it correctly?


Reply to this email directly or view it on GitHub
#1509 (comment)
.

Contributor

jordansissel commented Jul 15, 2014

Once we start measuring time spent waiting to write to the queue we could
use that data to apply back pressure on the tcp accept thread? Too much
time spent writing to queue? Block new connections?

On Monday, July 14, 2014, Colin Surprenant notifications@github.com wrote:

@jordansissel https://github.com/jordansissel yes and this defeats the
purpose of the sized queues, the connection pool becomes an unbounded input
queue. I think this should be as simple as having a (configurable?) max
pending connections and just stop accepting connections past this. another
probably better approach would be to have a fixed/configurable connection
thread pool which would have the added benefit of reducing object churn on
high frequency connections situations.

@dougmcclure https://github.com/dougmcclure its in our plans to expose
monitoring/health/internal metrics in logstash. in the short term, would
you be able to deal correctly with a connection refused scenario? if we
bounded the number of connections and started refusing connections past
that limit, would haproxy be able to deal with it correctly?


Reply to this email directly or view it on GitHub
#1509 (comment)
.

@colinsurprenant

This comment has been minimized.

Show comment
Hide comment
@colinsurprenant

colinsurprenant Jul 15, 2014

Contributor

isn't the best indication of downstream congestion simply, on one side a full write queue, or on the other side, as a result, the accumulation of blocked connection threads? we could have the accept thread only accept when the write queue is not full? this would throttle the inbound connection rate. otherwise, as I said, we could have a thread pool and have the accept thread only accept if there are free threads in the pool.

Contributor

colinsurprenant commented Jul 15, 2014

isn't the best indication of downstream congestion simply, on one side a full write queue, or on the other side, as a result, the accumulation of blocked connection threads? we could have the accept thread only accept when the write queue is not full? this would throttle the inbound connection rate. otherwise, as I said, we could have a thread pool and have the accept thread only accept if there are free threads in the pool.

@jordansissel jordansissel modified the milestones: v1.5.0, 1.4.3 Nov 11, 2014

@ph ph referenced this issue Dec 23, 2014

Closed

Memory leak #2284

@mrmanc

This comment has been minimized.

Show comment
Hide comment
@mrmanc

mrmanc Jan 6, 2015

Hello. Awesome work here. I understand from reading that the conclusion was not that there was a memory leak, rather a build up of blocked filter threads caused by slow outputs (and a lack of back pressure to the input source). Is back pressure being implemented in 1.5.0?

mrmanc commented Jan 6, 2015

Hello. Awesome work here. I understand from reading that the conclusion was not that there was a memory leak, rather a build up of blocked filter threads caused by slow outputs (and a lack of back pressure to the input source). Is back pressure being implemented in 1.5.0?

alcanzar pushed a commit to alcanzar/logstash that referenced this issue Jun 4, 2015

fix connection threads tracking leak
better mutex name, loop with while true

event per connection & thread cleanups specs

remove leftover

closes #1509

Conflicts:
	spec/inputs/tcp.rb

@tbragin tbragin added the v1.5.0 label Jun 18, 2015

@bbergstrom

This comment has been minimized.

Show comment
Hide comment
@bbergstrom

bbergstrom Jul 15, 2015

This is tagged with the 1.4 milestone but I don't see the patch for this in the 1.4 branch https://github.com/elastic/logstash/commits/1.4

Is this planned for 1.4.x?

This is tagged with the 1.4 milestone but I don't see the patch for this in the 1.4 branch https://github.com/elastic/logstash/commits/1.4

Is this planned for 1.4.x?

@danielsand

This comment has been minimized.

Show comment
Hide comment
@danielsand

danielsand Aug 3, 2015

@bbergstrom:
i hate to say this - but just copy the tcp.rb from the commit into your current logstash 1.4.x installation && restart
It will do the trick.

@bbergstrom:
i hate to say this - but just copy the tcp.rb from the commit into your current logstash 1.4.x installation && restart
It will do the trick.

@purbon

This comment has been minimized.

Show comment
Hide comment
@purbon

purbon Aug 3, 2015

Contributor

@danielsand @bbergstrom you can also see it in another way, do a PR with the changes to backport it to 1.4, makes sense?

Contributor

purbon commented Aug 3, 2015

@danielsand @bbergstrom you can also see it in another way, do a PR with the changes to backport it to 1.4, makes sense?

@danielsand

This comment has been minimized.

Show comment
Hide comment

@purbon here you go

@purbon

This comment has been minimized.

Show comment
Hide comment
@purbon

purbon Aug 3, 2015

Contributor

thanks, much appreciate it. We'll review and merge as soon as possible!

Contributor

purbon commented Aug 3, 2015

thanks, much appreciate it. We'll review and merge as soon as possible!

@danielsand

This comment has been minimized.

Show comment
Hide comment

@purbon awesome :)

colinsurprenant added a commit that referenced this issue Aug 4, 2015

fix connection threads tracking leak
better mutex name, loop with while true

event per connection & thread cleanups specs

remove leftover

closes #1509

(This is a backport of #1522 to 1.4 branch)

colinsurprenant added a commit that referenced this issue Aug 5, 2015

fix connection threads tracking leak
better mutex name, loop with while true

event per connection & thread cleanups specs

remove leftover

closes #1509

(This is a backport of #1522 to 1.4 branch)

Fixes #3696
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment