Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Libbeat output reloading cause a goroutine leak and a possibly a fd leak #10491

Closed
ph opened this issue Feb 1, 2019 · 3 comments · Fixed by #17381
Closed

Libbeat output reloading cause a goroutine leak and a possibly a fd leak #10491

ph opened this issue Feb 1, 2019 · 3 comments · Fixed by #17381
Assignees
Labels
bug libbeat Management Team:Integrations Label for the Integrations team Team:Services (Deprecated) Label for the former Integrations-Services team

Comments

@ph
Copy link
Contributor

ph commented Feb 1, 2019

Central management relies on the reloading implementation inside libbeat, when an input, module or output is found in the incoming configurations from CM, CM will take care of reloading the appropriate part of the code.

Lets take the following scenario:

Filebeat is configured with a simple input listening to events from the /var/log/syslog path and send events to Elasticsearch.

Now assume that the configuration changes every second when we reload the output, we have to migrate in flight events in the queue to the new output, when a new Elasticsearch output is created the first thing it will do is to ping the Elasticsearch host.

The current problem is when we swap the output, we keep a goroutine of the previous output and also keep a reference to the TCP Transport pool created for the output, in short we leak a goroutine and a fd on every reload of the output.

@ph ph self-assigned this Feb 1, 2019
@ph ph changed the title Libbeat eutput reloading cause a goroutine leak and a possibly a fd leak Libbeat output reloading cause a goroutine leak and a possibly a fd leak Feb 1, 2019
@ph
Copy link
Contributor Author

ph commented Feb 1, 2019

Adding pprof goroutine trace to identify reports.

(pprof) list netClientWorker
Total: 3089
ROUTINE ======================== github.com/elastic/beats/libbeat/publisher/pipeline.(*netClientWorker).run in /Users/ph/go/src/github.com/elastic/beats/libbeat/publisher/pipeline/output.go
         0       2178 (flat, cum) 70.51% of Total
         .          .     79:   for !w.closed.Load() {
         .          .     80:           reconnectAttempts := 0
         .          .     81:
         .          .     82:           // start initial connect loop from first batch, but return
         .          .     83:           // batch to pipeline for other outputs to catch up while we're trying to connect
         .       1738     84:           for batch := range w.qu {
         .          .     85:                   batch.Cancelled()
         .          .     86:
         .          .     87:                   if w.closed.Load() {
         .          .     88:                           logp.Info("Closed connection to %v", w.client)
         .          .     89:                           return
         .          .     90:                   }
         .          .     91:
         .          .     92:                   if reconnectAttempts > 0 {
         .          .     93:                           logp.Info("Attempting to reconnect to %v with %d reconnect attempt(s)", w.client, reconnectAttempts)
         .          .     94:                   } else {
         .          .     95:                           logp.Info("Connecting to %v", w.client)
         .          .     96:                   }
         .          .     97:
         .          .     98:                   err := w.client.Connect()
         .          .     99:                   if err != nil {
         .          .    100:                           logp.Err("Failed to connect to %v: %v", w.client, err)
         .          .    101:                           reconnectAttempts++
         .          .    102:                           continue
         .          .    103:                   }
         .          .    104:
         .          .    105:                   logp.Info("Connection to %v established", w.client)
         .          .    106:                   reconnectAttempts = 0
         .          .    107:                   break
         .          .    108:           }
         .          .    109:
         .          .    110:           // send loop
         .        440    111:           for batch := range w.qu {
         .          .    112:                   if w.closed.Load() {
         .          .    113:                           if batch != nil {
         .          .    114:                                   batch.Cancelled()
         .          .    115:                           }
         .          .    116:                           return
(pprof)

@ph
Copy link
Contributor Author

ph commented Feb 1, 2019

Looking at reloadable part this doesn't look so bad to fix.

@urso
Copy link

urso commented Jan 17, 2020

Initial attempt to fix this issue in #10599. We did run into issues trying to fix this issue, ultimately creating #11068 as a prerequisite.

@andresrc andresrc added Team:Services (Deprecated) Label for the former Integrations-Services team [zube]: Inbox [zube]: Ready and removed [zube]: In Progress [zube]: Inbox labels Jan 27, 2020
@andresrc andresrc added Team:Integrations Label for the Integrations team and removed Team:Beats labels Mar 6, 2020
@ycombinator ycombinator self-assigned this Mar 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug libbeat Management Team:Integrations Label for the Integrations team Team:Services (Deprecated) Label for the former Integrations-Services team
Projects
None yet
4 participants