Add test for elasticsearch re-connection after network error & allow graceful shutdown #40794

belimawr · 2024-09-12T17:07:58Z

Proposed commit message

When the Elasticsearch client fails to publish events, it ends up calling Close in the connection (that is reused). To cancel the in-flight requests, the context is cancelled and a new one is created to used in future requests.

The callback to check the version holds a reference to the connection via a closure, now the Elasticsearch client holds a pointer to that connection, so whenever Close is called, the callback can create a request with the new, not cancelled, context.

An integration test is added to ensure the
ES output can always recover from network errors.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
~~[ ] I have made corresponding changes to the documentation~~
~~[ ] I have made corresponding change to the default configuration files~~
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Disruptive User Impact

It's a bug fix, there is no disruptive user impact

~~## Author's Checklist~~

How to test this PR locally

Related issues

Closes Elasticsearch output does not recover after connection failure #40705

~~## Use cases~~
~~## Screenshots~~
~~## Logs~~

mergify · 2024-09-12T17:08:36Z

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b fix-es-connection-issue upstream/fix-es-connection-issue
git merge upstream/main
git push upstream fix-es-connection-issue

mergify · 2024-09-12T17:08:36Z

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @belimawr? 🙏.
For such, you'll need to label your PR with:

The upcoming major version of the Elastic Stack
The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit

mergify · 2024-09-12T17:08:37Z

backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label and remove the backport-8.x label.

elasticmachine · 2024-09-12T17:11:15Z

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

libbeat/esleg/eslegclient/connection.go

AndersonQ

LGTM, but I have a question. I'll approve once it's answered

libbeat/tests/integration/elasticsearch_test.go

cmacknz · 2024-09-16T14:31:15Z

libbeat/esleg/eslegclient/connection.go

+	// There are some cases where the connection is created but Connect
+	// is not called before it's used, so we populate reqsContext and cancelReqs
+	// here.


Can we just find the places where we don't call connect and fix them? This code is only called within Beats and we can search for uses of esleg.

Also using context.Background() is a smell, the parent context should be an argument, which will have the compiler find all NewConnection uses for you so you can audit them to see if they use close inappropriately.

The connection here also don't really represent a network connection at the network level, it looks like it is a convenience wrapper around an HTTP client. From that perspective closing or having a connection level context doesn't make much sense, it is just a wrapper for closing idle connections.

Arguably the contexts should be set on a per request basis in all the places that call execRequest and then the cancellation should propagate through those individual calls.

Can we just find the places where we don't call connect and fix them? This code is only called within Beats and we can search for uses of esleg.

I went with this approach to be on the safe side of not introducing the possibility of a panic for calling a nil function. The panic I got when running the tests seems to be coming form a Connection that is not used but it's closed. That is triggered by a test testing a failure scenario where the ES host is not reachable.

Anyways, I've been looking into that.

Arguably the contexts should be set on a per request basis in all the places that call execRequest and then the cancellation should propagate through those individual calls.

I'm not sure that would achieve the same effect we currently have. Currently reqsContext is used to cancel in-flight requests when the Connection needs to be shutdown, that is done by the Close method that is called by a different gorotine than the one waiting for the in-flight request(s) to finish.

The issues that are fixed by this new behaviour:

Windows service for Beat does not stop when output is unreachable #40518

Windows service for Beat does not stop gracefully #38666

An the PR fixing them:

[libbeat] Stop publisher properly #40572

Honestly, I rather have this PR merged as is, so the issues listed above and #40705 are correctly fixed on main. Then we can create a follow up issue to refactor the code ensuring that:

Methods creating requests accept a context

In-flight requests can be cancelled when the Connection is closed by a different goroutine.

I finally figured out the reason why Filebeat would panic when the connection was closed. It's an interesting corner case.
The root cause is if the publishing pipeline never tries to publish an event and Filebeat is shutdown, in that case the connection to ES was never used, however it is closed during the shutdown process, leading to a panic if cancelReqs is nil.

I could add some checks to ensure cancelReqs and reqsContext are not used if nil, however it feels cleaner to keep the previous behaviour of NewConnection returning a Connection that is safe to use without any change of behaviour on its methods.

libbeat/esleg/eslegclient/connection.go

cmacknz · 2024-09-16T21:03:49Z

libbeat/outputs/elasticsearch/client.go

+	// that is passed to this client is also used in a closure, we need
+	// to ensure both hold a reference to the same instance of the connection.


That you have to account for future usage at all here feels wrong too. We need to fix this in a way that does not require knowing future uses of the code. Make it impossible to have this bug again.

Why do we even need a closure, was it just there because it was convenient? Can we just not have a closure anymore? It looks like if we saved the captured onConnect it could be a method on the client to avoid also capturing the connection.

beats/libbeat/outputs/elasticsearch/client.go

Lines 136 to 160 in 3c03c74

conn.OnConnectCallback = func() error {

globalCallbackRegistry.mutex.Lock()

defer globalCallbackRegistry.mutex.Unlock()

for _, callback := range globalCallbackRegistry.callbacks {

err := callback(conn)

if err != nil {

return err

}

}

if onConnect != nil {

onConnect.mutex.Lock()

defer onConnect.mutex.Unlock()

for _, callback := range onConnect.callbacks {

err := callback(conn)

if err != nil {

return err

}

}

}

return nil

}

cmacknz · 2024-09-16T21:08:30Z

I want to make sure I follow the lifetime of the connection properly. The Connect and Publish calls come from here:

beats/libbeat/publisher/pipeline/client_worker.go

Lines 131 to 158 in 3c03c74

    
           	// Try to (re)connect so we can publish batch 
        
           	if !connected { 
        
           		// Return batch to other output workers while we try to (re)connect 
        
           		batch.Cancelled() 
        
           		if reconnectAttempts == 0 { 
        
           			w.logger.Infof("Connecting to %v", w.client) 
        
           		} else { 
        
           			w.logger.Infof("Attempting to reconnect to %v with %d reconnect attempt(s)", w.client, reconnectAttempts) 
        
           		} 
        
           		err := w.client.Connect() 
        
           		connected = err == nil 
        
           		if connected { 
        
           			w.logger.Infof("Connection to %v established", w.client) 
        
           			reconnectAttempts = 0 
        
           		} else { 
        
           			w.logger.Errorf("Failed to connect to %v: %v", w.client, err) 
        
           			reconnectAttempts++ 
        
           		} 
        
           		continue 
        
           	} 
        
           	if err := w.publishBatch(batch); err != nil { 
        
           		connected = false 
        
           	} 
        
           }

The close comes from here:

beats/libbeat/outputs/backoff.go

Lines 60 to 64 in 3c03c74

    
           func (b *backoffClient) Publish(ctx context.Context, batch publisher.Batch) error { 
        
           	err := b.client.Publish(ctx, batch) 
        
           	if err != nil { 
        
           		b.client.Close() 
        
           	}

Can you make Connect accept a context.Context so that the context is actually tied to the lifetime of the connection the way it is supposed to be? Then the client worker run() function would be responsible for creating+cancelling the context. That you can't see the close actually happen in that loop since it is dependent on the client type implementation is also annoying but one thing at a time.

This would require touching the interface of every output but it looks like the correct place for the lifetime of the context to be managed.

It looks like we only have one other use of eslegclient for monitoring that is not a test.

beats/libbeat/monitoring/report/elasticsearch/elasticsearch.go

Lines 215 to 218 in 3c03c74

    
           for { 
        
           	// Select one configured endpoint by random and check if xpack is available 
        
           	client := r.out[rand.Intn(len(r.out))] 
        
           	err := client.Connect()

marc-gr · 2024-09-17T08:24:56Z

Can you make Connect accept a context.Context so that the context is actually tied to the lifetime of the connection the way it is supposed to be? Then the client worker run() function would be responsible for creating+cancelling the context. That you can't see the close actually happen in that loop since it is dependent on the client type implementation is also annoying but one thing at a time.

This would require touching the interface of every output but it looks like the correct place for the lifetime of the context to be managed.

👍 , and in addition if this is done it would require the rest of outputs to honor that new context, too. IIRC each uses different cancellation mechanisms atm.

When the Elasticsearch client fails to publish events, it ends up calling `Close` in the connection (that is reused). To cancel the in-flight requests, the context is cancelled and a new one is created to used in future requests. The callback to check the version holds a reference to the connection via a closure, now the Elasticsearch client holds a pointer to that connection, so whenever Close is called, the callback can create a request with the new, not cancelled, context. An integration test is added to ensure the ES output can always recover from network errors.

This commit moves the creation of the request context to the connect method.

There are some cases where the Connection will be used without calling Connect, so we initialise reqsContext and cancelReqs in the NewConnection function to avoid panics.

Connection.Connect now accepts a context to control the life cycle of its requests.

Add a context to outputs.Connectable.Connect to correctly manage the life cycle of the connection and it's requests.

mergify · 2024-09-23T23:10:34Z

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b fix-es-connection-issue upstream/fix-es-connection-issue
git merge upstream/main
git push upstream fix-es-connection-issue

belimawr added the skip-ci Skip the build in the CI but linting label Sep 12, 2024

belimawr self-assigned this Sep 12, 2024

belimawr requested review from a team as code owners September 12, 2024 17:07

belimawr requested review from AndersonQ and faec September 12, 2024 17:07

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Sep 12, 2024

mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Sep 12, 2024

belimawr force-pushed the fix-es-connection-issue branch from 5e4d4de to 877dc31 Compare September 12, 2024 17:11

belimawr added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Sep 12, 2024

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Sep 12, 2024

belimawr added needs_team Indicates that the issue/PR needs a Team:* label and removed skip-ci Skip the build in the CI but linting labels Sep 12, 2024

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Sep 12, 2024

belimawr mentioned this pull request Sep 12, 2024

Elasticsearch output does not recover after connection failure #40705

Closed

belimawr added the backport-8.15 Automated backport to the 8.15 branch with mergify label Sep 12, 2024

cmacknz reviewed Sep 12, 2024

View reviewed changes

libbeat/esleg/eslegclient/connection.go Outdated Show resolved Hide resolved

cmacknz requested a review from marc-gr September 12, 2024 19:58

belimawr requested a review from cmacknz September 13, 2024 16:37

AndersonQ reviewed Sep 16, 2024

View reviewed changes

libbeat/tests/integration/elasticsearch_test.go Show resolved Hide resolved

libbeat/tests/integration/elasticsearch_test.go Outdated Show resolved Hide resolved

libbeat/tests/integration/elasticsearch_test.go Show resolved Hide resolved

belimawr force-pushed the fix-es-connection-issue branch from 1e9dcf4 to 33bcac0 Compare September 16, 2024 13:34

belimawr requested a review from AndersonQ September 16, 2024 13:34

cmacknz reviewed Sep 16, 2024

View reviewed changes

libbeat/esleg/eslegclient/connection.go Outdated Show resolved Hide resolved

belimawr force-pushed the fix-es-connection-issue branch from 33bcac0 to f2718c3 Compare September 16, 2024 20:59

cmacknz reviewed Sep 16, 2024

View reviewed changes

belimawr marked this pull request as draft September 18, 2024 16:37

belimawr added 15 commits September 20, 2024 10:30

Move creating the request context to Connect

ad2ea4e

This commit moves the creation of the request context to the connect method.

Fix tests and lint warnings

91f2eb0

Initialise reqsContext in NewConnections

4974e62

There are some cases where the Connection will be used without calling Connect, so we initialise reqsContext and cancelReqs in the NewConnection function to avoid panics.

Fix lint warnings

d4673cd

Fix lint warnings

483de0c

Fix error messages and improve documentation

82e1f65

Cancel context before replacing it

b4f133f

Improve comments

62aad7b

Accept a context on Connect

45875ee

Connection.Connect now accepts a context to control the life cycle of its requests.

Fix python dependencies

b4bd47b

Add context to outputs.Connectable interface

17b3113

Add a context to outputs.Connectable.Connect to correctly manage the life cycle of the connection and it's requests.

Add contexts when Beats create connections to ES

e02b43f

update tests

036502b

Revert PyYAML changes to fix x-pack/auditbeat tests

2dca761

belimawr force-pushed the fix-es-connection-issue branch from bfa6d6c to 2dca761 Compare September 20, 2024 14:39

cmacknz mentioned this pull request Sep 20, 2024

Regression test for recovery after Elasticsearch output connection failure #40928

Open

This was linked to issues Sep 20, 2024

Regression test for recovery after Elasticsearch output connection failure #40928

Open

Windows service for Beat does not stop gracefully #38666

Open

Windows service for Beat does not stop when output is unreachable #40518

Open

cmacknz changed the title ~~Fix elasticsearch re-connection after network error~~ Add test for elasticsearch re-connection after network error & allow graceful shutdown Sep 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add test for elasticsearch re-connection after network error & allow graceful shutdown #40794

Add test for elasticsearch re-connection after network error & allow graceful shutdown #40794

belimawr commented Sep 12, 2024 •

edited

Loading

mergify bot commented Sep 12, 2024

mergify bot commented Sep 12, 2024

mergify bot commented Sep 12, 2024

elasticmachine commented Sep 12, 2024

AndersonQ left a comment

cmacknz Sep 16, 2024

belimawr Sep 16, 2024

belimawr Sep 16, 2024

cmacknz Sep 16, 2024

cmacknz commented Sep 16, 2024

marc-gr commented Sep 17, 2024

mergify bot commented Sep 23, 2024

		// that is passed to this client is also used in a closure, we need
		// to ensure both hold a reference to the same instance of the connection.

	conn.OnConnectCallback = func() error {
	globalCallbackRegistry.mutex.Lock()
	defer globalCallbackRegistry.mutex.Unlock()

	for _, callback := range globalCallbackRegistry.callbacks {
	err := callback(conn)
	if err != nil {
	return err
	}
	}

	if onConnect != nil {
	onConnect.mutex.Lock()
	defer onConnect.mutex.Unlock()

	for _, callback := range onConnect.callbacks {
	err := callback(conn)
	if err != nil {
	return err
	}
	}
	}
	return nil
	}

Add test for elasticsearch re-connection after network error & allow graceful shutdown #40794

Are you sure you want to change the base?

Add test for elasticsearch re-connection after network error & allow graceful shutdown #40794

Conversation

belimawr commented Sep 12, 2024 • edited Loading

Proposed commit message

Checklist

Disruptive User Impact

How to test this PR locally

Related issues

mergify bot commented Sep 12, 2024

mergify bot commented Sep 12, 2024

mergify bot commented Sep 12, 2024

elasticmachine commented Sep 12, 2024

AndersonQ left a comment

Choose a reason for hiding this comment

cmacknz Sep 16, 2024

Choose a reason for hiding this comment

belimawr Sep 16, 2024

Choose a reason for hiding this comment

belimawr Sep 16, 2024

Choose a reason for hiding this comment

cmacknz Sep 16, 2024

Choose a reason for hiding this comment

cmacknz commented Sep 16, 2024

marc-gr commented Sep 17, 2024

mergify bot commented Sep 23, 2024

belimawr commented Sep 12, 2024 •

edited

Loading