-
Notifications
You must be signed in to change notification settings - Fork 24.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Watcher webhooks fail with invalidated SSL sessions #52997
Comments
Pinging @elastic/es-core-features (:Core/Features/Watcher) |
I ran a test of over a weekend from cloud for a HTTPS endpoint with 30m delay between Watches, and experienced no issues. However, I do believe that there is an issue that should be addressed. I think what is happening is that some firewall configurations are force-able closing the connection it see's as idle (Watcher uses HTTP client with a connection pool) and for some reason, the HTTP connection in the pool does not get the memo. Anecdotally it is most prevalent talking to Microsoft teams and https://discuss.elastic.co/t/watcher-and-microsoft-teams-webhook/203189 even mentions a |
I think it would help to expose I also don't see anything that enables TCP keepalives on these connections. I think we should do that by default on any connection that we expect to remain open for an extended period (relates #65213). |
Watcher uses a connection pool for outgoing HTTP traffic, which means that some HTTP connections may live for a long time, possibly in an idle state. Such connections may be silently torn down by a remote device, so that when we re-use them we encounter a `Connection reset` or similar error. This commit introduces a setting allowing users to set a finite expiry time on these connections, and also enables TCP keepalives on them by default so that a remote teardown will be actively detected sooner. Closes elastic#52997
Just curious - what's the downside of having default one? I'd say today there is more and more chances that some grumpy RST'ing loadbalancer gets in the way between ES and webhook endpoint. |
Watcher uses a connection pool for outgoing HTTP traffic, which means that some HTTP connections may live for a long time, possibly in an idle state. Such connections may be silently torn down by a remote device, so that when we re-use them we encounter a `Connection reset` or similar error. This commit introduces a setting allowing users to set a finite expiry time on these connections, and also enables TCP keepalives on them by default so that a remote teardown will be actively detected sooner. Closes #52997
Watcher uses a connection pool for outgoing HTTP traffic, which means that some HTTP connections may live for a long time, possibly in an idle state. Such connections may be silently torn down by a remote device, so that when we re-use them we encounter a `Connection reset` or similar error. This commit introduces a setting allowing users to set a finite expiry time on these connections, and also enables TCP keepalives on them by default so that a remote teardown will be actively detected sooner. Closes #52997
I'd rather not set a default TTL because that would result in tearing down and re-establishing connections unnecessarily in environments that don't need it. I also don't see a sensible choice for the default: something like 60s seems short for networks that don't need it, but is already long enough that it won't fix the problem on other networks. With properly-configured TCP keepalives the TTL should not be necessary anyway, so this gives us the best of both worlds: it should let us keep connections open forever even on networks that impose idle-connection timeouts. |
Elasticsearch version : 7.5.2
Description of the problem including expected versus actual behavior:
Webhook actions to call remote services which are configured to fire infrequently (say every 30 mins) will alternate between success and failure. This was reported by a user when watcher was hosted in a cloud environment. A successful call is followed up by a failure which logs the fact that the session is invalidated . This fails the user's action which is obviously less than great. Presumably this failure is also the cue for the Watcher client to drop the broken connection meaning the next triggered action will be forced to recreate the connection to the remote service. This would explain the flip-flop behaviour between successes and failures when triggering these actions.
Steps to reproduce:
A reproduction is possible using a self-managed Elasticsearch, running ES 7.5.2 on a macbook and configured with the following alert:
Next step, wait for the watcher to fire. And once it has fired, suspend laptop. After a while, unsuspend the laptop, and wait for the watcher to fire, and the first execution will fail with a reset connection.
Resolution
Ideally an invalidated session would be something a webhook action should attempt to recover from automatically with failing the user action.
The text was updated successfully, but these errors were encountered: