Watcher webhooks fail with invalidated SSL sessions #52997

markharwood · 2020-03-02T09:38:01Z

Elasticsearch version : 7.5.2

Description of the problem including expected versus actual behavior:

Webhook actions to call remote services which are configured to fire infrequently (say every 30 mins) will alternate between success and failure. This was reported by a user when watcher was hosted in a cloud environment. A successful call is followed up by a failure which logs the fact that the session is invalidated . This fails the user's action which is obviously less than great. Presumably this failure is also the cue for the Watcher client to drop the broken connection meaning the next triggered action will be forced to recreate the connection to the remote service. This would explain the flip-flop behaviour between successes and failures when triggering these actions.

Steps to reproduce:

A reproduction is possible using a self-managed Elasticsearch, running ES 7.5.2 on a macbook and configured with the following alert:

{
  "trigger": {
	"schedule": {
	  "interval": "30m"
	}
  },
  "input": {
	"simple": {}
  },
  "condition": {
	"always": {}
  },
  "actions": {
	"telegram": {
	  "webhook": {
		"scheme": "https",
		"host": "api.telegram.org",
		"port": 443,
		"method": "post",
		"path": "/MYUUID/sendMessage",
		"params": {
		  "text": "hello world",
		  "chat_id": "-1"
		},
		"headers": {}
	  }
	}
  }
}

Next step, wait for the watcher to fire. And once it has fired, suspend laptop. After a while, unsuspend the laptop, and wait for the watcher to fire, and the first execution will fail with a reset connection.

Resolution
Ideally an invalidated session would be something a webhook action should attempt to recover from automatically with failing the user action.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-03-02T09:38:03Z

Pinging @elastic/es-core-features (:Core/Features/Watcher)

jakelandis · 2021-02-08T15:07:21Z

I ran a test of over a weekend from cloud for a HTTPS endpoint with 30m delay between Watches, and experienced no issues. However, I do believe that there is an issue that should be addressed. I think what is happening is that some firewall configurations are force-able closing the connection it see's as idle (Watcher uses HTTP client with a connection pool) and for some reason, the HTTP connection in the pool does not get the memo.

Anecdotally it is most prevalent talking to Microsoft teams and https://discuss.elastic.co/t/watcher-and-microsoft-teams-webhook/203189 even mentions a tcp-rst-from-server. A tcp trace might be in order to once we can reproduce this to see who/what is closing the connection. I don't think this is a Watcher specific error, but maybe there is some HTTP client configuration (version) that would help, or possible we need to handle this kind of error with a retry at the application level.

DaveCTurner · 2021-05-04T14:41:43Z

maybe there is some HTTP client configuration (version) that would help

I think it would help to expose HttpClientBuilder#setConnectionTimeToLive() when constructing the HTTP client in org.elasticsearch.xpack.watcher.common.http.HttpClient#createHttpClient. Without that I believe we are assuming that a pooled connection remains valid forever, but that appears not to be the case in some environments. I don't think we should change the default from today's behaviour of an infinite TTL.

I also don't see anything that enables TCP keepalives on these connections. I think we should do that by default on any connection that we expect to remain open for an extended period (relates #65213).

Watcher uses a connection pool for outgoing HTTP traffic, which means that some HTTP connections may live for a long time, possibly in an idle state. Such connections may be silently torn down by a remote device, so that when we re-use them we encounter a `Connection reset` or similar error. This commit introduces a setting allowing users to set a finite expiry time on these connections, and also enables TCP keepalives on them by default so that a remote teardown will be actively detected sooner. Closes elastic#52997

chingis-elastic · 2021-05-06T02:25:31Z

I don't think we should change the default from today's behaviour of an infinite TTL.

Just curious - what's the downside of having default one? I'd say today there is more and more chances that some grumpy RST'ing loadbalancer gets in the way between ES and webhook endpoint.

Watcher uses a connection pool for outgoing HTTP traffic, which means that some HTTP connections may live for a long time, possibly in an idle state. Such connections may be silently torn down by a remote device, so that when we re-use them we encounter a `Connection reset` or similar error. This commit introduces a setting allowing users to set a finite expiry time on these connections, and also enables TCP keepalives on them by default so that a remote teardown will be actively detected sooner. Closes #52997

DaveCTurner · 2021-05-06T07:57:42Z

I'd rather not set a default TTL because that would result in tearing down and re-establishing connections unnecessarily in environments that don't need it. I also don't see a sensible choice for the default: something like 60s seems short for networks that don't need it, but is already long enough that it won't fix the problem on other networks.

With properly-configured TCP keepalives the TTL should not be necessary anyway, so this gives us the best of both worlds: it should let us keep connections open forever even on networks that impose idle-connection timeouts.

markharwood added >bug :Data Management/Watcher labels Mar 2, 2020

rjernst added the Team:Data Management Meta label for data/management team label May 4, 2020

DaveCTurner mentioned this issue May 5, 2021

Fix Watcher HTTP connection config for longevity #72736

Merged

DaveCTurner closed this as completed in #72736 May 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Watcher webhooks fail with invalidated SSL sessions #52997

Watcher webhooks fail with invalidated SSL sessions #52997

markharwood commented Mar 2, 2020

elasticmachine commented Mar 2, 2020

jakelandis commented Feb 8, 2021

DaveCTurner commented May 4, 2021

chingis-elastic commented May 6, 2021 •

edited

Loading

DaveCTurner commented May 6, 2021

Watcher webhooks fail with invalidated SSL sessions #52997

Watcher webhooks fail with invalidated SSL sessions #52997

Comments

markharwood commented Mar 2, 2020

elasticmachine commented Mar 2, 2020

jakelandis commented Feb 8, 2021

DaveCTurner commented May 4, 2021

chingis-elastic commented May 6, 2021 • edited Loading

DaveCTurner commented May 6, 2021

chingis-elastic commented May 6, 2021 •

edited

Loading