Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Watcher webhooks fail with invalidated SSL sessions #52997

Closed
markharwood opened this issue Mar 2, 2020 · 5 comments · Fixed by #72736
Closed

Watcher webhooks fail with invalidated SSL sessions #52997

markharwood opened this issue Mar 2, 2020 · 5 comments · Fixed by #72736
Labels

Comments

@markharwood
Copy link
Contributor

Elasticsearch version : 7.5.2

Description of the problem including expected versus actual behavior:

Webhook actions to call remote services which are configured to fire infrequently (say every 30 mins) will alternate between success and failure. This was reported by a user when watcher was hosted in a cloud environment. A successful call is followed up by a failure which logs the fact that the session is invalidated . This fails the user's action which is obviously less than great. Presumably this failure is also the cue for the Watcher client to drop the broken connection meaning the next triggered action will be forced to recreate the connection to the remote service. This would explain the flip-flop behaviour between successes and failures when triggering these actions.

Steps to reproduce:

A reproduction is possible using a self-managed Elasticsearch, running ES 7.5.2 on a macbook and configured with the following alert:

{
  "trigger": {
	"schedule": {
	  "interval": "30m"
	}
  },
  "input": {
	"simple": {}
  },
  "condition": {
	"always": {}
  },
  "actions": {
	"telegram": {
	  "webhook": {
		"scheme": "https",
		"host": "api.telegram.org",
		"port": 443,
		"method": "post",
		"path": "/MYUUID/sendMessage",
		"params": {
		  "text": "hello world",
		  "chat_id": "-1"
		},
		"headers": {}
	  }
	}
  }
}

Next step, wait for the watcher to fire. And once it has fired, suspend laptop. After a while, unsuspend the laptop, and wait for the watcher to fire, and the first execution will fail with a reset connection.

Resolution
Ideally an invalidated session would be something a webhook action should attempt to recover from automatically with failing the user action.

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (:Core/Features/Watcher)

@rjernst rjernst added the Team:Data Management Meta label for data/management team label May 4, 2020
@jakelandis
Copy link
Contributor

I ran a test of over a weekend from cloud for a HTTPS endpoint with 30m delay between Watches, and experienced no issues. However, I do believe that there is an issue that should be addressed. I think what is happening is that some firewall configurations are force-able closing the connection it see's as idle (Watcher uses HTTP client with a connection pool) and for some reason, the HTTP connection in the pool does not get the memo.

Anecdotally it is most prevalent talking to Microsoft teams and https://discuss.elastic.co/t/watcher-and-microsoft-teams-webhook/203189 even mentions a tcp-rst-from-server. A tcp trace might be in order to once we can reproduce this to see who/what is closing the connection. I don't think this is a Watcher specific error, but maybe there is some HTTP client configuration (version) that would help, or possible we need to handle this kind of error with a retry at the application level.

@DaveCTurner
Copy link
Contributor

maybe there is some HTTP client configuration (version) that would help

I think it would help to expose HttpClientBuilder#setConnectionTimeToLive() when constructing the HTTP client in org.elasticsearch.xpack.watcher.common.http.HttpClient#createHttpClient. Without that I believe we are assuming that a pooled connection remains valid forever, but that appears not to be the case in some environments. I don't think we should change the default from today's behaviour of an infinite TTL.

I also don't see anything that enables TCP keepalives on these connections. I think we should do that by default on any connection that we expect to remain open for an extended period (relates #65213).

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue May 5, 2021
Watcher uses a connection pool for outgoing HTTP traffic, which means
that some HTTP connections may live for a long time, possibly in an idle
state. Such connections may be silently torn down by a remote device, so
that when we re-use them we encounter a `Connection reset` or similar
error.

This commit introduces a setting allowing users to set a finite expiry
time on these connections, and also enables TCP keepalives on them by
default so that a remote teardown will be actively detected sooner.

Closes elastic#52997
@chingis-elastic
Copy link

chingis-elastic commented May 6, 2021

I don't think we should change the default from today's behaviour of an infinite TTL.

Just curious - what's the downside of having default one? I'd say today there is more and more chances that some grumpy RST'ing loadbalancer gets in the way between ES and webhook endpoint.

DaveCTurner added a commit that referenced this issue May 6, 2021
Watcher uses a connection pool for outgoing HTTP traffic, which means
that some HTTP connections may live for a long time, possibly in an idle
state. Such connections may be silently torn down by a remote device, so
that when we re-use them we encounter a `Connection reset` or similar
error.

This commit introduces a setting allowing users to set a finite expiry
time on these connections, and also enables TCP keepalives on them by
default so that a remote teardown will be actively detected sooner.

Closes #52997
DaveCTurner added a commit that referenced this issue May 6, 2021
Watcher uses a connection pool for outgoing HTTP traffic, which means
that some HTTP connections may live for a long time, possibly in an idle
state. Such connections may be silently torn down by a remote device, so
that when we re-use them we encounter a `Connection reset` or similar
error.

This commit introduces a setting allowing users to set a finite expiry
time on these connections, and also enables TCP keepalives on them by
default so that a remote teardown will be actively detected sooner.

Closes #52997
@DaveCTurner
Copy link
Contributor

I'd rather not set a default TTL because that would result in tearing down and re-establishing connections unnecessarily in environments that don't need it. I also don't see a sensible choice for the default: something like 60s seems short for networks that don't need it, but is already long enough that it won't fix the problem on other networks.

With properly-configured TCP keepalives the TTL should not be necessary anyway, so this gives us the best of both worlds: it should let us keep connections open forever even on networks that impose idle-connection timeouts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants