Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support a globally shared CA #5539

Merged
merged 21 commits into from Apr 21, 2022
Merged

Support a globally shared CA #5539

merged 21 commits into from Apr 21, 2022

Conversation

pebrc
Copy link
Collaborator

@pebrc pebrc commented Mar 30, 2022

This adds support for a new operator parameter that allows users to specify a shared CA that will be used for all managed resources. If users opted to specify separate CAs per resource those configurations will be left untouched, however any resources that are using the default per-resource self-signed CA will use the shared CA.

The CA is specified as a directory via an operator param, currently --ca. The operator expects two files to be present in this directory tls.crt and tls.key. I am thinking of supporting ca.crt and ca.key as well to be consistent with the existing per-resource custom CA feature.

The operator watches the directory and restarts if it detects changes. This is following the behaviour we have for general operator configuration changes and simplifies the implementation IMO considerably.

I am still trying to figure out a testing strategy (unit/integaition tests but maybe also an e2e test) but wanted to raise the PR already in case there is early feedback.

@botelastic botelastic bot added the triage label Mar 30, 2022
@pebrc pebrc added the >feature Adds or discusses adding a feature to the product label Mar 30, 2022
@botelastic botelastic bot removed the triage label Mar 30, 2022
@pebrc
Copy link
Collaborator Author

pebrc commented Mar 31, 2022

I am running into trouble in the e2e tests with the fsnotify change detection in the e2e tests. I know we observed this before. I might need to switch to simple polling instead for both the CA and the operator configuration.

@pebrc pebrc force-pushed the shared-ca branch 2 times, most recently from b7d3852 to b505883 Compare April 4, 2022 12:14
@pebrc pebrc marked this pull request as ready for review April 4, 2022 12:14
@pebrc
Copy link
Collaborator Author

pebrc commented Apr 4, 2022

I am not 💯 sure about supporting the two naming conventions ca.* and tls.* (the latter being the standard in K8s TLS secrets) and therefore happy to back out that last commit 5da3b3e

Copy link
Contributor

@barkbay barkbay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only did a first review, but looks good so far 👍
Thanks for the FileWatcher, I'm pretty sure it'll make config file detection more reliable.

pkg/controller/common/operator/parameters.go Outdated Show resolved Hide resolved
pkg/controller/common/operator/parameters.go Outdated Show resolved Hide resolved
pkg/utils/fs/watcher.go Outdated Show resolved Hide resolved
requireEventEquals := func(c chan []string, expected []string, timeout time.Duration) {
select {
case event := <-c:
require.Equal(t, expected, event)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need a strict equality here, my gut feeling is that it can be a cause of flaky tests

Suggested change
require.Equal(t, expected, event)
require.ElementsMatch(t, expected, event)

pkg/utils/fs/watcher_integration_test.go Show resolved Hide resolved
pkg/utils/fs/watcher.go Show resolved Hide resolved
pkg/controller/common/operator/flags.go Outdated Show resolved Hide resolved
test/e2e/ca_test.go Outdated Show resolved Hide resolved
cmd/manager/main.go Outdated Show resolved Hide resolved
pkg/controller/common/certificates/ca_reconcile.go Outdated Show resolved Hide resolved
Copy link
Contributor

@barkbay barkbay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have run a few other tests, including running the E2E test, or enabling/disabling the feature on existing deployments. It seems to work as expected 👍 During these tests I only noticed that the *-ca-internal Secrets are not removed, while my understanding is that they are not required anymore. However it's definitely not a big deal since it's completely harmless.

I'll take the time to do another review later.

test/e2e/ca_test.go Outdated Show resolved Hide resolved
test/e2e/ca_test.go Outdated Show resolved Hide resolved
pkg/controller/common/certificates/ca_reconcile.go Outdated Show resolved Hide resolved
pkg/controller/common/certificates/ca_reconcile.go Outdated Show resolved Hide resolved
pkg/controller/elasticsearch/certificates/transport/ca.go Outdated Show resolved Hide resolved
pkg/utils/fs/watcher.go Outdated Show resolved Hide resolved
test/e2e/test/elasticsearch/builder.go Outdated Show resolved Hide resolved
@pebrc
Copy link
Collaborator Author

pebrc commented Apr 19, 2022

Jenkins test this please

Makefile:63: recipe for target 'ci-build-image' failed

@pebrc pebrc requested review from barkbay and naemono April 19, 2022 15:10
cmd/manager/main.go Outdated Show resolved Hide resolved
test/e2e/global_ca_test.go Outdated Show resolved Hide resolved
@pebrc
Copy link
Collaborator Author

pebrc commented Apr 19, 2022

Unfortunately it looks like the file watcher integration test I added is flaky. Trying to improve on the current implementation.

@pebrc
Copy link
Collaborator Author

pebrc commented Apr 19, 2022

Sometimes mtime changes more than once on writes. I am not entirely sure why given the test case uses probably a single write syscall. But then I again I am not an expert on how these things work on unix'ish OSs.

/var/folders/1x/dzn9_byd2sn3cpn48j3d14m40000gn/T/file-watcher-test105416277/file2 prev 0001-01-01 00:00:00 +0000 UTC now 2022-04-19 20:52:24.299815768 +0200 CEST
/var/folders/1x/dzn9_byd2sn3cpn48j3d14m40000gn/T/file-watcher-test105416277/file1 prev 0001-01-01 00:00:00 +0000 UTC now 2022-04-19 20:52:24.300063142 +0200 CEST
/var/folders/1x/dzn9_byd2sn3cpn48j3d14m40000gn/T/file-watcher-test105416277/file3 prev 0001-01-01 00:00:00 +0000 UTC now 2022-04-19 20:52:24.300206579 +0200 CEST
/var/folders/1x/dzn9_byd2sn3cpn48j3d14m40000gn/T/file-watcher-test105416277/file4 prev 0001-01-01 00:00:00 +0000 UTC now 2022-04-19 20:52:24.302395466 +0200 CEST
/var/folders/1x/dzn9_byd2sn3cpn48j3d14m40000gn/T/file-watcher-test105416277/file4 prev 2022-04-19 20:52:24.302395466 +0200 CEST now 2022-04-19 20:52:24.302612857 +0200 CEST
/var/folders/1x/dzn9_byd2sn3cpn48j3d14m40000gn/T/file-watcher-test105416277/file2 prev 2022-04-19 20:52:24.299815768 +0200 CEST now 2022-04-19 20:52:24.302811553 +0200 CEST

See how file4 changes twice in the output above.

I modified the test's assumptions to allow for additional events on the simulated writes.

@barkbay
Copy link
Contributor

barkbay commented Apr 20, 2022

I am not entirely sure why given the test case uses probably a single write syscall.

I think that there are always at least 2 underlying syscall (using os.WriteFile):

  1. A first one to create an empty file, or truncate the content if it already exists: mtime is changed and file size is 0
  2. A second one to write the data to the empty file , mtime is changed once again

Running the WriteFile in debug mode, step by step, and using stat at the same time we can see that mtime is changing at least twice:

func WriteFile(name string, data []byte, perm FileMode) error {
	f, err := OpenFile(name, O_WRONLY|O_CREATE|O_TRUNC, perm) // -- File is created or truncated, `stat -f "%Sm" file1 = Apr 20 08:56:10 2022`
	if err != nil {...}
	_, err = f.Write(data) // -- data is written (not sure if it is synced though) -- `stat -f "%Sm" file1 = Apr 20 08:56:16 2022`
	if err1 := f.Close(); err1 != nil && err == nil { // -- data is flushed on disk?
		err = err1
	}
	return err
}

Also if the file is mapped to several pages in memory, I'm not sure there is a guarantee that all the pages are synced at the same time and mtime is changed only once. I don't know if this article is still relevant, but it is still interesting as it describes a behavior that has existed:

So a quick operation to make a page writable turns into a heavyweight filesystem operation, and it happens every time the application attempts to write to a clean page. If the application writes large numbers of pages that have been mapped into memory, the result will be a painful slowdown. And most of that effort is wasted; the timestamp updates overwrite each other, so only the last one will persist for any useful period of time.

@pebrc
Copy link
Collaborator Author

pebrc commented Apr 20, 2022

A first one to create an empty file, or truncate the content if it already exists: mtime is changed and file size is 0

Of course! I clearly didn't think this through. Thanks for the great explanation!

Also if the file is mapped to several pages in memory, I'm not sure there is a guarantee that all the pages are synced at the same time and mtime is changed only once.

Do you mean this in the sense that the test might still be flaky and we could see yet another event coming from another mtime update?

Copy link
Contributor

@barkbay barkbay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, I just left some nitpicks 👍 🚀

pkg/controller/common/operator/flags.go Show resolved Hide resolved
cmd/manager/main.go Outdated Show resolved Hide resolved
pkg/controller/common/certificates/ca_reconcile.go Outdated Show resolved Hide resolved
pkg/utils/fs/watcher_integration_test.go Outdated Show resolved Hide resolved
@barkbay
Copy link
Contributor

barkbay commented Apr 20, 2022

Do you mean this in the sense that the test might still be flaky and we could see yet another event coming from another mtime update?

No, such small files are very unlikely to be spread across multiple pages (min. page size is 4KB on Linux IIUC).

@pebrc pebrc merged commit 6e93392 into elastic:main Apr 21, 2022
@ejsmith
Copy link

ejsmith commented Jun 24, 2022

Is it possible to use this feature to get ES, Kibana and APM to use certs that are signed by a trusted authority like Let's Encrypt so that my app trusts the cert without having to configure my app to add a trusted self-signed CA?

@jmp601
Copy link

jmp601 commented Jul 8, 2022

Is it possible to use this feature to get ES, Kibana and APM to use certs that are signed by a trusted authority like Let's Encrypt so that my app trusts the cert without having to configure my app to add a trusted self-signed CA?

We are an elastic enterprise customer and have the same ask. I have a hard time configuring ECK with e2e with Let's encrypt certs.

@naemono
Copy link
Contributor

naemono commented Jul 9, 2022

We have some documentation around this feature:

https://www.elastic.co/guide/en/cloud-on-k8s/master/k8s-tls-certificates.html#k8s-setting-up-your-own-certificate

and here an example for cert manager (you can replace the self-signed issue of course with a publicly trusted one like Let's Encrypt)
https://www.elastic.co/guide/en/cloud-on-k8s/master/k8s-custom-http-certificate.html#k8s_custom_self_signed_certificate_using_cert_manager

We also have an issue tracking updating our documentation to be more clear #4812

@ejsmith
Copy link

ejsmith commented Jul 9, 2022

Would love to see a tutorial that sets up cert-manager with let's encrypt and then creates an ES cluster with a trusted SSL feet that apps can use without having to do anything as well as exposing a public Kibana and APM instance that an app can use to send front end APM data to.

fantapsody pushed a commit to fantapsody/cloud-on-k8s that referenced this pull request Feb 7, 2023
This adds support for a new operator parameter that allows users to specify a shared CA that will be used for all managed resources. If users opted to specify separate CAs per resource those configurations will be left untouched, however any resources that are using the default per-resource self-signed CA will use the shared CA.

The CA is specified as a directory via an operator param, --ca-dir. The operator expects two files to be present in this directory tls.crt and tls.key -- ca.crt and ca.key are supported as well to be consistent with the existing per-resource custom CA feature.

The operator watches the directory and restarts if it detects changes.
@jbouse
Copy link
Contributor

jbouse commented Oct 16, 2023

Is this still available in v2.9.0? I'm unable to find mention of it in the documentation nor in the Helm chart for the operator deployment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature Adds or discusses adding a feature to the product v2.3.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants