-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
curve.cert generation within the pod using zeromq #150
Comments
In kubeflow mpi-operator, it is the operator itself that generates keys: https://github.com/kubeflow/mpi-operator/blob/c77dfcf42fb8a39b1824c6fe540a929b84a36844/pkg/controller/mpi_job_controller.go#L1274 Would this be equivalent? |
This would require building zeromq in the operator image and then using the bindings - I was using them through Flux but (in retrospect) I could try just using them on their own. I'll give this a try - I think I had attempted it last year but wasn't able to get it working (mostly just new to cgo) but given I have it working through Flux bindings, it's probably worth another try! That would totally solve the issue for good, so it's a great suggestion 👍 |
An update on the work from today! I was able to dig into zeromq bindings with @grondo and @garlick to figure out how to not just generate the certificate to a path (how Flux does it) but to save it to a string (that we can do in Go). We figured that out toward the end of the day, and the rough draft of it looks like this: https://github.com/converged-computing/flux-go/blob/main/pkg/flux/keygen.go Then I moved into the Flux Operator to add similar logic to the operator itself - but not requiring an entire flux install, just the zeromq bindings. There were a few "gotchas" in terms of getting this to work with the operator-sdk and the original distroless image (both do not like linked libraries!) but I was able to get it working. I then did some experiments comparing
And here are the final results, just a local mini run of LAMMPS 20x each. # A. These are times with the operator generating the curve certificate
[11.522273302078247, 11.835262298583984, 11.50618028640747, 11.119516849517822, 11.024176120758057, 11.401328563690186, 11.369296312332153, 11.040543556213379, 11.348724126815796, 11.640495300292969, 11.585636854171753, 11.250822067260742, 11.169968128204346, 11.857943773269653, 11.732101202011108, 11.886579751968384, 11.454119682312012, 10.956253290176392, 11.347179651260376, 11.463804721832275]
Mean: 11.425610291957856
Std: 0.2796793262007871 And for the original (current) design: # B. These are times with a one-off pod doing it
[14.729787349700928, 13.066138982772827, 14.67087459564209, 14.394914865493774, 14.968234062194824, 14.576695919036865, 14.472598314285278, 14.54642391204834, 14.63283109664917, 14.919774770736694, 14.643915176391602, 14.472495317459106, 14.504558563232422, 15.089675903320312, 12.9026780128479, 14.505521535873413, 14.302513122558594, 15.221238851547241, 14.384584665298462, 14.278271198272705]
Mean: 14.464186310768127
Std: 0.5660379429411524 We can see when the operator generates it's own certificate using zeromq bindings it is ~3 seconds faster (mean) than the original design with a one-off pod, and of course the design itself is improved to not require the pod. Those are the improvements!! The cons to this approach are that if a local developer wants to build on their machine, they do need to install the development libraries for zeromq. I tried to offset this by adding a new TLDR: I think I'm happy with this change - will likely do one more look over tomorrow (later today) and then merge, there were some changes needed for the tests and I want to make sure everything looks OK. |
Done with #152. The CSI will still be fun, but not required at this point. |
We currently require an extra "one off" pod to generate the curve certificate, and using the same flux container. This could be simplified if we had a CSI driver that would automatically be used to generate the certificate once, and then mount as a read only volume to all the pods instead. The original discussion around this issue can be found here: kubernetes-sigs/kueue#716. For this issue we need to:
The above is done, see below for experiment results! I'll likely merge the work tomorrow after one final look over.
The text was updated successfully, but these errors were encountered: