Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

curve.cert generation within the pod using zeromq #150

Closed
3 tasks done
vsoch opened this issue Apr 26, 2023 · 4 comments
Closed
3 tasks done

curve.cert generation within the pod using zeromq #150

vsoch opened this issue Apr 26, 2023 · 4 comments

Comments

@vsoch
Copy link
Member

vsoch commented Apr 26, 2023

We currently require an extra "one off" pod to generate the curve certificate, and using the same flux container. This could be simplified if we had a CSI driver that would automatically be used to generate the certificate once, and then mount as a read only volume to all the pods instead. The original discussion around this issue can be found here: kubernetes-sigs/kueue#716. For this issue we need to:

  • prototype building zeromq (with flux) likely separately
  • do the same, but for the operator container / controller
  • do basic experiments to test improvement in minicluster generation times

The above is done, see below for experiment results! I'll likely merge the work tomorrow after one final look over.

@alculquicondor
Copy link

In kubeflow mpi-operator, it is the operator itself that generates keys: https://github.com/kubeflow/mpi-operator/blob/c77dfcf42fb8a39b1824c6fe540a929b84a36844/pkg/controller/mpi_job_controller.go#L1274

Would this be equivalent?

@vsoch
Copy link
Member Author

vsoch commented Apr 26, 2023

This would require building zeromq in the operator image and then using the bindings - I was using them through Flux but (in retrospect) I could try just using them on their own. I'll give this a try - I think I had attempted it last year but wasn't able to get it working (mostly just new to cgo) but given I have it working through Flux bindings, it's probably worth another try! That would totally solve the issue for good, so it's a great suggestion 👍

@vsoch vsoch changed the title Container Storage Interface CSI to generate certificate (or other means) curve.cert generation within the pod using zermq Apr 27, 2023
@vsoch
Copy link
Member Author

vsoch commented Apr 27, 2023

An update on the work from today!

I was able to dig into zeromq bindings with @grondo and @garlick to figure out how to not just generate the certificate to a path (how Flux does it) but to save it to a string (that we can do in Go). We figured that out toward the end of the day, and the rough draft of it looks like this: https://github.com/converged-computing/flux-go/blob/main/pkg/flux/keygen.go

Then I moved into the Flux Operator to add similar logic to the operator itself - but not requiring an entire flux install, just the zeromq bindings. There were a few "gotchas" in terms of getting this to work with the operator-sdk and the original distroless image (both do not like linked libraries!) but I was able to get it working. I then did some experiments comparing

  • A. the operator generating it's own curve cert internally via zeromq bindings, vs.
  • B. the operator spinning up a "one off" pod with flux to run flux keygen to do the same

And here are the final results, just a local mini run of LAMMPS 20x each.

# A. These are times with the operator generating the curve certificate
[11.522273302078247, 11.835262298583984, 11.50618028640747, 11.119516849517822, 11.024176120758057, 11.401328563690186, 11.369296312332153, 11.040543556213379, 11.348724126815796, 11.640495300292969, 11.585636854171753, 11.250822067260742, 11.169968128204346, 11.857943773269653, 11.732101202011108, 11.886579751968384, 11.454119682312012, 10.956253290176392, 11.347179651260376, 11.463804721832275]
Mean: 11.425610291957856
Std: 0.2796793262007871

And for the original (current) design:

# B. These are times with a one-off pod doing it
[14.729787349700928, 13.066138982772827, 14.67087459564209, 14.394914865493774, 14.968234062194824, 14.576695919036865, 14.472598314285278, 14.54642391204834, 14.63283109664917, 14.919774770736694, 14.643915176391602, 14.472495317459106, 14.504558563232422, 15.089675903320312, 12.9026780128479, 14.505521535873413, 14.302513122558594, 15.221238851547241, 14.384584665298462, 14.278271198272705]
Mean: 14.464186310768127
Std: 0.5660379429411524

We can see when the operator generates it's own certificate using zeromq bindings it is ~3 seconds faster (mean) than the original design with a one-off pod, and of course the design itself is improved to not require the pod. Those are the improvements!!

The cons to this approach are that if a local developer wants to build on their machine, they do need to install the development libraries for zeromq. I tried to offset this by adding a new make deploy-local that will build a container locally, and allow them to load into MiniKube with imagePullPolicy set to "Never" so it uses as is. This also means the operator image itself is a bit bigger and will take a little longer, but I think this is an OK tradeoff. That largely just happens once!

TLDR: I think I'm happy with this change - will likely do one more look over tomorrow (later today) and then merge, there were some changes needed for the tests and I want to make sure everything looks OK.

@vsoch vsoch changed the title curve.cert generation within the pod using zermq curve.cert generation within the pod using zeromq Apr 27, 2023
@vsoch
Copy link
Member Author

vsoch commented Apr 27, 2023

Done with #152. The CSI will still be fun, but not required at this point.

@vsoch vsoch closed this as completed Apr 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants