Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Next round of updates: operator details #22

Closed
5 of 8 tasks
vsoch opened this issue Sep 14, 2022 · 14 comments
Closed
5 of 8 tasks

Next round of updates: operator details #22

vsoch opened this issue Sep 14, 2022 · 14 comments

Comments

@vsoch
Copy link
Member

vsoch commented Sep 14, 2022

  • try testing the "worker" pods to not have sleep infinity (and just start) Updated entrypoints so pods run and complete #23
  • create Delete callbacks, e.g., if we do sleep infinity and then need to clean up
  • Slurm feature (salloc option) to just start (locate resource and keep it going?) and need to explicitly stop job --no-shell (we probably don't need this anymore given the change in entrypoint)
  • set maximum time for Job from CRD Updated entrypoints so pods run and complete #23
  • We might want a check that doesn't allow updates while it's running, or allows it and prints lots of warning emojis
  • Is there a way to scale "workers" without borking the main rank 0 running?
  • Debug cases of size 0, and ensure that size 1 works.
  • Add support for secret pull secret
@vsoch vsoch changed the title TODOs Next round of updates: operator details Sep 14, 2022
@vsoch
Copy link
Member Author

vsoch commented Sep 14, 2022

ping @zekemorton - let's chat when you have some time about what you'd like to jump on (or if you don't have bandwidth no issue either)

@zekemorton
Copy link
Collaborator

As an additional todo item, what do you think about adding the option for an imagePullSecret to the mini cluster api? Essentially this could default to not having one, and if specified it would be the name of an already created secret to use as the imagePullSecret for non public images.

@zekemorton
Copy link
Collaborator

zekemorton commented Sep 15, 2022

@vsoch The delete call back sounds like a an important item on the list, maybe I'll start looking into that and let you know if I run into any issues?

@vsoch
Copy link
Member Author

vsoch commented Sep 15, 2022

As an additional todo item, what do you think about adding the option for an imagePullSecret to the mini cluster api? Essentially this could default to not having one, and if specified it would be the name of an already created secret to use as the imagePullSecret for non public images.

Yep if we want support for private pulls, that's definitely something to add! It would be a simple check for the attribute, and if it's defined, add it to the jobspec pod. It's not a huge priority right now so I wouldn't do it first (unless you have a private image you want to test!) but let's add to the TODO.

The delete call back sounds like a an important item on the list, maybe I'll start looking into that and let you know if I run into any issues?

Absolutely, it's yours! Take a look in events.go - that should have examples for different events (that aren't currently used). We would basically want to clean everything up for a CRD given that request.

@zekemorton
Copy link
Collaborator

@vsoch I may have misunderstood what the goal was for the delete call back, it seems to already do what I thought need implementing. My thoughts were that we needed to implement a way to delete a mini cluster and have that clean up all the pods, jobs, configmaps, etc. It looks to already do this?

$ kubectl apply -f flux-framework.org_v1alpha1_minicluster.yaml 
minicluster.flux-framework.org/flux-sample created
$ kubectl get all
NAME                      READY   STATUS    RESTARTS   AGE
pod/flux-sample-0-dz2wp   1/1     Running   0          6s
pod/flux-sample-1-f6vwz   1/1     Running   0          6s
pod/flux-sample-2-wbfnq   1/1     Running   0          6s
pod/flux-sample-3-mp888   1/1     Running   0          6s
pod/flux-sample-4-shrdj   1/1     Running   0          6s
pod/flux-sample-5-xbg72   1/1     Running   0          6s
$ kubectl delete -f flux-framework.org_v1alpha1_minicluster.yaml 
minicluster.flux-framework.org "flux-sample" deleted
$ kubectl get all
NAME                      READY   STATUS        RESTARTS   AGE
pod/flux-sample-0-dz2wp   1/1     Terminating   0          26s
pod/flux-sample-1-f6vwz   1/1     Terminating   0          26s
pod/flux-sample-2-wbfnq   1/1     Terminating   0          26s
pod/flux-sample-3-mp888   1/1     Terminating   0          26s
pod/flux-sample-4-shrdj   1/1     Terminating   0          26s
pod/flux-sample-5-xbg72   1/1     Terminating   0          26s

Was there something else that you had in mind for this todo item?

@vsoch
Copy link
Member Author

vsoch commented Sep 15, 2022

I think the original need was based around the "we need to cleanup" because the jobs don't complete, but if they complete (and that works) then we are good! I'll check it off.

@zekemorton
Copy link
Collaborator

I am pretty sure that even after a job is completed, all the pods stick around in the completed state and nothing is actually deleted. This is good because you can still check the logs of the completed pods.

Using the kubectl delete MiniCluster will clean everything up. I guess that the default is a cascading delete based on dependencies, so everything that the MiniCluster creates will also be deleted by this action. This is useful for both jobs that complete and ones that might get stuck or are just running sleep.

But yes, I think we can check this one off!

@vsoch
Copy link
Member Author

vsoch commented Sep 15, 2022

Are there any other's in the TODOs that you might want to try?

@zekemorton
Copy link
Collaborator

zekemorton commented Sep 16, 2022

I took a look at Debug cases of size 0, and ensure that size 1 works. and it looks like for size 0, none of the other resources are created and the logs do show that it caught it - A MiniCluster without nodes? Is this a cluster for invisible ants? Canceling!. For size 1, it seems to have worked just fine!

For size 0 then, would we have some kind of message or error occur? Like if you were to use a string instead of an integer

I'd also be happy to try the ImagePullSecret one as well, looks like it shouldn't be too involved!

@vsoch
Copy link
Member Author

vsoch commented Sep 16, 2022

For size 0 then, would we have some kind of message or error occur?
For size 0 we have the message that you found - I would assume the person submitting the yaml would see that? If you have a better message / better way to indicate that 0 isn't allowed, I'd be OK with changing that!

Like if you were to use a string instead of an integer

I haven't tested this, but I'd guess if the wrong data type is provided it's either going to show an error or convert it automatically.

I'd also be happy to try the ImagePullSecret one as well, looks like it shouldn't be too involved!

Sounds good! Do we have any secret containers? If you have an example we can add to the folder here https://github.com/flux-framework/flux-operator/tree/main/docker. If you use uptodate to generate the matrix https://vsoch.github.io/uptodate/ it will be discovered and built automatically. Either we could reproduce that same example (with some private tag in the name) or make a new one, and then have the GitHub package private to test it out. Let me know if you have any questions about that - I developed uptodate for RADIUSS builds last year.

@zekemorton
Copy link
Collaborator

w.r.t. the ImagePull Secrets, I was able to implement that functionality and test it agains a private copy of the same image/command we have been testing with. I'll open a PR now for some feedback and to address some specific questions I have on it!

@zekemorton
Copy link
Collaborator

For size 0 we have the message that you found - I would assume the person submitting the yaml would see that?

I don't think that they would unless they look at the logs for the operator. It would be nice to have an error message pop up right after the kubectl apply that says it's an invalid argument and fail the creation. Right now it will create the MiniClluster object, but it won't create any of the sub resources. Its probably best not to create anything at all, right? I can look into it a bit more to see how to best achieve this!

@vsoch
Copy link
Member Author

vsoch commented Sep 16, 2022

I don't think that they would unless they look at the logs for the operator.

Is there somewhere else to look? Oh, you mean an error directly in the console after running kubectl? Yeah that's a great idea - I'm not sure how to do that!

Right now it will create the MiniClluster object, but it won't create any of the sub resources. Its probably best not to create anything at all, right? I can look into it a bit more to see how to best achieve this!

Sounds good!

@vsoch
Copy link
Member Author

vsoch commented Sep 28, 2022

I think we are good to close here - we can't really support any kind of live update if flux doesn't support that, so we can come back to that when the time comes.

@vsoch vsoch closed this as completed Sep 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants