-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revisit GPUMemoryMB, CUDARuntime and CUDACapabilities at workflow and job matchmaking #11595
Comments
Hi Alan, for the "resource provisioning step", aka first matchmaking, we are only asking WMAgent to put For the job (second) matchmaking we can use any of the attributes here: https://monit-grafana.cern.ch/d/2qoPfS0Mz/cms-submission-infrastructure-gpus-monitor?orgId=11 My preference would be to add a |
Thank you for the prompt feedback, Marco! For the job matchmaking, my concerns in terms of job classad are: |
Maybe we can use
Good question. Condor support the
And also |
Hi Alan, when you describe this:
are you implying that more than one of the steps will be ready to use the GPUs? Is this the case we are trying to cover here? |
That's correct, Antonio. It's common to have workflows using different CMSSW releases and those can be compiled with different CUDA support, hence we need to ensure that the job description and worker node can satisfy all of the different cmsRun requirements/capabilities. |
please let CRAB people (esp. @novicecpp ) know in case this ends up in some non=backward compatible format specifications in current classAds (form string to list e.g.) |
From further discussion in mattermost:
@mmascher @mambelli Hi Marco and Marco, I wonder if you can suggest how we can support a more complex job matchmaking in glideinWMS? A summary from what has been discussed above is:
Or please let us know if you would have any other suggestions that were not yet mentioned here. |
@mmascher Marco, now that the requirements are more clear (see comment above), I wonder if you have any recommendations on how to define these in term of job classads? |
Updating this issue with discussions that happened mostly in mattermost (SI and GPU Developments). Regarding the 2 glideinWMS questions raised above, here is further information on how that can be accomplished:
We can split each part of the version number and compare each of them (e.g. below for a version with major.medium.minor values):
and an example to test such expression can be seen in [1]. For the second question:
Marco suggests to adapt [1] from Marco M.
[2]
With this information, I think the development of this feature can be resumed and in parallel we discuss possible upgrades of the SI infrastructure. |
The capabilities are always of the format |
If CUDA capability is always in the format of |
All NVIDIA documentation about CUDA describes it as It might reasonably becomes If it ever does change in a non-backward compatible way, we can always update the comparison code, no ? |
@amaltaro Should this really be marked as "Waiting" if a PR was linked to this issue this past week? Seems more "in progress" to me. |
@klannon development of this feature should be completed, besides further testing. Now we have a dependency on the Submission Infrastructure team to upgrade HTCondor. I have not yet contacted them because, from Mattermost, I understand that most of the SI team is on vacation. |
Now that people are back from vacation season, here is a ticket to address the required developments at the glideinWMS layer: |
Just a note: We decided not to pull this ticket into 2023 Q4, as there are dependencies on the SI infrastructure and it's not clear when those will be implemented. |
Impact of the new feature
WMAgent (but perhaps ReqMgr2 for multi-step workflows, aka StepChain)
Is your feature request related to a problem? Please describe.
This is related to adding GPU support to StepChain workflows, tracked in this ticket:
#10401
after discussing about scenarios and use case of multi-step GPU workflows, in this thread and a couple of messages after it as well, it was pointed out that the GPU job description and matchmaking should evolve with the latest GPU developments.
For the record, here is an example of how it can be currently configured:
Describe the solution you'd like
This ticket requires 2 types of solutions, one at the workflow/job description level; the second at the job matchmaking.
Starting with the resource provisioning and job matchmaking, here are the required changes:
CUDARuntime
: the pilot needs to support all of the CUDARuntimes requested by the job (i.e. job CUDARuntime is a subset of pilot CUDARuntime)CUDACapabilities
: the pilot needs to support the same or newer CUDACapabilities as requested by the job (i.e. min(job CUDACapabilities) <= max(pilot CUDACapabilities)GPUMemoryMB
: does not change. job GPUMemoryMB needs to be smaller or equal to pilot GPUMemoryMB.For multi-step job workflows (StepChain and maybe PromptReco(?)), here is how a job needs to be described by the agent:
4)
CUDARuntime
: currently a python string with the version. It looks like it needs to be a comma separated string. TODO: Are we able to make a comma separated or list comparison in GlideinWMS and ensure that each element in one list is present in the other?5)
CUDACapabilities
: currently a list of python strings. TODO: Given that the pilot/node only needs to have an equal or newer version, should we revisit it and make it a plain string with the version? Are we able to have such comparison in GlideinWMS?6)
GPUMemoryMB
: has to be the max of all the stepsDescribe alternatives you've considered
Further discussion with Core and GlideinWMS still required to finalize these requirements.
Additional context
Documentation: gpu-parameter-specification
and initial development made in this PR #10388
The text was updated successfully, but these errors were encountered: