Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revisit GPUMemoryMB, CUDARuntime and CUDACapabilities at workflow and job matchmaking #11595

Open
amaltaro opened this issue May 17, 2023 · 16 comments · May be fixed by #11689
Open

Revisit GPUMemoryMB, CUDARuntime and CUDACapabilities at workflow and job matchmaking #11595

amaltaro opened this issue May 17, 2023 · 16 comments · May be fixed by #11689

Comments

@amaltaro
Copy link
Contributor

amaltaro commented May 17, 2023

Impact of the new feature
WMAgent (but perhaps ReqMgr2 for multi-step workflows, aka StepChain)

Is your feature request related to a problem? Please describe.
This is related to adding GPU support to StepChain workflows, tracked in this ticket:
#10401

after discussing about scenarios and use case of multi-step GPU workflows, in this thread and a couple of messages after it as well, it was pointed out that the GPU job description and matchmaking should evolve with the latest GPU developments.

For the record, here is an example of how it can be currently configured:

{"GPUMemoryMB": 123, "CUDARuntime": "11.2", "CUDACapabilities": ["11.2", "11.4"]}

Describe the solution you'd like
This ticket requires 2 types of solutions, one at the workflow/job description level; the second at the job matchmaking.

Starting with the resource provisioning and job matchmaking, here are the required changes:

  1. CUDARuntime: the pilot needs to support all of the CUDARuntimes requested by the job (i.e. job CUDARuntime is a subset of pilot CUDARuntime)
  2. CUDACapabilities: the pilot needs to support the same or newer CUDACapabilities as requested by the job (i.e. min(job CUDACapabilities) <= max(pilot CUDACapabilities)
  3. GPUMemoryMB: does not change. job GPUMemoryMB needs to be smaller or equal to pilot GPUMemoryMB.

For multi-step job workflows (StepChain and maybe PromptReco(?)), here is how a job needs to be described by the agent:
4) CUDARuntime: currently a python string with the version. It looks like it needs to be a comma separated string. TODO: Are we able to make a comma separated or list comparison in GlideinWMS and ensure that each element in one list is present in the other?
5) CUDACapabilities: currently a list of python strings. TODO: Given that the pilot/node only needs to have an equal or newer version, should we revisit it and make it a plain string with the version? Are we able to have such comparison in GlideinWMS?
6) GPUMemoryMB: has to be the max of all the steps

Describe alternatives you've considered
Further discussion with Core and GlideinWMS still required to finalize these requirements.

Additional context
Documentation: gpu-parameter-specification

and initial development made in this PR #10388

@mmascher
Copy link
Member

Hi Alan, for the "resource provisioning step", aka first matchmaking, we are only asking WMAgent to put requiresGPUs. Providing all the attributes is not feasible without a significant effort, and it brings little benefit IMHO.

For the job (second) matchmaking we can use any of the attributes here: https://monit-grafana.cern.ch/d/2qoPfS0Mz/cms-submission-infrastructure-gpus-monitor?orgId=11

My preference would be to add a Requirement in WMAgent. Of course we can support you in building that, IIRC there should be a way to test an expression locally. Let me do some tests.

@amaltaro
Copy link
Contributor Author

Thank you for the prompt feedback, Marco!
For the resource provisioning, I think we can keep it as is and gather further experience with this setup.

For the job matchmaking, my concerns in terms of job classad are:
i) can we have set based tests (apparently for CUDARuntime)?
ii) how do we implement "greater or equal than" with version-like values? Their expected regex is linked in the "Additional context"

@mmascher
Copy link
Member

mmascher commented May 17, 2023

Thank you for the prompt feedback, Marco! For the resource provisioning, I think we can keep it as is and gather further experience with this setup.

For the job matchmaking, my concerns in terms of job classad are:
i) can we have set based tests (apparently for CUDARuntime)?

Maybe we can use stringlistmember? It works like stringlistmember("A", "A,B,C") and checks if "A" is on the comma separated list "A,B,C". arguments can be classads.

ii) how do we implement "greater or equal than" with version-like values? Their expected regex is linked in the "Additional context"

Good question. Condor support the split function:

[mmascher@vocms0802 cmsgwms-frontend-configurations]$ condor_status -limit 1 -af GLIDEIN_SINGULARITY_BINDPATH
/storage,/lfs_roots,/cms,/etc/cvmfs/SITECONF,/ceph,/cvmfs/grid.cern.ch/etc/grid-security:/etc/grid-security
[mmascher@vocms0802 cmsgwms-frontend-configurations]$ condor_status -limit 1 -af 'split(GLIDEIN_SINGULARITY_BINDPATH)[2]'
/cms

And also int(). I fear we'll need to make an ad-hoc expression..

@aperezca
Copy link

Hi Alan, when you describe this:

For multi-step job workflows (StepChain and maybe PromptReco(?)), here is how a job needs to be described by the agent:
4) CUDARuntime: currently a python string with the version. It looks like it needs to be a comma separated string. TODO: Are we able to make a comma separated or list comparison in GlideinWMS and ensure that each element in one list is present in the other?
5) CUDACapabilities: currently a list of python strings. TODO: Given that the pilot/node only needs to have an equal or newer version, should we revisit it and make it a plain string with the version? Are we able to have such comparison in GlideinWMS?
6) GPUMemoryMB: has to be the max of all the steps

are you implying that more than one of the steps will be ready to use the GPUs? Is this the case we are trying to cover here?

@amaltaro
Copy link
Contributor Author

That's correct, Antonio. It's common to have workflows using different CMSSW releases and those can be compiled with different CUDA support, hence we need to ensure that the job description and worker node can satisfy all of the different cmsRun requirements/capabilities.

@belforte
Copy link
Member

please let CRAB people (esp. @novicecpp ) know in case this ends up in some non=backward compatible format specifications in current classAds (form string to list e.g.)

@amaltaro
Copy link
Contributor Author

From further discussion in mattermost:

  • Looking into CUDACapabilities, IF glideinWMS is able to do the matchmaking based on "greater or equal than", then my understanding is that it does not really need to be a list of capabilitites. We simply need to use the smallest capability (version) and do the matchmaking based on that.
    • a pseudo-code for a multi-step job would be: max(min(from step1), min(from step2), ...)
  • Andrea B. confirms that supporting multiple CUDARuntime within the same worker node is realistic. So, multi step jobs can indeed request multiple CUDARuntime during matchmaking.

@mmascher @mambelli Hi Marco and Marco, I wonder if you can suggest how we can support a more complex job matchmaking in glideinWMS? A summary from what has been discussed above is:

  1. CUDACapabilities: we need to have a ">=" comparison of the requested vs provided resource.
  2. CUDARuntime: for multi-step case, we need to ensure that every CUDARuntime version requested by the job is also supported by the resource/pilot.

Or please let us know if you would have any other suggestions that were not yet mentioned here.

@amaltaro
Copy link
Contributor Author

@mmascher Marco, now that the requirements are more clear (see comment above), I wonder if you have any recommendations on how to define these in term of job classads?

@amaltaro
Copy link
Contributor Author

amaltaro commented Aug 3, 2023

Updating this issue with discussions that happened mostly in mattermost (SI and GPU Developments).

Regarding the 2 glideinWMS questions raised above, here is further information on how that can be accomplished:

  1. CUDACapabilities: we need to have a ">=" comparison of the requested vs provided resource.

We can split each part of the version number and compare each of them (e.g. below for a version with major.medium.minor values):

int(split(v3,".")[0])<=int(split(v4,".")[0]) && int(split(v3,".")[1])<=int(split(v4,".")[1]) && int(split(v3,".") [2])<=int(split(v4,".")[2])' # for v3<=v4

and an example to test such expression can be seen in [1].
In addition, I was also asking about the format of the CUDA capabilities, which so far have been 2 decimal digits separated by a dot (\d.\d). Andrea B confirms that this is what we foresee as well (because if it changes, the classad expression would have to change as well).

For the second question:

  1. CUDARuntime: for multi-step case, we need to ensure that every CUDARuntime version requested by the job is also supported by the resource/pilot.

Marco suggests to adapt stringListSubsetMatch classad, and an example can be seen in [2]. The only problem with this approach is that - confirmed with Jaime from HTCondor - this feature has been added in HTCondor 10.0.6, while condor version in the CERN and FNAL schedds is 10.0.1. It's not clear to me which part of the SI layer would have to be upgraded to support it (negotiator plus schedds?).

[1] from Marco M.

[mmascher@lxplus7103 ~]$ cat classad_file 
v1 = "1.1.1";
v2 = "2.1.1";
v3 = "1.1.2";
v4 = "1.1.12";
v5 = "1.12.1";
v6 = "1.1.0";
v7 = "1";
v8 = "2";
v9 = "1.32.7";
v10 = "2.0";
[mmascher@lxplus7103 ~]$ classad_eval -file classad_file 'int(split(v3,".")[0])<=int(split(v4,".")[0]) && int(split(v3,".")[1])<=int(split(v4,".")[1]) && int(split(v3,".")[2])<=int(split(v4,".")[2])' # for v3<=v4
[ v1 = "1.1.1"; v2 = "2.1.1"; v3 = "1.1.2"; v4 = "1.1.12"; v5 = "1.12.1"; v6 = "1.1.0"; v7 = "1"; v8 = "2"; v9 = "1.32.7"; v10 = "2.0" ]
true

[2]

$ classad_eval -file alan 'stringListSubsetMatch(list3,list4)'
[ list1 = "1.2.3,2.3.4"; list2 = "1.2.3,2.3.4,3.4.5"; list3 = { "1.2.3","2.3.4" }; list4 = { "1.2.3","2.3.4","3.4.5" } ]
error

With this information, I think the development of this feature can be resumed and in parallel we discuss possible upgrades of the SI infrastructure.

@fwyzard
Copy link

fwyzard commented Aug 3, 2023

We can split each part of the version number and compare each of them (e.g. below for a version with major.medium.minor values):

The capabilities are always of the format a.b.
Can you just compute a * 10 + b and compare based on that ?

@amaltaro
Copy link
Contributor Author

amaltaro commented Aug 3, 2023

Can you just compute a * 10 + b and compare based on that ?

If CUDA capability is always in the format of \d.\d, then yes it would work.
But I feel like we should be prepared to have it in the format of \d+\.\d+, then it can fail with an example like 9.2 vs 8.13

@fwyzard
Copy link

fwyzard commented Aug 3, 2023

All NVIDIA documentation about CUDA describes it as x.y.

It might reasonably becomes xx.y in the future, but I doubt the second field will get a second digit, because many macros are already defined x * 10 + y.

If it ever does change in a non-backward compatible way, we can always update the comparison code, no ?

@klannon
Copy link

klannon commented Aug 21, 2023

@amaltaro Should this really be marked as "Waiting" if a PR was linked to this issue this past week? Seems more "in progress" to me.

@amaltaro
Copy link
Contributor Author

@klannon development of this feature should be completed, besides further testing. Now we have a dependency on the Submission Infrastructure team to upgrade HTCondor. I have not yet contacted them because, from Mattermost, I understand that most of the SI team is on vacation.

@amaltaro
Copy link
Contributor Author

Now that people are back from vacation season, here is a ticket to address the required developments at the glideinWMS layer:
https://its.cern.ch/jira/browse/CMSSI-79

@amaltaro
Copy link
Contributor Author

Just a note: We decided not to pull this ticket into 2023 Q4, as there are dependencies on the SI infrastructure and it's not clear when those will be implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Waiting
Development

Successfully merging a pull request may close this issue.

6 participants