Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GPU support to the StepChain spec #11588

Merged
merged 2 commits into from
May 17, 2023
Merged

Conversation

amaltaro
Copy link
Contributor

@amaltaro amaltaro commented May 9, 2023

Fixes #10401

Status

ready

Description

With this PR we will be able to execute StepChain GPU workflows. A few important remarks about this PR are:

  • configuration parameters are supported at Step level (similar to TaskChain).
  • jobs will request GPUs based on this logic: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMSpec/WMTask.py#L1500 (hence, if any step requires GPU, the whole job will require it!)
  • we assume that the GPU parameters will be the same if multiple steps can use - or need to - GPUs. In other words, saying that Step1 requires CUDARuntime A while Step2 requires CUDARuntime B is not supported.

Is it backward compatible (if not, which system it affects?)

NO (new feature)

Related PRs

Definition of parameters and supported values are provided in this very first GH issue: #10388

External dependencies / deployment changes

Even though it is a Request Manager change, I suspect that WMAgent will need to have the relevant WMTask/WMStep/CMSSW changes in.

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 2 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 31 warnings and errors that must be fixed
    • 6 warnings
    • 125 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 52 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14253/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Contributor Author

Even though it's well covered by unit tests, I still want to run some real tests with this patch in.
Meanwhile, I would appreciate any feedback on it. @vkuznet @todor-ivanov

Copy link
Contributor

@todor-ivanov todor-ivanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @amaltaro
I left only one comment inline. Please take a look.

currentCmsswStepHelper.setNumberOfCores(multicore, eventStreams)

# GPU settings
gpuRequired = self.requiresGPU
gpuParams = json.loads(taskConf.get('GPUParams', None))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is dangerous. It may raise the following exceptions:

  • TypeError: the JSON object must be str, bytes or bytearray, not NoneType - if the GPUParams is missing
  • JSONDecodeError: Expecting value: line 1 column 1 (char 0) - if the key GPUParams is present but is an empty or non json formatted string

If we have those cases covered in some earlier checks that's ok, but I'd stay on the safe side and would do something like:

if taskConf.get('GPUParams', None):
    gpuParams = json.loads(taskConf['GPUParams'])
else:
    gpuParams = json.loads(self.gPUParams)

which would also cover the check from two lines bellow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch Todor! Let me set it to the json encoded representation of None, which is 'null'.

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 tests no longer failing
    • 2 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 31 warnings and errors that must be fixed
    • 6 warnings
    • 125 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 52 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14255/artifact/artifacts/PullRequestReport.html

gpuParams = json.loads(taskConf.get('GPUParams', 'null'))
if taskConf.get('RequiresGPU', None):
gpuRequired = taskConf['RequiresGPU']
if "GPUParams" not in taskConf:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @amaltaro,
But in this line here, you may still hit exactly the same set of exceptions, because the 'GPUParams' key may be present in taskConf but have an empty string as a value. Maybe we rely on some previous validation for not having empty or misconfigured values for those parameters... I think the safer way would be to have all checks related to the gpuParams enclosed in a single if/then/else block. But anyway, up to you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be covered by this attribute definition:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMSpec/StdSpecs/StdBase.py#L1228

which already does the data type validation as well.

Copy link
Contributor

@vkuznet vkuznet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything seems fine to me

fix json.loads

Get helper function for CMSSW/step gpu parameters
@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 2 tests added
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 31 warnings and errors that must be fixed
    • 6 warnings
    • 125 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 52 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/14262/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Contributor Author

This has been tested and it is functional. Meanwhile, new discussions revealed new requirements at the job matchmaking, and it was also confirmed how to deal with different GPU requirements in different steps of the same job (StepChain).
That discussion is being followed up here and further WM developments should be tracked in the new issue:
#11595

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for GPU parameters at StepChain spec level
4 participants