-
Notifications
You must be signed in to change notification settings - Fork 1k
[GAT] Fix slurm and tpv trainings #6502
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GAT] Fix slurm and tpv trainings #6502
Conversation
| > params: | ||
| > - native_specification: --nodes=1 --ntasks=1 --cpus-per-task={cores} | ||
| > + native_specification: --nodes=1 --ntasks=1 --cpus-per-task={cores} --time={params['walltime']}:00:00 | ||
| > + native_specification: --nodes=1 --ntasks=1 --cpus-per-task={cores} --time={entity.params.get('walltime')}:00:00 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems odd, if you anticipate walltime isn't set you would need to set a default, otherwise indexing via [] is what I'd use ... there's an obvious error if the value isn't set.
... also if {cores} works, i would keep it consistent and use {params['walltime']} ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I was testing the training on one of the EU training vms and was getting an error saying params was undefined. I'll post the logs in a couple of hours.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm rerunning the tutorials and will make sure it works (or not 😆 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I think I get it now, it does need to be entity, because params is evaled as an f-string and is not a top level attribute of the EntityModel. I've used [ though since I don't think we'd ever want None:00:00 as walltime.
|
While you are working on this. I was surprised that the native spec do not use the mem (for example |
6979743 to
cf7a436
Compare
I know now why: I'll check how slurm is configured, maybe we'll just have to tune this down a little |
Else the `--mem` value will exceed the config in the pulsar tutorial.
| > + max_accepted_cores: 16 | ||
| > + params: | ||
| > + native_specification: --nodes=1 --ntasks=1 --cpus-per-task={cores} | ||
| > + native_specification: --nodes=1 --ntasks=1 --cpus-per-task={cores} --mem={round(mem*1024)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😍
There are some legacy configuration options for slurm and TPV in the trainings.
For the connecting to a cluster training:
training-material/topics/admin/tutorials/connect-to-compute-cluster/tutorial.md
Line 165 in e822f5f
cons_resis legacy and is replaced bycons_tresFor the TPV training:
training-material/topics/admin/tutorials/job-destinations/tutorial.md
Line 732 in e822f5f
params['walltime']should beentity.params.get('walltime')