Skip to content

Conversation

@pauldg
Copy link
Collaborator

@pauldg pauldg commented Nov 21, 2025

There are some legacy configuration options for slurm and TPV in the trainings.

For the connecting to a cluster training:


cons_res is legacy and is replaced by cons_tres
For the TPV training:
> + native_specification: --nodes=1 --ntasks=1 --cpus-per-task={cores} --time={params['walltime']}:00:00

params['walltime'] should be entity.params.get('walltime')

> params:
> - native_specification: --nodes=1 --ntasks=1 --cpus-per-task={cores}
> + native_specification: --nodes=1 --ntasks=1 --cpus-per-task={cores} --time={params['walltime']}:00:00
> + native_specification: --nodes=1 --ntasks=1 --cpus-per-task={cores} --time={entity.params.get('walltime')}:00:00
Copy link
Member

@mvdbeek mvdbeek Nov 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems odd, if you anticipate walltime isn't set you would need to set a default, otherwise indexing via [] is what I'd use ... there's an obvious error if the value isn't set.

... also if {cores} works, i would keep it consistent and use {params['walltime']} ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I was testing the training on one of the EU training vms and was getting an error saying params was undefined. I'll post the logs in a couple of hours.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm rerunning the tutorials and will make sure it works (or not 😆 )

Copy link
Member

@mvdbeek mvdbeek Nov 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think I get it now, it does need to be entity, because params is evaled as an f-string and is not a top level attribute of the EntityModel. I've used [ though since I don't think we'd ever want None:00:00 as walltime.

@lldelisle
Copy link
Collaborator

While you are working on this. I was surprised that the native spec do not use the mem (for example --mem={round(mem*1024)} while mem is defined in the tpv example).
Also, maybe this is helpful for you, here is how they deal with runtime in the shared tpv:
https://github.com/galaxyproject/tpv-shared-database/blob/9e497f53c2e72bf0467952f0d11bde80ae5e3fc3/tools.yml#L2941

@martenson martenson changed the title Fix slurm and tpv trainings [GAT] Fix slurm and tpv trainings Nov 22, 2025
@mvdbeek mvdbeek force-pushed the fix-slurm-and-tpv-trainings branch from 6979743 to cf7a436 Compare November 23, 2025 10:48
@mvdbeek
Copy link
Member

mvdbeek commented Nov 23, 2025

While you are working on this. I was surprised that the native spec do not use the mem (for example --mem={round(mem*1024)} while mem is defined in the tpv example).

I know now why:

:04:32,995 [pN:handler_0,p:684732,tN:SlurmRunner.work_thread-3] (14) native specification is: --nodes=1 --ntasks=1 --cpus-per-task=2 --mem=8192
Nov 23 12:04:33 gat-1.eu.training.galaxyproject.eu galaxyctl[684732]: python: error: Memory specification can not be satisfied

I'll check how slurm is configured, maybe we'll just have to tune this down a little

Else the `--mem` value will exceed the config in the pulsar tutorial.
@mvdbeek mvdbeek enabled auto-merge November 23, 2025 13:56
@mvdbeek mvdbeek merged commit 8f58aa9 into galaxyproject:main Nov 23, 2025
4 checks passed
> + max_accepted_cores: 16
> + params:
> + native_specification: --nodes=1 --ntasks=1 --cpus-per-task={cores}
> + native_specification: --nodes=1 --ntasks=1 --cpus-per-task={cores} --mem={round(mem*1024)}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants