Refactoring, bug fixes and adding tests #18

unkcpz · 2024-06-05T08:24:04Z

This PR is open since I use the branch to test the demo server lightweight scheduler integration. The PR bundles bunch of things include:

correctly support memory setup for resources.
support turn on hyperthreading with latest version of hyperqueue.
Use ruff to lint
Fix the submit bug for hq > 0.12 that resources are configured twice in job script and submit command.
Support install hq to remote computer over CLI.
WIP adding unit tests with submit to real hq using the fixture from hyperqueue repo.

The major change I made in terms of resource setting is I didn't use num_mpiprocs and rename num_cores -> num_cpus, rename memory_Mb -> memory_mb.
The reason is that I think this kind of "meta-scheduler" for task farming is not inherit from either ParEnvJobResource as SGE type scheduler nor NodeNumberJobResource. When we use hyperqueue for task farming or for local machine as light-weight scheduler we only set number of cpus and size of memory to allocate for each job. The multi-node support of hyperqueue is under experiments and will not cover our use case from what I can expect. But this is the point worth to discuss, looking forward to see your opinions @giovannipizzi @mbercx

Issues:

If remote binary exist, cannot override install. Hit sftp error (OSError: Failure)
Think of is it a new problem after eiger updated, only from the same login node can access the server.
Specifying HQ_SERVER_DIR explicitly, to distinguish multiple server (see Distinguish hq-server folder to have multiple server for different machines share the same home It4innovations/hyperqueue#719)
Should allow to pause and resume the alloc.

Must have features:

Use NodeNumberJobResource as parent and provide option for use case on LUMI that will require multinode functionality of HQ.
- how to tell alloc to fire workers in the same group? Every new multinode run is managed to a certain group. Which means if -N is passed to alloc, the group name should be always exclusive. We don't want HQ to mess around to have many unbalanced jobs in different compute nodes.

The resource can be set with #HQ or through CLI, but not both. The CLI options are removed from submit command.

- Change command to aiida-hq - add aiida-hq install <computer> - fix start server timeout problem - pre-commit lint

a

unkcpz added 7 commits April 30, 2024 16:40

Use same option for switch hyper threading for worker as hq

8c425dc

Make memory configurable

405ccb3

Fix problem that resources setting twice

cf2dbb3

The resource can be set with #HQ or through CLI, but not both. The CLI options are removed from submit command.

Update pyproject.toml and use ruff

ec6aa22

Improve base resource test

e409292

Add test for job script generate and test with real hq

950d8d1

Add test for parsing submit output to jobid

fd11895

unkcpz marked this pull request as draft June 5, 2024 08:24

unkcpz force-pushed the fea/mem-allocation branch 5 times, most recently from 5b56629 to 56f8345 Compare June 6, 2024 09:51

unkcpz added 4 commits June 7, 2024 21:47

Add more CLI to control the remote

92fad45

- Change command to aiida-hq - add aiida-hq install <computer> - fix start server timeout problem - pre-commit lint

Add test for parse job list

3505b46

Add test for CLI server

40bb573

Add test for CLI install

e8f2da7

unkcpz force-pushed the fea/mem-allocation branch 3 times, most recently from 58e77e3 to d525ff1 Compare June 7, 2024 20:06

pre-commit fixes and trigger from pre-commit.ci action

7981ac9

unkcpz force-pushed the fea/mem-allocation branch 2 times, most recently from fef7c55 to 8bffb31 Compare June 7, 2024 20:23

unkcpz added 2 commits June 7, 2024 22:27

Fix doc build

de5023c

a

Only test local transport since GH not by default support ssh

181a29b

unkcpz force-pushed the fea/mem-allocation branch from 8db9734 to 181a29b Compare June 7, 2024 20:27

unkcpz mentioned this pull request Jul 3, 2024

Multiple workers-per-allocation not working It4innovations/hyperqueue#443

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactoring, bug fixes and adding tests #18

Refactoring, bug fixes and adding tests #18

unkcpz commented Jun 5, 2024 •

edited

Loading

Refactoring, bug fixes and adding tests #18

Are you sure you want to change the base?

Refactoring, bug fixes and adding tests #18

Conversation

unkcpz commented Jun 5, 2024 • edited Loading

unkcpz commented Jun 5, 2024 •

edited

Loading