-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
flux-uri slurm:jobid does not work for slurm batch jbos #5482
Comments
A simple test worked for me, but this is the simplest case. Any hints on what you might have been doing different? $ squeue -u grondo
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1521171 pdebug interact grondo R 0:44 1 quartz3
$ flux uri slurm:1521171
ssh://quartz3/var/tmp/grondo/flux-ptC8HX/local-0
$ flux proxy slurm:1521171
f(s=1,d=0) $ flux resource list
STATE NNODES NCORES NGPUS NODELIST
free 1 36 0 quartz3
allocated 0 0 0
down 0 0 0 Also this reminds me that this would be a good testcase to add to our extra tests for the gitlab CI. (cc @wihobbs) |
Good idea @grondo. @garlick I tried the same thing as Mark but varied the number of nodes, and put some nested instances in there, and tried using |
It worked for me just now. I was probably doing something dumb before! Sorry for the noise. |
Ah this is what I was doing. But perhaps this isn't intended to work:
|
Reopening since it would be nice if this worked for slurm batch jobs. I think the only problem is that the batch script is the first child of the slurmstepd and we need to look one level deeper if the first LOCALID=0 process does not work out. Perhaps we could just try the pids in sorted order? On the first node of a job submitted like above:
and those pids are:
Confirmed that |
The slurm resolver doesn't walk the process tree of slurmstepd, but uses I think it would work if you did Not to say we couldn't fix this particular case, but searching for the first flux-broker that happens to be running under a Slurm batch job might give surprising results. For example, I could get a random test instance returned if running |
Should test this one though... |
Just another thought, we'd have a similar issue with flux if you run |
Oh duh! My example was not doing what I thought it was - I was just starting a size=1 flux instance on the first node of the batch allocation wasn't I? Yeah this works
Sorry for the noise! |
Note to self: while working on https://flux-framework.readthedocs.io/projects/flux-core/en/latest/guide/start.html#starting-with-slurm and running flux instances in the LLNL quartz debug queue, I was unable to get
flux uri slurm:jobid
to work. I didn't run it down. This needs to be revisited to see if it really works and I'm just doing something dumb or if something's gone sour in that code.The text was updated successfully, but these errors were encountered: