-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot run the redis servers and the simulation on different resources #6
Comments
Hi Mat, thanks for the issue! What I don't get is that it seems, from the very first line of the your first example log, that you are logged in on miriel056 (the central Redis instance is launched locally on the node the user is logged in), then from this node you must be running sbatch (I guess), but then everything is scheduled on this same node miriel056...that's a weird sbatch behaviour...but I have missed something probably. |
No, no. The login nodes is "devel02" and mirielXXX are compute nodes. This run has been allocated miriel056 and miriel057. So the central server is launched on miriel056 as well as the simulation and the post-processing. And that's one of the question... |
Ok thanks, I think I start to understand a bit better. The central Redis instance is not the instance where data are stored. It is a sort of manager instance that is used in the process of spawning the cluster of Redis instances that will be used for staging data. This central Redis instance is launched directly by executing the Redis binary, not through srun. While the other Redis instances are launched through srun. The consequence of this is that the central Redis instance is run on the login node if one use salloc to run the job (what I usually do for debugging) or on one of the allocated nodes in case of sbatch (what you are doing). And btw, I just realized salloc and sbatch have a different behaviour in this respect, which is why I got initially confused... So the question now is: on which node the Redis instance used for staging data has been run ? Is it on miriel056 or miriel057 ? As for your second issue, I will look into it. |
mhaefele@devel03:C $ sacct -j 3781 -o JobName,NodeList ... I contact my admins and I come back to you when I have the required inputs. |
sacct -j 3781 -o JobName,NodeList pdwfs_hel+ miriel[056-057] I am not sure to understand, I do not see my simu, neither my post-processing... But they print they are running on miriel056. So, everything seems to run on miriel056... |
Ok thanks, there must be some slurm configuration magic I am not aware of... Could you try launching your applications using the
and regarding your simu and post-processing in sacct, since they are wrapped by the |
I made some tests with the -r option, and indeed, the processes are executed on different nodes. But I have non reproducible behaviours. The same script executed on the same nodes sometimes gives the correct result, sometimes breaks with a very similar error as the one mentioned above:
And I tried several times this afternoon, and with the -r option, it was always broken... I am roaming in the dark... |
After several trials and errors, I managed to make it work the several redis instances and the post-processing on one node and the simulation on another ! I close this issue as it is not an issue any more. I come back to you with a more precise issue on this next time hopefully. |
Describe the bug
The title is the first issue. Up to two redis servers, it works, data are correct in result file and i get the following std output:
However, redis servers, simulation and post-processing are all running on the node miriel056. I tried around several options but did not manage to get anything else.
To Reproduce
The job script that uses the my C hello worlds from #2:
I tried to fill the first 16 cores of the first node with redis instances and it works with 2 redis instances but not more. I get the following error message with 4:
Expected behavior
I would like to have a way of telling pdwfs to run on a different node than the simulation. There seems to be ways for that for slurm but, as everything is embedded in
pdwfs-slurm
, I do not know to which extent this has to be put back in the job scriptThanks for yout help.
Mat
The text was updated successfully, but these errors were encountered: