Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot run the redis servers and the simulation on different resources #6

Closed
mathaefele opened this issue Dec 6, 2019 · 8 comments
Closed

Comments

@mathaefele
Copy link

Describe the bug

The title is the first issue. Up to two redis servers, it works, data are correct in result file and i get the following std output:

[PDWFS][init] Start central Redis instance on miriel056.plafrim.cluster:34000
waitkey 1
[PDWFS][59705][TRACE][C] intercepting fopen(path=staged/Cpok, mode=w)
[PDWFS][59705][TRACE][C] intercepting fwrite(ptr=0x400765, size=1, nmemb=9, stream=0x1659f00)
[PDWFS][59705][TRACE][C] intercepting fwrite(ptr=0x400765, size=1, nmemb=9, stream=0x1659f00)
[PDWFS][59705][TRACE][C] intercepting fwrite(ptr=0x400765, size=1, nmemb=9, stream=0x1659f00)
[PDWFS][59705][TRACE][C] intercepting fwrite(ptr=0x400765, size=1, nmemb=9, stream=0x1659f00)
[PDWFS][59705][TRACE][C] intercepting fwrite(ptr=0x400765, size=1, nmemb=9, stream=0x1659f00)
[PDWFS][59705][TRACE][C] intercepting fwrite(ptr=0x400765, size=1, nmemb=9, stream=0x1659f00)
[PDWFS][59705][TRACE][C] intercepting fwrite(ptr=0x400765, size=1, nmemb=9, stream=0x1659f00)
[PDWFS][59705][TRACE][C] intercepting fwrite(ptr=0x400765, size=1, nmemb=9, stream=0x1659f00)
[PDWFS][59705][TRACE][C] intercepting fwrite(ptr=0x400765, size=1, nmemb=9, stream=0x1659f00)
[PDWFS][59705][TRACE][C] intercepting fwrite(ptr=0x400765, size=1, nmemb=9, stream=0x1659f00)
[PDWFS][59705][TRACE][C] intercepting fclose(stream=0x1659f00)
[PDWFS][59705][TRACE][C] intercepting close(fd=5)
[PDWFS][59705][TRACE][C] intercepting close(fd=5)
[PDWFS][59705][TRACE][C] calling libc close
simu: running on host miriel056
redis-cli -h miriel056.plafrim.cluster -p 34000 --scan
addr
PONG
[PDWFS][59742][TRACE][C] intercepting fopen(path=staged/Cpok, mode=r)
[PDWFS][59742][TRACE][C] intercepting fread(ptr=0x7fff707ac8c0, size=1, nmemb=2560, stream=0xbbef00)
[PDWFS][59742][TRACE][C] intercepting fclose(stream=0xbbef00)
[PDWFS][59742][TRACE][C] intercepting close(fd=5)
[PDWFS][59742][TRACE][C] intercepting close(fd=5)
[PDWFS][59742][TRACE][C] calling libc close
[PDWFS][59742][TRACE][C] intercepting fopen(path=resC, mode=w)
[PDWFS][59742][TRACE][C] calling libc fopen
[PDWFS][59742][TRACE][C] intercepting fprintf(stream=0xbbf140, ...)
[PDWFS][59742][TRACE][C] intercepting fputs(s=Hello444
, stream=0xbbf140)
[PDWFS][59742][TRACE][C] calling libc fputs
[PDWFS][59742][TRACE][C] intercepting fclose(stream=0xbbf140)
[PDWFS][59742][TRACE][C] calling libc fclose
post-process: running on host miriel056
post-process: Hello444
waitkey 1

However, redis servers, simulation and post-processing are all running on the node miriel056. I tried around several options but did not manage to get anything else.

To Reproduce

The job script that uses the my C hello worlds from #2:

#!/bin/bash
#SBATCH --job-name=pdwfs_hello
#SBATCH --time=0:02:00
#SBATCH --nodes=2

work_directory="${SLURM_JOB_NAME}_${SLURM_JOB_ID}"
mkdir -p "${work_directory}/staged"
cd "${work_directory}"
ln ../simu .
ln ../post-process .

echo $SLURM_JOB_NODELIST > node_list

# Initialize the Redis instances:
pdwfs-slurm init -N 1 -n 1 -i ib0

# pdwfs-slurm produces a session file with some environment variables to source
source pdwfs.session

# pdwfs command will forward all I/O in $SCRATCHDIR in Redis instances
WITH_PDWFS="pdwfs -t -p staged"

# Execute ior benchmark on 128 tasks
srun --mpi=none -N 1 -n 1 $WITH_PDWFS ./simu 
host=`echo $PDWFS_CENTRAL_REDIS |cut -d':' -f 1`
port=`echo $PDWFS_CENTRAL_REDIS |cut -d':' -f 2`
echo "redis-cli -h $host -p $port --scan"
redis-cli -h $host -p $port --scan
redis-cli -h $host -p $port ping

srun --mpi=none -N 1 -n 1 $WITH_PDWFS ./post-process 

# gracefully shuts down Redis instances
pdwfs-slurm finalize

# pdwfs-slurm uses srun in background to execute Redis instances
# wait for background srun to complete
wait

I tried to fill the first 16 cores of the first node with redis instances and it works with 2 redis instances but not more. I get the following error message with 4:

PDWFS][init] Start central Redis instance on miriel018.plafrim.cluster:34000
Could not connect to Redis at miriel018.plafrim.cluster:34000: Connection refused
[PDWFS][init] Error: the central Redis instance is not responding
panic: dial tcp :6379: connect: connection refused

goroutine 17 [running, locked to thread]:
github.com/cea-hpc/pdwfs/redisfs.Try(...)
	/home/capulj/sources/cea-hpc/pdwfs/src/go/redisfs/redis.go:38
github.com/cea-hpc/pdwfs/redisfs.Pipe.Do(0x2ba6bc3e7880, 0xc0000109d0, 0x2ba6bc0fb6e6, 0x4, 0xc00000c340, 0x2, 0x2)
	/home/capulj/sources/cea-hpc/pdwfs/src/go/redisfs/redis.go:207 +0x99
github.com/cea-hpc/pdwfs/redisfs.(*Inode).initMeta(0xc0000ce0f0, 0x180bc0fb401)
	/home/capulj/sources/cea-hpc/pdwfs/src/go/redisfs/inodes.go:61 +0x357
github.com/cea-hpc/pdwfs/redisfs.NewRedisFS(0xc00008a680, 0xc00000c280, 0xc000076d68)
	/home/capulj/sources/cea-hpc/pdwfs/src/go/redisfs/fs.go:85 +0x1bb
main.NewPdwFS(0xc000010970, 0xe)
	/home/capulj/sources/cea-hpc/pdwfs/src/go/pdwfs.go:83 +0xf9
main.InitPdwfs(0x7ffccbadb450, 0x0, 0x400)
	/home/capulj/sources/cea-hpc/pdwfs/src/go/pdwfs.go:158 +0x72
main._cgoexpwrap_c1e4f2bfaf13_InitPdwfs(0x7ffccbadb450, 0x0, 0x400)
	_cgo_gotypes.go:281 +0x41
/home/mhaefele/public/opt/pdwfs/bin/pdwfs : ligne 86 : 16433 Abandon                 (core dumped)$*
srun: error: miriel018: task 0: Exited with exit code 134
redis-cli -h  -p  --scan
Could not connect to Redis at -p:6379: Name or service not known
Could not connect to Redis at -p:6379: Name or service not known
panic: dial tcp :6379: connect: connection refused

goroutine 17 [running, locked to thread]:
github.com/cea-hpc/pdwfs/redisfs.Try(...)
	/home/capulj/sources/cea-hpc/pdwfs/src/go/redisfs/redis.go:38
srun: error: miriel018: task 0: Exited with exit code 134
github.com/cea-hpc/pdwfs/redisfs.Pipe.Do(0x2b40aa975880, 0xc0000109d0, 0x2b40aa6896e6, 0x4, 0xc00000c340, 0x2, 0x2)
	/home/capulj/sources/cea-hpc/pdwfs/src/go/redisfs/redis.go:207 +0x99
github.com/cea-hpc/pdwfs/redisfs.(*Inode).initMeta(0xc0000ce0f0, 0x180aa689401)
	/home/capulj/sources/cea-hpc/pdwfs/src/go/redisfs/inodes.go:61 +0x357
github.com/cea-hpc/pdwfs/redisfs.NewRedisFS(0xc00008a680, 0xc00000c280, 0xc000076d68)
	/home/capulj/sources/cea-hpc/pdwfs/src/go/redisfs/fs.go:85 +0x1bb
main.NewPdwFS(0xc000010970, 0xe)
	/home/capulj/sources/cea-hpc/pdwfs/src/go/pdwfs.go:83 +0xf9
main.InitPdwfs(0x7ffe11c712e0, 0x0, 0x400)
	/home/capulj/sources/cea-hpc/pdwfs/src/go/pdwfs.go:158 +0x72
main._cgoexpwrap_c1e4f2bfaf13_InitPdwfs(0x7ffe11c712e0, 0x0, 0x400)
	_cgo_gotypes.go:281 +0x41
/home/mhaefele/public/opt/pdwfs/bin/pdwfs : ligne 86 : 16473 Abandon                 (core dumped)$*
[PDWFS][finalize] Error: pdwfs-slurm init command failed

Expected behavior

I would like to have a way of telling pdwfs to run on a different node than the simulation. There seems to be ways for that for slurm but, as everything is embedded in pdwfs-slurm , I do not know to which extent this has to be put back in the job script

Thanks for yout help.
Mat

@JCapul
Copy link
Collaborator

JCapul commented Dec 11, 2019

Hi Mat, thanks for the issue!

What I don't get is that it seems, from the very first line of the your first example log, that you are logged in on miriel056 (the central Redis instance is launched locally on the node the user is logged in), then from this node you must be running sbatch (I guess), but then everything is scheduled on this same node miriel056...that's a weird sbatch behaviour...but I have missed something probably.

@mathaefele
Copy link
Author

No, no. The login nodes is "devel02" and mirielXXX are compute nodes. This run has been allocated miriel056 and miriel057. So the central server is launched on miriel056 as well as the simulation and the post-processing. And that's one of the question...

@JCapul
Copy link
Collaborator

JCapul commented Dec 12, 2019

Ok thanks, I think I start to understand a bit better.

The central Redis instance is not the instance where data are stored. It is a sort of manager instance that is used in the process of spawning the cluster of Redis instances that will be used for staging data. This central Redis instance is launched directly by executing the Redis binary, not through srun. While the other Redis instances are launched through srun.

The consequence of this is that the central Redis instance is run on the login node if one use salloc to run the job (what I usually do for debugging) or on one of the allocated nodes in case of sbatch (what you are doing). And btw, I just realized salloc and sbatch have a different behaviour in this respect, which is why I got initially confused...

So the question now is: on which node the Redis instance used for staging data has been run ? Is it on miriel056 or miriel057 ?
You should be able to check it if you run sacct -j your_job_id -o JobName,NodeList and check on which the node the job step "redis.srun" is run.

As for your second issue, I will look into it.

@mathaefele
Copy link
Author

mhaefele@devel03:C $ sacct -j 3781 -o JobName,NodeList
sacct: error: slurm_persist_conn_open_without_init: failed to open persistent connection to localhost:6819: Connection refused
sacct: error: slurmdbd: Sending PersistInit msg: Connection refused
sacct: error: Problem talking to the database: Connection refused

... I contact my admins and I come back to you when I have the required inputs.

@mathaefele
Copy link
Author

sacct -j 3781 -o JobName,NodeList
JobName NodeList


pdwfs_hel+ miriel[056-057]
batch miriel056
extern miriel[056-057]
redis.srun miriel056
pdwfs miriel056
pdwfs miriel056

I am not sure to understand, I do not see my simu, neither my post-processing... But they print they are running on miriel056. So, everything seems to run on miriel056...

@JCapul
Copy link
Collaborator

JCapul commented Dec 13, 2019

Ok thanks, there must be some slurm configuration magic I am not aware of...

Could you try launching your applications using the -r option of srun ? This makes explicit on which node(s) you want to run your app using a relative numbering scheme starting at 0:

srun -r1 --mpi=none -N 1 -n 1 $WITH_PDWFS ./simu
...
srun -r1 --mpi=none -N 1 -n 1 $WITH_PDWFS ./post-process

and regarding your simu and post-processing in sacct, since they are wrapped by the pdwfs command line script, that's what slurm is recording. Not very handy i admit...

@mathaefele
Copy link
Author

I made some tests with the -r option, and indeed, the processes are executed on different nodes.

But I have non reproducible behaviours. The same script executed on the same nodes sometimes gives the correct result, sometimes breaks with a very similar error as the one mentioned above:

PDWFS][init] Start central Redis instance on miriel018.plafrim.cluster:34000
Could not connect to Redis at miriel018.plafrim.cluster:34000: Connection refused
[PDWFS][init] Error: the central Redis instance is not responding
panic: dial tcp :6379: connect: connection refused

goroutine 17 [running, locked to thread]:
github.com/cea-hpc/pdwfs/redisfs.Try(...)
	/home/capulj/sources/cea-hpc/pdwfs/src/go/redisfs/redis.go:38
....

And I tried several times this afternoon, and with the -r option, it was always broken... I am roaming in the dark...

@mathaefele
Copy link
Author

After several trials and errors, I managed to make it work the several redis instances and the post-processing on one node and the simulation on another !
The easiest setup is to work with an interactive job. Some still not understood combinations of sbatch + creating and working directories + having some bash commands failing + running on same node as a previous failed or successful run still produce the error in the issue text.

I close this issue as it is not an issue any more. I come back to you with a more precise issue on this next time hopefully.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants