Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IO not forwarded to redis when used inside a Slurm job #4

Closed
mathaefele opened this issue Nov 20, 2019 · 3 comments
Closed

IO not forwarded to redis when used inside a Slurm job #4

mathaefele opened this issue Nov 20, 2019 · 3 comments

Comments

@mathaefele
Copy link

mathaefele commented Nov 20, 2019

Describe the bug

Despite the usage of pdwfs, the file is created on the file system. The trace shows that POSIX calls are intercepted by pdwfs, but the libc symbol is called instead of the pdwfs redirection to redis.

mhaefele@devel01:C $ more slurm-3781.out
[PDWFS][init] Start central Redis instance on miriel056.plafrim.cluster:34000
waitkey 1
[PDWFS][75327][TRACE][C] intercepting fopen(path=staged/Cpok, mode=w)
[PDWFS][75327][TRACE][C] calling libc fopen
[PDWFS][75327][TRACE][C] intercepting fwrite(ptr=0x400765, size=1, nmemb=9, stream=0xd04f00)
[PDWFS][75327][TRACE][C] calling libc fwrite
[PDWFS][75327][TRACE][C] intercepting fwrite(ptr=0x400765, size=1, nmemb=9, stream=0xd04f00)
[PDWFS][75327][TRACE][C] calling libc fwrite
[PDWFS][75327][TRACE][C] intercepting fwrite(ptr=0x400765, size=1, nmemb=9, stream=0xd04f00)
[PDWFS][75327][TRACE][C] calling libc fwrite
[PDWFS][75327][TRACE][C] intercepting fwrite(ptr=0x400765, size=1, nmemb=9, stream=0xd04f00)
[PDWFS][75327][TRACE][C] calling libc fwrite
[PDWFS][75327][TRACE][C] intercepting fwrite(ptr=0x400765, size=1, nmemb=9, stream=0xd04f00)
[PDWFS][75327][TRACE][C] calling libc fwrite
[PDWFS][75327][TRACE][C] intercepting fwrite(ptr=0x400765, size=1, nmemb=9, stream=0xd04f00)
[PDWFS][75327][TRACE][C] calling libc fwrite
[PDWFS][75327][TRACE][C] intercepting fwrite(ptr=0x400765, size=1, nmemb=9, stream=0xd04f00)
[PDWFS][75327][TRACE][C] calling libc fwrite
[PDWFS][75327][TRACE][C] intercepting fwrite(ptr=0x400765, size=1, nmemb=9, stream=0xd04f00)
[PDWFS][75327][TRACE][C] calling libc fwrite
[PDWFS][75327][TRACE][C] intercepting fwrite(ptr=0x400765, size=1, nmemb=9, stream=0xd04f00)
[PDWFS][75327][TRACE][C] calling libc fwrite
[PDWFS][75327][TRACE][C] intercepting fwrite(ptr=0x400765, size=1, nmemb=9, stream=0xd04f00)
[PDWFS][75327][TRACE][C] calling libc fwrite
[PDWFS][75327][TRACE][C] intercepting fclose(stream=0xd04f00)
[PDWFS][75327][TRACE][C] calling libc fclose
simu: running on host miriel056
redis-cli -h miriel056.plafrim.cluster -p 34000 --scan
addr
PONG
[PDWFS][75365][TRACE][C] intercepting fopen(path=staged/Cpok, mode=r)
[PDWFS][75365][TRACE][C] calling libc fopen
[PDWFS][75365][TRACE][C] intercepting fread(ptr=0x7ffeba935f90, size=1, nmemb=2560, stream=0x7a3f00)
[PDWFS][75365][TRACE][C] calling libc fread
[PDWFS][75365][TRACE][C] intercepting fclose(stream=0x7a3f00)
[PDWFS][75365][TRACE][C] calling libc fclose
[PDWFS][75365][TRACE][C] intercepting fopen(path=resC, mode=w)
[PDWFS][75365][TRACE][C] calling libc fopen
[PDWFS][75365][TRACE][C] intercepting fprintf(stream=0x7a3f00, ...)
[PDWFS][75365][TRACE][C] intercepting fputs(s=Hello444
, stream=0x7a3f00)
[PDWFS][75365][TRACE][C] calling libc fputs
[PDWFS][75365][TRACE][C] intercepting fclose(stream=0x7a3f00)
[PDWFS][75365][TRACE][C] calling libc fclose
post-process: running on host miriel056
post-process: Hello444
waitkey 1
mhaefele@devel01:C $

The result is correct, but data have moved through the file system and not through the redis instances.

To Reproduce

The code is very similar to the one submitted in #2. The working version of post-process that uses fread is used. Only a call to getenv has been added to print on which host the simu and the post-process apps are running. Here is the job script:

#!/bin/bash
#SBATCH --job-name=pdwfs_hello
#SBATCH --time=0:02:00
#SBATCH --nodes=2

work_directory="${SLURM_JOB_NAME}_${SLURM_JOB_ID}"
mkdir -p "${work_directory}/staged"
cd "${work_directory}"
ln ../simu .
ln ../post-process .

echo $SLURM_JOB_NODELIST > node_list

# Initialize the Redis instances:
pdwfs-slurm init -N 1 -n 1 -i ib0

# pdwfs-slurm produces a session file with some environment variables to source
source pdwfs.session

# pdwfs command will forward all I/O in $SCRATCHDIR in Redis instances
WITH_PDWFS="pdwfs -t -p ${work_directory}/staged"

# Execute ior benchmark on 128 tasks
srun --mpi=none -N 1 -n 1 $WITH_PDWFS ./simu 
host=`echo $PDWFS_CENTRAL_REDIS |cut -d':' -f 1`
port=`echo $PDWFS_CENTRAL_REDIS |cut -d':' -f 2`
echo "redis-cli -h $host -p $port --scan"
redis-cli -h $host -p $port --scan
redis-cli -h $host -p $port ping

srun --mpi=none -N 1 -n 1 $WITH_PDWFS ./post-process 

# gracefully shuts down Redis instances
pdwfs-slurm finalize

# pdwfs-slurm uses srun in background to execute Redis instances
# wait for background srun to complete
wait

Expected behavior

Same result is expected but without any file created in the staged directory

Additional context

Running on a cluster with slurm 19.05.2

@JCapul
Copy link
Collaborator

JCapul commented Nov 23, 2019

Thanks for your issue !

in your example, you "cd" into ${work_directory} then you set pdwfs to intercept files in "${work_directory}/staged" where ${work_directory} is a relative path. So pdwfs will actually look for files in the absolute path ".../${work_directory}/${work_directory}/staged".

If I am not mistaken, it should work if you set pdwfs like this:
WITH_PDWFS="pdwfs -t -p staged"
or if you make you ${work_directory} an absolute path.

@mathaefele
Copy link
Author

Indeed... Sorry for the stupid directory name. It works when replacing
WITH_PDWFS="pdwfs -t -p ${work_directory}/staged"
with
WITH_PDWFS="pdwfs -t -p staged"

But now, when I try to increase the number of instances, pdwfs seg fault. I try to investigate a bit more on my side before sending again a stupid input 😅

@JCapul
Copy link
Collaborator

JCapul commented Nov 25, 2019

No worries ! It also means that we should add a few more debugging info output in the verbose mode to assist users setting up pdwfs.

I'll consider this issue closed. Don't hesitate to raise another one if you still can't make it work.

@JCapul JCapul closed this as completed Nov 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants