Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime Hang after finishing all tasks #473

Closed
jiazhihao opened this issue Nov 14, 2022 · 12 comments · Fixed by #465
Closed

Runtime Hang after finishing all tasks #473

jiazhihao opened this issue Nov 14, 2022 · 12 comments · Fixed by #465
Assignees
Labels
bug Something isn't working

Comments

@jiazhihao
Copy link
Collaborator

To reproduce

/home/ubuntu/FlexFlow//python/flexflow_python /home/ubuntu/FlexFlow//examples/python/keras/seq_mnist_mlp.py -ll:py 1 -ll:gpu 1 -ll:fsize 14048 -ll:zsize 12192 -b 64 --only-data-parallel

The execution hangs after real end top-level task is printed. The issue seems to be related to some of the recent changes to the Python part.

@jiazhihao jiazhihao added the bug Something isn't working label Nov 14, 2022
@goliaro
Copy link
Collaborator

goliaro commented Nov 14, 2022

@jiazhihao Do you know what Python version / GPU backend / CUDA version / GPU type was in use?

@goliaro
Copy link
Collaborator

goliaro commented Nov 14, 2022

also, was the bug obtained after compiling FlexFlow with the pre-built NCCL/Legion binaries?

@jiazhihao
Copy link
Collaborator Author

@gabrieleoliaro I did a binary search on the git log of the master branch, and it seems #400 is the source of the issue --- the python program hangs using commit 09ee16b, while it exits normally using the previous commit 81304c8.

Do you think this provides you enough information to look into the issue?

@jiazhihao
Copy link
Collaborator Author

@eddy16112 any idea on why the top_level_task has completed but flexflow_python refuses to exit.

@goliaro
Copy link
Collaborator

goliaro commented Nov 14, 2022

@jiazhihao let me try reproducing the failure on my instance

@eddy16112
Copy link
Collaborator

I am not able to reproduce it on sapling, here is my cmd:
./flexflow_python ../examples/python/keras/seq_mnist_mlp.py -ll:py 1 -ll:gpu 1 -ll:fsize 4096 -ll:zsize 12192 -b 64 --only-data-parallel
Here is the final output:

epochs 1, ELAPSED TIME = 3.5505s, interations 937, samples 60000, THROUGHPUT = 16899.20 samples/s

[0 - 7f6da8e07000]   41.193358 {3}{flexflow_c}: [FFConfig] delete 0x7f692c569f90
[0 - 7f6da8e07000]   41.193419 {3}{flexflow_c}: [FFModel] delete 0x7f692c8c7d50
[0 - 7f6da8e07000]   41.193562 {3}{flexflow_c}: [SingleDataLoader] delete 0x7f692dd02460
[0 - 7f6da8e07000]   41.193580 {3}{flexflow_c}: [SingleDataLoader] delete 0x7f692c9ae6e0
[0 - 7f6da8e07000]   41.193597 {3}{flexflow_c}: [SingleDataLoader] delete 0x7f692ff4b9c0
[0 - 7f6da8e07000]   41.193658 {3}{flexflow_c}: [SGDOptimizer] delete 0x7f692c5241e0
end top-level task
[0 - 7f6da8e07000]   41.213814 {3}{flexflow_c}: [GlorotUniform] delete 0x7f692c7e2620
[0 - 7f6da8e07000]   41.213836 {3}{flexflow_c}: [ZeroInitializer] delete 0x7f692c7f2600
real end top-level task

@gabrieleoliaro Are you able to reproduce it on your side?

@jiazhihao
Copy link
Collaborator Author

@eddy16112 Interesting --- did you see the program exits normally?

@eddy16112
Copy link
Collaborator

Yes, I did not see any crashing or hang. Here is my config. BTW, I am using makefile instead of cmake, but I do not think it matters.

export CUDNN_HOME=/scratch2/wwu/cudnn
export FF_HOME=/scratch2/wwu/FlexFlow
export LG_RT_DIR=/scratch2/wwu/FlexFlow/deps/legion/runtime
export FF_USE_PYTHON=1
export PYTHONPATH=/scratch2/wwu/FlexFlow/python:$PYTHONPATH
export PYTHONPATH=/scratch2/wwu/FlexFlow/align:$PYTHONPATH
export GPU_ARCH=pascal
export LD_LIBRARY_PATH=/scratch2/wwu/cudnn/lib64:$LD_LIBRARY_PATH
export FF_USE_CFFI=1
export FF_ENABLE_DEBUG=1
export DEBUG=1

@goliaro
Copy link
Collaborator

goliaro commented Nov 15, 2022

@jiazhihao It looks like this issue is closely related to #464. Basically, when compiling Legion with the CMake system, the path to the python library gets hard-coded into the compiled Realm library file, as well as in realm_defines.h. If you build FlexFlow using the pre-built libraries, but the Python library file is not present on your machine at the exact same path that was hard-coded when building Legion (/opt/conda/lib/libpython3.9.so), you get the error mentioned by Colin in issue #464.

The issue above (the hang), on the other hand, happens when the /opt/conda/lib/libpython3.9.so file does exist on the machine, but the version of Python in use by flexflow_python is not the one stored at /opt/conda/lib/libpython3.9.so. This happens if you activate a different environment, with its own version of Python.

Btw, @eddy16112 these issues are not present when building with the Makefile, so this explains why you are not able to reproduce on Sapling

@goliaro
Copy link
Collaborator

goliaro commented Nov 15, 2022

I'm currently working on fixing #464, PR #465 fixes the hard-coding in realm_defines.h, but an additional patch will be required to solve the hard-coding in the Realm library file (.so). In the meantime, I'd recommend making the following change in the config/config.linux file in master:

from

FF_USE_PREBUILT_LEGION=${FF_USE_PREBUILT_LEGION:-ON}

to

FF_USE_PREBUILT_LEGION=${FF_USE_PREBUILT_LEGION:-OFF}

@eddy16112
Copy link
Collaborator

@gabrieleoliaro do you know why the wrong version of python could lead to the hang?

@goliaro
Copy link
Collaborator

goliaro commented Nov 17, 2022

@eddy16112 I'm not too sure. I tried debugging with gdb, but the backtrace just showed that the threads were all stuck waiting. Right now, I'm checking whether the same issue applies when compiling FlexFlow (without using the pre-compiled binaries) and then using flexflow_python with a different active environment. If we get the same issue, then it will be unrelated to the pre-compiled Legion binary (although I expect issue #464 to persist), and we may want want to add a check and return an error if someone tries to use FlexFlow using a different version of Python

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
No open projects
Status: Done
4 participants