New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Runtime Hang after finishing all tasks #473
Comments
@jiazhihao Do you know what Python version / GPU backend / CUDA version / GPU type was in use? |
also, was the bug obtained after compiling FlexFlow with the pre-built NCCL/Legion binaries? |
@gabrieleoliaro I did a binary search on the git log of the master branch, and it seems #400 is the source of the issue --- the python program hangs using commit 09ee16b, while it exits normally using the previous commit 81304c8. Do you think this provides you enough information to look into the issue? |
@eddy16112 any idea on why the top_level_task has completed but flexflow_python refuses to exit. |
@jiazhihao let me try reproducing the failure on my instance |
I am not able to reproduce it on sapling, here is my cmd:
@gabrieleoliaro Are you able to reproduce it on your side? |
@eddy16112 Interesting --- did you see the program exits normally? |
Yes, I did not see any crashing or hang. Here is my config. BTW, I am using makefile instead of cmake, but I do not think it matters.
|
@jiazhihao It looks like this issue is closely related to #464. Basically, when compiling Legion with the CMake system, the path to the python library gets hard-coded into the compiled Realm library file, as well as in The issue above (the hang), on the other hand, happens when the Btw, @eddy16112 these issues are not present when building with the Makefile, so this explains why you are not able to reproduce on Sapling |
I'm currently working on fixing #464, PR #465 fixes the hard-coding in from
to
|
@gabrieleoliaro do you know why the wrong version of python could lead to the hang? |
@eddy16112 I'm not too sure. I tried debugging with gdb, but the backtrace just showed that the threads were all stuck waiting. Right now, I'm checking whether the same issue applies when compiling FlexFlow (without using the pre-compiled binaries) and then using |
To reproduce
The execution hangs after
real end top-level task
is printed. The issue seems to be related to some of the recent changes to the Python part.The text was updated successfully, but these errors were encountered: