-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GASPI state vector doesn't report anything #30
Comments
The test code didn't make it the first time, here it is test.zip |
Hi Andreas (@knuedd), If I understand correctly, a timeout is reported but the error state vector does not report an unhealthy rank. Is this understanding correct? This is a tricky aspect: a timeout is not quite an error. And gaspi_state_vec_get() only reports something if an error was detected. Currently, you can only "force" that by using GASPI_BLOCK as a timeout. In your example, when you see a timeout with gaspi_wait() and try again with GASPI_BLOCK, the error state vector should report an unhealthy rank. |
Hi Rui, thanks for the clarification. That was not clear to me. I assumed GASPI_BLOCK would block infinitely. When I change my test program to GASPI_BLOCK everywhere it looks like that, actually. It hangs close to 5 minutes and then all processes die. The blocking calls seem not to return, thus the program gets no chance to check the return values and do anything to recover. Do you happen to have an example code, where I can see a GASPI_TIMEOUT retrun code? Thanks a lot, Andreas |
Hi Andreas, I have to look deeper into your example but what do you mean by all processes die and the blocking calls (gaspi_wait?) seem not to return? I'll try to modify your example or check if I have one that illustrates this. cheers |
Hi Rui, I found out that the processes die only because of the ulimit on the frontend nodes of the cluster where I tested it. Without that, the GASPI program really blocks forever when you specify "GASPI_BLOCK" as the timeout argument and a communication partner was killed. This is pretty much what I assumed in the first place. That means there are two possibilities:
Thanks, Andreas |
Hi Andreas, the behaviour you observe should not happen. Which call hangs forever with GASPI_BLOCK? Is it gaspi_wait? Just to see if I'm on the right track: this is a system with Infiniband right? If that's the case, could you possibly give it a try with GPI-2 v1.1.1? Thanks, |
I've been debugging this issue with the program provided and found that Therefore I'm considering this as a bug in the GPI2 implementation, see #42 for details. As for
But the directly following example is:
It is possible that this is a bug in the example and |
The duplication was on purpose, because nothing happened the last two years and the bug is still present. |
Dear GPI maintainers,
I'm playing around with fault tolerance and what GASPI's timeout feature can do to make a program survive a rank failure.
In a test program I can identify which rank died but I need to do it manually. The GASPI state vector doesn't report anything. I attached a test code Uploading test.zip… which shows this behavior. The test program is not watertight when detecting a failure and it does not produce the correct result yet. But it shows that gaspi_state_vec_get() never reports a failure.
I tested it with gpi2/1.3.0 at the Taurus HPC machine at ZIH, TU Dresden.
Regards, Andreas
The text was updated successfully, but these errors were encountered: