Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GASPI state vector doesn't report anything #30

Open
knuedd opened this issue May 25, 2016 · 8 comments
Open

GASPI state vector doesn't report anything #30

knuedd opened this issue May 25, 2016 · 8 comments

Comments

@knuedd
Copy link

knuedd commented May 25, 2016

Dear GPI maintainers,

I'm playing around with fault tolerance and what GASPI's timeout feature can do to make a program survive a rank failure.

In a test program I can identify which rank died but I need to do it manually. The GASPI state vector doesn't report anything. I attached a test code Uploading test.zip… which shows this behavior. The test program is not watertight when detecting a failure and it does not produce the correct result yet. But it shows that gaspi_state_vec_get() never reports a failure.

I tested it with gpi2/1.3.0 at the Taurus HPC machine at ZIH, TU Dresden.

Regards, Andreas

@knuedd
Copy link
Author

knuedd commented May 25, 2016

The test code didn't make it the first time, here it is test.zip

@rumach
Copy link
Member

rumach commented Jun 3, 2016

Hi Andreas (@knuedd),
I apologize for the delay on the response.

If I understand correctly, a timeout is reported but the error state vector does not report an unhealthy rank. Is this understanding correct?

This is a tricky aspect: a timeout is not quite an error. And gaspi_state_vec_get() only reports something if an error was detected. Currently, you can only "force" that by using GASPI_BLOCK as a timeout. In your example, when you see a timeout with gaspi_wait() and try again with GASPI_BLOCK, the error state vector should report an unhealthy rank.

@knuedd
Copy link
Author

knuedd commented Jun 6, 2016

Hi Rui,

thanks for the clarification. That was not clear to me. I assumed GASPI_BLOCK would block infinitely. When I change my test program to GASPI_BLOCK everywhere it looks like that, actually. It hangs close to 5 minutes and then all processes die. The blocking calls seem not to return, thus the program gets no chance to check the return values and do anything to recover.

Do you happen to have an example code, where I can see a GASPI_TIMEOUT retrun code?

Thanks a lot, Andreas

@rumach
Copy link
Member

rumach commented Jun 8, 2016

Hi Andreas,

I have to look deeper into your example but what do you mean by all processes die and the blocking calls (gaspi_wait?) seem not to return?

I'll try to modify your example or check if I have one that illustrates this.

cheers

@knuedd
Copy link
Author

knuedd commented Jun 15, 2016

Hi Rui,

I found out that the processes die only because of the ulimit on the frontend nodes of the cluster where I tested it. Without that, the GASPI program really blocks forever when you specify "GASPI_BLOCK" as the timeout argument and a communication partner was killed. This is pretty much what I assumed in the first place.

That means there are two possibilities:

  1. With timeout == GASPI_BLOCK the GASPI call never returns, so the program cannot check the return value and then ask gaspi_state_vec_get() to report which process got killed.
  2. With a finite timeout the return value indicates that something was wrong. But gaspi_state_vec_get() does not tell which process disappeared.
    Thus, in both cases gaspi_state_vec_get() is useless, isn't it?

Thanks, Andreas

@rumach
Copy link
Member

rumach commented Jun 16, 2016

Hi Andreas,

the behaviour you observe should not happen. Which call hangs forever with GASPI_BLOCK? Is it gaspi_wait?

Just to see if I'm on the right track: this is a system with Infiniband right? If that's the case, could you possibly give it a try with GPI-2 v1.1.1?

Thanks,
Rui

@Flamefire
Copy link

Flamefire commented May 31, 2018

I've been debugging this issue with the program provided and found that gaspi_barrier does hang forever with GASPI_BLOCK in the loop at https://github.com/cc-hpc-itwm/GPI-2/blob/v1.3.0/src/GPI2_GRP.c#L564

Therefore I'm considering this as a bug in the GPI2 implementation, see #42 for details.

As for gaspi_state_vec_get: It is unclear to me whether gaspi_state_vec_get shall return an error if the previous result was GASPI_TIMEOUT. The spec for gaspi_barrier states:

In case of error, the return value is GASPI_ERROR . The error vector should be investigated.

But the directly following example is:

gaspi_return_t err;
do {
  err = gaspi_barrier (g, 100);
  if (err == GASPI_TIMEOUT && error vector indicates error)
    goto ERROR_HANDLING;
}
while (err != GASPI_SUCCESS);

It is possible that this is a bug in the example and if (err == GASPI_ERROR && error vector indicates error) was meant. I think so. But then, it is very hard to impossible to detect an error condition if gaspi_error is allowed to return GASPI_TIMEOUT in the case of a dead rank.

@dhinf
Copy link

dhinf commented Nov 13, 2018

The duplication was on purpose, because nothing happened the last two years and the bug is still present.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants