Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heavy use of CallWithTimeout makes GAP crash / produce wrong results #695

Closed
Stefan-Kohl opened this issue Mar 29, 2016 · 19 comments
Closed
Labels
kind: bug: crash Issues describing bugs that cause GAP to crash, and PRs fixing them (used for release notes) kind: bug Issues describing general bugs, and PRs fixing them

Comments

@Stefan-Kohl
Copy link
Member

In GAP 4.8.3, in sessions with heavy use of CallWithTimeout I observed the following:

  • crash with segfault when hitting [ctrl][c], instead of entering a break loop
  • unpredictable wrong results, sometimes not obvious

These seem to happen only when making heavy use of CallWithTimeout, but I am
not sure under which precise circumstances they do. My guess is that sometimes
things are left in an inconsistent state when a timeout occurs.

@ChrisJefferson
Copy link
Contributor

Crashes we should fix, if we can figure out how to reproduce them.

Unpredictable wrong results are, unfortunately, always going to happen, and are I think may be an unfixable problem. I realise that's a scary-sounding thing to say!

@markuspf
Copy link
Member

What code are you running to produce this behaviour?

I have observed similar things last night but didn't get to investigate yet. I
think there might be a problem with signal delivery/handling.

@markuspf
Copy link
Member

markuspf commented Mar 29, 2016 via email

@Stefan-Kohl
Copy link
Member Author

For me, this occurred in a longer GAP session which is not easily reproducible.

@markuspf
Copy link
Member

testing: /home/makx/ac/gap/tst/teststandard/small_groups2.tst
^CAssertion failed: (NumAlarmJumpBuffers), function CheckAndRespondToAlarm, file ../../src/stats.c, line 1641.

Looks like SIGALARM is delivered (or some other way CheckAndRespondToAlarm is called). Since this happens with Ctrl-C being pressed, I'll have a look at the signal handling code quickly.

@Stefan-Kohl
Copy link
Member Author

Yes, this is the message I also get when GAP crashes. A way to reproduce the crash is to run the
following code for a while and then press [ctrl][c]:

    G := SymmetricGroup(20); i := 1; aborted := 0;
    repeat
      if i mod 1000 = 0 then Print("i = ",i,", aborted = ",aborted,"\n"); fi;
      a := Random(G); b := Random(G);
      c := Random(Group(a,b));
      if CallWithTimeout(5,\in,c,Group(a,b)) = [false] then aborted := aborted + 1; fi;
      error := not c in Group(a,b);
      i := i + 1;
    until error;

I don't know whether the error is triggered eventually if one waits long enough.

@ChrisJefferson
Copy link
Contributor

I've found one issue -- on stats.c:164, we should set SyAlarmHasGoneOff to 0 straight away, else we can head through this twice -- or (perhaps easier, but might hide other bugs) if we end up here with no alarm buffers just return rather than asserting.

@stevelinton
Copy link
Contributor

The documentation does say

If   the   call  does  not  complete  within  the  timeout,  the  result  of
CallWithTimeout[List]  is  a  list of length 1 containing the value false In
this  case,  just  as  if you had quit from a break loop, there is some risk
that  internal  data structures in GAP may have been left in an inconsistent
state, and you should proceed with caution.

@ChrisJefferson
Copy link
Contributor

The segfaulting is (hopefully) fixed in #698. Is actually a very small fix.

This does introduce a different (tiny) bug -- if you press ctrl+c just as a timer goes off, the ctrl+c will get ignored. I'm just going to ignore that, because fixing it would be (I think) reasonably fiddly, and that is a much less serious issue to fix, particularly in stable.

@fingolfin fingolfin added kind: bug Issues describing general bugs, and PRs fixing them kind: bug: crash Issues describing bugs that cause GAP to crash, and PRs fixing them (used for release notes) labels Jan 9, 2017
@fingolfin
Copy link
Member

See #664 for a suggestion to phase out CallWithTimeout again due to major issues with it.

@stevelinton
Copy link
Contributor

stevelinton commented Mar 17, 2017 via email

@fingolfin
Copy link
Member

@Stefan-Kohl Have you tried using ParTakeFirstResultByFork from the io package for this purpose instead?

@ChrisJefferson
Copy link
Contributor

@Stefan-Kohl : But you realise, any answers you calculate when using CallWithTimeout can simply be wrong? And this is a real problem, I've had it happen. If you are using it, and care about the results, that is very bad.

@markuspf
Copy link
Member

As noted before, this is fundamentally hard to do, and the only way of doing this without completely rewriting GAP, is to use fork to take a snapshot of the state of the GAP process before you start a computation, so that you get to a well-defined state at the end.

@Stefan-Kohl
Copy link
Member Author

@markuspf What does "completely rewrite GAP" mean here? -- I would assume this concerns only the kernel, thus only a tiny part of the GAP system(?)

@ChrisJefferson
Copy link
Contributor

@Stefan-Kohl : Could you try this replacement CallWithTimeout, and see if it works for you? At the moment, timeout must always be given in seconds (this is a bit different to CallWithTimeout, but is just because it was fast to do to start). Also your function must return something (with CallWithTimeout it can not return. These two issues can be fixed if this basic idea is an acceptable substitute.

IO_CallWithTimeout := function ( timeout, func, args... )
    local  ret;
    ret := ParDoByFork( [ func ], [ args ], rec(
          TimeOut := rec(
              tv_sec := timeout,
              tv_usec := 0 ) ) );
    if ret = [  ]  then
        return [ false ];
    else
        return [ true, ret[1] ];
    fi;
    return;
end;

@Stefan-Kohl
Copy link
Member Author

@ChrisJefferson This seems to work -- thank you!

@ChrisJefferson
Copy link
Contributor

@Stefan-Kohl : If you start using that, please rename the function. Just because I plan to add that function to IO, probably with that name, and wouldn't want your code to break when the function is officially added to the IO package!

fingolfin added a commit to fingolfin/gap that referenced this issue May 9, 2017
It had fundamental problems which seem unlikely to be resolved.
See also issues gap-system#664, gap-system#695, gap-system#765.
fingolfin added a commit to fingolfin/gap that referenced this issue May 19, 2017
It had fundamental problems which seem unlikely to be resolved.
See also issues gap-system#664, gap-system#695, gap-system#765.
fingolfin added a commit to fingolfin/gap that referenced this issue Jun 8, 2017
It had fundamental problems which seem unlikely to be resolved.
See also issues gap-system#664, gap-system#695, gap-system#765.
ChrisJefferson pushed a commit that referenced this issue Jun 11, 2017
It had fundamental problems which seem unlikely to be resolved.
See also issues #664, #695, #765.
@fingolfin
Copy link
Member

Resolved by #1324.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind: bug: crash Issues describing bugs that cause GAP to crash, and PRs fixing them (used for release notes) kind: bug Issues describing general bugs, and PRs fixing them
Projects
None yet
Development

No branches or pull requests

5 participants