Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors in timeouts #664

Closed
olexandr-konovalov opened this issue Mar 9, 2016 · 14 comments
Closed

Errors in timeouts #664

olexandr-konovalov opened this issue Mar 9, 2016 · 14 comments
Assignees
Milestone

Comments

@olexandr-konovalov
Copy link
Member

We have recently merged #342 and it seems that the stability of timeouts code suffered from this. I am creating a new issue to avoid this being overlooked and setting 4.8.3 milestone - at least to revert #342 in the stable-4.8 branch. Here I am collecting reports on observed problems:

  1. master branch, 32-bit build on Ubuntu 14.04:
########> Diff in /home/hudson/hudson/workspace/GAP-dev-compilers/GAPCOPTS/32b\ 
uild/GAPREADLINE/noreadline/label/clang33/GAP-git-snapshot/tst/testinstall/tim\
eout.tst:25
# Input is:
CallWithTimeout(100000, spin2,2000,50000);
# Expected output:
[ true, [ false ] ]
# But found:
Error, Could not set interval timer
########

2. Hanging builds, for example:
* Cygwin64 on Windows 
* 64-bit build with gcc 4.8.4 on Ubuntu 14.04
@olexandr-konovalov olexandr-konovalov added this to the GAP 4.8.3 milestone Mar 9, 2016
@hulpke
Copy link
Contributor

hulpke commented Mar 9, 2016

This same issue seems to cause the build of #667 to fail. Is there any reason to make the timeout test a part of the standard install tests?

@olexandr-konovalov
Copy link
Member Author

@hulpke timeouts were actually there in standard tests before #342, and caused no problems. It is nested timeouts code which triggered this instability. I suggest to take it out of the stable branch for now.

Also, the timeout.tst indeed increases the duration of testinstall, maybe we should move it into teststandard - becayse of using TestDirectory this is as simple as git mv ....

Another often seen diff - this is nested timeout:

########> Diff in /home/hudson/hudson/workspace/GAP-dev/GAPCOPTS/64build/GAPGM\
P/gmp/GAPTARGET/install/label/graupius/GAP-git-snapshot/tst/testinstall/timeout.tst:33
# Input is:
ct := 0;; CallWithTimeout(1000000, function() while true do spin2(500,500000);\
ct := ct+1; od; end);; 2 <= ct; ct <= 10;
# Expected output:
true
true
# But found:
false
true
########

@olexandr-konovalov
Copy link
Member Author

I've submitted PR #668 to revert nested alarms in the stable-4.8 branch.

@markuspf
Copy link
Member

Mhm. Even after an hour of

while true do
  Test("tst/testinstall/timeout.tst");
od;

On my DragonFlyBSD system I cannot get the timeout test to fail or hang. On yin (x86_64 Linux) the timeout test hangs every so often, and bizarrely does so in a call to getrusage (according to an attached gdb). I wonder whether Linux gets something into a twist there wrt syscalls and signals? I'll try next on the Ubuntu cloud machine...

@markuspf
Copy link
Member

Ok, so DragonFly seems to be using one of the implementations of timeouts, wheras linux uses the other. Now to find out what the differences are.

@markuspf
Copy link
Member

I fixed this in #672, the hangs were due to setting a timer with timeout 0.

@olexandr-konovalov
Copy link
Member Author

#672 is done against the master branch, so I am re-assigning this issue to the GAP 4.9.0 milestone. I think this corresponds to the policy of adding new features in the master branch and bugfixes in the stable (with minor exceptions, if needed).

@fingolfin
Copy link
Member

See also issue #765

So, is this still occuring? And both for Ubuntu and Windows?

@markuspf
Copy link
Member

markuspf commented Jan 9, 2017

No, I fixed the hangs in #672 (at least for any system that uses posix timers)

@fingolfin
Copy link
Member

We (i.e. @alex-konovalov @markuspf and me) discussed this in France, and came to the conclusion that it may be impossible to fix all the issues with CallWithTimeout -- in particular, it can leave you in an inconsistent state, and the only way to fix that in general is GAP -- else you may be doing invalid computations without even realizing.

So perhaps we should really considering to phase this feature out again in 4.9, and tell people to not use it anymore...

@ChrisJefferson
Copy link
Contributor

Just to say, I agree it's unfixable -- and the problem is real, I knocked up a quick "fake experiment" and found I quickly got nonsense results due to groups having broken stabilizer chains.

We can bring back similar functionality in HPC-GAP by just killing a whole thread off (as long as it doesn't have any locks at the time).

Also an "interested party" could add a similar function to the IO package, which used fork -- I would be happy to discuss how to implement such a function but won't have time to do it myself.

@fingolfin
Copy link
Member

Of course io already offers functionality for running background tasks, including e.g. ParTakeFirstResultByFork, and I assume this takes care of many needs -- well, on Unix at least, tough luck for Windows users.

But I am curious how you'd do timeouts using io without incurring similar problems -- at least if by that you mean that in the "main" GAP a callback handler is called when the timeout runs out -- as that callback would also potentially see corrupt data, depending on what the main GAP was doing at the moment (granted, the effect would be much smaller, and a careful callback should be able to avoid issues). But actually I am not sure it would even be worth the bother -- the existing work farm code in io should cover 95% of needs for this anyway, shouldn't it?

Anyway, as to removing this code (or rather: deprecating it): If we really want to do that, somebody should bring this up on the gap@ list ASAP for broader discussions.

@ChrisJefferson
Copy link
Contributor

Sorry, this is offtopic, but I'll quickly answer.

It would be easy(ish) to extend Par*Fork functions take a timeout which applies to each problem individually -- that is (I think) what many people are using CallWithTimeout for -- to run a series of experiments. When the timeout runs out, we can just kill the forked GAP and make a new one if we want to run another test.

fingolfin added a commit to fingolfin/gap that referenced this issue May 9, 2017
It had fundamental problems which seem unlikely to be resolved.
See also issues gap-system#664, gap-system#695, gap-system#765.
fingolfin added a commit to fingolfin/gap that referenced this issue May 19, 2017
It had fundamental problems which seem unlikely to be resolved.
See also issues gap-system#664, gap-system#695, gap-system#765.
fingolfin added a commit to fingolfin/gap that referenced this issue Jun 8, 2017
It had fundamental problems which seem unlikely to be resolved.
See also issues gap-system#664, gap-system#695, gap-system#765.
ChrisJefferson pushed a commit that referenced this issue Jun 11, 2017
It had fundamental problems which seem unlikely to be resolved.
See also issues #664, #695, #765.
@fingolfin
Copy link
Member

Resolved by #1324.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants