Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolve random crashes when loading digraphs or semigroups #273

Closed
fingolfin opened this issue Apr 2, 2022 · 5 comments
Closed

Resolve random crashes when loading digraphs or semigroups #273

fingolfin opened this issue Apr 2, 2022 · 5 comments

Comments

@fingolfin
Copy link
Member

The ongoing theory proposed by @ChrisJefferson is that when we build things in one VM and cache them using ccache, and then use the cached binaries in another VM running on different hardware, then using the cached binary may not actually work and crash with illegal instructions. (See also discussion in #217)

This problem was also previously reported to ccache but has not yet been addressed there, see ccache/ccache#824

I've tried to work around this in 59d58d8 by removing the -march=native from the Semigroups and Digraphs build systems. But we still are seeing the random crashes. I've yet to check if this is because our theory is wrong, or because my patch simply was insufficient (for now I am more inclined to believe the latter).

An alternative and more systematic fix would be to adjust the ccache configuration, which allows specifying how changes in the compiler are detected: the default compiler_check setting is to use the mtime of the compiler (BTW I am actually surprised that this work across multiple VMs...?). Anyway, we could change that to always take the architecture set by -march=native into account...

@ChrisJefferson
Copy link
Contributor

Could we try temporarily pulling ccache for a week or so, and see if the crashes clean up? That would at least let us know if ccache is the problem?

@fingolfin
Copy link
Member Author

fingolfin commented Apr 2, 2022

We could, but I am pretty confident this is the cause: I've now seen the following architectures being set by `-march=native:

  • broadwell
  • haswell
  • icelake-server
  • skylake-avx512

And this is probably not all.

@fingolfin
Copy link
Member Author

The "alternative and more systematic fix" cannot work for this repository, because the systems on which we compile are not the same as the ones we run the tests. However, it would be fine for GAP, where we don't do this.

@fingolfin
Copy link
Member Author

I believe I've fixed this in #278

@fingolfin
Copy link
Member Author

Seems to be fine now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants