Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.vo files vary from build order #11229

Closed
bmwiedemann opened this issue Dec 2, 2019 · 53 comments
Closed

.vo files vary from build order #11229

bmwiedemann opened this issue Dec 2, 2019 · 53 comments
Labels
Coq Consortium support services Reporter subscribed Coq Consortium support services and requests processing by Consortium engineers. kind: bug An error, flaw, fault or unintended behaviour.

Comments

@bmwiedemann
Copy link
Contributor

Description of the problem

While working on reproducible builds for openSUSE, I found that
In addition to #11227 , there are variations in .vo files
that go away when I build on a filesystem that has deterministic readdir order and build with make instead of make -j4 - so that the builds happen in deterministic order.

/usr/lib64/coq/theories/FSets/FSetList.vo differs at offset '9059' (data)
--- /tmp/tmp.KtRUkpybQT/old.yPM 2019-12-02 10:39:36.148042421 +0000
+++ /tmp/tmp.KtRUkpybQT/new.vUR 2019-12-02 10:39:36.148042421 +0000
@@ -1,11 +1,11 @@
 00002340  8b e6 08 d0 e1 81 57 ec  a0 a0 28 4d 53 65 74 4c  |......W...(MSetL|
 00002350  69 73 74 a0 25 4d 53 65  74 73 a0 23 43 6f 71 40  |ist.%MSets.#Coq@|
-00002360  90 30 35 20 c8 a4 e3 1b  b6 aa ca 98 d2 da c7 7f  |.05 ............|
-00002370  f0 c1 a0 a0 29 4f 72 64  65 72 73 41 6c 74 a0 2a  |....)OrdersAlt.*|
+00002360  90 30 b5 b2 78 25 9a ee  e4 62 59 2b 46 34 9c 6c  |.0..x%...bY+F4.l|
+00002370  48 e4 a0 a0 29 4f 72 64  65 72 73 41 6c 74 a0 2a  |H...)OrdersAlt.*|
 00002380  53 74 72 75 63 74 75 72  65 73 a0 23 43 6f 71 40  |Structures.#Coq@|
 00002390  90 30 5c 91 d4 af b1 94  2f 7a 6f 13 17 a2 75 69  |.0\...../zo...ui|
-000023a0  ed 79 b1 cf 42 6c ca 2c  09 6f 38 78 f9 cc b9 cc  |.y..Bl.,.o8x....|
-000023b0  d4 7a 00 02 af 1c 84 95  a6 be 00 02 8b 52 00 00  |.z...........R..|
+000023a0  ed 79 6e 87 f6 25 58 27  a4 20 92 36 b7 1c 3b 0e  |.yn..%X'. .6..;.|
+000023b0  f3 06 00 02 af 1c 84 95  a6 be 00 02 8b 52 00 00  |.............R..|

https://github.com/bmwiedemann/openSUSE/blob/master/packages/c/coq/coq.spec#L77 has the details of how we build.

Coq Version

8.9.1

@ejgallego
Copy link
Member

Umm, that's bizarre and certainly a bug; while we try to debug, do you have access to the differing files?

You can use votour [a binary in checker/] which is a vo file explorer to see what's exactly the different bit.

@ejgallego ejgallego added the kind: bug An error, flaw, fault or unintended behaviour. label Dec 2, 2019
@ejgallego ejgallego added this to the 8.11.0 milestone Dec 2, 2019
@ejgallego ejgallego self-assigned this Dec 2, 2019
@SkySkimmer
Copy link
Contributor

Since it looks like a Require issue it may be worth trying with 8.11/master.

@ejgallego
Copy link
Member

master sees the same problem, I was able to reproduce.

@ejgallego
Copy link
Member

Looks a bit tricky to analyze, I guess having the ability for votour to produce a diff would help.

I need to take care of other stuff, I reproduced just doing two dune builds, one with -j 8 and the other with -j 3.

@ppedrot
Copy link
Member

ppedrot commented Dec 2, 2019

It might be due to the fact that hashconsing is not fully deterministic. Depending on the way the list of identifiers are processed we might get different in-memory representation.

@ejgallego
Copy link
Member

It might be due to the fact that hashconsing is not fully deterministic. Depending on the way the list of identifiers are processed we might get different in-memory representation.

Indeed I was wondering if the difference would be in the hash-cons, but how can this happen? Not for terms at least; almost surely what's causing the difference here is that we are taking some has of the directory contents.

@SkySkimmer
Copy link
Contributor

In add_vo_path the directory names are put through Id.of_string which hconses in some arbitrary order. Maybe we should order the result of all_subdirs somehow.
(around

coq/vernac/loadpath.ml

Lines 246 to 266 in de91f71

let convert_string d =
try Names.Id.of_string d
with
| CErrors.UserError _ ->
let d = Unicode.escaped_if_non_utf8 d in
warn_cannot_use_directory d;
raise Exit
let add_vo_path ~recursive lp =
let unix_path = lp.unix_path in
let implicit = lp.implicit in
if System.exists_dir unix_path then
let dirs = if recursive then System.all_subdirs ~unix_path else [] in
let prefix = DP.repr lp.coq_path in
let convert_dirs (lp, cp) =
try
let path = List.rev_map convert_string cp @ prefix in
Some (lp, DP.make path)
with Exit -> None
in
let dirs = List.map_filter convert_dirs dirs in
)

@ejgallego
Copy link
Member

Maybe we should order the result of all_subdirs somehow.

I don't understand how this affects the hashconsing, tho even ordering would not yield the same results as for example some directories may not be present between two runs I think.

@bmwiedemann
Copy link
Contributor Author

I could not find votour in openSUSE, so I uploaded 2 versions of one affected .vo file to https://rb.zq1.de/other/coq/ so that you can analyze it more closely.

@ppedrot
Copy link
Member

ppedrot commented Dec 4, 2019

@bmwiedemann the Coq version you used to generate these two files is still 8.9.1, right?

There is more than a change of representation, it seems that the provided files are different just because the vo files they are depending on are already different. Indeed, vo files store a digest of their dependencies and it happens in your example that it's essentially what is changing.

@ppedrot
Copy link
Member

ppedrot commented Dec 4, 2019

Forgot to mention, but the single offending file in your example is Coq.MSets.MSetList. For the file 1, it has digest b5b278259aeee462592b46349c6c48e4 and for the other 3520c8a4e31bb6aaca98d2dac77ff0c1.

@ejgallego
Copy link
Member

@ppedrot indeed a few files are like that, but there is some root cause; I can provide a full master dump with different .vo files, but it is fairly easy to get it yourself:

$ git clean -fxd
$ make -f Makefile.dune DUNEOPT='-j8' # if you have an 8 core machine
$ cp -a _build/default/theories _theories1
$ find _build/default/theories/ -name '*.vo' | xargs rm
$ make -f Makefile.dune DUNEOPT='-j2'
$ cp -a _build/default/theories _theories2
$ find theories1 -name '*.vo' | xargs md5sum | sort -k 2 > v1
$ find theories2 -name '*.vo' | xargs md5sum | sort -k 2 > v2
$ diff -u v1 v2

@ppedrot
Copy link
Member

ppedrot commented Dec 4, 2019

I can provide a script to dump a structured representation of the binary contents of the vo, if that helps.

@ejgallego
Copy link
Member

I can provide a script to dump a structured representation of the binary contents of the vo, if that helps.

If that is diff-friendly would be great, I was thinking of coding a vo diff tool directly in OCaml.

[offtopic note, maybe it's about time we move from .vo, we should have some chat about it.]

@ppedrot
Copy link
Member

ppedrot commented Dec 4, 2019

BTW, I am spotting highly suspicious code in the safe demarshaller that seems to have been introduced by the native integer and float additions. I think we can get interesting segfaults when checking files from another architecture...

@ppedrot
Copy link
Member

ppedrot commented Dec 4, 2019

Here is a diff-friendly vo-dumping program: https://github.com/ppedrot/vodump.

@ppedrot
Copy link
Member

ppedrot commented Dec 9, 2019

I found a minimal counter-example for file Numbers/Integer/Abstract/ZLcm.v using the commands provided by @ejgallego. Here is the memory dump, where the first file was compiled with 4 cores and the second file with 2 cores. The two files differ on their library segment because they do not have the same memory representation even though AFAICT they are structurally equal. In particular, the one difference I could see is the object 0xed33 which corresponds to a KerPair value in turn contained in a Const term node. In the first case, it shares the string "gcd" with another part of the file, while in the second case it gets a fresh instance of the string. All other differences seem to be cause by this offset of size one.

Now, understanding why this happened is another story.

@ppedrot
Copy link
Member

ppedrot commented Dec 9, 2019

Note that this file uses modules, and they famously rely on an imperative delayed substitution mechanism...

@ppedrot
Copy link
Member

ppedrot commented Dec 9, 2019

I do suspect that this is linked to hashconsing, in a way that is unrelated to the directory contents. Rather, this is due to the fact that the GC itself is a source of randomness. My theory is the following:

  1. We hashcons the string "foo" and store it in the global hashcons table.
  2. The pointer to "foo" goes out of scope because the object that contained it is not reachable anymore
  3. The GC may collect "foo" and remove it from the hashcons table, or not (← this is source of non-determinism)
  4. We hashcons the string "foo" again. Now, depending on whether the GC has run, we get a fresh instance of the string or a mere backpointer.

Definitely the memory representation will be different depending on whether 3. occurred.

@SkySkimmer
Copy link
Contributor

How does that break sharing though?

@ppedrot
Copy link
Member

ppedrot commented Dec 9, 2019

The issue appears right in a place where we use mutable state to delay the application of substitutions, so I guess it is an interaction with this imperative use of data.

@ejgallego
Copy link
Member

Sounds quite plausible @ppedrot .

@gares
Copy link
Member

gares commented Dec 10, 2019

What about optionally marshaling with the option No_sharing (to get reproducible builds now) and discuss at the next WG how to get rid of "randomized" hashconsing in full?

@ejgallego
Copy link
Member

What about optionally marshaling with the option No_sharing (to get reproducible builds now) and discuss at the next WG how to get rid of "randomized" hashconsing in full?

That could work, but to make it the default we'd have to bench size and time impact.

@SkySkimmer
Copy link
Contributor

Is it confirmed that No_sharing fixes the issue?

@ppedrot
Copy link
Member

ppedrot commented Dec 12, 2019

Is there an easy way to use your infrastructure if we come up with some patches? Or should we just ask you?

@bmwiedemann
Copy link
Contributor Author

You could use my tools on your hardware, or just use the simpler -j variations that ejgallego used.

I can also test patches for you (with the disadvantage of slower feedback cycles).

@ejgallego
Copy link
Member

I think it is safe to remove this from 8.11 deadline, right @ppedrot ?

@ejgallego ejgallego removed their assignment Dec 13, 2019
@ppedrot ppedrot removed this from the 8.11.0 milestone Dec 13, 2019
@bmwiedemann
Copy link
Contributor Author

coq 8.11.2 is still affected.

@ejgallego
Copy link
Member

coq 8.11.2 is still affected.

Thanks for the report @bmwiedemann ; I'm afraid that 8.12 will still suffer from this; if @ppedrot analysis is correct the fix is not going to be easy due to the way Coq on-disk representation system is designed :S

@ppedrot
Copy link
Member

ppedrot commented Jul 19, 2020

Well, I do suspect that making module substitutions pure would solve the problem. This can be tried easily but since the bench infrastructure is currently down it's hard to check whether this would have negative performance consequences.

@ppedrot
Copy link
Member

ppedrot commented Jun 2, 2021

Seems to be fixed after #14337.

@silene
Copy link
Contributor

silene commented Jun 2, 2021

Seems to be fixed after #14337.

What is the idea? I have a hard time imagining delayed module substitution would be the only reason serialization would be non-deterministic. Was that the only source of unsharing in Coq?

@ppedrot
Copy link
Member

ppedrot commented Jun 2, 2021

@silene there is a tentative explanation a few comments above, although I don't know whether the problem is really solved or if I was not unlucky enough to hit it.

@bmwiedemann
Copy link
Contributor Author

bmwiedemann commented Jun 2, 2021

my test records show 8.12.0 as the first reproducible version around 2020-08-15

@silene
Copy link
Contributor

silene commented Jun 2, 2021

there is a tentative explanation a few comments above

I cannot find it. I understand why eagerly substituting modules is a step toward solving the issue, but I do not understand why it would definitely solve it. Is there no more hashconsing in Coq?

@ppedrot
Copy link
Member

ppedrot commented Jun 2, 2021

@silene the problem is not hashconsing, it hashconsing of mutable data structures. We don't have other instances of this phenomenon in the code. The only places I know where we have mutability is:

  • the hashconsing of term arrays, but it doesn't matter precisely because this is synchronous with hashconsing
  • delayed computation of hashes in names, but it probably doesn't matter because the mutable data is an integer

@Blaisorblade
Copy link
Contributor

Since @bmwiedemann's records confirm Coq is now reproducible, tentatively closing.

@Blaisorblade
Copy link
Contributor

Blaisorblade commented Jan 13, 2023

Best reopened, based on https://coq.zulipchat.com/#narrow/stream/237656-Coq-devs-.26-plugin-devs/topic/Dune.20caching.20.26.20Coq.20non-reproducible.20builds.


Unless I made some very weird mistake, a (very mildly patched) Coqc 8.16.0 is (again?) non-reproducible, which affects our dune-based CI.

Here are two ARM vo files from the same source file — everything is public
https://gist.github.com/Blaisorblade/cd953398a241c17002b1d903806c2bea

I tried https://github.com/ppedrot/vodump from #11229, but it doesn't work as-is (fails on the magic number, might be easy to update?).

Most of the credits go to dune's DUNE_CACHE_CHECK_PROBABILITY, and to the recent fixes that made it usable.

We also have evidence that this happens on our x86 CI — possibly with lower frequency. I should note this file uses

Module IM := FMapAVL.Make OT_bs.

but other files don't use modules so blatantly.


I wiped my dune cache and I'm collecting more data via repeated rebuilds — I've already hit the same vos, and more.

If anybody wants to try, at recent dune versions make it easy (I'm using 3.6.1):

(for i in `seq 1 10`; do rm -rf _build/; time DUNE_CACHE_CHECK_PROBABILITY=1 dune b; done) 2>&1| tee -a output

Combining vodump, votour and Coq sources let me confirm that this is about string hash-consing again, like last time; I'll be uploading analysis results.

The example in the gist involves modules, but I'll upload a new one without where it seems that opaque constant into_sep_at = λ (thread_info : biIndex) ... : ∀ (thread_info : biIndex) with section-bound thread_info, stores thread_info twice, one for the type and one in the cooking_info for the opaque body.

... while other times into_sep_at stores thread_info only once, as expected, which fits #11229 (comment).


Here it is: https://gist.github.com/Blaisorblade/a7274f0eb15a13afbaa083b5b3673697 for the example that does not involve modules. It should be easy to confirm the same string appears once vs twice, and the node paths hint to the involved types and code:

Depth 17 Pos 0.1.2.0.0.1.0.2.0.1.0.1.0.0.1.1.0 Context library/compiled/module_body/module_sign/list/label*sfb/struct_field_body/constant_body/constant_def/opaque/list/cooking_info/abstr_info/list/named_declaration/constr/array
Depth 10 Pos 0.1.2.0.0.1.0.3.0.0 Context library/compiled/module_body/module_sign/list/label*sfb/struct_field_body/constant_body/constr/binder_annot

but the original example (avl.vo) involves different paths wrapping the different strings:

Depth 30 Pos 0.1.2.0.0.1.0.2.0.1.0.1.0.2.0.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.0.1.0.2.0.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.0.1.0.0.0.8.1.0.1.1.1.1.1.0.0.0 
Context 
library/compiled/module_body/module_sign/list/label*sfb/struct_field_body/module_body/module_sign/
list/label*sfb/struct_field_body/module_body/module_sign/list/label*sfb/struct_field_body/module_body/
module_sign/list/label*sfb/struct_field_body/mutual_inductive_body/array/one_inductive_body/array/*/
list/rel_declaration/binder_annot 

I haven't cleaned up the analysis of avl.vo, but I can do it if it helps.

@Blaisorblade Blaisorblade reopened this Jan 13, 2023
@Blaisorblade Blaisorblade added the Coq Consortium support services Reporter subscribed Coq Consortium support services and requests processing by Consortium engineers. label Jan 23, 2023
@Blaisorblade
Copy link
Contributor

Applying for Consortium support on behalf of Bedrock Systems (cc @gmalecha ).

@SkySkimmer
Copy link
Contributor

Here it is: gist.github.com/Blaisorblade/a7274f0eb15a13afbaa083b5b3673697 for the example that does not involve modules.

@Blaisorblade
Could you Print into_sep_at with Set Printing All?

@SkySkimmer
Copy link
Contributor

Going to move this to a new issue as I don't think the old comments are useful.

@ejgallego
Copy link
Member

As this issue is still active, just a small note about No_sharing, using it in Coq is totally unfeasible, results in a crazy .vo size explosion (tried for some other experiments)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Coq Consortium support services Reporter subscribed Coq Consortium support services and requests processing by Consortium engineers. kind: bug An error, flaw, fault or unintended behaviour.
Projects
None yet
Development

No branches or pull requests

7 participants