Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.vo files vary from build order #11229

Open
bmwiedemann opened this issue Dec 2, 2019 · 37 comments
Open

.vo files vary from build order #11229

bmwiedemann opened this issue Dec 2, 2019 · 37 comments
Labels

Comments

@bmwiedemann
Copy link
Contributor

@bmwiedemann bmwiedemann commented Dec 2, 2019

Description of the problem

While working on reproducible builds for openSUSE, I found that
In addition to #11227 , there are variations in .vo files
that go away when I build on a filesystem that has deterministic readdir order and build with make instead of make -j4 - so that the builds happen in deterministic order.

/usr/lib64/coq/theories/FSets/FSetList.vo differs at offset '9059' (data)
--- /tmp/tmp.KtRUkpybQT/old.yPM 2019-12-02 10:39:36.148042421 +0000
+++ /tmp/tmp.KtRUkpybQT/new.vUR 2019-12-02 10:39:36.148042421 +0000
@@ -1,11 +1,11 @@
 00002340  8b e6 08 d0 e1 81 57 ec  a0 a0 28 4d 53 65 74 4c  |......W...(MSetL|
 00002350  69 73 74 a0 25 4d 53 65  74 73 a0 23 43 6f 71 40  |ist.%MSets.#Coq@|
-00002360  90 30 35 20 c8 a4 e3 1b  b6 aa ca 98 d2 da c7 7f  |.05 ............|
-00002370  f0 c1 a0 a0 29 4f 72 64  65 72 73 41 6c 74 a0 2a  |....)OrdersAlt.*|
+00002360  90 30 b5 b2 78 25 9a ee  e4 62 59 2b 46 34 9c 6c  |.0..x%...bY+F4.l|
+00002370  48 e4 a0 a0 29 4f 72 64  65 72 73 41 6c 74 a0 2a  |H...)OrdersAlt.*|
 00002380  53 74 72 75 63 74 75 72  65 73 a0 23 43 6f 71 40  |Structures.#Coq@|
 00002390  90 30 5c 91 d4 af b1 94  2f 7a 6f 13 17 a2 75 69  |.0\...../zo...ui|
-000023a0  ed 79 b1 cf 42 6c ca 2c  09 6f 38 78 f9 cc b9 cc  |.y..Bl.,.o8x....|
-000023b0  d4 7a 00 02 af 1c 84 95  a6 be 00 02 8b 52 00 00  |.z...........R..|
+000023a0  ed 79 6e 87 f6 25 58 27  a4 20 92 36 b7 1c 3b 0e  |.yn..%X'. .6..;.|
+000023b0  f3 06 00 02 af 1c 84 95  a6 be 00 02 8b 52 00 00  |.............R..|

https://github.com/bmwiedemann/openSUSE/blob/master/packages/c/coq/coq.spec#L77 has the details of how we build.

Coq Version

8.9.1

@ejgallego

This comment has been minimized.

Copy link
Member

@ejgallego ejgallego commented Dec 2, 2019

Umm, that's bizarre and certainly a bug; while we try to debug, do you have access to the differing files?

You can use votour [a binary in checker/] which is a vo file explorer to see what's exactly the different bit.

@ejgallego ejgallego added the kind: bug label Dec 2, 2019
@ejgallego ejgallego added this to the 8.11.0 milestone Dec 2, 2019
@ejgallego ejgallego self-assigned this Dec 2, 2019
@SkySkimmer

This comment has been minimized.

Copy link
Contributor

@SkySkimmer SkySkimmer commented Dec 2, 2019

Since it looks like a Require issue it may be worth trying with 8.11/master.

@ejgallego

This comment has been minimized.

Copy link
Member

@ejgallego ejgallego commented Dec 2, 2019

master sees the same problem, I was able to reproduce.

@ejgallego

This comment has been minimized.

Copy link
Member

@ejgallego ejgallego commented Dec 2, 2019

Looks a bit tricky to analyze, I guess having the ability for votour to produce a diff would help.

I need to take care of other stuff, I reproduced just doing two dune builds, one with -j 8 and the other with -j 3.

@ppedrot

This comment has been minimized.

Copy link
Member

@ppedrot ppedrot commented Dec 2, 2019

It might be due to the fact that hashconsing is not fully deterministic. Depending on the way the list of identifiers are processed we might get different in-memory representation.

@ejgallego

This comment has been minimized.

Copy link
Member

@ejgallego ejgallego commented Dec 3, 2019

It might be due to the fact that hashconsing is not fully deterministic. Depending on the way the list of identifiers are processed we might get different in-memory representation.

Indeed I was wondering if the difference would be in the hash-cons, but how can this happen? Not for terms at least; almost surely what's causing the difference here is that we are taking some has of the directory contents.

@SkySkimmer

This comment has been minimized.

Copy link
Contributor

@SkySkimmer SkySkimmer commented Dec 3, 2019

In add_vo_path the directory names are put through Id.of_string which hconses in some arbitrary order. Maybe we should order the result of all_subdirs somehow.
(around

coq/vernac/loadpath.ml

Lines 246 to 266 in de91f71

let convert_string d =
try Names.Id.of_string d
with
| CErrors.UserError _ ->
let d = Unicode.escaped_if_non_utf8 d in
warn_cannot_use_directory d;
raise Exit
let add_vo_path ~recursive lp =
let unix_path = lp.unix_path in
let implicit = lp.implicit in
if System.exists_dir unix_path then
let dirs = if recursive then System.all_subdirs ~unix_path else [] in
let prefix = DP.repr lp.coq_path in
let convert_dirs (lp, cp) =
try
let path = List.rev_map convert_string cp @ prefix in
Some (lp, DP.make path)
with Exit -> None
in
let dirs = List.map_filter convert_dirs dirs in
)

@ejgallego

This comment has been minimized.

Copy link
Member

@ejgallego ejgallego commented Dec 3, 2019

Maybe we should order the result of all_subdirs somehow.

I don't understand how this affects the hashconsing, tho even ordering would not yield the same results as for example some directories may not be present between two runs I think.

@bmwiedemann

This comment has been minimized.

Copy link
Contributor Author

@bmwiedemann bmwiedemann commented Dec 3, 2019

I could not find votour in openSUSE, so I uploaded 2 versions of one affected .vo file to https://rb.zq1.de/other/coq/ so that you can analyze it more closely.

@ppedrot

This comment has been minimized.

Copy link
Member

@ppedrot ppedrot commented Dec 4, 2019

@bmwiedemann the Coq version you used to generate these two files is still 8.9.1, right?

There is more than a change of representation, it seems that the provided files are different just because the vo files they are depending on are already different. Indeed, vo files store a digest of their dependencies and it happens in your example that it's essentially what is changing.

@ppedrot

This comment has been minimized.

Copy link
Member

@ppedrot ppedrot commented Dec 4, 2019

Forgot to mention, but the single offending file in your example is Coq.MSets.MSetList. For the file 1, it has digest b5b278259aeee462592b46349c6c48e4 and for the other 3520c8a4e31bb6aaca98d2dac77ff0c1.

@ejgallego

This comment has been minimized.

Copy link
Member

@ejgallego ejgallego commented Dec 4, 2019

@ppedrot indeed a few files are like that, but there is some root cause; I can provide a full master dump with different .vo files, but it is fairly easy to get it yourself:

$ git clean -fxd
$ make -f Makefile.dune DUNEOPT='-j8' # if you have an 8 core machine
$ cp -a _build/default/theories _theories1
$ find _build/default/theories/ -name '*.vo' | xargs rm
$ make -f Makefile.dune DUNEOPT='-j2'
$ cp -a _build/default/theories _theories2
$ find theories1 -name '*.vo' | xargs md5sum | sort -k 2 > v1
$ find theories2 -name '*.vo' | xargs md5sum | sort -k 2 > v2
$ diff -u v1 v2
@ppedrot

This comment has been minimized.

Copy link
Member

@ppedrot ppedrot commented Dec 4, 2019

I can provide a script to dump a structured representation of the binary contents of the vo, if that helps.

@ejgallego

This comment has been minimized.

Copy link
Member

@ejgallego ejgallego commented Dec 4, 2019

I can provide a script to dump a structured representation of the binary contents of the vo, if that helps.

If that is diff-friendly would be great, I was thinking of coding a vo diff tool directly in OCaml.

[offtopic note, maybe it's about time we move from .vo, we should have some chat about it.]

@ppedrot

This comment has been minimized.

Copy link
Member

@ppedrot ppedrot commented Dec 4, 2019

BTW, I am spotting highly suspicious code in the safe demarshaller that seems to have been introduced by the native integer and float additions. I think we can get interesting segfaults when checking files from another architecture...

@ppedrot

This comment has been minimized.

Copy link
Member

@ppedrot ppedrot commented Dec 4, 2019

Here is a diff-friendly vo-dumping program: https://github.com/ppedrot/vodump.

@ppedrot

This comment has been minimized.

Copy link
Member

@ppedrot ppedrot commented Dec 9, 2019

I found a minimal counter-example for file Numbers/Integer/Abstract/ZLcm.v using the commands provided by @ejgallego. Here is the memory dump, where the first file was compiled with 4 cores and the second file with 2 cores. The two files differ on their library segment because they do not have the same memory representation even though AFAICT they are structurally equal. In particular, the one difference I could see is the object 0xed33 which corresponds to a KerPair value in turn contained in a Const term node. In the first case, it shares the string "gcd" with another part of the file, while in the second case it gets a fresh instance of the string. All other differences seem to be cause by this offset of size one.

Now, understanding why this happened is another story.

@ppedrot

This comment has been minimized.

Copy link
Member

@ppedrot ppedrot commented Dec 9, 2019

Note that this file uses modules, and they famously rely on an imperative delayed substitution mechanism...

@ppedrot

This comment has been minimized.

Copy link
Member

@ppedrot ppedrot commented Dec 9, 2019

I do suspect that this is linked to hashconsing, in a way that is unrelated to the directory contents. Rather, this is due to the fact that the GC itself is a source of randomness. My theory is the following:

  1. We hashcons the string "foo" and store it in the global hashcons table.
  2. The pointer to "foo" goes out of scope because the object that contained it is not reachable anymore
  3. The GC may collect "foo" and remove it from the hashcons table, or not (← this is source of non-determinism)
  4. We hashcons the string "foo" again. Now, depending on whether the GC has run, we get a fresh instance of the string or a mere backpointer.

Definitely the memory representation will be different depending on whether 3. occurred.

@SkySkimmer

This comment has been minimized.

Copy link
Contributor

@SkySkimmer SkySkimmer commented Dec 9, 2019

How does that break sharing though?

@ppedrot

This comment has been minimized.

Copy link
Member

@ppedrot ppedrot commented Dec 9, 2019

The issue appears right in a place where we use mutable state to delay the application of substitutions, so I guess it is an interaction with this imperative use of data.

@ejgallego

This comment has been minimized.

Copy link
Member

@ejgallego ejgallego commented Dec 9, 2019

Sounds quite plausible @ppedrot .

@gares

This comment has been minimized.

Copy link
Member

@gares gares commented Dec 10, 2019

What about optionally marshaling with the option No_sharing (to get reproducible builds now) and discuss at the next WG how to get rid of "randomized" hashconsing in full?

@ejgallego

This comment has been minimized.

Copy link
Member

@ejgallego ejgallego commented Dec 10, 2019

What about optionally marshaling with the option No_sharing (to get reproducible builds now) and discuss at the next WG how to get rid of "randomized" hashconsing in full?

That could work, but to make it the default we'd have to bench size and time impact.

@SkySkimmer

This comment has been minimized.

Copy link
Contributor

@SkySkimmer SkySkimmer commented Dec 10, 2019

Is it confirmed that No_sharing fixes the issue?

@gares

This comment has been minimized.

Copy link
Member

@gares gares commented Dec 10, 2019

Is it confirmed that No_sharing fixes the issue?

I did not try, but if @ppedrot is correct in his analysis then I believe it fixes the issue. Actually it could be a way to corroborate his thesis.

@ppedrot

This comment has been minimized.

Copy link
Member

@ppedrot ppedrot commented Dec 10, 2019

@gares I think we can expect trouble if we try this. This will be not far from not hashconsing at all, thus leading to memory explosion. This is still worth a shot though.

@bmwiedemann

This comment has been minimized.

Copy link
Contributor Author

@bmwiedemann bmwiedemann commented Dec 11, 2019

if you have a patch, I can test it with my https://github.com/bmwiedemann/reproducibleopensuse tools.

@ppedrot

This comment has been minimized.

Copy link
Member

@ppedrot ppedrot commented Dec 11, 2019

You can try this branch.

@bmwiedemann

This comment has been minimized.

Copy link
Contributor Author

@bmwiedemann bmwiedemann commented Dec 11, 2019

I just used this patch ppedrot@9be8ae8 and in the first build in a 8GB VM it ran out of memory in theories/FSets/FMapFullAVL.vo. With 30 GB it gave nice reproducible results.

@gares

This comment has been minimized.

Copy link
Member

@gares gares commented Dec 11, 2019

Nice!

Then the fix for the upcoming release should probably be to read No_sharing from an env variable.

This way distributions that need reproducibility of Coq (and Coq libraries) can just export such variable.
Final users may not care about reproducibility and get better performances on their files by not setting this variable (that would be the default).

@silene

This comment has been minimized.

Copy link
Contributor

@silene silene commented Dec 11, 2019

Then the fix for the upcoming release should probably be to read No_sharing from an env variable.

How can that even be an acceptable fix? Coq and its standard library compile fine with only 2GB of RAM. If the minimal requirement grows to several tens of GB, this is quite a regression. And I am not even talking about the size of .vo files, which are presumably humongous, given the huge amount of RAM needed to (de)marshal them. The latter will impact all the users of distributions that decide to ship reproducible builds of Coq's standard library, even if those users do not especially care about it.

@gares

This comment has been minimized.

Copy link
Member

@gares gares commented Dec 12, 2019

I don't know how suse considers unreproducible packages. If it is just a warning, then they can live with it I guess.

I agree it is a quick fix and that the drawbacks are visible, and that a better solution should be looked for. But in the very short term, I don't have one.

@bmwiedemann

This comment has been minimized.

Copy link
Contributor Author

@bmwiedemann bmwiedemann commented Dec 12, 2019

It is a long-term goal to have reproducible builds. At some point I want to add a notification whenever an openSUSE package gets updated and is unreproducible. No automatic rejects soon.

So since you now have a better understanding of how the nondeterminism gets introduced, I suppose you can come up with a cleaner solution in some months? No need for dirty patches that have severe downsides.

@ppedrot

This comment has been minimized.

Copy link
Member

@ppedrot ppedrot commented Dec 12, 2019

Is there an easy way to use your infrastructure if we come up with some patches? Or should we just ask you?

@bmwiedemann

This comment has been minimized.

Copy link
Contributor Author

@bmwiedemann bmwiedemann commented Dec 12, 2019

You could use my tools on your hardware, or just use the simpler -j variations that ejgallego used.

I can also test patches for you (with the disadvantage of slower feedback cycles).

@ejgallego

This comment has been minimized.

Copy link
Member

@ejgallego ejgallego commented Dec 13, 2019

I think it is safe to remove this from 8.11 deadline, right @ppedrot ?

@ejgallego ejgallego removed their assignment Dec 13, 2019
@ppedrot ppedrot removed this from the 8.11.0 milestone Dec 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.