New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The current code shape generated for time_zone_data.ml makes the compiler explode #16
Comments
@gasche when you said toplevel definitions, is it per module or the total number of top level definitions? |
There are 594 time zones, so afaict, either I build small number of definitions with long sequence of operations on a hashtable (which seems to blow up), or I build a lot of definitions with short sequence of operations (which seem to blow as well). EDIT: assuming it's about total number of definitions EDIT2: I can swap to using 594 strings, and deserialize the items during run time, but not sure if that fits within the provided recommendation either |
The code currently looks like this: let db =
String_map.empty
|> String_map.add "Cuba"
[|
((-62167219200L), { is_dst = false; offset = (-19776) });
((-2524501832L), { is_dst = false; offset = (-19776) });
((-1402813824L), { is_dst = false; offset = (-18000) });
[...]
|]
|> String_map.add "EET"
[| ... |]
|> String.map "GMT-0"
[| ... |]
|> [...] Two things may be problematic here:
Both of these can be limited in the code generator without changing the semantics of the program. For the array literal, you can rewrite I'm writing this without having seen precise failure logs (just "there is a stack overflow with flambda") and without having tried to reproduce the problem. My recommendation would be to start by trying to reproduce the issue, by compiling your program with flambda, possibly with |
Sorry I should have clarified that the modified code are in I have tried I'll have a look at the log EDIT: So setting
|
Okay I've tried myriad combinations of things and not getting much of anywhere. Hm... |
Okay I've reduced the range of generation, which seems to work on 4.11.1+flambda (not sure where to pass the https://raw.githubusercontent.com/daypack-dev/timere/main/gen-artifacts/time_zone_data.ml |
Now the code generator uses a hashtable, and to avoid a large sequence of I don't know if the change is necessary (I would have assumed that the previous approach could also work, by splitting the calls to This is a weird problem because we don't really know what is "good enough", but if your code compiles with flambda -O3 on your system, I guess you could consider the issue solved and submit on opam again? If you get a report of another issue on some system, you can always investigate and tweak the limits again. |
The split approach didn't seem to work immediately, but also could be due to the total size of data was too large (didn't try said approach with the current reduced range of dates).
Sorry but I still have no clue how to pass EDIT: passing it via |
I think that adding the following as a root
(See the dune documentation.) |
Note: Ideally I think that you should be able to include all the data you want, as long as it is split in reasonably-sized chunks at all "sequence" levels. |
Thanks! I can confirm it can now build on
I don't think this is possible after tuning the parameters a fair bit. |
I think the current fix suffices for now. I'll submit another PR to opam repository. Thanks very much for your help and feedback! |
[minor note] I don't understand what changed in the generation strategy by looking at the PR #17, simplified explanations of the generated code shape (as I did above) would help. My impression is that the PR both changes the generation strategy and undoes some data reduction that you performed to help, and I don't know how to get readable diffs from github for the data-generation-strategy change only (trying to look at the output rather than generator code). |
After several experiments, I concluded that the sheer number of constant sub-expression is the real problem. Flambda lifts everything to toplevel and has a traversal on toplevel expression that is not tail-rec, and crashes. I decided upon a fairly radical solution: we build a store of the data in advance (in sexp), which we distribute with the sources. At build time, we load the data, put it into the ocaml data-structure we want in the end, and |
It does indeed sound nicer. But how do you locate the data at runtime, is it just in the install data of the library? (This makes the system less relocatable, but I guess that's okay; we don't really support static-linking anyway.) Or are you considering playing with embedding the string in the program? |
At build time, we create a module whose only content is |
Side benefit, after this whole ordeal (and some elbow grease), the size of the generated file is 2.5Mo instead of 4Mo. I don't think we can make it smaller without getting out succinct data-structures. Unboxed arrays would help a tiny bit though. |
Hi there,
time_zone_data.ml is a 4Mib generated source file with veeery long list literals. This pattern is known to make the compiler blow up if certain optimization passes are enabled (typically, when compiling with flambda).
Currently the OCaml compiler tries to accept human-written files, but it does not claim to support program-generated code well: it is known that some parts of the compiler will blow up on unnatural inputs. The recommend practice for authors of OCaml code generators is to avoid extremely-long sequences of any language construct (sequences, list literals, array literals, toplevel definitions, etc.), and instead build equivalent programs by concatenating smaller groups of items -- with a reasonable limit baked in the generator.
Could you rewrite the generator for time_zone_data.ml to follow this recommended practice?
(Maybe someday the OCaml compiler will become more robust to unnatural data, but that requires a lot of effort and may in some cases decrease readability and maintainability of the compiler code, so this is not on the radar for now.)
The text was updated successfully, but these errors were encountered: