Please make the parsetab.py files reproducible #79

lamby · 2015-09-19T00:06:16Z

Whilst working on the Debian reproducible builds effort, I noticed that python-ply generates parsetab.py files with non-determinstic contents.

I first had a quick go at fixing this by adding a bunch of sorts inside write_table but looking deeper into the data structures it appears that "more" determinism is needed to ensure that the states are consistently numbered across builds. There are whole bunch of iterations over dict's items() throughout the table generation which—as you are no doubt aware—are non-determinstic. I'm sure some of these are harmless from a reproducibility point of view, so simply adding sorted() everywhere would be a total mess.

Of course, one solution would be to wontfix this and simply decree that these files are non-determistc.. but that would require that Debian etc. would not be able to ship these useful optimisations as they would render the package unreproducible.

The text was updated successfully, but these errors were encountered:

joeedh · 2015-10-08T17:47:10Z

It might not be a bad idea to use an ordered dict class. There was even a version of ply where I had to do this (it was relying on dict keys having a consistent order).

LocutusOfBorg · 2016-05-10T11:33:46Z

Hi, ping? :)

dabeaz · 2016-08-30T19:55:21Z

Dictionaries are used all over the place in yacc. Not sure I can easily fix this or not.

refi64 · 2016-08-30T20:07:00Z

@dabeaz Could you just do sorted(dct.items()) instead of dct.items()?

johnyf · 2016-08-30T20:13:48Z

Would such a change affect performance though? Should it be for the debugging implementation only, and not any optimized ones?

refi64 · 2016-08-30T20:25:30Z

@johnyf I think that would only be run when parsetab.py is being written, though, so it would be only for the first run.

dabeaz · 2016-08-30T20:35:35Z

Suggest using yacc(write_tables=False) to disable the creation of the parsetab.py file entirely.

Background: The whole reason that parsetab.py file is there in the first place is that the first version of PLY was written on a 200Mhz PC and the parser table creation was slow. To make startup faster on subsequent runs, parsetab.py was written and used as a kind of cache. I'm not even sure it matters now. For one, machines are a LOT faster. Also, PLY switched over to a different, much faster, algorithm ages ago (generating the tables for C with some 353 states takes about 0.3s on my current machine).

Honestly, I've been thinking about ditching all of this parsetab.py/lextab.py business entirely in some future version.

lamby · 2016-08-30T20:43:00Z

(Not all dicts would have to be changed, mind you...)

johnyf · 2016-08-30T20:49:51Z

In my experience with promela, the parsetab.py is useful and does accelerate runs. I would prefer that it remains available as functionality. Quantifying the difference would require collecting measurements from running a representative collection of large parsers.

dabeaz · 2016-08-30T21:15:44Z

Ditching parsetab.py/lextab.py is not something I'm likely to do in the context of PLY. I have a more modern project in the works (a successor to PLY) that will ditch the generated files however.

LocutusOfBorg · 2017-01-23T08:40:32Z

Hello, FWIW I switched the Debian packaging to the new release 1.0.0
https://github.com/viraptor/phply that seems to be the new one published on pypi

https://pypi.debian.net/phply
https://pypi.python.org/pypi/phply

So, I consider this issue "fixed" for my packaging needs

`ply`, [by design](dabeaz/ply#79), does not produce reproducible table files; hence bug 1633156. (Note that this was *always* true, but only became a problem once we switched to Python 3, which has more unpredictable dict iteration order than Python 2.7, at least prior to [3.7](https://docs.python.org/3/whatsnew/3.7.html#summary-release-highlights).) In any other circumstance I would consider submitting a patch to `ply` to fix this, but as of the [in-progress version 4.0 of the library](https://github.com/dabeaz/ply/blob/master/CHANGES), it doesn't even emit this cached data any more, and indeed the [latest version of the code](https://github.com/dabeaz/ply/tree/1fac9fed647909b92f3779dd34beb8564f6f247b/ply) doesn't even call `open()` at all except to do logging or to read the text data to be parsed from `stdin`. So if we were going to pin our future on `ply` and upgrade to later versions of the library in the future, we would have to live in a world where `ply` doesn't generate cached table files for us anyway. Emitting the cached table files so later build steps can consume them is an "optimization", but it's not clear exactly how much actual value that optimization provides overall. Quoth the `CHANGES` file from that repository: ``` PLY no longer writes cached table files. Honestly, the use of the cached files made more sense when I was developing PLY on my 200Mhz PC in 2001. It's not as much as an issue now. For small to medium sized grammars, PLY should be almost instantaneous. ``` In practice, I have found this to be true; namely, `./mach build pre-export export` takes just about as long on my machine after this patch as it did before, and in a try push I performed, there's no noticeable performance regression from applying this patch. In local testing I also found that generating the LALR tables in calls to `yacc()` takes about 0.01s on my machine generally, and we generate these tables a couple dozen times total over the course of the `export` tier now. This isn't *nothing*, but in my opinion it's also not nearly long enough where it would be a concern given how long `export` already takes. That `CHANGES` file also stresses that if caching this data is important, we have the option of doing so via `pickle`. If and when we decide that re-enabling this optimization is valuable for us, we should take control of this process and perform the generation in such a way that we can guarantee reproducibility. Differential Revision: https://phabricator.services.mozilla.com/D73484

`ply`, [by design](dabeaz/ply#79), does not produce reproducible table files; hence bug 1633156. (Note that this was *always* true, but only became a problem once we switched to Python 3, which has more unpredictable dict iteration order than Python 2.7, at least prior to [3.7](https://docs.python.org/3/whatsnew/3.7.html#summary-release-highlights).) In any other circumstance I would consider submitting a patch to `ply` to fix this, but as of the [in-progress version 4.0 of the library](https://github.com/dabeaz/ply/blob/master/CHANGES), it doesn't even emit this cached data any more, and indeed the [latest version of the code](https://github.com/dabeaz/ply/tree/1fac9fed647909b92f3779dd34beb8564f6f247b/ply) doesn't even call `open()` at all except to do logging or to read the text data to be parsed from `stdin`. So if we were going to pin our future on `ply` and upgrade to later versions of the library in the future, we would have to live in a world where `ply` doesn't generate cached table files for us anyway. Emitting the cached table files so later build steps can consume them is an "optimization", but it's not clear exactly how much actual value that optimization provides overall. Quoth the `CHANGES` file from that repository: ``` PLY no longer writes cached table files. Honestly, the use of the cached files made more sense when I was developing PLY on my 200Mhz PC in 2001. It's not as much as an issue now. For small to medium sized grammars, PLY should be almost instantaneous. ``` In practice, I have found this to be true; namely, `./mach build pre-export export` takes just about as long on my machine after this patch as it did before, and in a try push I performed, there's no noticeable performance regression from applying this patch. In local testing I also found that generating the LALR tables in calls to `yacc()` takes about 0.01s on my machine generally, and we generate these tables a couple dozen times total over the course of the `export` tier now. This isn't *nothing*, but in my opinion it's also not nearly long enough where it would be a concern given how long `export` already takes. That `CHANGES` file also stresses that if caching this data is important, we have the option of doing so via `pickle`. If and when we decide that re-enabling this optimization is valuable for us, we should take control of this process and perform the generation in such a way that we can guarantee reproducibility. Differential Revision: https://phabricator.services.mozilla.com/D73484 UltraBlame original commit: 9d9bb7c3e80604d8be0d46974cd1a46f1f12fa02

`ply`, [by design](dabeaz/ply#79), does not produce reproducible table files; hence bug 1633156. (Note that this was *always* true, but only became a problem once we switched to Python 3, which has more unpredictable dict iteration order than Python 2.7, at least prior to [3.7](https://docs.python.org/3/whatsnew/3.7.html#summary-release-highlights).) In any other circumstance I would consider submitting a patch to `ply` to fix this, but as of the [in-progress version 4.0 of the library](https://github.com/dabeaz/ply/blob/master/CHANGES), it doesn't even emit this cached data any more, and indeed the [latest version of the code](https://github.com/dabeaz/ply/tree/1fac9fed647909b92f3779dd34beb8564f6f247b/ply) doesn't even call `open()` at all except to do logging or to read the text data to be parsed from `stdin`. So if we were going to pin our future on `ply` and upgrade to later versions of the library in the future, we would have to live in a world where `ply` doesn't generate cached table files for us anyway. Emitting the cached table files so later build steps can consume them is an "optimization", but it's not clear exactly how much actual value that optimization provides overall. Quoth the `CHANGES` file from that repository: ``` PLY no longer writes cached table files. Honestly, the use of the cached files made more sense when I was developing PLY on my 200Mhz PC in 2001. It's not as much as an issue now. For small to medium sized grammars, PLY should be almost instantaneous. ``` In practice, I have found this to be true; namely, `./mach build pre-export export` takes just about as long on my machine after this patch as it did before, and in a try push I performed, there's no noticeable performance regression from applying this patch. In local testing I also found that generating the LALR tables in calls to `yacc()` takes about 0.01s on my machine generally, and we generate these tables a couple dozen times total over the course of the `export` tier now. This isn't *nothing*, but in my opinion it's also not nearly long enough where it would be a concern given how long `export` already takes. That `CHANGES` file also stresses that if caching this data is important, we have the option of doing so via `pickle`. If and when we decide that re-enabling this optimization is valuable for us, we should take control of this process and perform the generation in such a way that we can guarantee reproducibility. Differential Revision: https://phabricator.services.mozilla.com/D73484

josephfrazier mentioned this issue Jan 11, 2017

Warnings when using thefuck nvbn/thefuck#588

Closed

dabeaz closed this as completed Jan 31, 2017

jlev mentioned this issue Sep 22, 2017

parsetab module is not always generated property by ply mailgun/flanker#168

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Please make the parsetab.py files reproducible #79

Please make the parsetab.py files reproducible #79

lamby commented Sep 19, 2015

joeedh commented Oct 8, 2015

LocutusOfBorg commented May 10, 2016

dabeaz commented Aug 30, 2016

refi64 commented Aug 30, 2016

johnyf commented Aug 30, 2016

refi64 commented Aug 30, 2016

dabeaz commented Aug 30, 2016

lamby commented Aug 30, 2016

johnyf commented Aug 30, 2016

dabeaz commented Aug 30, 2016

LocutusOfBorg commented Jan 23, 2017

Please make the parsetab.py files reproducible #79

Please make the parsetab.py files reproducible #79

Comments

lamby commented Sep 19, 2015

joeedh commented Oct 8, 2015

LocutusOfBorg commented May 10, 2016

dabeaz commented Aug 30, 2016

refi64 commented Aug 30, 2016

johnyf commented Aug 30, 2016

refi64 commented Aug 30, 2016

dabeaz commented Aug 30, 2016

lamby commented Aug 30, 2016

johnyf commented Aug 30, 2016

dabeaz commented Aug 30, 2016

LocutusOfBorg commented Jan 23, 2017