-
Notifications
You must be signed in to change notification settings - Fork 465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Please make the parsetab.py files reproducible #79
Comments
It might not be a bad idea to use an ordered dict class. There was even a version of ply where I had to do this (it was relying on dict keys having a consistent order). |
Hi, ping? :) |
Dictionaries are used all over the place in yacc. Not sure I can easily fix this or not. |
@dabeaz Could you just do |
Would such a change affect performance though? Should it be for the debugging implementation only, and not any optimized ones? |
@johnyf I think that would only be run when |
Suggest using yacc(write_tables=False) to disable the creation of the parsetab.py file entirely. Background: The whole reason that parsetab.py file is there in the first place is that the first version of PLY was written on a 200Mhz PC and the parser table creation was slow. To make startup faster on subsequent runs, parsetab.py was written and used as a kind of cache. I'm not even sure it matters now. For one, machines are a LOT faster. Also, PLY switched over to a different, much faster, algorithm ages ago (generating the tables for C with some 353 states takes about 0.3s on my current machine). Honestly, I've been thinking about ditching all of this parsetab.py/lextab.py business entirely in some future version. |
(Not all dicts would have to be changed, mind you...) |
In my experience with |
Ditching parsetab.py/lextab.py is not something I'm likely to do in the context of PLY. I have a more modern project in the works (a successor to PLY) that will ditch the generated files however. |
Hello, FWIW I switched the Debian packaging to the new release 1.0.0
So, I consider this issue "fixed" for my packaging needs |
`ply`, [by design](dabeaz/ply#79), does not produce reproducible table files; hence bug 1633156. (Note that this was *always* true, but only became a problem once we switched to Python 3, which has more unpredictable dict iteration order than Python 2.7, at least prior to [3.7](https://docs.python.org/3/whatsnew/3.7.html#summary-release-highlights).) In any other circumstance I would consider submitting a patch to `ply` to fix this, but as of the [in-progress version 4.0 of the library](https://github.com/dabeaz/ply/blob/master/CHANGES), it doesn't even emit this cached data any more, and indeed the [latest version of the code](https://github.com/dabeaz/ply/tree/1fac9fed647909b92f3779dd34beb8564f6f247b/ply) doesn't even call `open()` at all except to do logging or to read the text data to be parsed from `stdin`. So if we were going to pin our future on `ply` and upgrade to later versions of the library in the future, we would have to live in a world where `ply` doesn't generate cached table files for us anyway. Emitting the cached table files so later build steps can consume them is an "optimization", but it's not clear exactly how much actual value that optimization provides overall. Quoth the `CHANGES` file from that repository: ``` PLY no longer writes cached table files. Honestly, the use of the cached files made more sense when I was developing PLY on my 200Mhz PC in 2001. It's not as much as an issue now. For small to medium sized grammars, PLY should be almost instantaneous. ``` In practice, I have found this to be true; namely, `./mach build pre-export export` takes just about as long on my machine after this patch as it did before, and in a try push I performed, there's no noticeable performance regression from applying this patch. In local testing I also found that generating the LALR tables in calls to `yacc()` takes about 0.01s on my machine generally, and we generate these tables a couple dozen times total over the course of the `export` tier now. This isn't *nothing*, but in my opinion it's also not nearly long enough where it would be a concern given how long `export` already takes. That `CHANGES` file also stresses that if caching this data is important, we have the option of doing so via `pickle`. If and when we decide that re-enabling this optimization is valuable for us, we should take control of this process and perform the generation in such a way that we can guarantee reproducibility. Differential Revision: https://phabricator.services.mozilla.com/D73484
`ply`, [by design](dabeaz/ply#79), does not produce reproducible table files; hence bug 1633156. (Note that this was *always* true, but only became a problem once we switched to Python 3, which has more unpredictable dict iteration order than Python 2.7, at least prior to [3.7](https://docs.python.org/3/whatsnew/3.7.html#summary-release-highlights).) In any other circumstance I would consider submitting a patch to `ply` to fix this, but as of the [in-progress version 4.0 of the library](https://github.com/dabeaz/ply/blob/master/CHANGES), it doesn't even emit this cached data any more, and indeed the [latest version of the code](https://github.com/dabeaz/ply/tree/1fac9fed647909b92f3779dd34beb8564f6f247b/ply) doesn't even call `open()` at all except to do logging or to read the text data to be parsed from `stdin`. So if we were going to pin our future on `ply` and upgrade to later versions of the library in the future, we would have to live in a world where `ply` doesn't generate cached table files for us anyway. Emitting the cached table files so later build steps can consume them is an "optimization", but it's not clear exactly how much actual value that optimization provides overall. Quoth the `CHANGES` file from that repository: ``` PLY no longer writes cached table files. Honestly, the use of the cached files made more sense when I was developing PLY on my 200Mhz PC in 2001. It's not as much as an issue now. For small to medium sized grammars, PLY should be almost instantaneous. ``` In practice, I have found this to be true; namely, `./mach build pre-export export` takes just about as long on my machine after this patch as it did before, and in a try push I performed, there's no noticeable performance regression from applying this patch. In local testing I also found that generating the LALR tables in calls to `yacc()` takes about 0.01s on my machine generally, and we generate these tables a couple dozen times total over the course of the `export` tier now. This isn't *nothing*, but in my opinion it's also not nearly long enough where it would be a concern given how long `export` already takes. That `CHANGES` file also stresses that if caching this data is important, we have the option of doing so via `pickle`. If and when we decide that re-enabling this optimization is valuable for us, we should take control of this process and perform the generation in such a way that we can guarantee reproducibility. Differential Revision: https://phabricator.services.mozilla.com/D73484 UltraBlame original commit: 9d9bb7c3e80604d8be0d46974cd1a46f1f12fa02
`ply`, [by design](dabeaz/ply#79), does not produce reproducible table files; hence bug 1633156. (Note that this was *always* true, but only became a problem once we switched to Python 3, which has more unpredictable dict iteration order than Python 2.7, at least prior to [3.7](https://docs.python.org/3/whatsnew/3.7.html#summary-release-highlights).) In any other circumstance I would consider submitting a patch to `ply` to fix this, but as of the [in-progress version 4.0 of the library](https://github.com/dabeaz/ply/blob/master/CHANGES), it doesn't even emit this cached data any more, and indeed the [latest version of the code](https://github.com/dabeaz/ply/tree/1fac9fed647909b92f3779dd34beb8564f6f247b/ply) doesn't even call `open()` at all except to do logging or to read the text data to be parsed from `stdin`. So if we were going to pin our future on `ply` and upgrade to later versions of the library in the future, we would have to live in a world where `ply` doesn't generate cached table files for us anyway. Emitting the cached table files so later build steps can consume them is an "optimization", but it's not clear exactly how much actual value that optimization provides overall. Quoth the `CHANGES` file from that repository: ``` PLY no longer writes cached table files. Honestly, the use of the cached files made more sense when I was developing PLY on my 200Mhz PC in 2001. It's not as much as an issue now. For small to medium sized grammars, PLY should be almost instantaneous. ``` In practice, I have found this to be true; namely, `./mach build pre-export export` takes just about as long on my machine after this patch as it did before, and in a try push I performed, there's no noticeable performance regression from applying this patch. In local testing I also found that generating the LALR tables in calls to `yacc()` takes about 0.01s on my machine generally, and we generate these tables a couple dozen times total over the course of the `export` tier now. This isn't *nothing*, but in my opinion it's also not nearly long enough where it would be a concern given how long `export` already takes. That `CHANGES` file also stresses that if caching this data is important, we have the option of doing so via `pickle`. If and when we decide that re-enabling this optimization is valuable for us, we should take control of this process and perform the generation in such a way that we can guarantee reproducibility. Differential Revision: https://phabricator.services.mozilla.com/D73484 UltraBlame original commit: 9d9bb7c3e80604d8be0d46974cd1a46f1f12fa02
`ply`, [by design](dabeaz/ply#79), does not produce reproducible table files; hence bug 1633156. (Note that this was *always* true, but only became a problem once we switched to Python 3, which has more unpredictable dict iteration order than Python 2.7, at least prior to [3.7](https://docs.python.org/3/whatsnew/3.7.html#summary-release-highlights).) In any other circumstance I would consider submitting a patch to `ply` to fix this, but as of the [in-progress version 4.0 of the library](https://github.com/dabeaz/ply/blob/master/CHANGES), it doesn't even emit this cached data any more, and indeed the [latest version of the code](https://github.com/dabeaz/ply/tree/1fac9fed647909b92f3779dd34beb8564f6f247b/ply) doesn't even call `open()` at all except to do logging or to read the text data to be parsed from `stdin`. So if we were going to pin our future on `ply` and upgrade to later versions of the library in the future, we would have to live in a world where `ply` doesn't generate cached table files for us anyway. Emitting the cached table files so later build steps can consume them is an "optimization", but it's not clear exactly how much actual value that optimization provides overall. Quoth the `CHANGES` file from that repository: ``` PLY no longer writes cached table files. Honestly, the use of the cached files made more sense when I was developing PLY on my 200Mhz PC in 2001. It's not as much as an issue now. For small to medium sized grammars, PLY should be almost instantaneous. ``` In practice, I have found this to be true; namely, `./mach build pre-export export` takes just about as long on my machine after this patch as it did before, and in a try push I performed, there's no noticeable performance regression from applying this patch. In local testing I also found that generating the LALR tables in calls to `yacc()` takes about 0.01s on my machine generally, and we generate these tables a couple dozen times total over the course of the `export` tier now. This isn't *nothing*, but in my opinion it's also not nearly long enough where it would be a concern given how long `export` already takes. That `CHANGES` file also stresses that if caching this data is important, we have the option of doing so via `pickle`. If and when we decide that re-enabling this optimization is valuable for us, we should take control of this process and perform the generation in such a way that we can guarantee reproducibility. Differential Revision: https://phabricator.services.mozilla.com/D73484 UltraBlame original commit: 9d9bb7c3e80604d8be0d46974cd1a46f1f12fa02
`ply`, [by design](dabeaz/ply#79), does not produce reproducible table files; hence bug 1633156. (Note that this was *always* true, but only became a problem once we switched to Python 3, which has more unpredictable dict iteration order than Python 2.7, at least prior to [3.7](https://docs.python.org/3/whatsnew/3.7.html#summary-release-highlights).) In any other circumstance I would consider submitting a patch to `ply` to fix this, but as of the [in-progress version 4.0 of the library](https://github.com/dabeaz/ply/blob/master/CHANGES), it doesn't even emit this cached data any more, and indeed the [latest version of the code](https://github.com/dabeaz/ply/tree/1fac9fed647909b92f3779dd34beb8564f6f247b/ply) doesn't even call `open()` at all except to do logging or to read the text data to be parsed from `stdin`. So if we were going to pin our future on `ply` and upgrade to later versions of the library in the future, we would have to live in a world where `ply` doesn't generate cached table files for us anyway. Emitting the cached table files so later build steps can consume them is an "optimization", but it's not clear exactly how much actual value that optimization provides overall. Quoth the `CHANGES` file from that repository: ``` PLY no longer writes cached table files. Honestly, the use of the cached files made more sense when I was developing PLY on my 200Mhz PC in 2001. It's not as much as an issue now. For small to medium sized grammars, PLY should be almost instantaneous. ``` In practice, I have found this to be true; namely, `./mach build pre-export export` takes just about as long on my machine after this patch as it did before, and in a try push I performed, there's no noticeable performance regression from applying this patch. In local testing I also found that generating the LALR tables in calls to `yacc()` takes about 0.01s on my machine generally, and we generate these tables a couple dozen times total over the course of the `export` tier now. This isn't *nothing*, but in my opinion it's also not nearly long enough where it would be a concern given how long `export` already takes. That `CHANGES` file also stresses that if caching this data is important, we have the option of doing so via `pickle`. If and when we decide that re-enabling this optimization is valuable for us, we should take control of this process and perform the generation in such a way that we can guarantee reproducibility. Differential Revision: https://phabricator.services.mozilla.com/D73484
Whilst working on the Debian reproducible builds effort, I noticed that python-ply generates
parsetab.py
files with non-determinstic contents.I first had a quick go at fixing this by adding a bunch of sorts inside
write_table
but looking deeper into the data structures it appears that "more" determinism is needed to ensure that the states are consistently numbered across builds. There are whole bunch of iterations over dict'sitems()
throughout the table generation which—as you are no doubt aware—are non-determinstic. I'm sure some of these are harmless from a reproducibility point of view, so simply addingsorted()
everywhere would be a total mess.Of course, one solution would be to wontfix this and simply decree that these files are non-determistc.. but that would require that Debian etc. would not be able to ship these useful optimisations as they would render the package unreproducible.
The text was updated successfully, but these errors were encountered: