Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Please make the parsetab.py files reproducible #79

Closed
lamby opened this issue Sep 19, 2015 · 11 comments
Closed

Please make the parsetab.py files reproducible #79

lamby opened this issue Sep 19, 2015 · 11 comments

Comments

@lamby
Copy link

lamby commented Sep 19, 2015

Whilst working on the Debian reproducible builds effort, I noticed that python-ply generates parsetab.py files with non-determinstic contents.

I first had a quick go at fixing this by adding a bunch of sorts inside write_table but looking deeper into the data structures it appears that "more" determinism is needed to ensure that the states are consistently numbered across builds. There are whole bunch of iterations over dict's items() throughout the table generation which—as you are no doubt aware—are non-determinstic. I'm sure some of these are harmless from a reproducibility point of view, so simply adding sorted() everywhere would be a total mess.

Of course, one solution would be to wontfix this and simply decree that these files are non-determistc.. but that would require that Debian etc. would not be able to ship these useful optimisations as they would render the package unreproducible.

@joeedh
Copy link
Contributor

joeedh commented Oct 8, 2015

It might not be a bad idea to use an ordered dict class. There was even a version of ply where I had to do this (it was relying on dict keys having a consistent order).

@LocutusOfBorg
Copy link

Hi, ping? :)

@dabeaz
Copy link
Owner

dabeaz commented Aug 30, 2016

Dictionaries are used all over the place in yacc. Not sure I can easily fix this or not.

@refi64
Copy link

refi64 commented Aug 30, 2016

@dabeaz Could you just do sorted(dct.items()) instead of dct.items()?

@johnyf
Copy link

johnyf commented Aug 30, 2016

Would such a change affect performance though? Should it be for the debugging implementation only, and not any optimized ones?

@refi64
Copy link

refi64 commented Aug 30, 2016

@johnyf I think that would only be run when parsetab.py is being written, though, so it would be only for the first run.

@dabeaz
Copy link
Owner

dabeaz commented Aug 30, 2016

Suggest using yacc(write_tables=False) to disable the creation of the parsetab.py file entirely.

Background: The whole reason that parsetab.py file is there in the first place is that the first version of PLY was written on a 200Mhz PC and the parser table creation was slow. To make startup faster on subsequent runs, parsetab.py was written and used as a kind of cache. I'm not even sure it matters now. For one, machines are a LOT faster. Also, PLY switched over to a different, much faster, algorithm ages ago (generating the tables for C with some 353 states takes about 0.3s on my current machine).

Honestly, I've been thinking about ditching all of this parsetab.py/lextab.py business entirely in some future version.

@lamby
Copy link
Author

lamby commented Aug 30, 2016

(Not all dicts would have to be changed, mind you...)

@johnyf
Copy link

johnyf commented Aug 30, 2016

In my experience with promela, the parsetab.py is useful and does accelerate runs. I would prefer that it remains available as functionality. Quantifying the difference would require collecting measurements from running a representative collection of large parsers.

@dabeaz
Copy link
Owner

dabeaz commented Aug 30, 2016

Ditching parsetab.py/lextab.py is not something I'm likely to do in the context of PLY. I have a more modern project in the works (a successor to PLY) that will ditch the generated files however.

@LocutusOfBorg
Copy link

Hello, FWIW I switched the Debian packaging to the new release 1.0.0
https://github.com/viraptor/phply that seems to be the new one published on pypi

https://pypi.debian.net/phply
https://pypi.python.org/pypi/phply

So, I consider this issue "fixed" for my packaging needs

@dabeaz dabeaz closed this as completed Jan 31, 2017
moz-v2v-gh pushed a commit to mozilla/gecko-dev that referenced this issue May 7, 2020
`ply`, [by design](dabeaz/ply#79), does not produce reproducible table files; hence bug 1633156. (Note that this was *always* true, but only became a problem once we switched to Python 3, which has more unpredictable dict iteration order than Python 2.7, at least prior to [3.7](https://docs.python.org/3/whatsnew/3.7.html#summary-release-highlights).)

In any other circumstance I would consider submitting a patch to `ply` to fix this, but as of the [in-progress version 4.0 of the library](https://github.com/dabeaz/ply/blob/master/CHANGES), it doesn't even emit this cached data any more, and indeed the [latest version of the code](https://github.com/dabeaz/ply/tree/1fac9fed647909b92f3779dd34beb8564f6f247b/ply) doesn't even call `open()` at all except to do logging or to read the text data to be parsed from `stdin`. So if we were going to pin our future on `ply` and upgrade to later versions of the library in the future, we would have to live in a world where `ply` doesn't generate cached table files for us anyway.

Emitting the cached table files so later build steps can consume them is an "optimization", but it's not clear exactly how much actual value that optimization provides overall. Quoth the `CHANGES` file from that repository:

```
PLY no longer writes cached table files.  Honestly, the use of
the cached files made more sense when I was developing PLY on
my 200Mhz PC in 2001. It's not as much as an issue now. For small
to medium sized grammars, PLY should be almost instantaneous.
```

In practice, I have found this to be true; namely, `./mach build pre-export export` takes just about as long on my machine after this patch as it did before, and in a try push I performed, there's no noticeable performance regression from applying this patch. In local testing I also found that generating the LALR tables in calls to `yacc()` takes about 0.01s on my machine generally, and we generate these tables a couple dozen times total over the course of the `export` tier now. This isn't *nothing*, but in my opinion it's also not nearly long enough where it would be a concern given how long `export` already takes.

That `CHANGES` file also stresses that if caching this data is important, we have the option of doing so via `pickle`. If and when we decide that re-enabling this optimization is valuable for us, we should take control of this process and perform the generation in such a way that we can guarantee reproducibility.

Differential Revision: https://phabricator.services.mozilla.com/D73484
gecko-dev-updater pushed a commit to marco-c/gecko-dev-comments-removed that referenced this issue May 7, 2020
`ply`, [by design](dabeaz/ply#79), does not produce reproducible table files; hence bug 1633156. (Note that this was *always* true, but only became a problem once we switched to Python 3, which has more unpredictable dict iteration order than Python 2.7, at least prior to [3.7](https://docs.python.org/3/whatsnew/3.7.html#summary-release-highlights).)

In any other circumstance I would consider submitting a patch to `ply` to fix this, but as of the [in-progress version 4.0 of the library](https://github.com/dabeaz/ply/blob/master/CHANGES), it doesn't even emit this cached data any more, and indeed the [latest version of the code](https://github.com/dabeaz/ply/tree/1fac9fed647909b92f3779dd34beb8564f6f247b/ply) doesn't even call `open()` at all except to do logging or to read the text data to be parsed from `stdin`. So if we were going to pin our future on `ply` and upgrade to later versions of the library in the future, we would have to live in a world where `ply` doesn't generate cached table files for us anyway.

Emitting the cached table files so later build steps can consume them is an "optimization", but it's not clear exactly how much actual value that optimization provides overall. Quoth the `CHANGES` file from that repository:

```
PLY no longer writes cached table files.  Honestly, the use of
the cached files made more sense when I was developing PLY on
my 200Mhz PC in 2001. It's not as much as an issue now. For small
to medium sized grammars, PLY should be almost instantaneous.
```

In practice, I have found this to be true; namely, `./mach build pre-export export` takes just about as long on my machine after this patch as it did before, and in a try push I performed, there's no noticeable performance regression from applying this patch. In local testing I also found that generating the LALR tables in calls to `yacc()` takes about 0.01s on my machine generally, and we generate these tables a couple dozen times total over the course of the `export` tier now. This isn't *nothing*, but in my opinion it's also not nearly long enough where it would be a concern given how long `export` already takes.

That `CHANGES` file also stresses that if caching this data is important, we have the option of doing so via `pickle`. If and when we decide that re-enabling this optimization is valuable for us, we should take control of this process and perform the generation in such a way that we can guarantee reproducibility.

Differential Revision: https://phabricator.services.mozilla.com/D73484

UltraBlame original commit: 9d9bb7c3e80604d8be0d46974cd1a46f1f12fa02
gecko-dev-updater pushed a commit to marco-c/gecko-dev-wordified that referenced this issue May 7, 2020
`ply`, [by design](dabeaz/ply#79), does not produce reproducible table files; hence bug 1633156. (Note that this was *always* true, but only became a problem once we switched to Python 3, which has more unpredictable dict iteration order than Python 2.7, at least prior to [3.7](https://docs.python.org/3/whatsnew/3.7.html#summary-release-highlights).)

In any other circumstance I would consider submitting a patch to `ply` to fix this, but as of the [in-progress version 4.0 of the library](https://github.com/dabeaz/ply/blob/master/CHANGES), it doesn't even emit this cached data any more, and indeed the [latest version of the code](https://github.com/dabeaz/ply/tree/1fac9fed647909b92f3779dd34beb8564f6f247b/ply) doesn't even call `open()` at all except to do logging or to read the text data to be parsed from `stdin`. So if we were going to pin our future on `ply` and upgrade to later versions of the library in the future, we would have to live in a world where `ply` doesn't generate cached table files for us anyway.

Emitting the cached table files so later build steps can consume them is an "optimization", but it's not clear exactly how much actual value that optimization provides overall. Quoth the `CHANGES` file from that repository:

```
PLY no longer writes cached table files.  Honestly, the use of
the cached files made more sense when I was developing PLY on
my 200Mhz PC in 2001. It's not as much as an issue now. For small
to medium sized grammars, PLY should be almost instantaneous.
```

In practice, I have found this to be true; namely, `./mach build pre-export export` takes just about as long on my machine after this patch as it did before, and in a try push I performed, there's no noticeable performance regression from applying this patch. In local testing I also found that generating the LALR tables in calls to `yacc()` takes about 0.01s on my machine generally, and we generate these tables a couple dozen times total over the course of the `export` tier now. This isn't *nothing*, but in my opinion it's also not nearly long enough where it would be a concern given how long `export` already takes.

That `CHANGES` file also stresses that if caching this data is important, we have the option of doing so via `pickle`. If and when we decide that re-enabling this optimization is valuable for us, we should take control of this process and perform the generation in such a way that we can guarantee reproducibility.

Differential Revision: https://phabricator.services.mozilla.com/D73484

UltraBlame original commit: 9d9bb7c3e80604d8be0d46974cd1a46f1f12fa02
gecko-dev-updater pushed a commit to marco-c/gecko-dev-wordified-and-comments-removed that referenced this issue May 7, 2020
`ply`, [by design](dabeaz/ply#79), does not produce reproducible table files; hence bug 1633156. (Note that this was *always* true, but only became a problem once we switched to Python 3, which has more unpredictable dict iteration order than Python 2.7, at least prior to [3.7](https://docs.python.org/3/whatsnew/3.7.html#summary-release-highlights).)

In any other circumstance I would consider submitting a patch to `ply` to fix this, but as of the [in-progress version 4.0 of the library](https://github.com/dabeaz/ply/blob/master/CHANGES), it doesn't even emit this cached data any more, and indeed the [latest version of the code](https://github.com/dabeaz/ply/tree/1fac9fed647909b92f3779dd34beb8564f6f247b/ply) doesn't even call `open()` at all except to do logging or to read the text data to be parsed from `stdin`. So if we were going to pin our future on `ply` and upgrade to later versions of the library in the future, we would have to live in a world where `ply` doesn't generate cached table files for us anyway.

Emitting the cached table files so later build steps can consume them is an "optimization", but it's not clear exactly how much actual value that optimization provides overall. Quoth the `CHANGES` file from that repository:

```
PLY no longer writes cached table files.  Honestly, the use of
the cached files made more sense when I was developing PLY on
my 200Mhz PC in 2001. It's not as much as an issue now. For small
to medium sized grammars, PLY should be almost instantaneous.
```

In practice, I have found this to be true; namely, `./mach build pre-export export` takes just about as long on my machine after this patch as it did before, and in a try push I performed, there's no noticeable performance regression from applying this patch. In local testing I also found that generating the LALR tables in calls to `yacc()` takes about 0.01s on my machine generally, and we generate these tables a couple dozen times total over the course of the `export` tier now. This isn't *nothing*, but in my opinion it's also not nearly long enough where it would be a concern given how long `export` already takes.

That `CHANGES` file also stresses that if caching this data is important, we have the option of doing so via `pickle`. If and when we decide that re-enabling this optimization is valuable for us, we should take control of this process and perform the generation in such a way that we can guarantee reproducibility.

Differential Revision: https://phabricator.services.mozilla.com/D73484

UltraBlame original commit: 9d9bb7c3e80604d8be0d46974cd1a46f1f12fa02
xeonchen pushed a commit to xeonchen/gecko that referenced this issue May 7, 2020
`ply`, [by design](dabeaz/ply#79), does not produce reproducible table files; hence bug 1633156. (Note that this was *always* true, but only became a problem once we switched to Python 3, which has more unpredictable dict iteration order than Python 2.7, at least prior to [3.7](https://docs.python.org/3/whatsnew/3.7.html#summary-release-highlights).)

In any other circumstance I would consider submitting a patch to `ply` to fix this, but as of the [in-progress version 4.0 of the library](https://github.com/dabeaz/ply/blob/master/CHANGES), it doesn't even emit this cached data any more, and indeed the [latest version of the code](https://github.com/dabeaz/ply/tree/1fac9fed647909b92f3779dd34beb8564f6f247b/ply) doesn't even call `open()` at all except to do logging or to read the text data to be parsed from `stdin`. So if we were going to pin our future on `ply` and upgrade to later versions of the library in the future, we would have to live in a world where `ply` doesn't generate cached table files for us anyway.

Emitting the cached table files so later build steps can consume them is an "optimization", but it's not clear exactly how much actual value that optimization provides overall. Quoth the `CHANGES` file from that repository:

```
PLY no longer writes cached table files.  Honestly, the use of
the cached files made more sense when I was developing PLY on
my 200Mhz PC in 2001. It's not as much as an issue now. For small
to medium sized grammars, PLY should be almost instantaneous.
```

In practice, I have found this to be true; namely, `./mach build pre-export export` takes just about as long on my machine after this patch as it did before, and in a try push I performed, there's no noticeable performance regression from applying this patch. In local testing I also found that generating the LALR tables in calls to `yacc()` takes about 0.01s on my machine generally, and we generate these tables a couple dozen times total over the course of the `export` tier now. This isn't *nothing*, but in my opinion it's also not nearly long enough where it would be a concern given how long `export` already takes.

That `CHANGES` file also stresses that if caching this data is important, we have the option of doing so via `pickle`. If and when we decide that re-enabling this optimization is valuable for us, we should take control of this process and perform the generation in such a way that we can guarantee reproducibility.

Differential Revision: https://phabricator.services.mozilla.com/D73484
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants