Skip to content
This repository has been archived by the owner on Oct 2, 2023. It is now read-only.

Python 3.9 generating different image digests #1680

Closed
tommyknows opened this issue Nov 20, 2020 · 7 comments
Closed

Python 3.9 generating different image digests #1680

tommyknows opened this issue Nov 20, 2020 · 7 comments
Labels
Can Close? Will close in 30 days unless there is a comment indicating why not

Comments

@tommyknows
Copy link

We're currently facing a bug with users having newer installations of Python (3.9).

We are depending on the reproducibility of Container Images and Digests, and image builds on different OS & Platforms should always result in the same image digest (note that we're always building images for linux / amd64).

However, we started noticing an issue with people that upgraded their Python 3 version to 3.9.
The digests seem to differ to older versions.

We noticed this on two different systems:

On MacOS, the system installations of python (/usr/bin) are old enough, for example on Big Sur, python3 is 3.8.2. However, the user may install newer Python versions with brew, which makes the python binary to be 3.9 instead of the system's one.
In this case, /usr/bin/python3 is the system-installation, but python3 selected from the user's $PATH is / may be brew's installation of python 3.9.

On Linux, where the user just had upgraded their python version.

We mitigated this issue, at least for MacOS, by registering custom toolchains:

py_runtime(
    name = "python3",
    interpreter_path = "/usr/bin/python3",
    python_version = "PY3",
)

py_runtime_pair(
    name = "python_pair",
    # we're not registering a python2 installation as:
    # old installations do not have `python2` as a binary
    # new installations have `python` linked to python 3.
    # So we never know which one works.
    #py2_runtime = "",
    py3_runtime = ":python3",
)

On Linux, we require the user to install a python version pre-3.9 and link it to /usr/bin/python3.

There is an issue in rules_python to register the python interpreter similarly to how go_register_toolchain downloads the correct version of the Go compiler; making the build independent of the Host. While this would fix the issue at hand, it would still only be a workaround too.

Is this a bug in rules_docker that the image digests differ with newer versions of python, or is this just an issue from Python itself?

@thundergolfer
Copy link
Contributor

thundergolfer commented Apr 19, 2021

If I'm understanding your issue correctly, the digest is different because the Python interpreter is changing, therefore the Python interpreter must be an input to your image somehow. The digester tool itself does not use the Python interpreter (the digest producing tool is Golang: "//container/go/cmd/digester").

If your containers are not layering in py_binary artifacts it would be interesting to find out how the interpreter is becoming an input to your images, but I think the root issue is that your current Python setup is not hermetic. As you point out, by default rules_python will reach out to the $PATH to discover the interpreter to use. interpreter_path = "/usr/bin/python3", in a py_runtime helps lock things down, but it's reaching outside Bazel onto the user's file system.

Is this a bug in rules_docker that the image digests differ with newer versions of python

I think we'd have to know why the Python interpreter is becoming an input into the digest. rules_docker produces some config/metadata files during image build which I'd check out first to see if you can find out what's going into your digest calculation.

@tommyknows
Copy link
Author

Sorry, I should have been clearer in the issue description.

The issue here is not the digester, it's the inputs into the digester iiuc.

I've generated an image with Python 3.8 and 3.9 and get different <name>-layer.tar files:

> # Python 3.8
> cat myimage-layer.tar | sha256sum
27af804a6570d5710c9d3ab931e97a9ef86462fc66859ad52a0ec7fed5d530c5  -
> # Python 3.9
> cat myimage-layer.tar | sha256sum
deff2ba4a7f9827cf37dce32abb4d38fb2a847fae286a34d8c8dc6aa299c7e0b  -

The image in this case contains a Go binary, which has the same hash in both layers.

I'm not sure what other information would be helpful, so please let me know what you need.

the root issue is that your current Python setup is not hermetic.

Yes, absolutely, but having a hermetic Python setup is hard 😄 There's no real option apart from building it yourself, and depending on MacOS version there's no one-size-fits-all command to do so. For us, that would have been way harder than just to tell people to use / install 3.8 :-)
(If you do know an easy way to have a hermetic Python setup which works across different OS, please let me know)

@thundergolfer
Copy link
Contributor

The image in this case contains a Go binary, which has the same hash in both layers.

So if you run a command like:

mkdir tar-contents/
tar -xvf bazel-bin/tools/foo/myimage-layer.tar -C tar-contents

You get one file in tar-contents/ which is a Golang binary, and that binary has the same sha256sum value between Python 3.8 and Python 3.9, despite the overall tar file having different sha256sum values?

Given my limited knowledge of tar files, that would differences in the headers contained in the tar file. Which would be strange.

@tommyknows
Copy link
Author

Yes, exactly.

I've created a minimal example of this issue here.
You can run the script ./reproduce.sh, an example output is in the README already.

I think I saw something similar with the tar headers in another issue, but I was unable to find it again. Something which changed between a minor Python version and had to be hardcoded in the rules... that feels similar.

@dataoleg
Copy link
Contributor

dataoleg commented Oct 5, 2021

This is because python/cpython#18080 [apparently] hasn't been back-ported to python3.8, and that library is used by

self.tar = tarfile.open(

So the tar headers end up slightly different (different devmajor, devminor, and as a consequence different header checksum value), which explains a different sha256sum.

@github-actions
Copy link

github-actions bot commented Apr 4, 2022

This issue has been automatically marked as stale because it has not had any activity for 180 days. It will be closed if no further activity occurs in 30 days.
Collaborators can add an assignee to keep this open indefinitely. Thanks for your contributions to rules_docker!

@github-actions github-actions bot added the Can Close? Will close in 30 days unless there is a comment indicating why not label Apr 4, 2022
@tommyknows
Copy link
Author

Judging from the underlying „issue“ as described above, I don‘t assume that this issue as described is fixable, really.

a built-in python toolchain would fix this by making it easier to depend on a single python version even in different dev environments.

Either way, I‘ll go ahead and close this issue. Thanks everyone that helped looking at this!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Can Close? Will close in 30 days unless there is a comment indicating why not
Projects
None yet
Development

No branches or pull requests

3 participants