Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature] Obtain timestamp from unzip routines for SOURCE_DATE_EPOCH #14480

Open
1 task done
iskunk opened this issue Aug 14, 2023 · 4 comments
Open
1 task done

[feature] Obtain timestamp from unzip routines for SOURCE_DATE_EPOCH #14480

iskunk opened this issue Aug 14, 2023 · 4 comments
Assignees

Comments

@iskunk
Copy link
Contributor

iskunk commented Aug 14, 2023

What is your suggestion?

This is related to #5152, but addresses one specific aspect of the issue.

Best practices for reproducible builds related to the use of SOURCE_DATE_EPOCH indicate that this should be set to a well-defined timestamp associated with the source code being compiled. For a source tarball, this could be the latest timestamp among its files; for a Git tree (or other VCS), this could be the commit timestamp. I am particularly concerned with the former case, but the latter might be considered in scope as well.

Since Conan takes care of unpacking source archives itself, rather than calling out to external programs, it is in a good position to grab timestamps of files as it iterates through them. It can then return a useful value from that information, e.g. the most recent timestamp across all of them.

Automatically setting SOURCE_DATE_EPOCH from that result is a potential user-configurable convenience, and perhaps might even be considered as a future default. For now, however, the focus of this issue is to add the logic necessary to obtain that timestamp in the first place, across the different source archive-unpacking implementations in the codebase. Once that information is available, how it is used will be a different story.

Have you read the CONTRIBUTING guide?

  • I've read the CONTRIBUTING guide
@memsharded
Copy link
Member

Hi @iskunk

Thanks for the suggestion, it is interesting.

It seems that not that much that Conan is extracting zipped files itself, but most times just calling Python stdlib tarredgzippedFile.extractall(destination). We have some other places that we iterate file by file, like .zip files extraction, and it is in general much slower. Also using tgz format is preferred, for example all ConanCenter use .tgz format from Github, not .zip files.

For a source tarball, this could be the latest timestamp among its files;

I have been trying to find some official guides about this, but I have found nothing. In some places it says that all files should be using the latest timestamp to work properly, but that is up to the creator of the tarball. It would be good to have something a bit more clear.

Taking all into account, I wouldn't love to make all decompressions slower to be able to iterate all files (or iterate all files after unzipping, is this possible too? If it is, then this would make sense as a separate tool, decoupled from the unzip, and would be more generic to all downloads?), just to extract this SOURCE_DATE_EPOCH, in case it might be eventually needed by some users (it hasn't been a massive use case so far, it is the first time this is mentioned). It seems too much negative impact for a potential value that doesn't have strong evidence yet.

@iskunk
Copy link
Contributor Author

iskunk commented Aug 15, 2023

The best implementation approach remains to be seen. For example, I see that zipfile.extractall() doesn't allow a way to get at file timestamps, but that may be better handled by running through the zipfile index. TarFile.extractall() has a fairly new filter parameter that may be useful. The extra processing incurred by the timestamp wrangling can be made optional, so uninterested users don't need to pay a price.

Iterating through all files after unpacking could work too, but that's a less integrated approach. By the same token, the user could run some external script/utility that returns the appropriate timestamp (though it would have to watch out for patched files with a current timestamp). It all comes down to how close to Conan's core mission is facilitating reproducible builds.

I'm not aware of any formal guidelines for obtaining the SOURCE_DATE_EPOCH timestamp, but the constraints/criteria aren't complicated:

  • You want a date that is fixed for the source code in question;
  • If the source is in an official tarball, then all the timestamps therein are frozen; if it's in an official VCS commit, then than commit timestamp is frozen;
  • There is no common notion of a tarball creator specifying a value of SOURCE_DATE_EPOCH to use when compiling the source;
  • A specific time/date can be inferred from the tarball release date, but there is no standardized way to do so (Tarball timestamp on the official download site? ChangeLog date? What's the time-of-day and time zone for that? Look at the associated Git tag commit timestamp? Etc.);
  • The latest timestamp in a tarball is typically quite close to the release date, given the normal workflow associated with preparing a release. And even if it isn't, the "when was this source last touched" datum remains independently relevant;
  • __DATE__ should still have some useful meaning (Dec. 1969 is obviously bogus, and so is today's date for ten-year-old source if you want to reproduce a ten-year-old build);
  • Future timestamps should never occur, and for the purposes of reproducibility, may be considered a fatal error.

Not everyone may follow the same exact guidelines for reproducibility, but testing for reproducibility isn't hard. If the guidelines deliver, and don't get in the way of other goals, that's ultimately what matters.

@memsharded
Copy link
Member

Iterating through all files after unpacking could work too, but that's a less integrated approach.

If this is possible, I think it would be the best approach:

  • It works equally well for all unzip/untargz and even possibly others (xz, etc)
  • Easier to implement (that means typically more robust too)
  • it works when downloading and merging from more than 1 tarball (real use case I have seen)
  • It provides a nicer interface:
from conan.tools.files import get, source_date_epoch
get(self, url, "myfile.zip", ...)
source_date = source_date_epoch(self, self.source_folder)
apply_conandata_patches(self)  # this can be done later

# vs

# Yes, one liner instead of 2 lines, still uglier interface
from conan.tools.files import get
source_date = get(self, url, .... compute_source_date=True)
apply_conandata_patches(self)  # this can be done later

@iskunk
Copy link
Contributor Author

iskunk commented Aug 17, 2023

Okay, so you'd prefer a new function using os.walk().

I suppose that is reasonable for now, when reproducible builds are still somewhat of a specialized use case. There will probably be pressure to speed things up in the future, especially if Conan Center decides to go that way. (Getting timestamps straight from the archive metadata would mean no stat() calls and no additional tree recursion.) I don't think having get() return the timestamp directly would be the way to go---there is additional metadata that could be returned that should be allowed for in the future (perhaps return a dict?). Not sure how multiple get() calls should be handled... maybe a get_multiple() function that takes a list, and returns info applicable to all the items?

I would say, give some low-priority thought cycles to how this might be handled via get() in the future, but for now, a separate function that walks through the unpacked tree is a straightforward chunk of work. I'll see what I can put together.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants