Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explanation of info- & pkg ordering #182

Closed
wolfv opened this issue Dec 20, 2022 · 4 comments
Closed

Explanation of info- & pkg ordering #182

wolfv opened this issue Dec 20, 2022 · 4 comments
Labels
type::documentation request for improved documentation

Comments

@wolfv
Copy link

wolfv commented Dec 20, 2022

I see that the recent release changed the order of info and pkg archives in the .conda format (it's mentioned in the Changelog as well). I tried to go through some PRs but couldn't find the reasoning for the change. Would be curious to hear why this was done :)

@mbargull
Copy link
Member

Would be good to document this, yes.
I can't say why it was changed. But a good explanation for it is that the outer archive is a Zip file. Hence, the outer archive's index is at the end of the file. So, if you put the info-*.tar.zst at the end too, you can fetch the metadata with a single fetch (from disk or (HTTP) server).
(In case of the former .tar.bz2 you'd want info at the beginning of the index-less tarball, of course.)

@mbargull mbargull added the type::documentation request for improved documentation label Dec 20, 2022
@wolfv
Copy link
Author

wolfv commented Dec 22, 2022

Hmm, although you don't know beforehand how large the info.tar.zst file is, right? You mean one would fetch N bytes and hope that it covers both the zip-index and info.tar.zst part?

@baszalmstra
Copy link

Wouldn't it make much more sense to make sure that you put it at the start? If I understand zip correctly, every file in the zip is preceded by a zip local file header. If we would always put the info archive at the start of the zip, we could stream the contents of the entire file with a regular GET request. Since the local file header contains all the information you need. There would be no need to inspect the zips central directory at all, which would really simplify the handling. It would actually be similar to how the tar.bz2 files are handled currently.

Having the central directory of the zip at the end really makes things hard.

Obviously too late now because .conda files are already widespread. 🤷

@dholth
Copy link
Contributor

dholth commented Jan 5, 2023

conda-package-streaming has good support for reading partial remote zip archives, and using this to get the info out of a conda in a maximum of 3 remote requests, but it doesn't matter where the info is inside the zip.

It was done so that this transmute implementation https://github.com/conda/conda-package-streaming/blob/main/conda_package_streaming/transmute.py#L72 could buffer the usually-small info in memory while writing the pkg- directly to the zip archive.

There are streaming zip implementations for Python that ignore the central directory, but not the excellent standard library zipfile.

The order doesn't matter for conda-package-handling's create because it asks for a complete list of info and pkg members ahead of time. https://github.com/conda/conda-package-handling/blob/main/src/conda_package_handling/conda_fmt.py

@dholth dholth closed this as completed Oct 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type::documentation request for improved documentation
Projects
Archived in project
Development

No branches or pull requests

4 participants