Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Make zstd compressed index files available #648

Closed
2 tasks done
corneliusroemer opened this issue Nov 6, 2022 · 20 comments · Fixed by conda/conda-index#65
Closed
2 tasks done

ENH: Make zstd compressed index files available #648

corneliusroemer opened this issue Nov 6, 2022 · 20 comments · Fixed by conda/conda-index#65
Labels
in-progress issue is actively being worked on locked [bot] locked due to inactivity type::feature request for a new feature or capability

Comments

@corneliusroemer
Copy link

Checklist

  • I added a descriptive title
  • I searched open requests and couldn't find a duplicate

What is the idea?

Right now, conda channel servers provide .bz2 compressed indexes. That's alright, but bz2 isn't really state of the art anymore. Instead, zstd has quickly taken over for fast compression/decompression with good compression ratios.

It would be great if you could offer zstd compressed indexes in addition to bz2 and on the fly gzip compressed ones.

Why is this needed?

Mamba channel index downloads are currently rate limited by gzip server compression (see #637).

While this issue could eventually get fixed, mamba developers have stated that they are not keen to add bz2 support to mamba and instead would prefer zstd compressed indexes.

It would be great if either #637 or this issue could be implemented fairly soon as it does cause mamba to be quite a bit slower than it should be.

What should happen?

No response

Additional Context

See here for @wolfv's proposal to add zstd compressed indexes: mamba-org/mamba#2021 (comment) :

I am not a big fan of the added complexity of using bz2 encoded files and I also don't like bz2 anymore (zstd would be cool!).

@corneliusroemer corneliusroemer added the type::feature request for a new feature or capability label Nov 6, 2022
@jezdez
Copy link
Member

jezdez commented Nov 7, 2022

Thanks for opening this issue, that seems reasonable to me, but I'm wondering if @barabo has opinions on this?

@dholth
Copy link
Contributor

dholth commented Nov 8, 2022

How would you feel about supporting on the fly content-encoding: zstd

I wish we could stop generating repodata.json and then one or more compressed copies of the same.

@corneliusroemer
Copy link
Author

On the fly zstd isn't a bad idea - it should be faster than gz compression - but at which point would you implement this? Who would do the compression, cloudflare?

@dholth
Copy link
Contributor

dholth commented Nov 8, 2022

Not sure, we're talking about anaconda.org dynamically generated channels and not defaults / conda-forge static hosted channels?

@corneliusroemer
Copy link
Author

corneliusroemer commented Nov 8, 2022

I'm not quite sure what the difference is. The purpose is to speed up download of things like https://conda.anaconda.org/conda-forge/linux-64/repodata.json which is currently limited to ~30MB/s even if gzipped because cloudflare gzip on-the-fly compression can't do faster than this. See #637 (comment)

I don't know whether conda-forge/linux-64/repodata.json is dynamic or static.

@dholth
Copy link
Contributor

dholth commented Nov 14, 2022

conda/conda-index#65

@jezdez jezdez linked a pull request Nov 14, 2022 that will close this issue
@corneliusroemer
Copy link
Author

Closed by conda/conda-index#65

Thank you @dholth 🎉 When will the zst output file be available for download from https://conda.anaconda.org/conda-forge/linux-64/repodata.json.zst say?

@dholth
Copy link
Contributor

dholth commented Nov 14, 2022

It is available for repo.anaconda.com defaults. It will be available on conda-forge after the channel clone system update is deployed.

@dholth dholth added the in-progress issue is actively being worked on label Nov 15, 2022
@corneliusroemer
Copy link
Author

@dholth thanks a lot for this, great work! Do you have an indication on when the channel clone system update will be deployed? It would be great if you could comment here if it is so that it can be tested.

@jezdez
Copy link
Member

jezdez commented Nov 28, 2022

Quick note that Anaconda is on a company holiday today.

@dholth
Copy link
Contributor

dholth commented Dec 15, 2022

@corneliusroemer repodata.json.zst should be available on conda-forge, and on repo.anaconda.com/main (defaults)!

Please experiment. Is it byte-identical? Are the caches invalidated at the same time? How's the speed?

@corneliusroemer
Copy link
Author

corneliusroemer commented Dec 15, 2022

@dholth that's fantastic new 🎉

I can confirm that conda-forge has repodata.json.zst available now, and it downloads really fast in ~1s for me:

❯ curl -L https://conda.anaconda.org/conda-forge/linux-64/repodata.json.zst -o test.json.zst     
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 24.0M  100 24.0M    0     0  20.3M      0  0:00:01  0:00:01 --:--:-- 20.2M

There seem to be some differences, most notably the zst repodata has extra lines: license_family in a lot of packages, and also some extra packages.

Here is a diff from zstd to uncompressed:
diff.txt

Calculated as follows:

❯ curl -L https://conda.anaconda.org/conda-forge/linux-64/repodata.json >uncomp.json
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  225M  100  225M    0     0  37.2M      0  0:00:06  0:00:06 --:--:-- 42.4M

❯ curl -L https://conda.anaconda.org/conda-forge/linux-64/repodata.json.zst | zstdcat >zstd.json
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 24.0M  100 24.0M    0     0  29.9M      0 --:--:-- --:--:-- --:--:-- 29.9M

❯ json-diff zstd.json uncomp.json >diff.txt

So the zstd compressed appears to be a superset of uncompressed.

cc @jonashaag for your PR for mamba support

@dholth
Copy link
Contributor

dholth commented Dec 15, 2022

@corneliusroemer try a cache-busting technique like curl ?random, or curl -I (what is last-modified). I don't think it is possible for it to be the same as repodata-from-packages.json. https://github.com/conda/conda-index/blob/main/conda_index/index/__init__.py#L817

@corneliusroemer
Copy link
Author

Ah yes, that worked @dholth - appending a random URL param got me identical files, thanks!

curl -L "https://conda.anaconda.org/conda-forge/linux-64/repodata.json?rand=$(shuf -i 1-10000000 -n 1)" >uncomp.json

@dholth
Copy link
Contributor

dholth commented Dec 15, 2022

Was the cached repodata.json older than your repodata.json.zst then?

Normally they should all have a HTTP Last-Modified within a few seconds of each other. It's also possible to download both of them at exactly the wrong moment, when one has been updated and the other hasn't.

@corneliusroemer
Copy link
Author

The differences were reproducible - it wasn't because I downloaded on just before the other.

So it looks like it was a cache thing. I didn't check last-modified headers. I'll play more with it and will see whether it is indeed just a stale cache.

@dholth
Copy link
Contributor

dholth commented Dec 15, 2022

We will check the cache invalidation logic.

@dholth
Copy link
Contributor

dholth commented Dec 15, 2022

Better: % curl -I https://conda.anaconda.org/conda-forge/linux-64/current_repodata.json.zst | grep last-modified ; date -u should be within 10-20 minutes of each other now.

@corneliusroemer
Copy link
Author

Works for me. What's the difference between repodata and current_repodata?

❯ curl -I https://conda.anaconda.org/conda-forge/linux-64/current_repodata.json.zst | grep last-modified ; date -u
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0 6282k    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
last-modified: Thu, 15 Dec 2022 23:28:47 GMT
Thu Dec 15 23:47:47 UTC 2022

❯ curl -I https://conda.anaconda.org/conda-forge/linux-64/current_repodata.json | grep last-modified ; date -u 
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0 54.3M    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
last-modified: Thu, 15 Dec 2022 23:28:47 GMT
Thu Dec 15 23:47:54 UTC 2022

@dholth
Copy link
Contributor

dholth commented Dec 16, 2022

Current is the newest version of everything plus its dependencies. Conda tries that first.

@github-actions github-actions bot added the locked [bot] locked due to inactivity label Dec 16, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 16, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
in-progress issue is actively being worked on locked [bot] locked due to inactivity type::feature request for a new feature or capability
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants