Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conda servers don't seem to cache gzip compressed indexes - limiting download speed to 2-3MB/s (compressed) #637

Closed
2 tasks done
corneliusroemer opened this issue Oct 18, 2022 · 34 comments
Labels
locked [bot] locked due to inactivity type::bug describes erroneous operation, use severity::* to classify the type

Comments

@corneliusroemer
Copy link

corneliusroemer commented Oct 18, 2022

Checklist

  • I added a descriptive title
  • I searched open reports and couldn't find a duplicate

What happened?

Using mamba, I noticed that the maximum download speed of compressed package index jsons is 2-3 MB/s, well below my download speed.

Notably, downloading uncompressed was faster than telling the server to serve gzip.

This suggests that gzip compressed fiels are not cached

Additional Context

This was discussed and investigated at length in the mamba issue, please check this for further details: mamba-org/mamba#2021

I was asked to open here by @jakirkham conda-forge/conda-forge.github.io#1835 (comment)

@jonashaag @wolfv

@jakirkham
Copy link
Member

cc @jezdez

@corneliusroemer
Copy link
Author

Any progress here? It would be amazing if mamba channel downloads could last only 2s instead of 20s :)

@jezdez
Copy link
Member

jezdez commented Nov 7, 2022

No progress so far, that would be visible here. @barabo can you look into (or redirect to the appropriate person) what's causing this? This smells like a CDN misconfiguration to me

@barabo
Copy link

barabo commented Nov 7, 2022

I looked into this a bit last week and couldn't find anything obviously wrong in the CDN configuration for the bz2 compressed repodata files. It was curious, however, that I couldn't find any record of mamba user agents downloading the bz2 repodata files. It seems that mamba user agents exclusively download repodata.json files from channels.

image

image

In general, though, I think the team that runs the anaconda.org server prefers that users do not download the bz2 repodata because it takes longer for them to generate it server-side. And since the repodata.json files are generated per-request - they specify that cloudflare not cache them (nor the bz2 files) (cache status=dynamic).

For the cloned channels (conda-forge, bioconda, pytorch, etc) - there's no problem downloading the bz2 repodata - it's just relatively uncommon for anyone to do it.

image

image

@wolfv
Copy link

wolfv commented Nov 7, 2022

@barabo yes, we're never using the bz2 files.

This is about the on the fly gzip compression. You might be able to cache the gzip compression to serve the static files faster (just a theory) since the files are not that dynamic (change every 30 min or so)

@wolfv
Copy link

wolfv commented Nov 7, 2022

@barabo
Copy link

barabo commented Nov 7, 2022

Well, I don't know what to say. Cloudflare's docs say that they do compress certain content-types by default, and it looks like the repodata.json files are fetched as one of those types. So, I expect Cloudflare is gzip compressing them. I don't think we're providing the headers necessary to disable compression.

curl -I https://conda.anaconda.org/conda-forge/linux-64/repodata.json  0.02s user 0.01s system 11% cpu 0.262 total
(base) canderson@carls-mbp-2 mamba % curl -I https://conda.anaconda.org/conda-forge/linux-64/repodata.json
HTTP/2 200
date: Mon, 07 Nov 2022 22:43:43 GMT
content-type: application/json
content-length: 228290132
cf-ray: 766996e68a9413eb-ORD
accept-ranges: bytes
age: 324
cache-control: public, max-age=1200
etag: "2ee6b095b3f5fb9dda17f1b2741201cc"
expires: Mon, 07 Nov 2022 23:03:43 GMT
last-modified: Mon, 07 Nov 2022 22:37:52 GMT
vary: Accept-Encoding
cf-cache-status: HIT
set-cookie: __cf_bm=ogC55_Mnd.vxNKi64N8y5NMwCkAmbTi54a7gOTljyEg-1667861023-0-AR3kbil5B1sYC6XpDoAiJYLvMxpA2RsIw7EMZylo/0xD9tXjEIT2rT9nzlb6g1WQD/kDEGcWitubLaZ10maRNHh7vBC8K0g7ZaJcueAqCibq; path=/; expires=Mon, 07-Nov-22 23:13:43 GMT; domain=.anaconda.org; HttpOnly; Secure; SameSite=None
server: cloudflare

But on the other hand, this section of the page suggests that the provided user agent can determine whether gzip or brotli (or both) will be used. I wonder if the mamba user agents aren't recognized by cloudflare in an optimal way. I don't think I have access to inspect how this is done, though.

However, in my local testing, I don't see any performance penalty for using the mamba user agent strings. I typically am getting 20-40MB down (when downloading the conda-forge linux-64 repodata.json) - regardless of which agent string I provide to curl.

@corneliusroemer
Copy link
Author

corneliusroemer commented Nov 8, 2022

Gzip compression does clearly happen but its on-the-fly nature seems to be the bottleneck.

Maybe it would be possible to explicitly add repodata.json.gz so that cloudflare doesn't need to recompute every time?

Alternatively, one could implement Zstd compressed repodata.json.zst (#648) which @wolfv seems to be happy to accept.

Here's another benchmark showing that --compressed clearly works but has a raw download speed of 3MB/s as opposed to bz2 download which gets a raw speed of 8MB/s. And 8MB/s is the max download speed I have. You (@barabo) seem to be able to get 20-40MB raw download speed for bz2 - but should still only get 3MB/s if you use --compressed. Is that correct?

$ curl --compressed https://conda.anaconda.org/conda-forge/linux-64/repodata.json > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 26.2M    0 26.2M    0     0  3059k      0 --:--:--  0:00:08 --:--:-- 3119k

$ curl https://conda.anaconda.org/conda-forge/linux-64/repodata.json > /dev/null 
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  217M  100  217M    0     0  8288k      0  0:00:26  0:00:26 --:--:--  9.9M

$ curl https://conda.anaconda.org/conda-forge/linux-64/repodata.json.bz2 > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 22.8M  100 22.8M    0     0  8372k      0  0:00:02  0:00:02 --:--:-- 8370k

@dholth
Copy link
Contributor

dholth commented Nov 8, 2022

This is where the compression happens https://github.com/conda/conda-index/blob/main/conda_index/index/__init__.py#L798

@corneliusroemer
Copy link
Author

corneliusroemer commented Nov 8, 2022

This is where the compression happens https://github.com/conda/conda-index/blob/main/conda_index/index/__init__.py#L798

Thanks @dholth ! This is the explicit bz2 compression - whereas the gzip compression that's slow is probably cloudflare on the fly if we're not mistaken.

@dholth
Copy link
Contributor

dholth commented Nov 8, 2022

I do like zstd and would like to remove bz2 entirely. Conda currently looks for bz2 but doesn't use it which is weird. If we were to support it that function would need to be updated, we'd need a new flag on conda-index, and we'd have to update a glob pattern on our CDN sync.

@corneliusroemer
Copy link
Author

mamba doesn't use bz2 either and @wolfv has mentioned that he would be happy for zstd to be used by mamba whereas he's not keen on bz2 - so that sounds like a good plan!

@dholth
Copy link
Contributor

dholth commented Nov 8, 2022

Also when's conda-forge going to produce .conda by default?

@corneliusroemer
Copy link
Author

I'm not quite sure what .conda means in this context, also I'm not a conda-forge person 😀 Maybe @jakirkham or @jezdez know?

@jakirkham
Copy link
Member

When Anaconda.org and the CDN can support them. IIUC that is not the case yet currently (happy to learn I'm wrong).

@dholth
Copy link
Contributor

dholth commented Nov 8, 2022

They should be supported now on anaconda.org and the CDN! We should make one and see how it goes.

@corneliusroemer the .conda format for conda packages uses zstd instead of .tar.bz2

@jakirkham
Copy link
Member

Great! 🎉 For a long time that wasn't the case.

There's some other work that would need to be done on the conda-forge side first ( conda-forge/conda-forge.github.io#1586 )

@jakirkham
Copy link
Member

A separate question is how we would go about converting the existing packages to .conda

@jakirkham
Copy link
Member

cc @beckermr

@dholth
Copy link
Contributor

dholth commented Nov 8, 2022

I would suggest not converting the existing packages. It would be slow, take a lot of disk space, and double the size of repodata.json.

@beckermr
Copy link

beckermr commented Nov 8, 2022

We need to do a bunch of manual testing for .conda packages before we can roll that out. I've put in PRs in many places where the extension is assumed, but for sure we are going to miss things.

I don't have a ton of conda-forge dev time these days, so it has been slow going on my end.

@jakirkham
Copy link
Member

Maybe we could come up with a list of things to do so folks like Cornelius could help?

@beckermr
Copy link

beckermr commented Nov 8, 2022

Sure. See the attached issue. We need PRs to conda-smithy next. I will add a few other items.

@jonashaag
Copy link

I would be willing to invest some time into .conda packages if you can guide me to some open problems that I’m able to work on.

That said, this discussion is off topic, shall we split into a separate thread?

Re gzip compression, IIUC the question of on the fly vs cached gzip responses is still open. I can try to find a setup of Cloudflare CDN that uses cached gzip responses in a personal Cloudflare account and report back here.

@beckermr
Copy link

beckermr commented Nov 8, 2022

I'd like all conda-forge related things to appear in conda-forge repos. So yes, let's move any discussion over there. The next item @jonashaag is a PR to smithy to optionally turn on .conda artifacts

@jakirkham
Copy link
Member

@jonashaag
Copy link

jonashaag commented Nov 14, 2022

Result from trying out gzip compression on a personal Cloudflare account: You cannot make Cloudflare use cached/pre-compressed responses. It will always re-compress your files. (Maybe caching works for smaller files, not sure.)

So to improve repodata load times we need to request a pre-compressed file (eg. repodata.json.bz2).

For the selection of the compression format, here's some non-scientific compression benchmarks (b = bzip2, g = gzip, z = zstd):

method compression time compression ratio
orig 0 1
b9 13.9 0.105016637432424
b1 11.6 0.116227201341733
g9 5.8 0.125971405821175
g7 2.3 0.128959094531222
g5 1.8 0.134876966036568
g1 0.9 0.16432670929425
z15 10.5 0.121263110685134
z10 2.0 0.122599202681385
z5 0.8 0.129368847595369
z1 0.3 0.115159940890874

So zstd is the clear winner here. It's interesting that -1 is smaller than -5, it's not a benchmarking mistake. I'm going to report an issue upstream about this.

@corneliusroemer
Copy link
Author

corneliusroemer commented Nov 14, 2022

@jonashaag Good to run a benchmark with repodata.json! zstd is known to be very performant so not a huge surprise ;)

Interesting that zstd -1 performs so well. I guess that's the zstd mode we should use then!

I think your time is for compression? Would be good to have both - zstd is particularly good for fast decompression as well.

You can get higher compression ratios with zstd --ultra -20 but that takes unreasonably long for minor gains.

I'll add a decompression speed benchmark on

curl https://conda.anaconda.org/conda-forge/linux-64/repodata.json > repodata.json
time zstd -1 repodata.json
time zstd -d repodata.json.zst
...
method decompression time in seconds
orig  
b9 13.9
b1 11.6
g9 0.33
g1 0.36
z1 0.12

So zstd -1 is the best for compression and uncompression speed, and only loses against b9 by a small margin for compression ratio. Overall very clear result: we should use zstd.

@jonashaag
Copy link

Try zstd -b1 :)

@corneliusroemer
Copy link
Author

Oh that's neat! Learned something again!

Looks like we'll soon get zst compressed outputs @jonashaag, see conda/conda-index#65

Time to add mamba support 🙃

@jonashaag
Copy link

🤩🤩🤩

@jakirkham
Copy link
Member

Are you already aware of the libmamba solver for Conda?

@dholth
Copy link
Contributor

dholth commented Nov 14, 2022

We decided on zstd -T0 -16 for repodata.json.zst by conda/conda-index. At that level the compression starts to get a bit slow. Decompression is very fast at all levels.

Smaller or private channels hosted on https://anaconda.org/conda may have repodata.json generated on-demand; in that case zstd -1 or -3 might be warranted.

@dholth
Copy link
Contributor

dholth commented Nov 16, 2022

WOW on my home internet repodata.json (uncompressed) curl's to /dev/null faster than repodata.json (compressed). Naturally repodata.json.zst beats them all. Thanks for reporting!

@dholth dholth closed this as completed Nov 16, 2022
@dholth dholth reopened this Nov 16, 2022
@dholth dholth closed this as not planned Won't fix, can't repro, duplicate, stale Nov 16, 2022
@github-actions github-actions bot added the locked [bot] locked due to inactivity label Nov 17, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 17, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
locked [bot] locked due to inactivity type::bug describes erroneous operation, use severity::* to classify the type
Projects
Archived in project
Development

No branches or pull requests

8 participants