New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch the backend to something other than Git-based solutions #2780

Closed
xtendo-org opened this Issue Nov 15, 2016 · 11 comments

Comments

Projects
None yet
8 participants
@xtendo-org

xtendo-org commented Nov 15, 2016

I've tried but couldn't find the issue that addresses exactly this problem, so I'm creating a new one. Please let me know if there is, or any misinformation I have.

According to this comment by a GitHub engineer, it seems using GitHub (or any Git-based solution at all) as a package manager's backend causes severe damage to the tool's performance.

I have recently attempted a Haskell "boot camp" at the company I'm working at, and recommended all participants to install Stack. The most frequently raised inquiry/complaint was that it took ages to install. Some people reported "70 minutes and still not complete." It's 2016, and I think we can agree that if a programming language's tooling takes more than an hour to download and install, something's certainly wrong. As @snoyberg pointed out, the time it takes to actually use it is important, so we should consider this not a performance trouble but a blocker for anyone who ever attempts to enter Haskell.

Although less dramatic, the problem with a Git-based backend is not limited to the initial installation, but pervasive in the whole tooling. For example, suppose I'd like to choose the latest nightly as the project's resolver. A shell script like

sed -i 's/^resolver: .*/resolver: '$(curl -s https://www.stackage.org/snapshots | grep -o -m 1 "nightly-[0-9]\+-[0-9]\+-[0-9]\+")'/' stack.yaml

takes less than 1.5 seconds to run because the heaviest task here is to download one HTML file. On the other hand, Stack's built-in command for the same task, stack --resolver nightly solver --update-config, may take more than 10 seconds because it has to git-fetch a repository that contains more than ten thousand commits regarding more than nine thousand files, which is generally expensive according to the aforementioned comment.

One solution I can think of is to make the Stack command line tool switch to using an independent server (e.g. the Stackage website) as backend and avoid GitHub. If Git or GitHub is necessary for versioning or something, that's fine; we can still rely on it, just make it cached or mirrored somewhere so the command line tool won't directly depend on them.

@xtendo-org

This comment has been minimized.

xtendo-org commented Nov 15, 2016

The quickest example that comes to my mind is to have a Git mirror repo and provide its contents as a tar file with a web server like Nginx, renewing the repo once a day with a cronjob like git fetch origin && git rebase origin/master master. This should require little coding/engineering, I'm guessing.

@heejongahn

This comment has been minimized.

heejongahn commented Nov 15, 2016

I have also experienced this issue once or twice. When I first encountered this, I was pretty sure that something's wrong and even thought of stop using stack after a (impatient) series of Ctrl-C and re-runs. Though this isn't a everyday issue, I can assume that this will act as a high barrier for stack users, especially those who have never met this issue before.

I strongly agree that we need a fix for this, as soon as possible. Using an independent server seems like a valid solution to me, but there might be some issues I couldn't figure out. Either way, I think this should be treated as a high-priority issue.

@hvr

This comment has been minimized.

hvr commented Nov 15, 2016

@xtendo-org

The quickest example that comes to my mind to have a Git mirror repo and provide its contents as a tar file

That's indeed a good and proven approach. That's also how Hackage's package index works, i.e.

which is versioned, contains sha256 (for TUF) & md5 (useful to mirror tooling) hashes in TUF-records, and even allows for fast incremental updates (since the index is only appended to, so there's always a common prefix we can resume from). The logic for all this (and more) is implemented in hackage-security.

@snoyberg

This comment has been minimized.

Contributor

snoyberg commented Nov 21, 2016

I'm not going to end up making any decisions here (I don't handle day-to-day management of Stack anymore). However, I'll throw in a few thoughts:

  • The surest way to ensure that no one wants to do something is to be lectured to by Herbert (read: why I'm now opposed to PVP instead of simply not following it). Herbert: we get it, you and Duncan won this battle because you control Hackage. I still think you came up with an incredibly backwards solution.
  • I tried using hackage-security, and couldn't make heads or tails of that library. So if someone wants to see hackage-security happen here, perhaps someone who actually understands the library will want to contribute a patch.
  • All that said: given that hackage-security is the new reality hoisted upon us, I actually lean towards moving to it in place of Git, even if it's technically inferior. We may as well reduce additional code paths needed to be followed in various tooling.
  • I don't think the issues raised here are in any way insurmountable. We can easily go back to the shallow clones that we've done in the past with the new metadata in snapshots, or simply default to not having the Hackage revision detection in place. (Side note: Hackage revisions is another example of a terrible feature.)

Herbert: please don't turn this issue into a discussion of the complaints I'm raising. I'm pointing them out here to try and encourage you to engage more respectfully on issues in the future.

@23Skidoo

This comment has been minimized.

23Skidoo commented Nov 21, 2016

/cc @edsko, who is the main author of the hackage-security library.

@dcoutts

This comment has been minimized.

dcoutts commented Nov 21, 2016

I tried using hackage-security, and couldn't make heads or tails of that library

For anyone having a go, a good place to start is the example client which is quite compact (it also demos using http-client as the http impl)

https://github.com/well-typed/hackage-security/tree/master/example-client

It may also be useful to look at the use of the interface in cabal-install where it iterates over the index, getting every revision of every .cabal file. You'd probably want something like that plus converting info the cached formats that stack uses. In principle the interface supports doing index conversions incrementally, by saving a archive directory index and starting from there (though it has to validate the saved info to know doing an incremental conversion is ok).

@alexanderkjeldaas

This comment has been minimized.

Contributor

alexanderkjeldaas commented Nov 26, 2016

According to this comment by a GitHub engineer, it seems using GitHub (or any Git-based solution at all) as a package manager's backend causes severe damage to the tool's performance.

I don't see anything like that in the comment.

Yes, having 10000 commits might be more costly than copying the resulting file. Git has lots of ways of fixing that, such as squashing or shallow checkouts.

In recent git versions we can now write git clone --shallow-since=<date>, and give all clients the same set of objects. With a reasonable caching strategy on the server side, it should be possible to reuse the calculated bundles.

@xtendo-org

This comment has been minimized.

xtendo-org commented Nov 27, 2016

@alexanderkjeldaas The comment says that shallow checkout is more costly in a long run.

... most of the initial clones are shallow, meaning that not the whole history is fetched, but just the top commit. But then subsequent fetches don't use the --depth=1 option. Ironically, this practice can be much more expensive than full fetches/clones, especially over the long term.

@alexanderkjeldaas

This comment has been minimized.

Contributor

alexanderkjeldaas commented Nov 27, 2016

@xtendo-org maybe I'm misunderstanding, but I thought the "subsequent fetches don't use the --depth=1 option" implied that the repo was converted into a full clone.

But to see how --depth=1 doesn't make sense for that repo, look at
https://github.com/CocoaPods/CocoaPods/releases
and then read
https://blogs.gnome.org/simos/2009/04/18/git-clones-vs-shallow-git-clones/

It seems that shallow clones, by cloning every tag at depth 1 basically picks up everything when there are 153 releases to shallow clone.

If the cloning is done by date, the results could be very different.

Git can do "anything", so if it needs to fetch less, then there is likely a way to achieve that.

@snoyberg

This comment has been minimized.

Contributor

snoyberg commented Dec 4, 2016

While I still think Git is the better way overall, it seems that enough people are having connectivity issues with Github that it's worth switching the backend. I have a PR at #2827. This does not address switching to hackage-security for downloads... that codebase does still intimidate me, and I'm not sure how I feel about the partial download bit and having to switch to uncompressed streams for it.

@dcoutts

This comment has been minimized.

dcoutts commented Dec 13, 2016

having to switch to uncompressed streams for it

@snoyberg it's worth noting that the partial/incremental downloading works on the compressed stream. It's a range get on the tail of the .tar.gz file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment