Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Wishlist) Support optimal downloads for non-Github repositories #1126

Closed
totten opened this issue Oct 6, 2020 · 4 comments
Closed

(Wishlist) Support optimal downloads for non-Github repositories #1126

totten opened this issue Oct 6, 2020 · 4 comments

Comments

@totten
Copy link

totten commented Oct 6, 2020

Use-Case

  • Create a git repository for a PHP package
  • Publish the repository on any public server besides github.com -- e.g. a plain-old git repo or a self-hosted Gitlab/Gitea/Phabricator/etc.
  • Register the repo on packagist.org

Improvement

The above currently produces a working download system, which is a really good thing -- but it could be better. The issue can be seen a couple ways:

  • If you run composer require ..., it performs a git clone of the repository. This brings in the full history, which can be significantly larger than a version-specific download (e.g. 220mb vs 10mb).
  • If you inspect the package feed (e.g. grepping ~/.composer/cache/repo/https---repo.packagist.org/provider...), you can see that each release:
    • Has a source URL (VCS)
    • Does not have a dist URL

By contrast, if the same repo had been published on github.com, then the feed would have a dist property like:

"dist" : {
    "reference" : "COMMIT",
    "url" : "https://api.github.com/repos/USER/REPO/zipball/COMMIT",
    "type" : "zip",
    "shasum" : ""
},

and the download would be optimal. Of course, there is no standard zipball URL that works in all git repos, so that's problem. But do we really have to download the whole history?

Approaches

Here are a couple approaches which would address the specific symptom, but they come with trade-offs:

  1. Publish a private feed. That's fine for private projects, but it is a chunk of setup/maintenance for both authors and consumers, and it seems better for the ecosystem if public code is on the public registry.

  2. Let authors control the dist URL directly (e.g. with some extra metadata in composer.json or the packagist web UI). Note that making dist more configurable would hypothetically address multiple issues (i.e. this issue and [feature request] Support for packages not in VCS to be handled by packagist.org #270 and Packages on GitHub: Point dist URL to release file #903), but I sense some hesitation in reading [feature request] Support for packages not in VCS to be handled by packagist.org #270/Packages on GitHub: Point dist URL to release file #903. In any event, this still requires docs/coordination for package authors. In terms of "developer experience", I could imagine it goes against the easy-to-use ethos of "post a repo URL to packagist, and it just works".

I'd like to offer/document another option. It's not initially obvious, but it avoids the trade-offs of #1 and #2.

  1. Add support to composer+packagist for a dist that specifically grabs a git commit. Hypothetically, suppose the feed said:
"dist" : {
    "reference" : "COMMIT",
    "url" : "https://git.example.com/my-repo.git",
    "type" : "git-commit",
    "shasum" : ""
},

The download process for a git-commit would be akin to this bash mockup:

mkdir "$TMP_REPO" && cd "$TMP_REPO"
git init
git remote add origin "$URL"
git fetch --depth=1 origin "$REFERENCE"
git archive --format=zip -o "$CACHE_FILE" "$REFERENCE"
cd .. && rm -rf "$TMP_REPO"

Observations about this arrangement:

  • The download size for git fetch --depth=1 is about the same as the zipball download size.
    • (At least, according to my spot-checked example. I checked against this repo which is moderately large on account of its long history. The zipball is ~10mb, and the --depth=1 download is also ~10mb. The full history is ~220mb.)
  • It does not rely on any proprietary Github APIs. The same script works for me on a Gitlab-based repo.
  • It doesn't require any extra configuration/deployment by the author or consumer. But if you want some basic options, git archive uses the same .gitattributes as Github.
  • git archive generates a zip file, which can be stored in a cache (just like a Github zipball)

(Note that I'm not really certain about the best notation for the feed. I suggested "type": "git-commit" because it was easy to skim, but maybe there's a better way to get a similar download behavior. Perhaps, alternatively, packagist could continue to give a "dist": null, but composer would massage the data. To wit: "if dist is null, and if source is a git VCS, then populate a default dist that references a commit.")

@stof
Copy link
Contributor

stof commented Oct 7, 2020

doing limited clones of repositories could be something that Composer itself does when it has no dist URLs, instead of cloning the full history (and there is no need to use git-archive and then unzip the file).

But generating these archives on Packagist would require a big change to the infrastructure (and its cost): currently, packagist.org does not host the source code, only metadata. I don't see it changing anytime soon (especially give that more than 90% of packages registered on packagist.org are hosted either on github or gitlab.com and composer knows how to use archives provided by these platforms)

2\. et authors control the `dist` URL directly (e.g. with some extra metadata in `composer.json` or the packagist web UI)

Technically, Composer already supports defining the dist URL in the composer.json. But it is hard to manage for VCS-based repository, as the URL has to depend on the commit reference. So this is not really practical for VCS-based repositories (unless your VCS repository is the output of some build process).
A solution here might be to allow a special syntax for dist URLs for a placeholder that would get replaced by the reference (so that the configured URL does not need to change for each commit). This would provide a way to implement a download endpoint for custom hosting not known by composer.

@hopeseekr
Copy link
Contributor

The biggest problem plaguing the packagist.org database are all the dead Git servers, particularly those on dead IP addresses.

When I’m making bulk archives of every PHP project, the IP addresses just WILL NOT time out, and take up a PHP process for a good 30 min, an hour, or longer. I can’t for the life of me convince git to timeout in a reasonable manner. The other major problem are live domains pointing to dead Git servers (like https://git.xaifiet.com/xavier.dubreuil/doctrine-api-bundle) but at least those I can block via /etc/hosts.

I wish packagist.org would just delete these unreachable-repo projects after, say, a year of no updates. It should just run a simple “git clone” and if it times out or errors out, remove the project.

@stof
Copy link
Contributor

stof commented Oct 15, 2020

@hopeseekr I don't see how this relates to the subject of this issue.

@Seldaek
Copy link
Member

Seldaek commented Jan 14, 2021

@hopeseekr this is quite a pain to detect but we do some amount of cleanup.. Definitely should do more though I agree, I'm just wary of automatically deleting things without enough safeguards in place, as this can go bad if say your own network is in trouble you don't want to wipe out everything.

Anyway as for the OP here.. @stof kinda answered most of it. We do offer support for Bitbucket/GitLab/GitHub. Supporting every other self hosted thing out there is too much work for what it's worth IMO. And self-hosting OSS just leads to dead servers and hassle for contributors anyway, so it's not a practice I'd personally like to encourage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants