New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maybe git_repository should fetch tarballs from GitHub #2147

Closed
jart opened this Issue Nov 29, 2016 · 23 comments

Comments

Projects
None yet
10 participants
@jart
Member

jart commented Nov 29, 2016

Services like GitHub that have well-known HTTP interfaces for grabbing snapshots at specific revisions and tags.

For example:

Maybe it would be better if, under the hood, git_repository was able to recognize when remote is pointing to one of those services, and then transparently turn itself into {new_,}http_archive under the hood.

This will make things go significantly faster and more reliable for many users, without requiring any action on their part.

CC: @nitnelave @pcj
See also: #1194

@abergmeier-dsfishlabs

This comment has been minimized.

Contributor

abergmeier-dsfishlabs commented Nov 30, 2016

I would prefer to have some logic, which keeps the git history around, even if a git_repo gets invalidated and afterwards fetches based on that history. THAT would be way faster.
Perhaps have a persist_history switch which would persist the history in ~/.cache/bazel/git_history?
Not to mention LFS support.

Currently we already use what you want by simply using http_archive and it is pretty straight forward.

@trainman419

This comment has been minimized.

trainman419 commented Jan 25, 2017

This seems more surprising than useful, and debugging this feature when the remote tarball isn't available will be more confusing than just having an http_archive directory in your WORKSPACE file.

@jart

This comment has been minimized.

Member

jart commented Jan 25, 2017

Yes but many users aren't aware that http_archive is a much better option than git_repository in 90% of cases. Most users think, "I want something from GitHub, therefore I should use git_repository" and end up with slow builds as a result. We want builds for them to go fast, without asking them to upgrade their code.

@jwnimmer-tri

This comment has been minimized.

Contributor

jwnimmer-tri commented Jan 31, 2017

I like the motivation, but is it possible to do this invisibly while preserving integrity checking? I would expect the commit= from git and sha256= checksum of the archive to be unrelated, so we'd only be relying on https for authenticity?

For me at least, a big fat warning on https://bazel.build/versions/master/docs/be/workspace.html#new_git_repository would have been enough. I was content to write RobotLocomotion/drake#4975 myself having eventually found the advice here.

@jart

This comment has been minimized.

Member

jart commented Jan 31, 2017

That's actually a very wise insight I hadn't considered @jwnimmer-tri. I guess the only solution is to get rid of git_repository.

@ittaiz

This comment has been minimized.

Member

ittaiz commented Apr 11, 2017

@kchodorow the opinions above are unclear if this won't be more of a deficit than a gain given the surprise factor. Do you still think it's better to have this feature as is?

@kchodorow

This comment has been minimized.

Contributor

kchodorow commented Apr 11, 2017

The change should be mostly invisible to users, as it falls back to using jGit if anything goes wrong with downloading the tarball. To the user, this should just make GitHub repos "feel faster" and will avoid jGit bugs. For private repositories, it may feel slightly slower (since there will be an initial 403 before the clone) but compared to the time the "average" clone has, I don't think this will be particularly noticeable. What do you think the surprise factor would be?

I think it's a good idea to add a warning to git_repository/new_git_repository.

@jwnimmer-tri

This comment has been minimized.

Contributor

jwnimmer-tri commented Apr 11, 2017

I glanced at the patch but didn't immediately see anything special for integrity checking -- the approach is to rely solely on https for authenticity, relying on https github trust to turn a commit hash into a tarball? Perhaps the sha256 tarball checksum should be an optional kwargs, as it is with http_archive -- the rule works without it, but careful users should set it.

@ittaiz

This comment has been minimized.

Member

ittaiz commented Apr 11, 2017

Mainly the fact that you get a shallow clone without asking it and only for public github repositories. Without reading the code or this thread it can be a bit daunting.
I'm really not an expert but I think maybe documenting in big capital letters in a few strategic places when to use git_repository and more importantly when not to use it might be the best approach.

@kchodorow

This comment has been minimized.

Contributor

kchodorow commented Apr 12, 2017

Well, I'm not doing all-caps, but I have 2 changes in code review: one adds a big warning and one adds a sha256 attribute.

@steren

This comment has been minimized.

Contributor

steren commented Apr 12, 2017

Adding a sha256 attribute to the signature of git_repository that is supposed to be the sha256 of the downloaded archive does not seem the right API to me, because it relies on knowledge of the implementation, that differs in case of public GitHub repos.

I think that if the performance improvement introduced cannot guarantee integrity, then we should revert it, or at making it opt-in or opt-out.

I agree that we can recommend to users who prefer performances over convenience to use http_archive for GitHub archives.

@pcj

This comment has been minimized.

Member

pcj commented Apr 12, 2017

I'm not in favor of faking git_repository to become http_archive under the hood. I agree it's more an issue with educating users than surprising them.

I'd be OK with popping a warning/suggestion to "hey, use http_archive instead!" if a git_repository could be substituted out, and a flag to silence those warnings, something like --experimental_suppress_http_archive_recommendation.

Even then, not sure it's worth the effort...

@steren

This comment has been minimized.

Contributor

steren commented Apr 12, 2017

I don't think it is worth a warning in Bazel itself. At best, it is a recommendation in the doc for better first build performances.

@ittaiz

This comment has been minimized.

Member

ittaiz commented Apr 12, 2017

@jart

This comment has been minimized.

Member

jart commented Apr 12, 2017

I only opened this issue under the assumption that deprecating and removing git_repository wasn't an option on the table. I think that might benefit Bazel users more.

@steren

This comment has been minimized.

Contributor

steren commented Apr 13, 2017

Depending on public or private git repositories is critical to Bazel use cases for our external users. We should not deprecate the git_repository API. Only offering an archive dependency solution is a big step back.

Instead we should improve its internal performances and stability.

@ittaiz

This comment has been minimized.

Member

ittaiz commented Apr 13, 2017

@steren

This comment has been minimized.

Contributor

steren commented Apr 13, 2017

Projects might organize their codebase in various git repositories or depend on open-source libraries available in public git repositories.

@ittaiz

This comment has been minimized.

Member

ittaiz commented Apr 13, 2017

@steren

This comment has been minimized.

Contributor

steren commented Apr 13, 2017

In my opinion, for API semantic reasons: if the user's goal is to depend on a git repository, the API should be about giving the coordinates to a git repository, not to an archive.

@ittaiz

This comment has been minimized.

Member

ittaiz commented Apr 13, 2017

@steren

This comment has been minimized.

Contributor

steren commented Apr 13, 2017

Yes, the dev mailing list is a better for this. However, I will be unable to answer before Tuesday.

bazel-io pushed a commit that referenced this issue Apr 19, 2017

Add warning to git_repository
#2147

PiperOrigin-RevId: 153494286

bazel-io pushed a commit that referenced this issue May 5, 2017

Add sha256 attribute to git_repository
RELNOTES: Adds a sha256 attribute to git_repository and new_git_repository.
This can only be used if the remote is a public GitHub repository. It forces
Bazel to download the repository as a tarball, which will often be faster and
more robust than cloning it.

#2147.

PiperOrigin-RevId: 155223382
@MarkusTeufelberger

This comment has been minimized.

Contributor

MarkusTeufelberger commented May 11, 2017

https://git-scm.com/docs/git-archive could be a better solution...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment