Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

go_repository: git repositories should be stored in the cache #549

Open
jayconrod opened this issue Jun 16, 2019 · 18 comments
Open

go_repository: git repositories should be stored in the cache #549

jayconrod opened this issue Jun 16, 2019 · 18 comments

Comments

@jayconrod
Copy link
Contributor

If a go_repository rule is invalidated but @go_repository_cache is not, we shouldn't need to clone a Git repository again. The first time we fetch a repository, we could clone it and store a zip file in the cache. When go_repository is invalidated, we would just need to extract the cached zip.

jayconrod pushed a commit to jayconrod/bazel-gazelle that referenced this issue Jun 16, 2019
'gazelle fix' and 'gazelle update' now accept -repo_config, the path
to a file where information about repositories can be loaded. By
default, this is WORKSPACE in the repository root directory.
'gazelle fix' and 'gazelle update-repos' still update the WORKSPACE
file in the repository root directory when this flag is set.

go_repository passes the path to @//:WORKSPACE to -repo_config.

go_repository resolves @//:WORKSPACE and any files mentioned in
'# gazelle:repository_macro' directives. When these files, all
go_repository rules will be invalidated. It should not be necessary to
download cached repositories (except vcs repositories; see bazelbuild#549).
On a Macbook Pro, it takes about 22.5s to re-evaluate 70 cached,
invalidated go_repository rules for github.com/gohugoio/hugo. If this
becomes a project for large projects, we can provide a way to disable
or limit this behavior in the future.

go_repository_tools and go_repository_cache are moved to their own
.bzl files. Changes in go_repository.bzl should not invalidate these
in the future.

Fixes bazelbuild#529
jayconrod pushed a commit that referenced this issue Jun 17, 2019
'gazelle fix' and 'gazelle update' now accept -repo_config, the path
to a file where information about repositories can be loaded. By
default, this is WORKSPACE in the repository root directory.
'gazelle fix' and 'gazelle update-repos' still update the WORKSPACE
file in the repository root directory when this flag is set.

go_repository passes the path to @//:WORKSPACE to -repo_config.

go_repository resolves @//:WORKSPACE and any files mentioned in
'# gazelle:repository_macro' directives. When these files, all
go_repository rules will be invalidated. It should not be necessary to
download cached repositories (except vcs repositories; see #549).
On a Macbook Pro, it takes about 22.5s to re-evaluate 70 cached,
invalidated go_repository rules for github.com/gohugoio/hugo. If this
becomes a project for large projects, we can provide a way to disable
or limit this behavior in the future.

go_repository_tools and go_repository_cache are moved to their own
.bzl files. Changes in go_repository.bzl should not invalidate these
in the future.

Fixes #529
@Globegitter
Copy link
Contributor

There is also a different approach used here: bazelbuild/bazel#7424 and some relevant discussion here: https://groups.google.com/forum/#!searchin/bazel-dev/buchgr%7Csort:date/bazel-dev/7N_6-RbqBf4/bOggKYkUBgAJ

@jayconrod
Copy link
Contributor Author

Nice. I hope that gets merged at some point. I filed bazelbuild/bazel#5086, but I've given up hope of it ever being implemented.

@mariusgrigoriu
Copy link

I'm finding that go_repositories are being fetched when not expecting it. For example, I make a trivial change to the WORKSPACE file, such as inserting whitespace at the end of the file, and it fetches the go_repository.

Additionally, I'm finding the same repository being downloaded multiple times in a row, extending fetch times. This is mostly evident when downloading in urls mode without a sha256, although I believe this happens with a sha256 and in git mode just based on the duration of the fetch.

Are these all symptoms of this issue or should I go open another issue?

@jayconrod
Copy link
Contributor Author

@mariusgrigoriu Sorry for the delay. Was at GopherCon last week.

For example, I make a trivial change to the WORKSPACE file, such as inserting whitespace at the end of the file, and it fetches the go_repository.

This may be working as intended. Gazelle, as run by go_repository, reads the WORKSPACE file for configuration. So go_repository rules need to be re-evaluated when WORKSPACE changes. However, the module cache should not be cleared, so go_repository in module mode shouldn't need to download anything.

Additionally, I'm finding the same repository being downloaded multiple times in a row, extending fetch times. This is mostly evident when downloading in urls mode without a sha256, although I believe this happens with a sha256 and in git mode just based on the duration of the fetch.

Use a sha256. In HTTP mode, HTTP downloads are cached by Bazel. The cache key is the expected sha256, which means downloads without sha256 are not cached. It's also important to do this to ensure builds are authentic and reproducible.

VCS downloads are currently not cached, which is this issue. I'd strongly encourage use of module mode instead though.

@mariusgrigoriu
Copy link

Using modules sounds good when they work. We're importing Terraform and parts of k8s, neither of which seem to play nicely with module mode. Since k8s already uses Bazel, I think switching to http_archive is a good approach. Seems that terraform should use go_repository in http mode until other issues regarding terraform are resolved.

@jayconrod
Copy link
Contributor Author

Be aware that the the main Kubernetes repo, k8s.io/kubernetes (a.k.a. github.com/kubernetes/kubernetes) is not intended to be imported by other repos. Most other repos in that org should work though.

@mariusgrigoriu
Copy link

Understood. This is all because we're consuming e2e tests. Not sure we can do much until kubernetes/kubernetes#74352 moves the e2e framework into staging. (Apologies for hijacking this thread.)

@evie404
Copy link
Contributor

evie404 commented Nov 27, 2019

while not a fix, most go dependencies we use are hosted on github which supplies tar archives over http given a sha. we wrote this as a wrapper to convert existing go_repository with git tags into using urls which allows for caching: https://gist.github.com/rickypai/abadfd810ba13ad295f9987348ffc6af

doesn't work with the more recent go mod stuff though

@jayconrod
Copy link
Contributor Author

@rickypai Archives served by GitHub do not have stable SHA-256 sums. They haven't changed in a couple years, but it's broken us in the past. Use at your own risk.

@kalbasit
Copy link
Contributor

kalbasit commented Dec 4, 2019

@rickypai Archives served by GitHub do not have stable SHA-256 sums. They haven't changed in a couple years, but it's broken us in the past. Use at your own risk.

On NixOS side, we have been using GitHub's archives for a few years now with no issues with regards to the sha256 stability:

$ git grep 'fetchFromGitHub {' | wc -l
5266

@jayconrod
Copy link
Contributor Author

@kalbasit I don't think they've changed anything since fall of 2017. However, I spoke with GitHub support ~6 months ago. They're aware of the issue, but they confirmed archives returned from those endpoints are not guaranteed to have stable hashes.

These breaks were really painful. If the hashes change, it retroactively breaks deterministic builds that depend on the old hashes. At the time, it broke every version of rules_go, since we were using http_archive with those endpoints.

@kalbasit
Copy link
Contributor

kalbasit commented Dec 4, 2019

@kalbasit I don't think they've changed anything since fall of 2017. However, I spoke with GitHub support ~6 months ago. They're aware of the issue, but they confirmed archives returned from those endpoints are not guaranteed to have stable hashes.

These breaks were really painful. If the hashes change, it retroactively breaks deterministic builds that depend on the old hashes. At the time, it broke every version of rules_go, since we were using http_archive with those endpoints.

What do you think about introducing a flag to the update-repos to get it to use http_archive for those who are willing to take that risk?

@jayconrod
Copy link
Contributor Author

@kalbasit Not sure I follow. update-repos doesn't emit http_archive rules at all. It also no longer emits go_repository rules that clone git repositories (making this issue mostly moot). In the common case, go_repository downloads module zip files from https://proxy.golang.org (or whatever GOPROXY is set to).

@mariusgrigoriu
Copy link

We experienced several changing sha256 hashes over the last few months for a subset of archives. In one case, a hash change was only experienced by some people on the team, and it eventually reverted.

@kalbasit
Copy link
Contributor

kalbasit commented Dec 5, 2019

@kalbasit Not sure I follow. update-repos doesn't emit http_archive rules at all. It also no longer emits go_repository rules that clone git repositories (making this issue mostly moot). In the common case, go_repository downloads module zip files from https://proxy.golang.org (or whatever GOPROXY is set to).

What version of gazelle, rules_go and Go should make this work? I'm currently using Bazel 0.28.0, rules_go at 0.20.2, gazelle 0.19.0 and Go 1.12.9 and I'm still showing it Git cloning.

Definitely an archive from GOPROXY would help, does it go in the cache as described in https://bazel.build/designs/2016/09/30/repository-cache.html?

@jayconrod
Copy link
Contributor Author

What version of gazelle, rules_go and Go should make this work?

@kalbasit That's a new enough version of Gazelle. Make sure none of your go_repository rules have commit, tag, or remote attributes. Gazelle won't generate go_repository rules with those anymore, but that doesn't change the semantics of old rules. Use version and sum (for module mode) or urls and sha256 (for HTTP mode). See the go_repository documentation.

Definitely an archive from GOPROXY would help, does it go in the cache as described in https://bazel.build/designs/2016/09/30/repository-cache.html?

Modules zips don't get stored in Bazel's cache, but they do get stored in a separate cache within an internal repository. It's a bit of a hack, but it means they don't need to be downloaded whenever WORKSPACE changes. Bazel can't cache them for the same reason as the GitHub archives: module zips don't promise SHA-256 stability. The sums are hashes of the contents, not of the zip files themselves.

@kalbasit
Copy link
Contributor

kalbasit commented Dec 5, 2019

They are using version and sum so they should be good then. Maybe it's not cloning and I think it is? Is there a way to enable debug to see what's going on behind the scenes?

Here's my go.bzl that I load from the workspace: https://gist.github.com/kalbasit/9ad680ece5d6904a4e96635a38cf18d6

@jayconrod
Copy link
Contributor Author

You can run bazel info output_base, then look at subdirectories of external within that. Those are the directories where repository rules get evaluated. Look for a .git subdirectory in there.

There are some git_repository rules declared in go_rules_dependencies. That may be what you're seeing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants