Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Treeless clones #1152

Open
tiger12506 opened this issue Feb 1, 2023 · 10 comments
Open

Support for Treeless clones #1152

tiger12506 opened this issue Feb 1, 2023 · 10 comments

Comments

@tiger12506
Copy link

There is already fetch-depth: 1 to retrieve only the latest commit and working tree, which is great.
However, for my particular CI project using Actions, we use the git tags to track version information, and the commit messages to generate changelogs. Seems like the feature "treeless clones" would be ideal for our situation.

I couldn't figure out how to make this work with actions/checkout@v3, so I assume that support for it would need to be added.

https://github.blog/2020-12-21-get-up-to-speed-with-partial-clone-and-shallow-clone/
image

@j-liedtke
Copy link

This feature would be very helpfull indeed

@prein
Copy link

prein commented Mar 11, 2023

Related #172 and #663 and

@marc-hb
Copy link

marc-hb commented Mar 23, 2023

Thanks for filing this!

There is already fetch-depth: 1 to retrieve only the latest commit and working tree, which is great.

Actually, shallow clones are "great" until you start trying to use git; then they can turn into disasters. This is the most time-consuming shallow clone disaster I found:

Even git describe is changed which means you can't https://reproducible-builds.org/ in some projects

So the (newer?) --filter=tree:0 option seems much, much better from a CI perspective than shallow clones because it doesn't seem to "break git" so far and the cloning speed-up seems pretty dramatic, 2-3 times faster in my experience. Shallow cloning is not faster, maybe even slower, see some numbers here:

@marc-hb
Copy link

marc-hb commented Mar 24, 2023

@derrickstolee your (excellent!) blog and work about this is now 1.5 years old. It's a largely wasted effort if the official and universal way to clone on github still can't use these optimizations.

@derrickstolee
Copy link

One thing to be really careful about is the fact that fetches from treeless clones can be very strange if there is a .gitmodules file in the repo. That will really only affect users that run git fetch within their workflow for some reason. I mentioned this in the article:

⚠️ Warning: While writing this article, we were putting treeless clones to the test beyond the typical limits. We noticed that repositories that contain submodules behave very poorly with treeless clones. Specifically, if you run git fetch in a treeless clone, then the logic in Git that looks for changed submodules will trigger a tree request for every new commit! This behavior can be avoided by running git config fetch.recurseSubmodules false in your treeless clones. We are working on a more robust fix in the Git client.

The main reason to use a treeless clone over a shallow clone is if you need the commit history for something. For example, Git Credential Manager uses full clones because its build determines the version number from the commit history. This example could use treeless clones instead, saving a lot of effort.

@pleunv
Copy link

pleunv commented Apr 17, 2023

When looking into migrating a build pipeline to GitHub I was certain there would be some way to customize the checkout in order to get a treeless clone, since I read about this magic first on the GitHub blog a few years ago. To my surprise this was not the case. Even worse, there's barely any checkout options, with efforts to implement improvements such as sparse checkouts going completely unanswered, despite plenty of demand from the community. Even worse, if I want to use reusable workflows I seem to be forced to opt-in to using the official checkout action so I can't even efficiently replace it with my own checkout logic that is optimized for speed.

Treeless clones are a game changer and can drastically improve performance on larger repositories. Please consider supporting this.

@marc-hb
Copy link

marc-hb commented Apr 17, 2023

One thing to be really careful about is...

No one is asking to change the default behavior. This issue is only about making the new git feature available.

Treeless clones are a game changer and can drastically improve performance on larger repositories.

... and it would also save Github a lot CPU and network cycles - hence $$$.

It's really unexpected to see a Github employee doing all the amazing work in https://github.blog/2020-12-21-get-up-to-speed-with-partial-clone-and-shallow-clone/ and upstreaming it all which makes it available out of the box for every git user... except for Github! :-)

@sliekens
Copy link

sliekens commented Oct 8, 2023

Commenting to cast my support for this feature request. I have a workflow to build a documentation site where I need to use the git commit log to find the last modified date of a page (git log -- docs/guide/usage.md). With shallow clone, the git log is unreliable.

/edit

Looks like this is now possible(!)

Treeless clone

steps:
  - name: Checkout
    uses: actions/checkout@v4
    with:
      filter: tree:0
      fetch-depth: 0 # (no history limit)

Blobless clone (which is what I needed)

steps:
  - name: Checkout
    uses: actions/checkout@v4
    with:
      filter: blob:none
      fetch-depth: 0 # (no history limit)

@marc-hb
Copy link

marc-hb commented Oct 9, 2023

Indeed #1396 was merged last week. Testing it now.

EDIT: when measuring pay attention to this: fetch-depth changes how many tags are fetched, which can obviously have an effect on performance. Worse: the plain git fetch command has 3 distinct choices:
https://git-scm.com/docs/git-fetch

This default behavior can be changed by using the --tags or --no-tags options

But the fetch-tags: field in this action is a boolean!

@marc-hb
Copy link

marc-hb commented Oct 9, 2023

I got a few numbers from the https://github.com/thesofproject/linux/actions/runs/6451238988/job/17511599536?pr=4622 test run and a couple similar others. This is cloning the Linux kernel repo.

Obviously, this sort of end-to-end timings depends on a gazillion of other parameters like the current workload and network traffic so the numbers below are only orders of magnitude, not accurate numbers. Also, performance is HIGHLY dependent on your particular git repo and I suspect the size of the Linux kernel is way above the average - which also makes its performance extreme and interesting.

=> Do your own testing and measurements.

  • fetch-depth: 1 (the default) usually takes about 35s

  • filter: tree:0 + fetch-depth: 0 took about 1min30s +/-10s in that run.

So a treeless clone is 2x-3x slower than a shallow clone but this is still an amazing and incredibly useful speed-up because:

  • git describe, git log and maybe others are fixed! This was the whole point. Maybe git merge-base is fixed too?
  • A complete, regular (with fetch-depth: 0 alone) takes 11 minutes.

Treeless could probably be faster if the action supported the default fetch behavior with respect to tags: "By default, any tag that points into the histories being fetched is also fetched; the effect is to fetch tags that point at branches that you are interested in." (from: https://git-scm.com/docs/git-fetch)

Unfortunately, fetching tags in this action is an "all or nothing" boolean (#579). I'm assuming many people interested by filter: tree:0 want to fix git describe which requires some (but not all!) tags.

Note a shallow AND treeless clone took about 1min 10s
As expected, this was faster than a treeless clone - but barely.
Surprisingly, this is slower than a shallow clone?!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants