Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do I fetch all commits only in the PR branch #552

Open
samholmes opened this issue Jul 7, 2021 · 23 comments
Open

How do I fetch all commits only in the PR branch #552

samholmes opened this issue Jul 7, 2021 · 23 comments

Comments

@samholmes
Copy link

How do I tell actions/checkout to fetch all the commits up to the fork point? I want all the commits up to the fork point from the base branch. I don't see any examples on how to do this in the README. I'd assume this is a common need.

@wmertens
Copy link

wmertens commented Jul 9, 2021

workaround: assume that forks are never more than 300 commits and fetch 300 commits

@wmertens
Copy link

wmertens commented Jul 9, 2021

also duplicate of #520

@samholmes
Copy link
Author

If you want your script to only consider the commits in the PR, then by over fetching you'll pollute this list with commits not in the PR.

@jbreckmckye
Copy link

This is a necessity for most monorepo builds

@kettanaito
Copy link

I'd like to run an automated release cron job that'd analyze commits for the past N hours and create a release. Turns out that actions/checkout will only have a single commit—the one that's triggered the corn job.

I know there's a fetch-depth option but it pushes me away with the "all branches and all tags" statement. I really need only all commits for the default branch to be present in Git history. I wonder how actions like semantic-release-action even work with corn releases then. It'd be silly to expect a new release for each commit.

@jbreckmckye
Copy link

jbreckmckye commented May 11, 2022

Without being able to do it via the checkout action, here's a workaround in GHA

- name: Checkout
  uses: actions/checkout@v2
- name: Checkout PR changes
  run: |
    # Un-shallow
    git config remote.origin.fetch "+refs/heads/*:refs/remotes/origin/*"
    # Deepen topic branch; checkout topic branch
    git fetch origin ${{ github.event.pull_request.head.ref }} --depth=$(( ${{ github.event.pull_request.commits }} + 1 ))
    git checkout ${{ github.event.pull_request.head.ref }}
    # Fetch main for common origin
    git fetch origin main:main --depth=50
- name: Get BRANCH_POINT
  id: branch_point
  run: |
    # Find common ancestor between main and topic branch
    BRANCH_POINT=$(git merge-base $(git rev-parse --abbrev-ref HEAD) main)
    [[ -z BRANCH_POINT ]] && echo "No branch point" && exit 1
    echo "::set-output name=REF::$BRANCH_POINT"
- name: List changed files
  run: git diff --name-only ${{ steps.branch_point.outputs.REF }}

(Improvements welcome)

@polarathene
Copy link

polarathene commented Jun 27, 2022

I want all the commits up to the fork point from the base branch.

You can use the github context to get how many commits belong to the PR, then fetch that depth.


Example - Fetch enough commits from PR & base branch (eg: master)

To minimize commits fetched to compare two branches, a git fetch to the other branch (eg: master / main) will identify a common commit between the two branches and retrieve roughly only the commits needed to support that minimal history.

If the local commit history doesn't already have a commit from the 2nd branch being fetched, the full history (or whatever depth is requested) is fetched for that branch instead. This usually requires fetching one additional commit (the one you branched from) for the 1st branch (eg: PR branch).

Example for the PR branch to pull enough commit history to include a commit the other branch also has:

- name: 'PR commits + 1'
  run: echo "PR_FETCH_DEPTH=$(( ${{ github.event.pull_request.commits }} + 1 ))" >> "${GITHUB_ENV}"

- name: 'Checkout PR branch and all PR commits'
  uses: actions/checkout@v3
    with:
      ref: ${{ github.event.pull_request.head.ref }}
      fetch-depth: ${{ env.PR_FETCH_DEPTH }}

- name: 'Fetch the other branch with enough history for a common merge-base commit'
  run: |
    git fetch origin ${{ github.event.pull_request.base.ref }}

More Info

It's important to set the ref to the PR head ref as above, since the default ref of this action is one extra "merge-commit" (the PR into the base branch), which will not only offset your fetch-depth by 1 additional commit needed, but possibly cause other issues (eg: with git fetch on the base branch, and trying to get commit history that can successfully derive a merge-base).

NOTE: Merge-commits (those from the base branch into the PR branch, not the action default test merge-commit ref) contribute to the total commits from github context, and the fetch depth range would not only get N commits from your branch but also N commits associated to the history of merge commits.

You may not need the extra commit - If you have a merge commit from the base branch, then your local history will have a commit prior to the merge that it can probably use instead. Otherwise fetching an extra commit should ensure a common ancestor.

EDIT: Fetching to a common ancestor may fail when you have commits belonging to the base branch in local history which have not been merged into the PR branch. This problem occurs with the default ref (the generated test merge-commit of the PR branch into the base branch) with a fetch-depth of 2 or more.


An earlier attempt for fetch that didn't work out

Originally I was suggesting a way to provide a more specific commit for fetch to negotiate with, but I think it usually won't make much of a difference vs default git fetch? I also ran into scenarios where it broke, keeping for reference here as it may be helpful to others and not common to find online.

# Get the oldest commit in the branch which should have no parents,
# `--first-parent` only follows first parent encountered (should be our PR branch commits, not commits associated to merge commits from base branch):
FIRST_COMMIT_IN_BRANCH=$(git rev-list --first-parent --max-parents=0 --max-count=1 HEAD)

# Use that commit hash to lookup the next parent (that we don't have in history):
PARENT_COMMIT_HASH=$(git cat-file -p "${FIRST_COMMIT_IN_BRANCH}" | sed -n 's/^parent //p')

# Use that parent commit as a hint for fetch auto-depth, otherwise default fetch increments by the 
# fibonacci sequence until a commit from the base branch is found that also exists in our local commit history:
git fetch --negotiation-tip "${PARENT_COMMIT_HASH}" origin ${{ github.event.pull_request.base.ref }}

# Note this hint seems to fail if you already have a newer commit in local history from the base branch, 
# Such as when the default merge-commit ref with fetch depth of 2 or more pulls in base branch commits.
# Thus not always reliable to establish a merge-base...
#
# It will also fail if the root commit (first commit in repo) was the oldest commit found for FIRST_COMMIT_IN_BRANCH,
# Due to the PARENT_COMMIT_HASH not being possible to resolve.
# That can happen with too many merge commits of base branch to PR branch, as it bumps up the github context
# for commit count, fetching more history than needed (those merge commits don't seem to count for history depth)

Otherwise I found with a merge-commit, the fetch depth for +1 commits of your PR commit count could result in history being more than the expected number of commits fetched (the depth is technically correct from the commit chain associated to the merge commit).

With that lengthy command, you shouldn't need to request one more commit. But I've since found it unreliable (explained in comments for snippet).

@polarathene
Copy link

polarathene commented Jun 27, 2022

(Improvements welcome)

@jbreckmckye if you're interested in a file-name diff (specifically files added/changed, ignoring others like renames or deletions), here's a few examples.

This is nice and small, and should be fine AFAIK, see commented version below for more details:

- name: 'Checkout PR branch (with test merge-commit)'
  uses: actions/checkout@v3
    with:
      fetch-depth: 2
- name: 'Get a list of changed files to process'
  run: git diff-tree --no-commit-id --name-only --diff-filter 'AM' -r HEAD^1 HEAD

Variants and details

I've collapsed the original content (still useful maybe if you want to understand why a --merge-base is important), as I have curated a better source of this information here.

Click to view more info
- name: 'Checkout PR branch'
  uses: actions/checkout@v3
    with:
      ref: ${{ github.event.pull_request.head.ref }}

- name: 'Get a list of changed files to process'
  run: |
    # Fetch enough history for a common merge-base commit
    git fetch origin ${{ github.event.pull_request.head.ref }} --depth $(( ${{ github.event.pull_request.commits }} + 1 ))
    git fetch origin ${{ github.event.pull_request.base.ref }}

    # Show only files from the PR with content filtered by specific git status (Added or Modified):
    git diff-tree --name-only --diff-filter 'AM' -r \
      --merge-base origin/${{ github.event.pull_request.base.ref }} ${{ github.event.pull_request.head.ref }}

The --merge-base option will find a common ancestor like git merge-base, and use that commit as the "before" reference to diff against, replacing the base branch ref value (thus: --merge-base <common ancestor commit> <PR head commit>)

If your action(s) involved use actions/checkout prior to this point, careful of the default "test merge-commit" ref being in the local history still. That can cause git fetch origin ${{ github.event.pull_request.base.ref }} to not fetch extra history needed for a merge-base commit, assuming that ref had a fetch-depth of 2 or more (which will fetch commits from both base and PR branches at that depth).

If you do need to avoid that failure scenario, you can lookup the date of the commit you branched from the base branch, and request all commits since then:

# This should get the oldest commit in the local fetched history (which may not be the branched base commit):
BRANCHED_FROM_COMMIT=$( git rev-list --first-parent --max-parents=0 --max-count=1 ${{ github.event.pull_request.head.ref }} )
UNIX_TIMESTAMP=$( git log --format=%ct "${BRANCHED_FROM_COMMIT}" )

# Get all commits since that commit for the base branch (eg: master):
git fetch --shallow-since "${UNIX_TIMESTAMP}" origin ${{ github.event.pull_request.base.ref }}

If you only need the diff from the PR, you may not need the potentially lengthy commit history from the above approach and can instead use either of these:

- name: 'Checkout PR branch (with test merge-commit)'
  uses: actions/checkout@v3
    with:
      # Merge commit + two commits (1 from each branch, the 2nd depth level)
      fetch-depth: 2

# Show only files from the PR with content filtered by specific git status (Added or Modified)
# HEAD^1 is the base branch commit, HEAD is the merge commit, no merge-base needed
- name: 'Get a list of changed files to process'
  run: git diff-tree --no-commit-id --name-only --diff-filter 'AM' -r HEAD^1 HEAD

Instead of using HEAD, you can use the actual default ref, and set a --merge-base, but since it's effectively on the same branch (base) to compare, there isn't much benefit from doing so, this is just more verbose for the sake of it (but useful if you need/prefer to use refs instead of HEAD or commit hashes):

- name: 'Checkout PR branch (with test merge-commit)'
  uses: actions/checkout@v3
    with:
      # Merge commit + two commits (1 from each branch, the 2nd depth level)
      fetch-depth: 2

- name: 'Get a list of changed files to process'
  env:
    # This is the default ref value, not exactly the same value as `github.ref` context:
    PR_REF: refs/remotes/pull/${{ github.event.pull_request.number }}/merge
  run: |
    # No extra commits need to be fetched, the base branch ref will be set to HEAD^1 commit
    git fetch origin ${{ github.event.pull_request.base.ref }}

    # Show only files from the PR with content filtered by specific git status (Added or Modified)
    git diff-tree --name-only --diff-filter 'AM' -r \
      --merge-base origin/${{ github.event.pull_request.base.ref }} ${PR_REF}

If you tried to use --merge-base against the other commit (from the PR) belonging to the merge commit, then it'd fail to derive until both branches fetch enough commits into local history, as explained earlier with --shallow-since approach.

WARNING: If you removed the --merge-base and changed ${PR_REF} to the PR commit, you'd get a diff between those two commits, which isn't that useful (eg: If the base branch has deleted a file since, but the PR hasn't, then that file is shown in part of the diff as "Added", which is wrong).

For those unfamiliar with why --merge-base is important, that's why, but a non-issue AFAIK when comparing with the "test merge commit" that Github provides and fetches by default.

@jbreckmckye
Copy link

jbreckmckye commented Jun 27, 2022

@polarathene What I was actually after was that BRANCH_POINT ref. The reason being, I'm using Yarn workspace tools, which take the reference to figure out which things need to be re-built / re-tested. However, the diff itself is still very relevant for things like linting.

I'm finding this is really helping me optimise my PR flow. For example, I'm working on repo that used to have a twenty-five minute PR flow for lint, unit-tests, static analysis, etc. Now I have differential builds I can run a differential lint and a subset of unit tests, and a PR build can now take as little as two minutes.

Your examples are still helpful though - I didn't realise I could parameterise the checkout action with the fetch depth, and I'd never really looked too hard at diff-tree. Seems like exactly the command I needed.

@polarathene
Copy link

polarathene commented Jun 28, 2022

What I was actually after was that BRANCH_POINT ref.

# Avoid the default ref "test merge commit" generated by Github (`github.sha`),
# Otherwise `git fetch` will fail to retrieve enough commits for a merge-base,
# The head ref (with default `fetch-depth: 1`) references the latest commit from the PR branch:
- name: 'Checkout PR branch'
  uses: actions/checkout@v3
    with:
      ref: ${{ github.event.pull_request.head.ref }}

- name: 'Find the merge-base commit hash'
  env:
    branch_main: ${{ github.event.pull_request.base.ref }}
    branch_topic: ${{ github.event.pull_request.head.ref }}
  run: |
    # Fetch enough history to find a common ancestor commit (aka merge-base):
    git fetch origin ${{ env.branch_topic }} --depth $(( ${{ github.event.pull_request.commits }} + 1 ))
    git fetch origin ${{ env.branch_main }}

    BRANCH_POINT=$( git merge-base origin/${{ env.branch_main }} ${{ env.branch_topic }} )
    echo "::set-output name=REF::$BRANCH_POINT"

Same thing really. Just remember that won't identify the commit that the PR branched from.

If there was a merge commit from base branch into the PR branch, the last commit from the base branch becomes the merge-base target instead (which is probably what you'd want, anyway).

WARNING: While using the branch refs as shown above works most of the time, it's not as deterministic for when you re-run a previous workflow job, as those branches may have new commits since, and especially for the PR branch, with a stale github context in the re-run, you risk an inaccurate --depth to fetch. Prefer referencing the base.sha and head.sha instead, as demonstrated at the end of this response.


Example of when merge-base derived is not the original commit you branched the PR from:

Example graph with merge-base when merge-commits are in history
git init /tmp/example && cd /tmp/example
git remote add origin https://github.com/user-name/repo-name

# PR has 6 commits (1 is a merge-commit from `main` branch), so fetch that + 1:
git fetch origin pr-branch --depth=7
git log --oneline --graph --all

* 264e38d (origin/pr-branch) PR-F
*   3a2b30b Merge branch 'main' into pr-branch <PR-E>
|\  
| * 43653c4 MAIN-E
| * 1f7e88e MAIN-D
| * 50feed1 MAIN-C
* | e210a28 PR-D
* | 60f8089 PR-C
* | 6f71a7f PR-B
* | 194ad14 PR-A
|/  
* d9411d4 MAIN-B <branched from here> <fetch-depth from 264e38d down PR commit chain>
* 6a18b1c (grafted) MAIN-A <fetch-depth from 264e38d down MAIN commit chain>

It's a bit harder to read the graph output without the colours CLI uses, but you can see the two lines on the left, the * is a commit in that line/branch, currently it only knows of the pr-branch and that some other commits were merged into the pr-branch.

The pr-branch chain looks like a depth of 8, but it's due to the merge commit chain also ensuring a depth of 7 commits (*).

Next fetch the main branch (base branch):

git fetch origin main
git log --oneline --graph --all

* 60b5fd6 (origin/main) MAIN-H <test merge-commit target>
* d1eeab7 MAIN-G
* b0d5074 MAIN-F
| * 264e38d (origin/pr-branch) PR-F
| *   3a2b30b Merge branch 'main' into pr-branch <PR-E>
| |\  
| |/  
|/|   
* | 43653c4 MAIN-E <merge-base result> <current commit for ${{ github.event.pull_request.base.sha }}>
* | 1f7e88e MAIN-D
* | 50feed1 MAIN-C
| * e210a28 PR-D
| * 60f8089 PR-C
| * 6f71a7f PR-B
| * 194ad14 PR-A
|/  
* d9411d4 MAIN-B <branched from here>
* 6a18b1c (grafted) MAIN-A

The graph output now shows commit 43653c4 (MAIN-E) connects to 3a2b30b (PR-E) commit, but also continues to the newer b0d5074 (MAIN-F) commit as the main branch has had new commits since the last merge-commit to the PR happened. MAIN-E is used as the merge-base, not MAIN-B as that is where main diverges from pr-branch now.

Additional tips:

  • ${{ github.event.pull_request.base.sha }} could point to the commit hash of MAIN-D, if at the time the PR was opened on the repo when that was the latest commit on the main branch.
  • ${{ github.event.pull_request.base.sha }} value then updates to the commit hash of MAIN-E once the merge commit is made at a later point in time.
  • With the default ref for the action however, the generated test merge commit is ${{ github.sha }}, but the parent commit on main (which pr-branch was merged with to create the test commit) was the latest commit on main when commit history of pr-branch last updated. Do not mistake ${{ github.event.pull_request.base.sha }} for that same commit (which is updated to the latest main commit when main is merged into pr-branch, or when rebasing pr-branch onto main).
  • When there are multiple merge commits from the base branch into the PR branch, they do count towards the fetch-depth progress when encountered and are part of the commits context count that Github provides. The other commit parent of each merge commit however will branch off and continue to retrieve N commits for the remaining depth as well (which can sometimes result in retrieving as far back as the root commit).

@marc-hb
Copy link

marc-hb commented Dec 9, 2022

I want all the commits up to the fork point from the base branch.

git merge-base is the "obvious" solution and it works in most cases but it does NOT work in all cases. So it cannot be used as the solution in the most common action that absolutely everyone uses.

One alternative is to use https://api.github.com/repos/$gh_project/pulls/$pr_number/commits, see example in
https://github.com/thesofproject/sof/blob/73df64444b752/.github/workflows/codestyle.yml#L38

Two problems with git merge-base:

prha added a commit to dagster-io/dagster that referenced this issue Apr 21, 2023
## Summary & Motivation
Based off of actions/checkout#552

We should limit the fetch depth of the PR branch checkout to the bare
minimum of what we need.

## How I Tested These Changes
Applied script to local repository, made sure that the main branch
checkout happened and the PR checkout happened with the calculated fetch
depth.
takeru911 pushed a commit to takeru911/dagster that referenced this issue Apr 29, 2023
…er-io#13796)

## Summary & Motivation
Based off of actions/checkout#552

We should limit the fetch depth of the PR branch checkout to the bare
minimum of what we need.

## How I Tested These Changes
Applied script to local repository, made sure that the main branch
checkout happened and the PR checkout happened with the calculated fetch
depth.
@polarathene
Copy link

polarathene commented Jun 13, 2023

  • git merge-base is not compatible with shallow clones. Relying on it can make you believe the entire git history has been submitted in the Pull Request.

I was curious so gave this a look with your 2020 example script and adapted to advice I mentioned earlier in this thread about doing a shallow clone and using git merge-base.

Your script referenced the 4.20 kernel (which later became 5.0), so I pulled in the history since then (probably could have started from 2020-10-01 (October 1st) commit date):

# Initial setup

# Expect around 3-6GB:
git clone --single-branch --shallow-since 2019-03-03 https://github.com/torvalds/linux
cd linux
git remote add sof_remote_demo https://github.com/thesofproject/linux

# This probably could be reduced in scope? I tried the shallow since but
# this step still took ridiculously long on a 1GB RAM VPS, like hours resolving delta?
# (more RAM would probably have made a big difference)
# git fetch sof_remote_demo
git fetch --shallow-since 2020-10-01 sof_remote_demo

# Tagging commits and adding a change:
# 16th Oct 2020 (authored 9th Sep): https://github.com/thesofproject/linux/commit/b150588d227ac0
git tag _real_PR_base b150588d227ac0
git checkout _real_PR_base

touch dummyfile; git add dummyfile
git commit -m 'pull request' dummyfile
git tag _pull_request

# 30th Oct 2020: https://github.com/thesofproject/linux/commit/70fe32e776dafb
git tag _target_branch 70fe32e776dafb
# 10th Oct 2020: https://github.com/torvalds/linux/commit/86f29c7442ac4b
git tag _spurious_base 86f29c7442ac4b
# Make a local clone to use the first clone as a remote instead:
# (--bare is faster / smaller by excluding the working tree, only cloning `.git/` content)
# (single commit in PR + common branching point commit, github provides this context)
git clone --bare --single-branch --branch _pull_request --depth 2 "file://$(pwd)" /tmp/linux-local

# Go through the commit history available to find a useful commit in common with target branch:
# (b150588d227ac0edc933e8f7f810d4ea9453b32c aka `_real_PR_base`)
git rev-list --first-parent --max-parents=0 --max-count=1 _pull_request

# Get date of commit (2020-10-15 16:52:12 -0500):
git log --max-count=1 --date=iso8601 --format=%cd "b150588d227ac0edc933e8f7f810d4ea9453b32c"

# Fetch the commit history from the `_target_branch` since then:
git fetch origin _target_branch --shallow-since "2020-10-15 16:52:12 -0500"
# `FETCH_HEAD` is pointing to tip of this branch, tag it:
git tag _target_branch FETCH_HEAD

# Now we can do the merge base and get the result we expect:
# (b150588d227ac0edc933e8f7f810d4ea9453b32c aka `_real_PR_base`)
git merge-base -a _target_branch _pull_request

Avoids the concern of fetching over a million commits 👍

Just need to provide context of date or number of commits to get shared history between the two branches. A similar issue (with much smaller repo size) was discussed here recently: #578 (comment) (lack of common history between the two branches caused the problem for the operation)

If you don't have that extra context, you can probably clone/fetch with --shallow-since at a date you'd consider reasonable, or likewise with --depth which should still accomplish getting common history whilst avoiding excessive history.


I didn't quite follow your intention of the example. You chose to clone the _target_branch with N commits, then bring in the _pull_request commits. At least with Github Actions, it would probably make more sense to use the Github context for PR commit count (includes merge commits, so can fetch a bit more but not excessively).

Once you have that, you can derive a date from a commit that should be a common ancestor between the two branches like I had done in my linked example. If you did the clone with --shallow-since instead of --depth, the plain fetch command for the PR history should find a common commit too if you need to approach it that way.

@marc-hb
Copy link

marc-hb commented Jun 13, 2023

you can probably clone/fetch with --shallow-since at a date you'd consider reasonable, or likewise with --depth which should still accomplish getting common history whilst avoiding excessive history.

Yes of course, if you "guess right" and fetch enough then git merge-base will work and will return some merge-base (which may or many not be the merge-base you want).

But that's just guessing; it does not give an automated solution that works every time, it does not give you something that you can script and forget about it.

@polarathene
Copy link

But that's just guessing; it does not give an automated solution that works every time

When does it not work when you fetch commits since 1 year ago? 5 years ago?

Obviously that's not as efficient, but you're not going to need more than that are you? If your PRs are branching off and merging back within a 3 month window, you could use that too, or if you know your PRs aren't likely to exceed 1000 commits that works well.

Depends what you're comfortable with if you for some reason are not able to provide better context.


it does not give you something that you can script and forget about it.

What is wrong with what I've shared? Github provides the commit count of a PR as context, you can use that to get the --shallow-since date for the target branch like I described and it works very well.

With your kernel example, my approach resulted in a little over 100 commits total fetched via --shallow-since, and the merge base was correct.

My approach does ensure you get sufficient commit history pulled in without having to bring in a full clone of a branches history (which would be wasteful with the linux kernel example if you're not truncating it with --shallow-since or similar that you're opposed to?).

So even if you don't use git merge-base it's still getting you all the commits of the PR branch. It can go back a bit further but not excessively compared to other solutions suggested. I'm not sure what you consider an alternative to git merge-base though?


I know you've cited niche scenarios (criss-cross), but it's not clear what your actual goal / expectation is for handling that. What would you do differently? At the criss-cross the two branches effectively effectively have a common branch history?

It's not uncommon in some projects I've contributed to for the main branch to add merge commits into a PR branch, but you're not really going to be merging a PR branch back into main and keep treating it as the same branch/PR are you? So is that really an applicable concern?

I'm open to real-world scenarios where my approach doesn't work well, and what you believe would be better. --unshallow fetches excess history without any benefit being cited where that makes a difference? (_if anything, you were not happy about fetching over a million commits which unless I'm mistaken is what will happen if you're remote/origin is not shallow)

@marc-hb
Copy link

marc-hb commented Jun 14, 2023

When does it not work when you fetch commits since 1 year ago? 5 years ago?
Obviously that's not as efficient, but you're not going to need more than that are you?

It's not just "inefficient", it's extremely slow with a large repo like the Linux kernel. This is going to systematically fetch many thousands of commits when you need only a few most of time (and a few thousands from time to time)

Plus this is still guessing some semi-random number of years.

Github provides the commit count of a PR as context, you can use that to get the --shallow-since date for the target branch like I described and it works very well.

How do you compute a date from a number of commits?!?

I know you've cited niche scenarios (criss-cross), [...] but you're not really going to be merging a PR branch back into main and keep treating it as the same branch/PR are you?

It maybe a niche scenario in Github but not in git:

Now to be honest I've heard many kernel developers (who invented git) say "don't use Github". It's not just about using git merge, it's also that Github's design is mostly incompatible with git range-diff https://gitlab.com/gitlab-org/gitlab/-/issues/24096

Off-topic but we've also experienced many performance issues with a kernel repo while much smaller repos in the same project were experiencing none. I think it got better.

@polarathene
Copy link

It's not just "inefficient", it's extremely slow with a large repo like the Linux kernel. This is going to systematically fetch many thousands of commits when you need only a few most of time (and a few thousands from time to time)

What would you do differently? I've seen you suggest --unshallow in a related issue which would retrieve far more commits than my approach?

When reproducing your 2020 example with the initial linux repo clone, I wasn't entirely sure what I needed history wise beyond 4.20/5.0 kernel release (later inspecting the referenced commits dates, realizing I could have retrieved less). --shallow-since worked great for that.

Here's a revised starting point if you know your actual commit/ref before and after points:

git init example && cd example
git remote add origin https://github.com/torvalds/linux
git remote add sof_remote_demo https://github.com/thesofproject/linux

# Fetch starting commit and assign a ref (branch):
git fetch sof_remote_demo --depth 1 b150588d227ac0edc933e8f7f810d4ea9453b32c:_real_PR_base

# Get commits between these two refs:
# (fetch commits since date of _real_PR_base up to a known commit that we assign ref _target_branch)
git fetch sof_remote_demo --shallow-since "2020-10-15 16:52:12 -0500" 70fe32e776dafb4b03581d62a4569f65c2f13ada:_target_branch

# 111 commits total
git rev-list _target_branch _real_PR_base | wc -l
# 292M size
du -sh
# 1.4G size after checkout:
git switch _real_PR_base

# Now make a change and add your _pull_request ref
# Optionally pull in commits from origin to include the _spurious_base
# Continue with exercise (but note there won't be millions of commits to pull from a subsequent clone due to how shallow this was)

You could have fetched _pull_request earlier instead of _real_PR_base if it were available, with sufficient depth to get the common commit ancestor to provide --shallow-since date for _target_branch. I used the commit explicitly to avoid fetching excess future history since the example is from years back, it could just be a real ref instead when suitable.


Plus this is still guessing some semi-random number of years.

You're setting a ceiling of what should be safe for the bulk of PRs in a project. There could be some outliers but often you're probably not dealing with merging a PR that branched off years ago are you? You'd probably want to rebase or similar at that point.

Many projects I've engaged with don't tend to have PRs active for such long durations. There's plenty that amass open PRs that cease to get updated, but other projects also close/reject PRs that are considered stale/abandoned in favor of opening a new one if activity will continue.

I'm not sure why we're debating this --shallow-since / --depth ceiling? I suggested it as a fallback in the event you don't have the better context. For many projects they are nothing like the linux kernel repo.. regardless what is an appropriate value to choose depends on the project. You're not going to need to rely on this fallback approach with Github Actions AFAIK, but I still consider it viable when you lack context.

What would you do differently?


How do you compute a date from a number of commits?!?

You just need a common ancestor commit between the two branches, and then take the commit date from it.

Did you miss the earlier example I provided you? Or was there a misunderstanding?:

# Go through the commit history available to find a useful commit in common with target branch:
# (b150588d227ac0edc933e8f7f810d4ea9453b32c aka `_real_PR_base`)
git rev-list --first-parent --max-parents=0 --max-count=1 _pull_request

# Get date of commit (2020-10-15 16:52:12 -0500):
git log --max-count=1 --date=iso8601 --format=%cd "b150588d227ac0edc933e8f7f810d4ea9453b32c"

# Fetch the commit history from the `_target_branch` since then:
git fetch origin _target_branch --shallow-since "2020-10-15 16:52:12 -0500"

For a workflow example with context and vars used, I linked you to this previously: #520 (comment)


It maybe a niche scenario in Github but not in git

I'm not disagreeing with you there, but:

  • This is a github issue for a Github CI action actions/checkout
  • The discussion is about PR branches and fetching enough history (likely they wanted only history since branching, not actually the entire history beyond that)

You're referring to I guess long-lived branches that are merging between each other? (the criss-cross)

I've not had any issues with the main/target branch being merged into the PR branch, or rebasing the PR branch. I can't think of a scenario where I've needed a PR branch that also merges into a target branch, and still continues to be iterated on instead of a separate PR branch.

I'm sure that's probably a thing as the LWN article seemed to discuss, and your projects rebase branch, but I'm not really following what you're suggesting should be done differently than the approach I've shared?

Now to be honest I've heard many kernel developers (who invented git) say "don't use Github".

That's fine. That's a very different workflow from what most users need in their projects, and the linux kernel isn't using Github for CI jobs right?


If you've got a better solution that can be used on Github, that'd be great to know. My approach to retrieving both branches relevant commit histories should meet the needs of most projects (even handled your kernel PR example with git merge-base).

I'd also appreciate any reproducible examples of real-world scenarios where it will falter and where a better alternative that can be used instead is provided.

@marc-hb
Copy link

marc-hb commented Jun 14, 2023

Did you miss the earlier example I provided you?

# Go through the commit history available to find a useful commit in common with target branch
git rev-list --first-parent --max-parents=0 --max-count=1 _pull_request

This does not make sense to me sorry. I'm merely saying that git merge-base is not compatible with shallow clones and you keep giving commands that only work in deep or "deep enough" clones for some hardcoded variants of "deep enough". A hypothetical Github Action that solves the problem stated in the description at the top of this issue should work for every PR in every repo.

Also, you keep posting very long comments which makes it very hard to extract the bits that matter.

https://en.wikipedia.org/wiki/Sealioning

@polarathene
Copy link

polarathene commented Jun 14, 2023

you keep posting very long comments which makes it very hard to extract the bits that matter.

I was already making an effort to be terse 😅

This does not make sense to me sorry.

I can't meet your request to be terse and break this stuff down further for you. The command does what is needed to find an appropriate commit.

git merge-base is not compatible with shallow clones

Not compatible when there is insufficient history. My approach gets you sufficient history, maybe try it out?

you keep giving commands that only work in deep or "deep enough" clones for some hardcoded variants of "deep enough".

I hard-coded them for the examples. I mentioned using github context is better and linked to an example, but I wanted to try keep it simple for you.

Here's how you get the "deep enough" commits:

- name: 'PR commits + 1'
  run: echo "PR_FETCH_DEPTH=$(( ${{ github.event.pull_request.commits }} + 1 ))" >> "${GITHUB_ENV}"

- name: 'Checkout PR branch and all PR commits'
  uses: actions/checkout@v3
    with:
      ref: ${{ github.event.pull_request.head.sha }}
      fetch-depth: ${{ env.PR_FETCH_DEPTH }}
Alternative fetch example
- name: 'Checkout the latest commit of the PR branch (at the time this event originally triggered)'
  uses: actions/checkout@v3
    with:
      ref: ${{ github.event.pull_request.head.sha }}

- name:  'Example - Fetch full PR commit history'
  run: |
    git fetch --no-tags --prune --no-recurse-submodules \
      --depth=$(( ${{ github.event.pull_request.commits }} + 1 )) \
      origin +${{ github.event.pull_request.head.sha }}:remotes/origin/${{ github.event.pull_request.head.ref }}

A hypothetical Github Action that solves the problem stated in the description at the top of this issue should work for every PR in every repo.

Cool, well until one actually exists, my Github Actions workflow example can be adapted: #520 (comment)

@marc-hb
Copy link

marc-hb commented Jun 15, 2023

I can't meet your request to be terse and break this stuff down further for you.

I never asked you to "break down" anything, quite the opposite.

It maybe a niche scenario in Github but not in git

I'm not disagreeing with you there, but: - This is a github issue for a Github CI action actions/checkout

Guess what: Github lets you import ANY git repo. Maybe Github itself should say "don't import the Linux kernel" and not just kernel developers?

That's a very different workflow from what most users need in their projects,..

Totally agreed. Yet an official git action should work all the time. Or at the very least be ready for all cases and fail gracefully when it can't do.

... and the linux kernel isn't using Github for CI jobs right?

This entire discussion started with you questioning my data posted in thesofproject/linux#2556 ?!?

I hard-coded them for the examples. I mentioned using github context is better and linked to an example, but I wanted to try keep it simple for you.

(Counter-)examples can only show when something does not work; examples can never show that a generic solution always works. “Program testing can be used to show the presence of bugs, but never to show their absence” (Dijkstra).

Yes, examples are very useful as a "simple" way to introduce concepts new to some people but they're just noise when trying to prove that something always works. An example is never an answer to a counter example. BTW someone showing a counter example is obviously not new to a topic.

Finally, --shallow-since never made any sense; with or without hardcoded examples.

the command does what is needed to find an appropriate commit.

Which command? you posted so many of them.

@marc-hb
Copy link

marc-hb commented Jun 15, 2023

PR_FETCH_DEPTH=$(( ${{ github.event.pull_request.commits }} + 1 ))

fetch-depth: ${{ env.PR_FETCH_DEPTH }}

Ah, this command/approach. Thanks for finally dropping all the noise about hardcoded commits, hardcoded number of years and pointless --shallow-since and for focusing back on some actually generic solution.

So now I remember that I did try this approach and I agree it does work every time to find one merge-base. It's great for the very large majority of Github users - but not for all people / a generic Github Action.

That's because it's massively inefficient for git repos that use git merge, so it's not "shallow" anymore.

it would probably make more sense to use the Github context for PR commit count (includes merge commits, so can fetch a bit more but not excessively).

A LOT more actually for a project that uses git merge a lot. In this example below, git fetch origin --depth 806 downloads 1.1 million commits which is SLOWER than downloading everything.

curl -L   -H "Accept: application/vnd.github+json"    -H "X-GitHub-Api-Version: 2022-11-28" \
    https://api.github.com/repos/thesofproject/linux/pulls/4393

  => 806 commits

cd test-repo
git init
git remote add origin https://github.com/thesofproject/linux/

# 1 commit, 45 seconds
time git fetch origin --depth   1  2f0c3e8b882303de468060e35769c0f79465f848
# (start over)
# Full clone, 7 minutes
time git fetch origin              2f0c3e8b882303de468060e35769c0f79465f848
# (start over)
# 10 minutes, LONGER than a full clone!
time git fetch origin --depth 806  2f0c3e8b882303de468060e35769c0f79465f848

git log --oneline FETCH_HEAD | wc -l
  => 1.1 million commits!

If you've got a better solution that can be used on Github, that'd be great to know.

The more efficient approach that I did implement and already linked to is to iterate and git fetch all the commits in the PR one by one. If you start with the "older" commits then most of the other git fetch should be no-ops.

I've implemented this method and put it in production but only in a small repo. I haven't used it "in anger" on a Linux kernel repo yet. The API used in the current implementation has a 250 commits limit. This may or may not be what you want/need.

EDIT: correction, that implementation isn't finished yet, it's still using an inefficient count and depth for now.

@polarathene
Copy link

polarathene commented Jun 16, 2023

I never asked you to "break down" anything, quite the opposite.

You said you didn't understand a command. I responded that if I elaborated, it wouldn't be terse like you requested.

Guess what: Github lets you import ANY git repo.

There is a lot of things you can do in life, doesn't mean you should. To remain terse (and on topic), this isn't worth debating.

an official git action should work all the time. Or at the very least be ready for all cases and fail gracefully when it can't do.

One would hope for the best yes.

In my experience it doesn't matter how well funded a company is, these issues lasting for years is common to encounter, even if they're open-source and solutions are provided via PR it can take quite some time before action is taken.


This entire discussion started with you questioning my data ?!?

Yes? I was curious, so I tried my approach with git merge-base and it resulted in a shallow clone of a little over 100 commits and working git merge-base 🤷‍♂️

Most of the discussion after my first response was repeating myself because you wouldn't read what was said the first time 🙄

but they're just noise when trying to prove that something always works.

  1. I shared a sequence of commands to reproduce your 2020 script example, but with the approach I shared as a better solution.
  2. Since I wanted to avoid noise in the example, I used hard-coded values but comments explained them and I linked you to the workflow example with values sourced from github context very early on.

Clearly I communicated that poorly and could have done better... or it'd still be glossed over and variables considered noise/overhead to explain. Can't really predict that up front 🤷‍♂️

--shallow-since never made any sense; with or without hardcoded examples.

If you say so

@polarathene
Copy link

Which command? you posted so many of them.

The git rev-list one you quoted and said didn't make sense? 🙄


Ah, this command/approach.
Thanks for finally dropping all the noise about hardcoded commits, hardcoded number of years and pointless --shallow-since and for focusing back on some actually generic solution.

That was only for getting the number of commits.

I shared this via a link in my very first sentence to you: #552 (comment)

The rest was about bringing in the other branch for the git merge-base while avoiding over a million commits fetched.


If you start with the "older" commits then most of the other git fetch should be no-ops.

I'm not sure how you propose doing that with the PR you referenced?

Warning: Verbosity ahead

Here is the end of it with the base commit sha a714334 highlighted (this is available in the JSON you used a curl request from, or via github context):

Screenshot_20230616_175310

That highlighted commit and the one below it are listed in the PR commit history after the merge commits:

Screenshot_20230616_175646

The commits belonging to the merges then follow afterwards and can be seen from the top of this git tree:

Screenshot_20230616_183639

I'm not sure what to make of that 🤷‍♂️ what is your starting point for older commits? The PR base commit a714334 is newer than many of the commits belonging to merge commits, there is no easy way to use fetch depth reliably here.

Due to all those cascading merges, there isn't much actual commits unique to the PR branch itself, I'm really not sure how you'd plan to only fetch the 806 individual commits without bringing in a bunch of excess? (as you can see the original merged branches are truncated, they have more history technically)


With enough history (--shallow-since 2023-02-01) this does provide the expected commit count:

# 806 commits:
git rev-list --count 5224191...a714334

# Wasteful excess though:
git rev-list --count --all
20879

However if we fetch commits since the date of the oldest commit, we should have those commits but something seems off:

# Commits sorted by timestamp, output the oldest commit:
$ git rev-list --timestamp 5224191...a714334 | sort -n | head -n 1
1680775471 430cac487400494c19a8b85299e979bb07b4671f

$ git log --max-count=1 --date=iso8601 --format=%cd 430cac487400494c19a8b85299e979bb07b4671f
2023-04-06 12:04:31 +0200

$ git init
$ git remote add origin https://github.com/thesofproject/linux

# Fetch commits this far back:
$ git fetch origin --no-tags --prune --no-recurse-submodules --shallow-since "2023-04-06 12:04:31 +0200" '+5224191371f701bbf0e10bb49bd5cd3fda9848a5:remotes/origin/merge/sound-upstream-20230530'

# More than 806?
$ git rev-list --count 5224191...a714334
1223

$ git rev-list --count --all
5049

This is probably something you'd know the answer to. Presumably all the commits are there still, but some are already applied / shared further back in history?

The more efficient approach that I did implement and already linked to is to iterate and git fetch all the commits in the PR one by one.

I'm not sure how your staggered fetching would handle the above experience any better? 🤔

You said you've linked to it but all I see is git fetch --depth with an awkward way to retrieve the depth instead of using context? (presumably due to Travis, before GH Actions)

@polarathene
Copy link

polarathene commented Jun 16, 2023

EDIT: correction, that implementation isn't finished yet, it's still using an inefficient count and depth for now.

So you're fetching over a million commits still, while saying my advice is inefficient? (with PR 4393, since you chose that outlier)

Most of the PRs on the thesofproject/linux aren't these massive piles of commits, those are special case and could be handled in a separate manner if preferred.


Fallback (niche scenario) - Only double the time to fetch vs single commit

The fallback advice I gave with --shallow-since and a relative offset is probably applicable to your large PRs (with the cascading merges).

# 1 minute (324 MiB) - Approx 20k commits fetched:
$ time git fetch origin --no-tags --prune --no-recurse-submodules --shallow-since "2023-02-01" \
'+5224191371f701bbf0e10bb49bd5cd3fda9848a5:remotes/origin/merge/sound-upstream-20230530'

real    1m1.267s
user    0m27.861s
sys     0m3.801s


# 40 sec (237 MiB):
$ time git fetch origin --no-tags --prune --no-recurse-submodules --depth 1 \
'+5224191371f701bbf0e10bb49bd5cd3fda9848a5:remotes/origin/merge/sound-upstream-20230530'


# 45 minutes (1GB VPS instance, memory starved by git) - Over a million commits fetched:
$ time git fetch origin --no-tags --prune --no-recurse-submodules --depth 806 \
'+5224191371f701bbf0e10bb49bd5cd3fda9848a5:remotes/origin/merge/sound-upstream-20230530'

As you can see close to 6 months isn't too much overhead, you'd probably end up with only double the time of a single commit fetch, despite fetching tens of thousands.

For kernel cycles, I'd imagine a relative date offset for fetching like this is predictable enough to know if 6 months is a large enough window for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants