-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New sparse-checkout builtin and "cone" mode #316
New sparse-checkout builtin and "cone" mode #316
Conversation
19ad979
to
568fda2
Compare
/submit |
Submitted as pull.316.git.gitgitgadget@gmail.com |
On the Git mailing list, Elijah Newren wrote (reply to this):
|
On the Git mailing list, Derrick Stolee wrote (reply to this):
|
On the Git mailing list, Derrick Stolee wrote (reply to this):
|
unpack-trees.c
Outdated
@@ -1397,15 +1397,23 @@ static int clear_ce_flags(struct index_state *istate, | |||
struct exclude_list *el) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Elijah Newren wrote (reply to this):
On Tue, Aug 20, 2019 at 8:12 AM Jeff Hostetler via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Jeff Hostetler <jeffhost@microsoft.com>
Can the commit summary be turned into English?
> The clear_ce_flags_1 method is used by many types of calls to
> unpack_trees(). Add trace2 regions around the method, including
> some flag information, so we can get granular performance data
> during experiments.
It might be nice to have some words in the cover letter about why this
patch is included in this series instead of being a separate
submission. I'm not familiar with the trace2 stuff yet; this looks
probably useful, but the commit message makes it sound like something
general rather than specific to this series.
> Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
<snip>
On the Git mailing list, Elijah Newren wrote (reply to this):
|
On the Git mailing list, Derrick Stolee wrote (reply to this):
|
On the Git mailing list, Elijah Newren wrote (reply to this):
|
On the Git mailing list, Derrick Stolee wrote (reply to this):
|
On the Git mailing list, Eric Sunshine wrote (reply to this):
|
7cb542c
to
7b2b121
Compare
/submit |
Submitted as pull.316.v6.git.1574373891.gitgitgadget@gmail.com |
This patch series was integrated into pu via git@3fa2852. |
This patch series was integrated into pu via git@4a63815. |
This patch series was integrated into pu via git@4126d07. |
This patch series was integrated into pu via git@bba10d2. |
This patch series was integrated into pu via git@c2cdfc4. |
This patch series was integrated into pu via git@c671989. |
This patch series was integrated into pu via git@0d091dd. |
This patch series was integrated into pu via git@0f159e4. |
This patch series was integrated into pu via git@a3525f8. |
This patch series was integrated into pu via git@41ac7e4. |
This patch series was integrated into pu via git@380642b. |
This patch series was integrated into pu via git@0341d5f. |
This patch series was integrated into pu via git@25c9911. |
This patch series was integrated into pu via git@7bf36df. |
This branch is now known as |
This patch series was integrated into pu via git@c3fcb82. |
This patch series was integrated into pu via git@553c471. |
This patch series was integrated into next via git@c840c1d. |
This patch series was integrated into pu via git@5f635d6. |
This patch series was integrated into pu via git@1362a31. |
This patch series was integrated into pu via git@87f7b46. |
This patch series was integrated into pu via git@7d920a1. |
This patch series was integrated into pu via git@bd72a08. |
This patch series was integrated into master via git@bd72a08. |
Closed via bd72a08. |
This series makes the sparse-checkout feature more user-friendly. While there, I also present a way to use a limited set of patterns to gain a significant performance boost in very large repositories.
Sparse-checkout is only documented as a subsection of the read-tree docs [1], which makes the feature hard to discover. Users have trouble navigating the feature, especially at clone time [2], and have even resorted to creating their own helper tools [3].
This series attempts to solve these problems using a new builtin. Here is a sample workflow to give a feeling for how it can work:
In an existing repo:
At clone time:
Here are some more specific details:
git sparse-checkout init
enablescore.sparseCheckout
and populates the sparse-checkout file with patterns that match only the files at root.git clone
learns the--sparse
argument to rungit sparse-checkout init
before the first checkout.git sparse-checkout set
reads patterns from the arguments, or with --stdin reads patterns from stdin one per line, then writes them to the sparse-checkout file and refreshes the working directory.git sparse-checkout disable
removes the patterns from the sparse-checkout file, disablescore.sparseCheckout
, and refills the working directory.git sparse-checkout list
lists the contents of the sparse-checkout file.The documentation for the sparse-checkout feature can now live primarily with the git-sparse-checkout documentation.
Cone Mode
What really got me interested in this area is a performance problem. If we have N patterns in the sparse-checkout file and M entries in the index, then we can perform up to O(N * M) pattern checks in clear_ce_flags(). This quadratic growth is not sustainable in a repo with 1,000+ patterns and 1,000,000+ index entries.
To solve this problem, I propose a new, more restrictive mode to sparse-checkout: "cone mode". In this mode, all patterns are based on prefix matches at a directory level. This can then use hashsets for fast performance -- O(M) instead of O(N*M). My hashset implementation is based on the virtual filesystem hook in the VFS for Git custom code [4].
In cone mode, a user specifies a list of folders which the user wants every file inside. In addition, the cone adds all blobs that are siblings of the folders in the directory path to that folder. This makes the directories look "hydrated" as a user drills down to those recursively-closed folders. These directories are called "parent" folders, as a file matches them only if the file's immediate parent is that directory.
When building a prototype of this feature, I used a separate file to contain the list of recursively-closed folders and built the hashsets dynamically based on that file. In this implementation, I tried to maximize the amount of backwards-compatibility by storing all data in the sparse-checkout file using patterns recognized by earlier Git versions.
For example, if we add
A/B/C
as a recursive folder, then we add the following patterns to the sparse-checkout file:The alternating positive/negative patterns say "include everything in this folder, but exclude everything another level deeper". The final pattern has no matching negation, so is a recursively closed pattern.
Note that I have some basic warnings to try and check that the sparse-checkout file doesn't match what would be written by a cone-mode add. In such a case, Git writes a warning to stderr and continues with the old pattern matching algorithm. These checks are currently very barebones, and would need to be updated with more robust checks for things like regex characters in the middle of the pattern. As review moves forward (and if we don't change the data storage) then we could spend more time on this.
Thanks,
-Stolee
Updates in v2, relative to the RFC:
Instead of an 'add' subcommand, use a 'set' subcommand. We can consider adding 'add' and/or 'remove' subcommands later.
'set' reads from the arguments by default. '--stdin' option is available.
A new performance-oriented commit is added at the end.
Patterns no longer end with a trailing asterisk except for the first "/*" pattern.
References to a "bug" (that was really a strange GVFS interaction in microsoft/git) around deleting outside the cone are removed.
Updates in v3:
The bad interaction with "cone mode" and .gitignore files is fixed. A test is added in the last patch.
Several patches are added that make the feature more robust. One sanitizes user input, another few add progress indicators, and another more prevent users from getting in bad states due to working directory changes or concurrent processes.
Updated several docs and commit messages according to feedback. Thanks, Elijah!
Updates in V4:
Updated hashmap API usage to respond to ew/hashmap
Responded to detailed review by Elijah. Thanks!
Marked the feature as experimental in git-sparse-checkout.txt the same way that git-switch.txt does.
Updates in V5:
The 'set' subcommand now enables the core.sparseCheckout config setting (unless the checkout fails).
If the in-process unpack_trees() fails with the new patterns, the index.lock file is rolled back before the replay of the old sparse-checkout patterns.
Some documentation fixes, f(d)open->xf(d)open calls, and other nits. Thanks everyone!
Updates in V6:
The init, set, and disable commands now require a clean status.
Git config is now set in-process instead of via a run_command call.
The working directory was being updated twice, leading to multiple errors being shown if the working directory would become empty.
Before, only the 'set' command used the in-process workdir update. Now 'init' and 'disable' also use this in-process code, which removes some error cases.
Things to leave for future patches:
Integrate in 'git worktree add' to copy the sparse-checkout file to a worktree-specific file.
More robustness around detecting non-cone patterns with wildcards in the middle of the line.
'git clone --sparse-cone' to clone into "cone mode" sparse-checkouts (i.e. set 'core.sparseCheckoutCone=true'). This may not be super-valuable, as it only starts changing behavior when someone calls 'git sparse-checkout set', but may be interesting.
Make the working-directory update not modify the staging environment. Block only if it would lose work-in-progress.
Some robustness things can be saved for later, such as including pattern arguments next to "--stdin", "set --cone", etc.
[1] https://git-scm.com/docs/git-read-tree#_sparse_checkout
Sparse-checkout documentation in git-read-tree.
[2] https://stackoverflow.com/a/4909267/127088
Is it possible to do a sparse checkout without checking out the whole repository first?
[3] http://www.marcoyuen.com/articles/2016/06/07/git-sparse.html
A blog post of a user's extra "git-sparse" helper.
[4] git/git@fc5fd70...01204f2
The virtual filesystem hook in microsoft/git.
Cc: newren@gmail.com, jon@jonsimons.org, szeder.dev@gmail.com