New sparse-checkout builtin and "cone" mode #316

derrickstolee · 2019-08-20T14:28:34Z

This series makes the sparse-checkout feature more user-friendly. While there, I also present a way to use a limited set of patterns to gain a significant performance boost in very large repositories.

Sparse-checkout is only documented as a subsection of the read-tree docs [1], which makes the feature hard to discover. Users have trouble navigating the feature, especially at clone time [2], and have even resorted to creating their own helper tools [3].

This series attempts to solve these problems using a new builtin. Here is a sample workflow to give a feeling for how it can work:

In an existing repo:

$ git sparse-checkout init
$ ls
myFile1.txt myFile2.txt
$ git sparse-checkout set "/*" "!/*/" /myFolder/
$ ls
myFile1.txt myFile2.txt myFolder
$ ls myFolder
a.c a.h
$ git sparse-checkout disable
$ ls
hiddenFolder myFile1.txt myFile2.txt myFolder

At clone time:

$ git clone --sparse origin repo
$ cd repo
$ ls
myFile1.txt myFile2.txt
$ git sparse-checkout set "/*" "!/*/" /myFolder/
$ ls
myFile1.txt myFile2.txt myFolder

Here are some more specific details:

git sparse-checkout init enables core.sparseCheckout and populates the sparse-checkout file with patterns that match only the files at root.
git clone learns the --sparse argument to run git sparse-checkout init before the first checkout.
git sparse-checkout set reads patterns from the arguments, or with --stdin reads patterns from stdin one per line, then writes them to the sparse-checkout file and refreshes the working directory.
git sparse-checkout disable removes the patterns from the sparse-checkout file, disables core.sparseCheckout, and refills the working directory.
git sparse-checkout list lists the contents of the sparse-checkout file.

The documentation for the sparse-checkout feature can now live primarily with the git-sparse-checkout documentation.

Cone Mode

What really got me interested in this area is a performance problem. If we have N patterns in the sparse-checkout file and M entries in the index, then we can perform up to O(N * M) pattern checks in clear_ce_flags(). This quadratic growth is not sustainable in a repo with 1,000+ patterns and 1,000,000+ index entries.

To solve this problem, I propose a new, more restrictive mode to sparse-checkout: "cone mode". In this mode, all patterns are based on prefix matches at a directory level. This can then use hashsets for fast performance -- O(M) instead of O(N*M). My hashset implementation is based on the virtual filesystem hook in the VFS for Git custom code [4].

In cone mode, a user specifies a list of folders which the user wants every file inside. In addition, the cone adds all blobs that are siblings of the folders in the directory path to that folder. This makes the directories look "hydrated" as a user drills down to those recursively-closed folders. These directories are called "parent" folders, as a file matches them only if the file's immediate parent is that directory.

When building a prototype of this feature, I used a separate file to contain the list of recursively-closed folders and built the hashsets dynamically based on that file. In this implementation, I tried to maximize the amount of backwards-compatibility by storing all data in the sparse-checkout file using patterns recognized by earlier Git versions.

For example, if we add A/B/C as a recursive folder, then we add the following patterns to the sparse-checkout file:

/*
!/*/
/A/
!/A/*/
/A/B/
!/A/B/*/
/A/B/C/

The alternating positive/negative patterns say "include everything in this folder, but exclude everything another level deeper". The final pattern has no matching negation, so is a recursively closed pattern.

Note that I have some basic warnings to try and check that the sparse-checkout file doesn't match what would be written by a cone-mode add. In such a case, Git writes a warning to stderr and continues with the old pattern matching algorithm. These checks are currently very barebones, and would need to be updated with more robust checks for things like regex characters in the middle of the pattern. As review moves forward (and if we don't change the data storage) then we could spend more time on this.

Thanks,
-Stolee

Updates in v2, relative to the RFC:

Instead of an 'add' subcommand, use a 'set' subcommand. We can consider adding 'add' and/or 'remove' subcommands later.
'set' reads from the arguments by default. '--stdin' option is available.
A new performance-oriented commit is added at the end.
Patterns no longer end with a trailing asterisk except for the first "/*" pattern.
References to a "bug" (that was really a strange GVFS interaction in microsoft/git) around deleting outside the cone are removed.

Updates in v3:

The bad interaction with "cone mode" and .gitignore files is fixed. A test is added in the last patch.
Several patches are added that make the feature more robust. One sanitizes user input, another few add progress indicators, and another more prevent users from getting in bad states due to working directory changes or concurrent processes.
Updated several docs and commit messages according to feedback. Thanks, Elijah!

Updates in V4:

Updated hashmap API usage to respond to ew/hashmap
Responded to detailed review by Elijah. Thanks!
Marked the feature as experimental in git-sparse-checkout.txt the same way that git-switch.txt does.

Updates in V5:

The 'set' subcommand now enables the core.sparseCheckout config setting (unless the checkout fails).
If the in-process unpack_trees() fails with the new patterns, the index.lock file is rolled back before the replay of the old sparse-checkout patterns.
Some documentation fixes, f(d)open->xf(d)open calls, and other nits. Thanks everyone!

Updates in V6:

The init, set, and disable commands now require a clean status.
Git config is now set in-process instead of via a run_command call.
The working directory was being updated twice, leading to multiple errors being shown if the working directory would become empty.
Before, only the 'set' command used the in-process workdir update. Now 'init' and 'disable' also use this in-process code, which removes some error cases.

Things to leave for future patches:

Integrate in 'git worktree add' to copy the sparse-checkout file to a worktree-specific file.
More robustness around detecting non-cone patterns with wildcards in the middle of the line.
'git clone --sparse-cone' to clone into "cone mode" sparse-checkouts (i.e. set 'core.sparseCheckoutCone=true'). This may not be super-valuable, as it only starts changing behavior when someone calls 'git sparse-checkout set', but may be interesting.
Make the working-directory update not modify the staging environment. Block only if it would lose work-in-progress.
Some robustness things can be saved for later, such as including pattern arguments next to "--stdin", "set --cone", etc.

[1] https://git-scm.com/docs/git-read-tree#_sparse_checkout
Sparse-checkout documentation in git-read-tree.

[2] https://stackoverflow.com/a/4909267/127088
Is it possible to do a sparse checkout without checking out the whole repository first?

[3] http://www.marcoyuen.com/articles/2016/06/07/git-sparse.html
A blog post of a user's extra "git-sparse" helper.

[4] git/git@fc5fd70...01204f2
The virtual filesystem hook in microsoft/git.

Cc: newren@gmail.com, jon@jonsimons.org, szeder.dev@gmail.com

derrickstolee · 2019-08-20T15:10:24Z

/submit

gitgitgadget · 2019-08-20T15:11:26Z

Submitted as pull.316.git.gitgitgadget@gmail.com

gitgitgadget · 2019-08-21T21:53:11Z

On the Git mailing list, Elijah Newren wrote (reply to this):

On Tue, Aug 20, 2019 at 8:12 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> This RFC includes a potential direction to make the sparse-checkout more
> user-friendly. While there, I also present a way to use a limited set of
> patterns to gain a significant performance boost in very large repositories.
>
> Sparse-checkout is only documented as a subsection of the read-tree docs
> [1], which makes the feature hard to discover. Users have trouble navigating
> the feature, especially at clone time [2], and have even resorted to
> creating their own helper tools [3].

Ooh, intriguing.  Count me as another person who has resorted to
making my own helper tool for others to use (specific to our internal
repository, though, as it also figures out inter-module dependencies
to allow specifying only a few modules of interest while still
checking out everything needed to build those; but it'd be nice to
need less scripting to handle the git-related bits to actually
sparsify or densify).

> This RFC attempts to solve these problems using a new builtin. Here is a
> sample workflow to give a feeling for how it can work:
>
> In an existing repo:
>
> $ git sparse-checkout init
> $ ls
> myFile1.txt myFile2.txt
>
> $ git sparse-checkout add
> /myFolder/*
> ^D
> $ ls
> myFile1.txt myFile2.txt myFolder
> $ ls myFolder
> a.c a.h
> $ git sparse-checkout disable
> $ ls
> hiddenFolder myFile1.txt myFile2.txt myFolder
>
> At clone time:
>
> $ git clone --sparse origin repo
> $ cd repo
> $ ls
> myFile1.txt myFile2.txt
> $ git sparse-checkout add
> /myFolder/*
> ^D
> $ ls
> myFile1.txt myFile2.txt myFolder
>
> Here are some more specific details:
>
>  * git sparse-checkout init enables core.sparseCheckout and populates the
>    sparse-checkout file with patterns that match only the files at root.

Does it enable core.sparseCheckout in the current worktree, or for all
worktrees?  Do we require extensions.worktreeConfig to be set to true
first?  If we don't require extensions.worktreeConfig to be set to
true, and users add worktrees later, do they encounter negative
surprises (immediately or later)?

worktrees in combination with sparseCheckouts were a headache here
until I just forced people to manually first set
extensions.worktreeConfig to true before using my 'sparsify' script,
regardless of whether the user was currently using worktrees.  That
fixed the issues, but having to provide a long error message and
explanation of why I wanted users to set some special config first was
slightly annoying.

I wonder if 'git worktree' and maybe even 'git config' should
themselves have special handling for core.sparseCheckouts, because it
can be a real mess otherwise.

>  * git clone learns the --sparse argument to run git sparse-checkout init
>    before the first checkout.

Nice.

>  * git sparse-checkout add reads patterns from stdin, one per line, then
>    adds them to the sparse-checkout file and refreshes the working
>    directory.

The default of reading from stdin seems a bit unusual to me, and I
worry about having to explain that to users.  I'd rather the add
command took positional parameters (anything that doesn't start with a
hyphen) and added those, e.g.
  $ git sparse-checkout add '/myFolder/*' '
with the option of the user specifying --stdin.

>  * git sparse-checkout disable removes the patterns from the sparse-checkout
>    file, disables core.sparseCheckout, and refills the working directory.

Does it leave an empty sparse-checkout file around?  Also, what if
users have several paths defining their sparse pattern, and want to
temporarily get a full checkout and then come back -- do they need to
re-specify all the paths?  (Maybe this *is* the route we want to go;
I'm just trying to mention any possible negative effects we _might_
run into so we can consider them.  It's not quite as relevant in my
case since people specify a few toplevel modules and sparse-checkout
gets several entries auto-generated for them.)

Also, I'm particularly worried that a user with multiple worktrees,
both sparse, could run 'git sparse-checkout disable' in one and then
find that when they return to the other worktree they get a variety of
nasty surprises (e.g. accidental staging or committing of the deletion
of a huge number of files, random weird errors, or gradual and weird
de-sparsifying as various git commands are run).  This, of course, can
be averted by making sure core.sparseCheckout is set on a per-worktree
basis, but that seems to be something people only do after running
into problems several times unless some kind of tooling enforces it.

>  * git sparse-checkout list lists the contents of the sparse-checkout file.
>
>
>
> The documentation for the sparse-checkout feature can now live primarily
> with the git-sparse-checkout documentation.

Yaay!

> Cone Mode
> =========
>
> What really got me interested in this area is a performance problem. If we
> have N patterns in the sparse-checkout file and M entries in the index, then
> we can perform up to O(N * M) pattern checks in clear_ce_flags(). This
> quadratic growth is not sustainable in a repo with 1,000+ patterns and
> 1,000,000+ index entries.

This has worried me for a while, even if it hasn't yet caused us
issues in practice.

> To solve this problem, I propose a new, more restrictive mode to
> sparse-checkout: "cone mode". In this mode, all patterns are based on prefix
> matches at a directory level. This can then use hashsets for fast
> performance -- O(M) instead of O(N*M). My hashset implementation is based on
> the virtual filesystem hook in the VFS for Git custom code [4].

Sweet!

> In cone mode, a user specifies a list of folders which the user wants every
> file inside. In addition, the cone adds all blobs that are siblings of the
> folders in the directory path to that folder. This makes the directories
> look "hydrated" as a user drills down to those recursively-closed folders.
> These directories are called "parent" folders, as a file matches them only
> if the file's immediate parent is that directory.
>
> When building a prototype of this feature, I used a separate file to contain
> the list of recursively-closed folders and built the hashsets dynamically
> based on that file. In this implementation, I tried to maximize the amount
> of backwards-compatibility by storing all data in the sparse-checkout file
> using patterns recognized by earlier Git versions.
>
> For example, if we add A/B/C as a recursive folder, then we add the
> following patterns to the sparse-checkout file:
>
> /*
> !/*/*
> /A/*
> !/A/*/*
> /A/B/*
> !/A/B/*/*
> /A/B/C/*
>
> The alternating positive/negative patterns say "include everything in this
> folder, but exclude everything another level deeper". The final pattern has
> no matching negation, so is a recursively closed pattern.

Oh, um, would there be any option for fast but without grabbing
sibling and parent files of requested directories?  And could users
still request individual files (not with regex or pathspec, but fully
specifying the path) and still get the fast mode?

Basically, our sparse usage is exclusively specifying leading
directories or full pathnames of individual files, but we really want
the repo to feel smaller and make sure people notice at a glance.  We
have a huge 'modules/' directory, and want people to be able to get
just 15 of the 500 or so subdirectories that would appear in that
directory with a non-sparse checkout.  And similarly we want to be
able to grab just one or two files from a directory of many files.

> Note that I have some basic warnings to try and check that the
> sparse-checkout file doesn't match what would be written by a cone-mode add.
> In such a case, Git writes a warning to stderr and continues with the old
> pattern matching algorithm. These checks are currently very barebones, and
> would need to be updated with more robust checks for things like regex
> characters in the middle of the pattern. As review moves forward (and if we
> don't change the data storage) then we could spend more time on this.

Instead of trying to validate the sparse-checkout file everytime,
perhaps we want to change core.sparseCheckout from a boolean to a
tri-state or something where it specifies how to parse the
sparse-checkout file?  Or maybe when special directive (some form of
comment-looking line) appears at the top of sparse-checkout then we
use the hashsets speedup while disallowing general regexes and
pathspecs other than leading directories and full pathnames?

I'm not sure if that makes sense or not; just throwing it out there as an idea.

> Derrick Stolee (8):
>   sparse-checkout: create builtin with 'list' subcommand
>   sparse-checkout: create 'init' subcommand
>   clone: add --sparse mode
>   sparse-checkout: 'add' subcommand
>   sparse-checkout: create 'disable' subcommand
>   sparse-checkout: add 'cone' mode
>   sparse-checkout: use hashmaps for cone patterns
>   sparse-checkout: init and add in cone mode
>
> Jeff Hostetler (1):
>   trace2:experiment: clear_ce_flags_1

I'll try to get some time to look over these patches in the next few days.

gitgitgadget · 2019-08-22T13:17:40Z

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 8/21/2019 5:52 PM, Elijah Newren wrote:
> On Tue, Aug 20, 2019 at 8:12 AM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>>
>> This RFC includes a potential direction to make the sparse-checkout more
>> user-friendly. While there, I also present a way to use a limited set of
>> patterns to gain a significant performance boost in very large repositories.
>>
>> Sparse-checkout is only documented as a subsection of the read-tree docs
>> [1], which makes the feature hard to discover. Users have trouble navigating
>> the feature, especially at clone time [2], and have even resorted to
>> creating their own helper tools [3].
> 
> Ooh, intriguing.  Count me as another person who has resorted to
> making my own helper tool for others to use (specific to our internal
> repository, though, as it also figures out inter-module dependencies
> to allow specifying only a few modules of interest while still
> checking out everything needed to build those; but it'd be nice to
> need less scripting to handle the git-related bits to actually
> sparsify or densify).
> 
>> This RFC attempts to solve these problems using a new builtin. Here is a
>> sample workflow to give a feeling for how it can work:
>>
>> In an existing repo:
>>
>> $ git sparse-checkout init
>> $ ls
>> myFile1.txt myFile2.txt
>>
>> $ git sparse-checkout add
>> /myFolder/*
>> ^D
>> $ ls
>> myFile1.txt myFile2.txt myFolder
>> $ ls myFolder
>> a.c a.h
>> $ git sparse-checkout disable
>> $ ls
>> hiddenFolder myFile1.txt myFile2.txt myFolder
>>
>> At clone time:
>>
>> $ git clone --sparse origin repo
>> $ cd repo
>> $ ls
>> myFile1.txt myFile2.txt
>> $ git sparse-checkout add
>> /myFolder/*
>> ^D
>> $ ls
>> myFile1.txt myFile2.txt myFolder
>>
>> Here are some more specific details:
>>
>>  * git sparse-checkout init enables core.sparseCheckout and populates the
>>    sparse-checkout file with patterns that match only the files at root.
> 
> Does it enable core.sparseCheckout in the current worktree, or for all
> worktrees?  Do we require extensions.worktreeConfig to be set to true
> first?  If we don't require extensions.worktreeConfig to be set to
> true, and users add worktrees later, do they encounter negative
> surprises (immediately or later)?

This is an interesting scenario that I had not considered. Thanks!

My guess is that we should set `extensions.worktreeConfig=true` to
avoid surprises. I'll need to play with this to discover the answers
to these questions:

1. Where does the worktree look for the sparse-checkout file? Does
   each worktree have its own sparse-checkout file? Should it?

2. If I have `extensions.worktreeConfig=true` and `core.sparseCheckout=true`
   in the current worktree and run `git worktree add`, does the new worktree
   have `core.sparseCheckout=true`? Can we `git clone --sparse` and then
   start building sparse worktrees seamlessly?

> worktrees in combination with sparseCheckouts were a headache here
> until I just forced people to manually first set
> extensions.worktreeConfig to true before using my 'sparsify' script,
> regardless of whether the user was currently using worktrees.  That
> fixed the issues, but having to provide a long error message and
> explanation of why I wanted users to set some special config first was
> slightly annoying.
> 
> I wonder if 'git worktree' and maybe even 'git config' should
> themselves have special handling for core.sparseCheckouts, because it
> can be a real mess otherwise.
> 
>>  * git clone learns the --sparse argument to run git sparse-checkout init
>>    before the first checkout.
> 
> Nice.
> 
>>  * git sparse-checkout add reads patterns from stdin, one per line, then
>>    adds them to the sparse-checkout file and refreshes the working
>>    directory.
> 
> The default of reading from stdin seems a bit unusual to me, and I
> worry about having to explain that to users.  I'd rather the add
> command took positional parameters (anything that doesn't start with a
> hyphen) and added those, e.g.
>   $ git sparse-checkout add '/myFolder/*' '
> with the option of the user specifying --stdin.

I had the same thought, and likely that's where we should go with the
builtin. For our needs, the input over stdin is more important for
testing, so I built it first. I will adjust the CLI here to take a set
of paths over the arguments unless --stdin is given.

>>  * git sparse-checkout disable removes the patterns from the sparse-checkout
>>    file, disables core.sparseCheckout, and refills the working directory.
> 
> Does it leave an empty sparse-checkout file around?  Also, what if
> users have several paths defining their sparse pattern, and want to
> temporarily get a full checkout and then come back -- do they need to
> re-specify all the paths?  (Maybe this *is* the route we want to go;
> I'm just trying to mention any possible negative effects we _might_
> run into so we can consider them.  It's not quite as relevant in my
> case since people specify a few toplevel modules and sparse-checkout
> gets several entries auto-generated for them.)

In this case, there is an intermediate step (that follows the existing
advice) to modify the sparse-checkout file to contain only "/*\n" then
run read-tree to fill the working directory before disabling the config
setting.

Perhaps "disable" is the wrong word to use, as it makes you think that
there should be an "enable" that can quickly toggle between the two
modes. Maybe instead it should be "git sparse-checkout reset [empty|full]"
where you could 'reset' the sparse-checkout to one of two initial
states:

1. empty: only files at root are included.
2. full: all files are included.

In each case, we would obliterate the existing sparse-checkout entries,
but hopefully that behavior is more clear from the command names.

> Also, I'm particularly worried that a user with multiple worktrees,
> both sparse, could run 'git sparse-checkout disable' in one and then
> find that when they return to the other worktree they get a variety of
> nasty surprises (e.g. accidental staging or committing of the deletion
> of a huge number of files, random weird errors, or gradual and weird
> de-sparsifying as various git commands are run).  This, of course, can
> be averted by making sure core.sparseCheckout is set on a per-worktree
> basis, but that seems to be something people only do after running
> into problems several times unless some kind of tooling enforces it.
> 
>>  * git sparse-checkout list lists the contents of the sparse-checkout file.
>>
>>
>>
>> The documentation for the sparse-checkout feature can now live primarily
>> with the git-sparse-checkout documentation.
> 
> Yaay!
> 
>> Cone Mode
>> =========
>>
>> What really got me interested in this area is a performance problem. If we
>> have N patterns in the sparse-checkout file and M entries in the index, then
>> we can perform up to O(N * M) pattern checks in clear_ce_flags(). This
>> quadratic growth is not sustainable in a repo with 1,000+ patterns and
>> 1,000,000+ index entries.
> 
> This has worried me for a while, even if it hasn't yet caused us
> issues in practice.
> 
>> To solve this problem, I propose a new, more restrictive mode to
>> sparse-checkout: "cone mode". In this mode, all patterns are based on prefix
>> matches at a directory level. This can then use hashsets for fast
>> performance -- O(M) instead of O(N*M). My hashset implementation is based on
>> the virtual filesystem hook in the VFS for Git custom code [4].
> 
> Sweet!
> 
>> In cone mode, a user specifies a list of folders which the user wants every
>> file inside. In addition, the cone adds all blobs that are siblings of the
>> folders in the directory path to that folder. This makes the directories
>> look "hydrated" as a user drills down to those recursively-closed folders.
>> These directories are called "parent" folders, as a file matches them only
>> if the file's immediate parent is that directory.
>>
>> When building a prototype of this feature, I used a separate file to contain
>> the list of recursively-closed folders and built the hashsets dynamically
>> based on that file. In this implementation, I tried to maximize the amount
>> of backwards-compatibility by storing all data in the sparse-checkout file
>> using patterns recognized by earlier Git versions.
>>
>> For example, if we add A/B/C as a recursive folder, then we add the
>> following patterns to the sparse-checkout file:
>>
>> /*
>> !/*/*
>> /A/*
>> !/A/*/*
>> /A/B/*
>> !/A/B/*/*
>> /A/B/C/*
>>
>> The alternating positive/negative patterns say "include everything in this
>> folder, but exclude everything another level deeper". The final pattern has
>> no matching negation, so is a recursively closed pattern.
> 
> Oh, um, would there be any option for fast but without grabbing
> sibling and parent files of requested directories?  And could users
> still request individual files (not with regex or pathspec, but fully
> specifying the path) and still get the fast mode?

Exact files could probably be included and still be fast. It requires an
extra hash check per entry, but that's a small price to pay I think.

With the sibling files, this is something I believe to be user-friendly:
as a user drills down into the folder they included recursively, there may
be helpful files along the way, like documentation, project files, etc.

Here is my philosophical position here: a repo can take advantage of the
sparse-checkout feature if it is properly componetized. Those component
boundaries are likely at folder boundaries. Any file that exists in a parent
folder for two components is likely important to _both_ components. If
a file is large and is not needed by both components, it should be placed
deeper in the tree, so it can be avoided.

With that philosophy in mind, I designed this to help users fall into the
"pit of success" when their repo is in a good shape AND to motivate users
with repos in non-optimal shapes to reorganize.

The thought I had about exact file names is similar: if there is a large
list of files in a folder where I only need a subset, then how do I know
if a new file is added that I need? It will not show up in the directory
without updating the sparse-checkout. A user would discover this need by
something going wrong when they are not interacting with version control:
a build.

This is particularly important with the root directory. We need things
like .gitignore, .gitattributes, README, LICENSE, etc. to be populated
by default. If there are too many files at root to reasonably work with
the repo, then the repo should be reorganized using folders.

> Basically, our sparse usage is exclusively specifying leading
> directories or full pathnames of individual files, but we really want
> the repo to feel smaller and make sure people notice at a glance.  We
> have a huge 'modules/' directory, and want people to be able to get
> just 15 of the 500 or so subdirectories that would appear in that
> directory with a non-sparse checkout.  And similarly we want to be
> able to grab just one or two files from a directory of many files.

Your modules/ example seems to work with the feature as designed, as
you want a set of folders one level deeper. Grabbing one or two files
from a directory is a direction we can go with the feature, but I will
continue to believe that should be a rare occurrence compared to including
a folder recursively.

>> Note that I have some basic warnings to try and check that the
>> sparse-checkout file doesn't match what would be written by a cone-mode add.
>> In such a case, Git writes a warning to stderr and continues with the old
>> pattern matching algorithm. These checks are currently very barebones, and
>> would need to be updated with more robust checks for things like regex
>> characters in the middle of the pattern. As review moves forward (and if we
>> don't change the data storage) then we could spend more time on this.
> 
> Instead of trying to validate the sparse-checkout file everytime,
> perhaps we want to change core.sparseCheckout from a boolean to a
> tri-state or something where it specifies how to parse the
> sparse-checkout file?  Or maybe when special directive (some form of
> comment-looking line) appears at the top of sparse-checkout then we
> use the hashsets speedup while disallowing general regexes and
> pathspecs other than leading directories and full pathnames?

In this series, I turn `core.sparseCheckout` into a tri-state, and only
try to validate the sparse-checkout when `core.sparseCheckout=cone`.
This avoids spending time on the validation when someone is content using
the existing feature.

The _intent_ of using the sparse-checkout file and no extra data structure
was to let other clients (or an older client) read the sparse-checkout data
and result in the same working directory. One thing I realized after
submitting is that the tri-state config variable will cause old clients
to error on parsing the non-boolean value. Instead, in v2 I will introduce
a new boolean config variable "core.sparseCheckoutCone" that will do the
same thing as the current series when `core.sparseCheckout=cone` and will
fix this compat scenario.

> I'll try to get some time to look over these patches in the next few days.

I look forward to your feedback! I also have some feedback to respond to
from my team [1], but I'm waiting to make sure the community likes the
overall idea before jumping into code style and method organization
details.

Thanks,
-Stolee

[1] https://github.com/microsoft/git/pull/180

gitgitgadget · 2019-08-22T14:26:51Z

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 8/22/2019 9:10 AM, Derrick Stolee wrote:
> On 8/21/2019 5:52 PM, Elijah Newren wrote:
>> On Tue, Aug 20, 2019 at 8:12 AM Derrick Stolee via GitGitGadget
>>> For example, if we add A/B/C as a recursive folder, then we add the
>>> following patterns to the sparse-checkout file:
>>>
>>> /*
>>> !/*/*
>>> /A/*
>>> !/A/*/*
>>> /A/B/*
>>> !/A/B/*/*
>>> /A/B/C/*
>>>
>>> The alternating positive/negative patterns say "include everything in this
>>> folder, but exclude everything another level deeper". The final pattern has
>>> no matching negation, so is a recursively closed pattern.
>>
>> Oh, um, would there be any option for fast but without grabbing
>> sibling and parent files of requested directories?  And could users
>> still request individual files (not with regex or pathspec, but fully
>> specifying the path) and still get the fast mode?
> 
> Exact files could probably be included and still be fast. It requires an
> extra hash check per entry, but that's a small price to pay I think.

Quick side note on this point about exact file names and the REAL reason
for the parent paths that I had forgotten until just now.

The following comment exists in unpack-trees.c, clear_ce_flags_dir():

	/*
	 * TODO: check el, if there are no patterns that may conflict
	 * with ret (iow, we know in advance the incl/excl
	 * decision for the entire directory), clear flag here without
	 * calling clear_ce_flags_1(). That function will call
	 * the expensive is_excluded_from_list() on every entry.
	 */

While I haven't implemented it yet in this RFC, this TODO can actually
happen with the current set of cone patterns:

1. If we hit a directory that is not in a parent or recursive path,
   then all paths it contains must have their skipworktree bits set.
   We can avoid computing hashes for them.

2. If we hit a directory that is in a recursive path, then all paths
   it contains must have skipworktree bits off. We can avoid computing
   hashes for them.

When we have a million index entries, these hash computations are not
insignificant!

With that in mind, I think there is a _performance_ reason to include
the parent folders in addition to _user experience_ reason. If we
were to add the complexity of exact file matches, then we would also
want to add the parent folders leading to that file so we can still
do the logic above.

Thanks,
-Stolee

.gitignore

Documentation/git-sparse-checkout.txt

Documentation/git-clone.txt

Documentation/git-sparse-checkout.txt

gitgitgadget · 2019-08-24T00:11:39Z

unpack-trees.c

@@ -1397,15 +1397,23 @@ static int clear_ce_flags(struct index_state *istate,
 			  struct exclude_list *el)


On the Git mailing list, Elijah Newren wrote (reply to this):

On Tue, Aug 20, 2019 at 8:12 AM Jeff Hostetler via GitGitGadget <gitgitgadget@gmail.com> wrote: > > From: Jeff Hostetler <jeffhost@microsoft.com> Can the commit summary be turned into English? > The clear_ce_flags_1 method is used by many types of calls to > unpack_trees(). Add trace2 regions around the method, including > some flag information, so we can get granular performance data > during experiments. It might be nice to have some words in the cover letter about why this patch is included in this series instead of being a separate submission. I'm not familiar with the trace2 stuff yet; this looks probably useful, but the commit message makes it sound like something general rather than specific to this series. > Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com> > Signed-off-by: Derrick Stolee <dstolee@microsoft.com> > --- <snip>

Documentation/config/core.txt

Documentation/git-sparse-checkout.txt

builtin/sparse-checkout.c

@@ -0,0 +1,389 @@
+#include "builtin.h"


gitgitgadget · 2019-08-24T05:48:37Z

On the Git mailing list, Elijah Newren wrote (reply to this):

On Thu, Aug 22, 2019 at 6:10 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 8/21/2019 5:52 PM, Elijah Newren wrote:
> > On Tue, Aug 20, 2019 at 8:12 AM Derrick Stolee via GitGitGadget
> > <gitgitgadget@gmail.com> wrote:

<snip>
> >> Here are some more specific details:
> >>
> >>  * git sparse-checkout init enables core.sparseCheckout and populates the
> >>    sparse-checkout file with patterns that match only the files at root.
> >
> > Does it enable core.sparseCheckout in the current worktree, or for all
> > worktrees?  Do we require extensions.worktreeConfig to be set to true
> > first?  If we don't require extensions.worktreeConfig to be set to
> > true, and users add worktrees later, do they encounter negative
> > surprises (immediately or later)?
>
> This is an interesting scenario that I had not considered. Thanks!
>
> My guess is that we should set `extensions.worktreeConfig=true` to
> avoid surprises. I'll need to play with this to discover the answers
> to these questions:
>
> 1. Where does the worktree look for the sparse-checkout file? Does
>    each worktree have its own sparse-checkout file? Should it?

For the main/first/primary worktree: .git/info/sparse-checkout
For all other worktrees: .git/worktrees/$WORKTREE/info/sparse-checkout

So, yes, each has its own, and from my viewpoint, absolutely yes that
is what we want.

> 2. If I have `extensions.worktreeConfig=true` and `core.sparseCheckout=true`
>    in the current worktree and run `git worktree add`, does the new worktree
>    have `core.sparseCheckout=true`? Can we `git clone --sparse` and then
>    start building sparse worktrees seamlessly?

My $0.02: I think `git worktree add` should not only adopt the setting
of core.sparseCheckout from the current worktree, but it should also
adopt the $GIT_DIR/info/sparse-checkout file too.  Granted, users can
change it to something else, but much like a new shell starts up with
the same current working directory as its parent shell, I think it'd
be most obvious for people to have a worktree that looked similar to
the one they launched it from.

<snip>
> > The default of reading from stdin seems a bit unusual to me, and I
> > worry about having to explain that to users.  I'd rather the add
> > command took positional parameters (anything that doesn't start with a
> > hyphen) and added those, e.g.
> >   $ git sparse-checkout add '/myFolder/*' '
> > with the option of the user specifying --stdin.
>
> I had the same thought, and likely that's where we should go with the
> builtin. For our needs, the input over stdin is more important for
> testing, so I built it first. I will adjust the CLI here to take a set
> of paths over the arguments unless --stdin is given.
>
> >>  * git sparse-checkout disable removes the patterns from the sparse-checkout
> >>    file, disables core.sparseCheckout, and refills the working directory.
> >
> > Does it leave an empty sparse-checkout file around?  Also, what if
> > users have several paths defining their sparse pattern, and want to
> > temporarily get a full checkout and then come back -- do they need to
> > re-specify all the paths?  (Maybe this *is* the route we want to go;
> > I'm just trying to mention any possible negative effects we _might_
> > run into so we can consider them.  It's not quite as relevant in my
> > case since people specify a few toplevel modules and sparse-checkout
> > gets several entries auto-generated for them.)
>
> In this case, there is an intermediate step (that follows the existing
> advice) to modify the sparse-checkout file to contain only "/*\n" then
> run read-tree to fill the working directory before disabling the config
> setting.
>
> Perhaps "disable" is the wrong word to use, as it makes you think that
> there should be an "enable" that can quickly toggle between the two
> modes. Maybe instead it should be "git sparse-checkout reset [empty|full]"
> where you could 'reset' the sparse-checkout to one of two initial
> states:
>
> 1. empty: only files at root are included.
> 2. full: all files are included.
>
> In each case, we would obliterate the existing sparse-checkout entries,
> but hopefully that behavior is more clear from the command names.

Will "reset" be seen as slightly less obvious wording that needs to be
explained to users?  If so, maybe have "undo" and "empty" verbs?  (Of
course, "init" already empties, both when starting from full or when
we have some kind of sparse checkout.)  I dunno, just some ideas.

> >> In cone mode, a user specifies a list of folders which the user wants every
> >> file inside. In addition, the cone adds all blobs that are siblings of the
> >> folders in the directory path to that folder. This makes the directories
> >> look "hydrated" as a user drills down to those recursively-closed folders.
> >> These directories are called "parent" folders, as a file matches them only
> >> if the file's immediate parent is that directory.
> >>
> >> When building a prototype of this feature, I used a separate file to contain
> >> the list of recursively-closed folders and built the hashsets dynamically
> >> based on that file. In this implementation, I tried to maximize the amount
> >> of backwards-compatibility by storing all data in the sparse-checkout file
> >> using patterns recognized by earlier Git versions.
> >>
> >> For example, if we add A/B/C as a recursive folder, then we add the
> >> following patterns to the sparse-checkout file:
> >>
> >> /*
> >> !/*/*
> >> /A/*
> >> !/A/*/*
> >> /A/B/*
> >> !/A/B/*/*
> >> /A/B/C/*
> >>
> >> The alternating positive/negative patterns say "include everything in this
> >> folder, but exclude everything another level deeper". The final pattern has
> >> no matching negation, so is a recursively closed pattern.
> >
> > Oh, um, would there be any option for fast but without grabbing
> > sibling and parent files of requested directories?  And could users
> > still request individual files (not with regex or pathspec, but fully
> > specifying the path) and still get the fast mode?
>
> Exact files could probably be included and still be fast. It requires an
> extra hash check per entry, but that's a small price to pay I think.
>
> With the sibling files, this is something I believe to be user-friendly:
> as a user drills down into the folder they included recursively, there may
> be helpful files along the way, like documentation, project files, etc.
>
> Here is my philosophical position here: a repo can take advantage of the
> sparse-checkout feature if it is properly componetized. Those component
> boundaries are likely at folder boundaries. Any file that exists in a parent
> folder for two components is likely important to _both_ components. If
> a file is large and is not needed by both components, it should be placed
> deeper in the tree, so it can be avoided.
>
> With that philosophy in mind, I designed this to help users fall into the
> "pit of success" when their repo is in a good shape AND to motivate users
> with repos in non-optimal shapes to reorganize.
>
> The thought I had about exact file names is similar: if there is a large
> list of files in a folder where I only need a subset, then how do I know
> if a new file is added that I need? It will not show up in the directory
> without updating the sparse-checkout. A user would discover this need by
> something going wrong when they are not interacting with version control:
> a build.
>
> This is particularly important with the root directory. We need things
> like .gitignore, .gitattributes, README, LICENSE, etc. to be populated
> by default. If there are too many files at root to reasonably work with
> the repo, then the repo should be reorganized using folders.
>
> > Basically, our sparse usage is exclusively specifying leading
> > directories or full pathnames of individual files, but we really want
> > the repo to feel smaller and make sure people notice at a glance.  We
> > have a huge 'modules/' directory, and want people to be able to get
> > just 15 of the 500 or so subdirectories that would appear in that
> > directory with a non-sparse checkout.  And similarly we want to be
> > able to grab just one or two files from a directory of many files.
>
> Your modules/ example seems to work with the feature as designed, as
> you want a set of folders one level deeper. Grabbing one or two files
> from a directory is a direction we can go with the feature, but I will
> continue to believe that should be a rare occurrence compared to including
> a folder recursively.

Oh, you're right, I was misunderstanding what it'd do.  This does look
like it's really close to what we're using, and most of the
differences are probably worth some slightly reshuffling of paths in
the repo.  Now that I've played with it some, it seems really awesome.

Being able to grab one or two files from a directory without grabbing
an entire directory and its parents I think would probably still be
useful, but I do agree that it'd be a rare occurrence.

> >> Note that I have some basic warnings to try and check that the
> >> sparse-checkout file doesn't match what would be written by a cone-mode add.
> >> In such a case, Git writes a warning to stderr and continues with the old
> >> pattern matching algorithm. These checks are currently very barebones, and
> >> would need to be updated with more robust checks for things like regex
> >> characters in the middle of the pattern. As review moves forward (and if we
> >> don't change the data storage) then we could spend more time on this.
> >
> > Instead of trying to validate the sparse-checkout file everytime,
> > perhaps we want to change core.sparseCheckout from a boolean to a
> > tri-state or something where it specifies how to parse the
> > sparse-checkout file?  Or maybe when special directive (some form of
> > comment-looking line) appears at the top of sparse-checkout then we
> > use the hashsets speedup while disallowing general regexes and
> > pathspecs other than leading directories and full pathnames?
>
> In this series, I turn `core.sparseCheckout` into a tri-state, and only
> try to validate the sparse-checkout when `core.sparseCheckout=cone`.
> This avoids spending time on the validation when someone is content using
> the existing feature.
>
> The _intent_ of using the sparse-checkout file and no extra data structure
> was to let other clients (or an older client) read the sparse-checkout data
> and result in the same working directory. One thing I realized after
> submitting is that the tri-state config variable will cause old clients
> to error on parsing the non-boolean value. Instead, in v2 I will introduce
> a new boolean config variable "core.sparseCheckoutCone" that will do the
> same thing as the current series when `core.sparseCheckout=cone` and will
> fix this compat scenario.

Once we are forced to use yet another config variable, we may as well
use yet another config file ($GITDIR/info/sparse-checkout-cone or
something; or maybe a less specific name with greater future
compatibility via some version marking in it).

One thing I noticed twice while using this series was that when I had
an existing sparse checkout it was easy to get into a weird state
where things were messed up, I think due to the fact that
"sparse-checkout init [--cone]" prefers to honor any pre-existing
$GITDIR/info/sparse-checkout file.  Once my config file was very much
not cone-compatible, and another time it was empty and caused
read-tree to error out with something like "there'd be nothing left!".
I manually twiddled with core.sparseCheckout and the sparse-checkout
file and 'git read-tree -mu HEAD' to get it fixed, but I'd rather
avoid others running into such problems.  Sorry I didn't take good
notes on it; I was just trying to get a good feel for this series.

> > I'll try to get some time to look over these patches in the next few days.
>
> I look forward to your feedback! I also have some feedback to respond to
> from my team [1], but I'm waiting to make sure the community likes the
> overall idea before jumping into code style and method organization
> details.

I think this idea is great; I'm a big fan right now.  I'm excited to
see how this will pan out.

gitgitgadget · 2019-08-26T13:34:29Z

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 8/24/2019 1:40 AM, Elijah Newren wrote:
> On Thu, Aug 22, 2019 at 6:10 AM Derrick Stolee <stolee@gmail.com> wrote:
>>
>> On 8/21/2019 5:52 PM, Elijah Newren wrote:
>>> On Tue, Aug 20, 2019 at 8:12 AM Derrick Stolee via GitGitGadget
>>> <gitgitgadget@gmail.com> wrote:
> 
> <snip>
>>>> Here are some more specific details:
>>>>
>>>>  * git sparse-checkout init enables core.sparseCheckout and populates the
>>>>    sparse-checkout file with patterns that match only the files at root.
>>>
>>> Does it enable core.sparseCheckout in the current worktree, or for all
>>> worktrees?  Do we require extensions.worktreeConfig to be set to true
>>> first?  If we don't require extensions.worktreeConfig to be set to
>>> true, and users add worktrees later, do they encounter negative
>>> surprises (immediately or later)?
>>
>> This is an interesting scenario that I had not considered. Thanks!
>>
>> My guess is that we should set `extensions.worktreeConfig=true` to
>> avoid surprises. I'll need to play with this to discover the answers
>> to these questions:
>>
>> 1. Where does the worktree look for the sparse-checkout file? Does
>>    each worktree have its own sparse-checkout file? Should it?
> 
> For the main/first/primary worktree: .git/info/sparse-checkout
> For all other worktrees: .git/worktrees/$WORKTREE/info/sparse-checkout
> 
> So, yes, each has its own, and from my viewpoint, absolutely yes that
> is what we want.

Thanks for the info! I will definitely consider this in the next version,
and include a test to verify the interaction.

>> 2. If I have `extensions.worktreeConfig=true` and `core.sparseCheckout=true`
>>    in the current worktree and run `git worktree add`, does the new worktree
>>    have `core.sparseCheckout=true`? Can we `git clone --sparse` and then
>>    start building sparse worktrees seamlessly?
> 
> My $0.02: I think `git worktree add` should not only adopt the setting
> of core.sparseCheckout from the current worktree, but it should also
> adopt the $GIT_DIR/info/sparse-checkout file too.  Granted, users can
> change it to something else, but much like a new shell starts up with
> the same current working directory as its parent shell, I think it'd
> be most obvious for people to have a worktree that looked similar to
> the one they launched it from.

This seems natural to me: I'm adding a new worktree and expect the
settings to match my current worktree. If we later want to extend
`git worktree add` to include an `--empty-cone` option (that creates
the worktree as if it was created by `git clone --sparse-cone`) we
could do that independently.

> <snip>
>>> The default of reading from stdin seems a bit unusual to me, and I
>>> worry about having to explain that to users.  I'd rather the add
>>> command took positional parameters (anything that doesn't start with a
>>> hyphen) and added those, e.g.
>>>   $ git sparse-checkout add '/myFolder/*' '
>>> with the option of the user specifying --stdin.
>>
>> I had the same thought, and likely that's where we should go with the
>> builtin. For our needs, the input over stdin is more important for
>> testing, so I built it first. I will adjust the CLI here to take a set
>> of paths over the arguments unless --stdin is given.
>>
>>>>  * git sparse-checkout disable removes the patterns from the sparse-checkout
>>>>    file, disables core.sparseCheckout, and refills the working directory.
>>>
>>> Does it leave an empty sparse-checkout file around?  Also, what if
>>> users have several paths defining their sparse pattern, and want to
>>> temporarily get a full checkout and then come back -- do they need to
>>> re-specify all the paths?  (Maybe this *is* the route we want to go;
>>> I'm just trying to mention any possible negative effects we _might_
>>> run into so we can consider them.  It's not quite as relevant in my
>>> case since people specify a few toplevel modules and sparse-checkout
>>> gets several entries auto-generated for them.)
>>
>> In this case, there is an intermediate step (that follows the existing
>> advice) to modify the sparse-checkout file to contain only "/*\n" then
>> run read-tree to fill the working directory before disabling the config
>> setting.
>>
>> Perhaps "disable" is the wrong word to use, as it makes you think that
>> there should be an "enable" that can quickly toggle between the two
>> modes. Maybe instead it should be "git sparse-checkout reset [empty|full]"
>> where you could 'reset' the sparse-checkout to one of two initial
>> states:
>>
>> 1. empty: only files at root are included.
>> 2. full: all files are included.
>>
>> In each case, we would obliterate the existing sparse-checkout entries,
>> but hopefully that behavior is more clear from the command names.
> 
> Will "reset" be seen as slightly less obvious wording that needs to be
> explained to users?  If so, maybe have "undo" and "empty" verbs?  (Of
> course, "init" already empties, both when starting from full or when
> we have some kind of sparse checkout.)  I dunno, just some ideas.

Thanks for pointing out that my word choice could be improved. I'll
consider several options.

>>>> In cone mode, a user specifies a list of folders which the user wants every
>>>> file inside. In addition, the cone adds all blobs that are siblings of the
>>>> folders in the directory path to that folder. This makes the directories
>>>> look "hydrated" as a user drills down to those recursively-closed folders.
>>>> These directories are called "parent" folders, as a file matches them only
>>>> if the file's immediate parent is that directory.
>>>>
>>>> When building a prototype of this feature, I used a separate file to contain
>>>> the list of recursively-closed folders and built the hashsets dynamically
>>>> based on that file. In this implementation, I tried to maximize the amount
>>>> of backwards-compatibility by storing all data in the sparse-checkout file
>>>> using patterns recognized by earlier Git versions.
>>>>
>>>> For example, if we add A/B/C as a recursive folder, then we add the
>>>> following patterns to the sparse-checkout file:
>>>>
>>>> /*
>>>> !/*/*
>>>> /A/*
>>>> !/A/*/*
>>>> /A/B/*
>>>> !/A/B/*/*
>>>> /A/B/C/*
>>>>
>>>> The alternating positive/negative patterns say "include everything in this
>>>> folder, but exclude everything another level deeper". The final pattern has
>>>> no matching negation, so is a recursively closed pattern.
>>>
>>> Oh, um, would there be any option for fast but without grabbing
>>> sibling and parent files of requested directories?  And could users
>>> still request individual files (not with regex or pathspec, but fully
>>> specifying the path) and still get the fast mode?
>>
>> Exact files could probably be included and still be fast. It requires an
>> extra hash check per entry, but that's a small price to pay I think.
>>
>> With the sibling files, this is something I believe to be user-friendly:
>> as a user drills down into the folder they included recursively, there may
>> be helpful files along the way, like documentation, project files, etc.
>>
>> Here is my philosophical position here: a repo can take advantage of the
>> sparse-checkout feature if it is properly componetized. Those component
>> boundaries are likely at folder boundaries. Any file that exists in a parent
>> folder for two components is likely important to _both_ components. If
>> a file is large and is not needed by both components, it should be placed
>> deeper in the tree, so it can be avoided.
>>
>> With that philosophy in mind, I designed this to help users fall into the
>> "pit of success" when their repo is in a good shape AND to motivate users
>> with repos in non-optimal shapes to reorganize.
>>
>> The thought I had about exact file names is similar: if there is a large
>> list of files in a folder where I only need a subset, then how do I know
>> if a new file is added that I need? It will not show up in the directory
>> without updating the sparse-checkout. A user would discover this need by
>> something going wrong when they are not interacting with version control:
>> a build.
>>
>> This is particularly important with the root directory. We need things
>> like .gitignore, .gitattributes, README, LICENSE, etc. to be populated
>> by default. If there are too many files at root to reasonably work with
>> the repo, then the repo should be reorganized using folders.
>>
>>> Basically, our sparse usage is exclusively specifying leading
>>> directories or full pathnames of individual files, but we really want
>>> the repo to feel smaller and make sure people notice at a glance.  We
>>> have a huge 'modules/' directory, and want people to be able to get
>>> just 15 of the 500 or so subdirectories that would appear in that
>>> directory with a non-sparse checkout.  And similarly we want to be
>>> able to grab just one or two files from a directory of many files.
>>
>> Your modules/ example seems to work with the feature as designed, as
>> you want a set of folders one level deeper. Grabbing one or two files
>> from a directory is a direction we can go with the feature, but I will
>> continue to believe that should be a rare occurrence compared to including
>> a folder recursively.
> 
> Oh, you're right, I was misunderstanding what it'd do.  This does look
> like it's really close to what we're using, and most of the
> differences are probably worth some slightly reshuffling of paths in
> the repo.  Now that I've played with it some, it seems really awesome.
> 
> Being able to grab one or two files from a directory without grabbing
> an entire directory and its parents I think would probably still be
> useful, but I do agree that it'd be a rare occurrence.

I think we can leave the file-by-file addition for later, but may need
to make certain design decisions in this initial version to avoid issues
with adding that feature in the future. (Perhaps the recursive-folder input
should have "/" at the end, to clearly state these are folders, not files.)

>>>> Note that I have some basic warnings to try and check that the
>>>> sparse-checkout file doesn't match what would be written by a cone-mode add.
>>>> In such a case, Git writes a warning to stderr and continues with the old
>>>> pattern matching algorithm. These checks are currently very barebones, and
>>>> would need to be updated with more robust checks for things like regex
>>>> characters in the middle of the pattern. As review moves forward (and if we
>>>> don't change the data storage) then we could spend more time on this.
>>>
>>> Instead of trying to validate the sparse-checkout file everytime,
>>> perhaps we want to change core.sparseCheckout from a boolean to a
>>> tri-state or something where it specifies how to parse the
>>> sparse-checkout file?  Or maybe when special directive (some form of
>>> comment-looking line) appears at the top of sparse-checkout then we
>>> use the hashsets speedup while disallowing general regexes and
>>> pathspecs other than leading directories and full pathnames?
>>
>> In this series, I turn `core.sparseCheckout` into a tri-state, and only
>> try to validate the sparse-checkout when `core.sparseCheckout=cone`.
>> This avoids spending time on the validation when someone is content using
>> the existing feature.
>>
>> The _intent_ of using the sparse-checkout file and no extra data structure
>> was to let other clients (or an older client) read the sparse-checkout data
>> and result in the same working directory. One thing I realized after
>> submitting is that the tri-state config variable will cause old clients
>> to error on parsing the non-boolean value. Instead, in v2 I will introduce
>> a new boolean config variable "core.sparseCheckoutCone" that will do the
>> same thing as the current series when `core.sparseCheckout=cone` and will
>> fix this compat scenario.
> 
> Once we are forced to use yet another config variable, we may as well
> use yet another config file ($GITDIR/info/sparse-checkout-cone or
> something; or maybe a less specific name with greater future
> compatibility via some version marking in it).

I'm hesitant to include a second "source of truth," as that can cause
issues when users modify the sparse-checkout file directly. Since the
existing way to interact with the sparse-checkout is to directly edit
the file, I want to be as careful as possible around users who modify
that themselves. The caveat is that if they specify "cone" mode then
they will get warnings and worse performance if they modify it outside
the limited patterns we allow.

> One thing I noticed twice while using this series was that when I had
> an existing sparse checkout it was easy to get into a weird state
> where things were messed up, I think due to the fact that
> "sparse-checkout init [--cone]" prefers to honor any pre-existing
> $GITDIR/info/sparse-checkout file.  Once my config file was very much
> not cone-compatible, and another time it was empty and caused
> read-tree to error out with something like "there'd be nothing left!".
> I manually twiddled with core.sparseCheckout and the sparse-checkout
> file and 'git read-tree -mu HEAD' to get it fixed, but I'd rather
> avoid others running into such problems.  Sorry I didn't take good
> notes on it; I was just trying to get a good feel for this series.

Thanks for this interesting use case! I think an empty file should be
updated with the root files, since Git does not think that is a valid
state. The current series must only check for existence, not content.
 
>>> I'll try to get some time to look over these patches in the next few days.
>>
>> I look forward to your feedback! I also have some feedback to respond to
>> from my team [1], but I'm waiting to make sure the community likes the
>> overall idea before jumping into code style and method organization
>> details.
> 
> I think this idea is great; I'm a big fan right now.  I'm excited to
> see how this will pan out.

Thanks! I'll be taking a close look at your patch-by-patch feedback
this week and hope to have a non-RFC v2 soon.

-Stolee

gitgitgadget · 2019-08-26T18:20:02Z

On the Git mailing list, Elijah Newren wrote (reply to this):

On Mon, Aug 26, 2019 at 6:29 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 8/24/2019 1:40 AM, Elijah Newren wrote:
> > On Thu, Aug 22, 2019 at 6:10 AM Derrick Stolee <stolee@gmail.com> wrote:
> >>
> >> On 8/21/2019 5:52 PM, Elijah Newren wrote:
> >>> On Tue, Aug 20, 2019 at 8:12 AM Derrick Stolee via GitGitGadget
> >>> <gitgitgadget@gmail.com> wrote:
> >

> >> In this series, I turn `core.sparseCheckout` into a tri-state, and only
> >> try to validate the sparse-checkout when `core.sparseCheckout=cone`.
> >> This avoids spending time on the validation when someone is content using
> >> the existing feature.
> >>
> >> The _intent_ of using the sparse-checkout file and no extra data structure
> >> was to let other clients (or an older client) read the sparse-checkout data
> >> and result in the same working directory. One thing I realized after
> >> submitting is that the tri-state config variable will cause old clients
> >> to error on parsing the non-boolean value. Instead, in v2 I will introduce
> >> a new boolean config variable "core.sparseCheckoutCone" that will do the
> >> same thing as the current series when `core.sparseCheckout=cone` and will
> >> fix this compat scenario.
> >
> > Once we are forced to use yet another config variable, we may as well
> > use yet another config file ($GITDIR/info/sparse-checkout-cone or
> > something; or maybe a less specific name with greater future
> > compatibility via some version marking in it).
>
> I'm hesitant to include a second "source of truth," as that can cause
> issues when users modify the sparse-checkout file directly. Since the
> existing way to interact with the sparse-checkout is to directly edit
> the file, I want to be as careful as possible around users who modify
> that themselves. The caveat is that if they specify "cone" mode then
> they will get warnings and worse performance if they modify it outside
> the limited patterns we allow.

Wait...does that mean you allow mixing and matching both regular style
sparse-checkout declarations with cone-mode style declarations within
the same file?  Are the non-cone mode entries ignored?  Does it
fallback to non-cone mode for all entries?  Or does that mean you
allow checking out both old and new styles of filesets, where you
optimize the cone-mode style declarations with your hashsets, and have
the remaining ones fall back to the old O(N*M) matching?  (I think it
does the last of those, right?)

If you support both, it sounds like you naturally support doing cone
mode with allowing people to also grab a handful of additional files
without the rest of their directories or parents.  It's just that
folks who want to do that will ask for a way to turn off any warnings
you spew, and if you turn the warnings off, then people who meant to
get cone behavior but mistyped stuff might complain about no warnings.
Hmm....

(Just trying to think things through out loud; I don't necessarily
know what's good or bad here, just voicing what I think might happen.)

gitgitgadget · 2019-08-26T19:20:21Z

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 8/26/2019 2:16 PM, Elijah Newren wrote:
> On Mon, Aug 26, 2019 at 6:29 AM Derrick Stolee <stolee@gmail.com> wrote:
>>
>> On 8/24/2019 1:40 AM, Elijah Newren wrote:
>>> On Thu, Aug 22, 2019 at 6:10 AM Derrick Stolee <stolee@gmail.com> wrote:
>>>>
>>>> On 8/21/2019 5:52 PM, Elijah Newren wrote:
>>>>> On Tue, Aug 20, 2019 at 8:12 AM Derrick Stolee via GitGitGadget
>>>>> <gitgitgadget@gmail.com> wrote:
>>>
> 
>>>> In this series, I turn `core.sparseCheckout` into a tri-state, and only
>>>> try to validate the sparse-checkout when `core.sparseCheckout=cone`.
>>>> This avoids spending time on the validation when someone is content using
>>>> the existing feature.
>>>>
>>>> The _intent_ of using the sparse-checkout file and no extra data structure
>>>> was to let other clients (or an older client) read the sparse-checkout data
>>>> and result in the same working directory. One thing I realized after
>>>> submitting is that the tri-state config variable will cause old clients
>>>> to error on parsing the non-boolean value. Instead, in v2 I will introduce
>>>> a new boolean config variable "core.sparseCheckoutCone" that will do the
>>>> same thing as the current series when `core.sparseCheckout=cone` and will
>>>> fix this compat scenario.
>>>
>>> Once we are forced to use yet another config variable, we may as well
>>> use yet another config file ($GITDIR/info/sparse-checkout-cone or
>>> something; or maybe a less specific name with greater future
>>> compatibility via some version marking in it).
>>
>> I'm hesitant to include a second "source of truth," as that can cause
>> issues when users modify the sparse-checkout file directly. Since the
>> existing way to interact with the sparse-checkout is to directly edit
>> the file, I want to be as careful as possible around users who modify
>> that themselves. The caveat is that if they specify "cone" mode then
>> they will get warnings and worse performance if they modify it outside
>> the limited patterns we allow.
> 
> Wait...does that mean you allow mixing and matching both regular style
> sparse-checkout declarations with cone-mode style declarations within
> the same file?  Are the non-cone mode entries ignored?  Does it
> fallback to non-cone mode for all entries?  Or does that mean you
> allow checking out both old and new styles of filesets, where you
> optimize the cone-mode style declarations with your hashsets, and have
> the remaining ones fall back to the old O(N*M) matching?  (I think it
> does the last of those, right?)
> 
> If you support both, it sounds like you naturally support doing cone
> mode with allowing people to also grab a handful of additional files
> without the rest of their directories or parents.  It's just that
> folks who want to do that will ask for a way to turn off any warnings
> you spew, and if you turn the warnings off, then people who meant to
> get cone behavior but mistyped stuff might complain about no warnings.
> Hmm....
> 
> (Just trying to think things through out loud; I don't necessarily
> know what's good or bad here, just voicing what I think might happen.)

The way I built the current series is that we honor what is in the
sparse-checkout as historically allowed. Always.

If a user modifies the sparse-checkout to have patterns that don't match
those that are added in cone mode, then Git warns the user this is the
case then reverts to the old pattern-by-pattern logic. This is to have
the Git client always match what another Git client would expect. (This
could be JGit or an older version of Git.) A user could always disable
cone mode to remove the warning and keep their sparse-checkout in its
current state.

Note: I have not made the "non-code-mode pattern" checks very robust.
For instance, I don't check the middle characters for wildcards. This
needs to happen at write time, too. The plan is to make these more
robust in future versions of the series.

Thanks,
-Stolee

gitgitgadget · 2019-09-02T18:01:15Z

On the Git mailing list, Eric Sunshine wrote (reply to this):

On Sat, Aug 24, 2019 at 1:40 AM Elijah Newren <newren@gmail.com> wrote:
> My $0.02: I think `git worktree add` should not only adopt the setting
> of core.sparseCheckout from the current worktree, but it should also
> adopt the $GIT_DIR/info/sparse-checkout file too.

As another example in favor of imbuing "git worktree add" with
first-class support for this feature (via command-line option and/or
inheriting existing settings), the commit message of ef2a0ac9a0
(worktree: add: introduce --checkout option, 2016-03-29) specifically
sites sparse checkout as the motivation for adding --no-checkout to
"git worktree add".

builtin/sparse-checkout.c

derrickstolee · 2019-11-21T22:03:41Z

/submit

gitgitgadget · 2019-11-21T22:05:38Z

Submitted as pull.316.v6.git.1574373891.gitgitgadget@gmail.com

gitgitgadget · 2019-11-22T08:35:50Z

This patch series was integrated into pu via git@3fa2852.

gitgitgadget · 2019-11-23T04:33:25Z

This patch series was integrated into pu via git@4a63815.

gitgitgadget · 2019-11-25T04:51:19Z

This patch series was integrated into pu via git@4126d07.

gitgitgadget · 2019-11-25T08:54:51Z

This patch series was integrated into pu via git@bba10d2.

gitgitgadget · 2019-11-27T10:12:24Z

This patch series was integrated into pu via git@c2cdfc4.

gitgitgadget · 2019-12-02T16:15:26Z

This patch series was integrated into pu via git@c671989.

gitgitgadget · 2019-12-03T01:13:11Z

This patch series was integrated into pu via git@0d091dd.

gitgitgadget · 2019-12-05T22:42:13Z

This patch series was integrated into pu via git@0f159e4.

gitgitgadget · 2019-12-06T19:54:27Z

This patch series was integrated into pu via git@a3525f8.

gitgitgadget · 2019-12-06T22:30:05Z

This patch series was integrated into pu via git@41ac7e4.

gitgitgadget · 2019-12-10T05:55:49Z

This patch series was integrated into pu via git@380642b.

gitgitgadget · 2019-12-10T18:08:15Z

This patch series was integrated into pu via git@0341d5f.

gitgitgadget · 2019-12-10T23:52:40Z

This patch series was integrated into pu via git@25c9911.

gitgitgadget · 2019-12-11T22:54:55Z

This patch series was integrated into pu via git@7bf36df.

gitgitgadget · 2019-12-13T23:08:25Z

This branch is now known as ds/sparse-cone-old.

gitgitgadget · 2019-12-13T23:08:26Z

This patch series was integrated into pu via git@c3fcb82.

gitgitgadget · 2019-12-16T22:27:19Z

This patch series was integrated into pu via git@553c471.

gitgitgadget · 2019-12-16T22:27:21Z

This patch series was integrated into next via git@c840c1d.

gitgitgadget · 2019-12-17T22:08:07Z

This patch series was integrated into pu via git@5f635d6.

gitgitgadget · 2019-12-18T23:19:33Z

This patch series was integrated into pu via git@1362a31.

gitgitgadget · 2019-12-19T22:15:58Z

This patch series was integrated into pu via git@87f7b46.

gitgitgadget · 2019-12-20T21:56:24Z

This patch series was integrated into pu via git@7d920a1.

gitgitgadget · 2019-12-25T21:44:39Z

This patch series was integrated into pu via git@bd72a08.

gitgitgadget · 2019-12-25T21:44:41Z

This patch series was integrated into master via git@bd72a08.

gitgitgadget · 2019-12-25T21:44:44Z

Closed via bd72a08.

derrickstolee force-pushed the sparse-checkout/upstream branch from 19ad979 to 568fda2 Compare August 20, 2019 14:38