Skip to content

pr-316/derrickstolee/sparse-checkout/upstream-v6

This series makes the sparse-checkout feature more user-friendly. While
there, I also present a way to use a limited set of patterns to gain a
significant performance boost in very large repositories.

Sparse-checkout is only documented as a subsection of the read-tree docs
[1], which makes the feature hard to discover. Users have trouble navigating
the feature, especially at clone time [2], and have even resorted to
creating their own helper tools [3].

This series attempts to solve these problems using a new builtin. Here is a
sample workflow to give a feeling for how it can work:

In an existing repo:

$ git sparse-checkout init
$ ls
myFile1.txt myFile2.txt
$ git sparse-checkout set "/*" "!/*/" /myFolder/
$ ls
myFile1.txt myFile2.txt myFolder
$ ls myFolder
a.c a.h
$ git sparse-checkout disable
$ ls
hiddenFolder myFile1.txt myFile2.txt myFolder

At clone time:

$ git clone --sparse origin repo
$ cd repo
$ ls
myFile1.txt myFile2.txt
$ git sparse-checkout set "/*" "!/*/" /myFolder/
$ ls
myFile1.txt myFile2.txt myFolder

Here are some more specific details:

 * git sparse-checkout init enables core.sparseCheckout and populates the
   sparse-checkout file with patterns that match only the files at root.

 * git clone learns the --sparse argument to run git sparse-checkout init
   before the first checkout.

 * git sparse-checkout set reads patterns from the arguments, or with
   --stdin reads patterns from stdin one per line, then writes them to the
   sparse-checkout file and refreshes the working directory.

 * git sparse-checkout disable removes the patterns from the sparse-checkout
   file, disables core.sparseCheckout, and refills the working directory.

 * git sparse-checkout list lists the contents of the sparse-checkout file.

The documentation for the sparse-checkout feature can now live primarily
with the git-sparse-checkout documentation.

Cone Mode
=========

What really got me interested in this area is a performance problem. If we
have N patterns in the sparse-checkout file and M entries in the index, then
we can perform up to O(N * M) pattern checks in clear_ce_flags(). This
quadratic growth is not sustainable in a repo with 1,000+ patterns and
1,000,000+ index entries.

To solve this problem, I propose a new, more restrictive mode to
sparse-checkout: "cone mode". In this mode, all patterns are based on prefix
matches at a directory level. This can then use hashsets for fast
performance -- O(M) instead of O(N*M). My hashset implementation is based on
the virtual filesystem hook in the VFS for Git custom code [4].

In cone mode, a user specifies a list of folders which the user wants every
file inside. In addition, the cone adds all blobs that are siblings of the
folders in the directory path to that folder. This makes the directories
look "hydrated" as a user drills down to those recursively-closed folders.
These directories are called "parent" folders, as a file matches them only
if the file's immediate parent is that directory.

When building a prototype of this feature, I used a separate file to contain
the list of recursively-closed folders and built the hashsets dynamically
based on that file. In this implementation, I tried to maximize the amount
of backwards-compatibility by storing all data in the sparse-checkout file
using patterns recognized by earlier Git versions.

For example, if we add A/B/C as a recursive folder, then we add the
following patterns to the sparse-checkout file:

/*
!/*/
/A/
!/A/*/
/A/B/
!/A/B/*/
/A/B/C/

The alternating positive/negative patterns say "include everything in this
folder, but exclude everything another level deeper". The final pattern has
no matching negation, so is a recursively closed pattern.

Note that I have some basic warnings to try and check that the
sparse-checkout file doesn't match what would be written by a cone-mode add.
In such a case, Git writes a warning to stderr and continues with the old
pattern matching algorithm. These checks are currently very barebones, and
would need to be updated with more robust checks for things like regex
characters in the middle of the pattern. As review moves forward (and if we
don't change the data storage) then we could spend more time on this.

Thanks, -Stolee

Updates in v2, relative to the RFC:

 * Instead of an 'add' subcommand, use a 'set' subcommand. We can consider
   adding 'add' and/or 'remove' subcommands later.

 * 'set' reads from the arguments by default. '--stdin' option is available.

 * A new performance-oriented commit is added at the end.

 * Patterns no longer end with a trailing asterisk except for the first "/*"
   pattern.

 * References to a "bug" (that was really a strange GVFS interaction in
   microsoft/git) around deleting outside the cone are removed.

Updates in v3:

 * The bad interaction with "cone mode" and .gitignore files is fixed. A
   test is added in the last patch.

 * Several patches are added that make the feature more robust. One
   sanitizes user input, another few add progress indicators, and another
   more prevent users from getting in bad states due to working directory
   changes or concurrent processes.

 * Updated several docs and commit messages according to feedback. Thanks,
   Elijah!

Updates in V4:

 * Updated hashmap API usage to respond to ew/hashmap

 * Responded to detailed review by Elijah. Thanks!

 * Marked the feature as experimental in git-sparse-checkout.txt the same
   way that git-switch.txt does.

Updates in V5:

 * The 'set' subcommand now enables the core.sparseCheckout config setting
   (unless the checkout fails).

 * If the in-process unpack_trees() fails with the new patterns, the
   index.lock file is rolled back before the replay of the old
   sparse-checkout patterns.

 * Some documentation fixes, f(d)open->xf(d)open calls, and other nits.
   Thanks everyone!

Updates in V6:

 * The init, set, and disable commands now require a clean status.

 * Git config is now set in-process instead of via a run_command call.

 * The working directory was being updated twice, leading to multiple errors
   being shown if the working directory would become empty.

 * Before, only the 'set' command used the in-process workdir update. Now
   'init' and 'disable' also use this in-process code, which removes some
   error cases.

Things to leave for future patches:

 1. Integrate in 'git worktree add' to copy the sparse-checkout file to a
    worktree-specific file.

 2. More robustness around detecting non-cone patterns with wildcards in the
    middle of the line.

 3. 'git clone --sparse-cone' to clone into "cone mode" sparse-checkouts
    (i.e. set 'core.sparseCheckoutCone=true'). This may not be
    super-valuable, as it only starts changing behavior when someone calls
    'git sparse-checkout set', but may be interesting.

 4. Make the working-directory update not modify the staging environment.
    Block only if it would lose work-in-progress.

 5. Some robustness things can be saved for later, such as including pattern
    arguments next to "--stdin", "set --cone", etc.

[1] https://git-scm.com/docs/git-read-tree#_sparse_checkoutSparse-checkout
documentation in git-read-tree.

[2] https://stackoverflow.com/a/4909267/127088Is it possible to do a sparse
checkout without checking out the whole repository first?

[3] http://www.marcoyuen.com/articles/2016/06/07/git-sparse.htmlA blog post
of a user's extra "git-sparse" helper.

[4]
https://github.com/git/git/compare/fc5fd706ff733392053e6180086a4d7f96acc2af...01204f24c5349aa2fb0c474546d768946d315dab
The virtual filesystem hook in microsoft/git.

Derrick Stolee (18):
  sparse-checkout: create builtin with 'list' subcommand
  sparse-checkout: create 'init' subcommand
  clone: add --sparse mode
  sparse-checkout: 'set' subcommand
  sparse-checkout: add '--stdin' option to set subcommand
  sparse-checkout: create 'disable' subcommand
  sparse-checkout: add 'cone' mode
  sparse-checkout: use hashmaps for cone patterns
  sparse-checkout: init and set in cone mode
  unpack-trees: hash less in cone mode
  unpack-trees: add progress to clear_ce_flags()
  sparse-checkout: sanitize for nested folders
  sparse-checkout: update working directory in-process
  sparse-checkout: use in-process update for disable subcommand
  sparse-checkout: write using lockfile
  sparse-checkout: cone mode should not interact with .gitignore
  sparse-checkout: update working directory in-process for 'init'
  sparse-checkout: check for dirty status

Jeff Hostetler (1):
  trace2: add region in clear_ce_flags

 .gitignore                            |   1 +
 Documentation/config/core.txt         |  10 +-
 Documentation/git-clone.txt           |   8 +-
 Documentation/git-read-tree.txt       |   2 +-
 Documentation/git-sparse-checkout.txt | 161 +++++++++
 Makefile                              |   1 +
 builtin.h                             |   1 +
 builtin/clone.c                       |  27 ++
 builtin/read-tree.c                   |   2 +-
 builtin/sparse-checkout.c             | 489 ++++++++++++++++++++++++++
 cache.h                               |   6 +-
 command-list.txt                      |   1 +
 config.c                              |   5 +
 dir.c                                 | 207 ++++++++++-
 dir.h                                 |  36 ++
 environment.c                         |   1 +
 git.c                                 |   1 +
 t/t1091-sparse-checkout-builtin.sh    | 307 ++++++++++++++++
 unpack-trees.c                        | 110 ++++--
 unpack-trees.h                        |   3 +-
 20 files changed, 1331 insertions(+), 48 deletions(-)
 create mode 100644 Documentation/git-sparse-checkout.txt
 create mode 100644 builtin/sparse-checkout.c
 create mode 100755 t/t1091-sparse-checkout-builtin.sh

base-commit: 108b97dc372828f0e72e56bbb40cae8e1e83ece6

Submitted-As: https://public-inbox.org/git/pull.316.v6.git.1574373891.gitgitgadget@gmail.com
In-Reply-To: https://public-inbox.org/git/pull.316.git.gitgitgadget@gmail.com
In-Reply-To: https://public-inbox.org/git/pull.316.v2.git.gitgitgadget@gmail.com
In-Reply-To: https://public-inbox.org/git/pull.316.v3.git.gitgitgadget@gmail.com
In-Reply-To: https://public-inbox.org/git/pull.316.v4.git.1571147764.gitgitgadget@gmail.com
In-Reply-To: https://public-inbox.org/git/pull.316.v5.git.1571666186.gitgitgadget@gmail.com
Assets 2