Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Maintenance jobs and job runner #597

Closed
wants to merge 15 commits into from

Conversation

derrickstolee
Copy link

@derrickstolee derrickstolee commented Apr 2, 2020

Background maintenance. I've blogged about it [1]. I've talked about it [2]. I brought it up at the contributor summit [3]. I think it's important. It's a critical feature in Scalar and VFS for Git. It's the main reason we created the "scalar register" feature for existing repos (so the Scalar background maintenance would apply to a "vanilla" Git repo).

[1] https://devblogs.microsoft.com/devops/introducing-scalar/
[2] https://stolee.dev/docs/git-merge-2020.pdf
[3] https://lore.kernel.org/git/35FDF767-CE0C-4C51-88A8-12965CD2D4FF@jramsay.com.au/

This RFC does the "maintenance" half of "background maintenance". It creates two new builtins to assist users in scheduling their own background maintenance:

  • git run-job <job-name>: This builtin will run a single instance of a maintenance job.

  • git job-runner [--repo=<path>]: This builtin will run an infinite loop that executes git run-job as a subcommand.

I believe that I could replace most maintenance steps in Scalar with the git run-job command and all that we would lose is some logging. (The only one that would remain is the one that sets recommended config settings, but that could be changed to a one-time operation at service start-up.) I haven't tested this yet, but I plan to before sending a non-RFC option.

Of course, just because we think these maintenance operations work for our scenario does not mean that they immediately work as the "canonical" maintenance activities. I attempt to demonstrate the value of these jobs as they are implemented. My perspective is skewed towards very large repositories (2000+ full-time contributors), but even medium-size repositories (100-200 full-time contributors) can benefit from these jobs.

I've tried to split the series into logical commits. This includes each specific maintenance job being completely described in its own commit (plus maybe an extra one to allow a custom option).

Most commit messages have "RFC QUESTION(S)" for things I was unsure about.

I'm currently testing this locally by setting job.repo in my global config to be a few important repos on my Linux VM then running GIT_TRACE2_PERF=<path> git job-runner --daemonize to launch a background process that logs the subcommands to <path>.

Here I will call out things that could definitely be improved:

  1. The git job-runner process is not tested AT ALL. It's a bit tricky to test it since it shouldn't exit when things work as expected. I expect to create a --no-loop option to stop after one iteration of the job loop. But I would like a bit more feedback on the concept before jumping into those tests. (I do test the git run-job builtin for each job, but not the customization through arguments and config.) Perhaps we could do some funny business about mocking git using --exec-path to check that it is calling git run-job in the correct order (and more importantly, not calling it when certain config settings are present).

  2. The --daemonize option at the end is shamelessly stolen from git gc --daemonize and git daemon, but has limited abilities on some platforms (I've tested on Linux and macOS). I have not done my research on how far this gets us to allowing users to launch this at startup or something.

  3. As I said, this is the "maintenance" "half" of "background maintenance". The "background" part is harder in my opinion because it involves creating platform-specific ways to consistently launch background processes. For example, Unix systems should have one way to service start X while Windows has another. macOS has launchd to launch processes as users log in, which should be a good way forward. Scalar implements a Windows Service that runs as root but impersonates the latest user to log in, and it implements a macOS "service" that is running only with the current user. I expect to need to create these services myself as a follow-up, but I lack the expertise to do it (currently). If someone else has experience creating these things and wants to take over or advise that half then I would appreciate the help!

  4. I noticed late in the RFC process that I'm not clearing my argv_arrays carefully in the job-runner. This will need to be rectified and carefully checked with valgrind before merging this code. While it leaks memory very slowly, it will be important that we don't leak any memory at all since this is a long-lived process. There's also some places where I was careful to not include too much of libgit.a to help keep the memory footprint low.

Thanks,
-Stolee

Cc: peff@peff.net, jrnieder@google.com, stolee@gmail.com

@derrickstolee derrickstolee changed the title [WIP] Jobs and job runner [RFC] Maintenance jobs and job runner Apr 3, 2020
@derrickstolee derrickstolee marked this pull request as ready for review April 3, 2020 03:58
The 'git run-job' command will be used to execute a short-lived set
of maintenance activities by a background job manager. The intention
is to perform small batches of work that reduce the foreground time
taken by repository maintenance such as 'git gc --auto'.

This change does the absolute minimum to create the builtin and show
the usage output.

Provide an explicit warning that this command is experimental. The
set of jobs may change, and each job could alter its behavior in
future versions.

RFC QUESTION: This builtin is based on the background maintenance in
Scalar. Specifically, this builtin is based on the "scalar run <job>"
command [1] [2]. My default thought was to make this a "git run <job>"
command to maximize similarity. However, it seems like "git run" is
too generic. Or, am I being overly verbose for no reason?

[1] https://github.com/microsoft/scalar/blob/master/docs/advanced.md#run-maintenance-in-the-foreground
[2] https://github.com/microsoft/scalar/blob/master/Scalar/CommandLine/RunVerb.cs

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
The first job to implement in the 'git run-job' command is the
'commit-graph' job. It is based on the sequence of events in the
'commit-graph' job in Scalar [1]. This sequence is as follows:

1. git commit-graph write --reachable --split
2. git commit-graph verify --shallow
3. If the verify succeeds, stop.
4. Delete the commit-graph-chain file.
5. git commit-graph write --reachable --split

By writing an incremental commit-graph file using the "--split"
option we minimize the disruption from this operation. The default
behavior is to merge layers until the new "top" layer is less than
half the size of the layer below. This provides quick writes "most"
of the time, with the longer writes following a power law
distribution.

Most importantly, concurrent Git processes only look at the
commit-graph-chain file for a very short amount of time, so they
will verly likely not be holding a handle to the file when we try
to replace it. (This only matters on Windows.)

If a concurrent process reads the old commit-graph-chain file, but
our job expires some of the .graph files before they can be read,
then those processes will see a warning message (but not fail).
This could be avoided by a future update to use the --expire-time
argument when writing the commit-graph.

By using 'git commit-graph verify --shallow' we can ensure that
the file we just wrote is valid. This is an extra safety precaution
that is faster than our 'write' subcommand. In the rare situation
that the newest layer of the commit-graph is corrupt, we can "fix"
the corruption by deleting the commit-graph-chain file and rewrite
the full commit-graph as a new one-layer commit graph. This does
not completely prevent _that_ file from being corrupt, but it does
recompute the commit-graph by parsing commits from the object
database. In our use of this step in Scalar and VFS for Git, we
have only seen this issue arise because our microsoft/git fork
reverted 43d3561 ("commit-graph write: don't die if the existing
graph is corrupt" 2019-03-25) for a while to keep commit-graph
writes very fast. We dropped the revert when updating to v2.23.0.
The verify still has potential for catching corrupt data across
the layer boundary: if the new file has commit X with parent Y
in an old file but the commit ID for Y in the old file had a
bitswap, then we will notice that in the 'verify' command.

[1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/CommitGraphStep.cs

RFC QUESTIONS:

1. Are links like [1] helpful?

2. Can anyone think of a way to test the rewrite fallback?
   It requires corrupting the latest file between two steps of
   this one command, so that is a tricky spot to inject a fault.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
When working with very large repositories, an incremental 'git fetch'
command can download a large amount of data. If there are many other
users pushing to a common repo, then this data can rival the initial
pack-file size of a 'git clone' of a medium-size repo.

Users may want to keep the data on their local repos as close as
possible to the data on the remote repos by fetching periodically in
the background. This can break up a large daily fetch into several
smaller hourly fetches.

However, if we simply ran 'git fetch <remote>' in the background,
then the user running a foregroudn 'git fetch <remote>' would lose
some important feedback when a new branch appears or an existing
branch updates. This is especially true if a remote branch is
force-updated and this isn't noticed by the user because it occurred
in the background. Further, the functionality of 'git push
--force-with-lease' becomes suspect.

When running 'git fetch <remote> <options>' in the background, use
the following options for careful updating:

1. --no-tags prevents getting a new tag when a user wants to see
   the new tags appear in their foreground fetches.

2. --refmap= removes the configured refspec which usually updates
   refs/remotes/<remote>/* with the refs advertised by the remote.

3. By adding a new refspec "+refs/heads/*:refs/hidden/<remote>/*"
   we can ensure that we actually load the new values somewhere in
   our refspace while not updating refs/heads or refs/remotes. By
   storing these refs here, the commit-graph job will update the
   commit-graph with the commits from these hidden refs.

4. --prune will delete the refs/hidden/<remote> refs that no
   longer appear on the remote.

We've been using this step as a critical background job in Scalar
[1] (and VFS for Git). This solved a pain point that was showing up
in user reports: fetching was a pain! Users do not like waiting to
download the data that was created while they were away from their
machines. After implementing background fetch, the foreground fetch
commands sped up significantly because they mostly just update refs
and download a small amount of new data. The effect is especially
dramatic when paried with --no-show-forced-udpates (through
fetch.showForcedUpdates=false).

[1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/FetchStep.cs

RFC QUESTIONS:

1. One downside of the refs/hidden pattern is that 'git log' will
   decorate commits with twice as many refs if they appear at a
   remote ref (<remote>/<ref> _and_ refs/hidden/<remote>/<ref>). Is
   there an easy way to exclude a refspace from decorations? Should
   we make refs/hidden/* a "special" refspace that is excluded from
   decorations?

2. This feature is designed for a desktop machine or equivalent
   that has a permanent wired network connection, and the machine
   stays on while the user is not present. For things like laptops
   with inconsistent WiFi connections (that may be metered) the
   feature can use the less stable connection more than the user
   wants. Of course, this feature is opt-in for Git, but in Scalar
   we have a "scalar pause" command [2] that pauses all maintenance
   for some amount of time. We should consider a similar mechanism
   for Git, but for the point of this series the user needs to set
   up the "background" part of these jobs manually.

[2] https://github.com/microsoft/scalar/blob/master/Scalar/CommandLine/PauseVerb.cs
[3] https://github.com/microsoft/scalar/blob/master/docs/advanced.md#controlling-background-maintenance

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
One goal of background maintenance jobs is to allow a user to
disable auto-gc (gc.auto=0) but keep their repository in a clean
state. Without any cleanup, loose objects will clutter the object
database and slow operations. In addition, the loose objects will
take up extra space because they are not stored with deltas against
similar objects.

Create a 'loose-objects' job for the 'git run-job' command. This
helps clean up loose objects without disrupting concurrent Git
commands using the following sequence of events:

1. Run 'git prune-packed' to delete any loose objects that exist
   in a pack-file. Concurrent commands will prefer the packed
   version of the object to the loose version. (Of course, there
   are exceptions for commands that specifically care about the
   location of an object. These are rare for a user to run on
   purpose, and we hope a user that has selected background
   maintenance will not be trying to do foreground maintenance.)

2. Run 'git pack-objects' on a batch of loose objects. These
   objects are grouped by scanning the loose object directories in
   lexicographic order until listing all loose objects -or-
   reaching 50,000 objects. This is more than enough if the loose
   objects are created only by a user doing normal development.
   We noticed users with _millions_ of loose objects because VFS
   for Git downloads blobs on-demand when a file read operation
   requires populating a virtual file. This has potential of
   happening in partial clones if someone runs 'git grep' or
   otherwise evades the batch-download feature for requesting
   promisor objects.

This step is based on a similar step in Scalar [1] and VFS for Git.
[1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/LooseObjectsStep.cs

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
The previous change cleaned up loose objects using the
'loose-objects' that can be run safely in the background. Add a
similar job that performs similar cleanups for pack-files.

One issue with running 'git repack' is that it is designed to
repack all pack-files into a single pack-file. While this is the
most space-efficient way to store object data, it is not time or
memory efficient. This becomes extremely important if the repo is
so large that a user struggles to store two copies of the pack on
their disk.

Instead, perform an "incremental" repack by collecting a few small
pack-files into a new pack-file. The multi-pack-index facilitates
this process ever since 'git multi-pack-index expire' was added in
19575c7 (multi-pack-index: implement 'expire' subcommand,
2019-06-10) and 'git multi-pack-index repack' was added in ce1e4a1
(midx: implement midx_repack(), 2019-06-10).

The 'pack-files' job runs the following steps:

1. 'git multi-pack-index write' creates a multi-pack-index file if
   one did not exist, and otherwise will update the multi-pack-index
   with any new pack-files that appeared since the last write. This
   is particularly relevant with the background fetch job.

   When the multi-pack-index sees two copies of the same object, it
   stores the offset data into the newer pack-file. This means that
   some old pack-files could become "unreferenced" which I will use
   to mean "a pack-file that is in the pack-file list of the
   multi-pack-index but none of the objects in the multi-pack-index
   reference a location inside that pack-file."

2. 'git multi-pack-index expire' deletes any unreferenced pack-files
   and updaes the multi-pack-index to drop those pack-files from the
   list. This is safe to do as concurrent Git processes will see the
   multi-pack-index and not open those packs when looking for object
   contents. (Similar to the 'loose-objects' job, there are some Git
   commands that open pack-files regardless of the multi-pack-index,
   but they are rarely used. Further, a user that self-selects to
   use background operations would likely refrain from using those
   commands.)

3. 'git multi-pack-index repack --bacth-size=<size>' collects a set
   of pack-files that are listed in the multi-pack-index and creates
   a new pack-file containing the objects whose offsets are listed
   by the multi-pack-index to be in those objects. The set of pack-
   files is selected greedily by sorting the pack-files by modified
   time and adding a pack-file to the set if its "expected size" is
   smaller than the batch size until the total expected size of the
   selected pack-files is at least the batch size. The "expected
   size" is calculated by taking the size of the pack-file divided
   by the number of objects in the pack-file and multiplied by the
   number of objects from the multi-pack-index with offset in that
   pack-file. The expected size approximats how much data from that
   pack-file will contribute to the resulting pack-file size. The
   intention is that the resulting pack-file will be close in size
   to the provided batch size.

   The next run of the pack-files job will delete these repacked
   pack-files during the 'expire' step.

   In this version, the batch size is set to "0" which ignores the
   size restrictions when selecting the pack-files. It instead
   selects all pack-files and repacks all packed objects into a
   single pack-file. This will be updated in the next change, but
   it requires doing some calculations that are better isolated to
   a separate change.

Each of the above steps update the multi-pack-index file. After
each step, we verify the new multi-pack-index. If the new
multi-pack-index is corrupt, then delete the multi-pack-index,
rewrite it from scratch, and stop doing the later steps of the
job. This is intended to be an extra-safe check without leaving
a repo with many pack-files without a multi-pack-index.

These steps are based on a similar background maintenance step in
Scalar (and VFS for Git) [1]. This was incredibly effective for
users of the Windows OS repository. After using the same VFS for Git
repository for over a year, some users had _thousands_ of pack-files
that combined to up to 250 GB of data. We noticed a few users were
running into the open file descriptor limits (due in part to a bug
in the multi-pack-index fixed by af96fe3 (midx: add packs to
packed_git linked list, 2019-04-29).

These pack-files were mostly small since they contained the commits
and trees that were pushed to the origin in a given hour. The GVFS
protocol includes a "prefetch" step that asks for pre-computed pack-
files containing commits and trees by timestamp. These pack-files
were grouped into "daily" pack-files once a day for up to 30 days.
If a user did not request prefetch packs for over 30 days, then they
would get the entire history of commits and trees in a new, large
pack-file. This led to a large number of pack-files that had poor
delta compression.

By running this pack-file maintenance step once per day, these repos
with thousands of packs spanning 200+ GB dropped to dozens of pack-
files spanning 30-50 GB. This was done all without removing objects
from the system and using a constant batch size of two gigabytes.
Once the work was done to reduce the pack-files to small sizes, the
batch size of two gigabytes means that not every run triggers a
repack operation, so the following run will not expire a pack-file.
This has kept these repos in a "clean" state.

[1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/PackfileMaintenanceStep.cs

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
When repacking during the 'pack-files' job, we use the --batch-size
option in 'git multi-pack-index repack'. The initial setting used
--batch-size=0 to repack everything into a single pack-file. This
is not sustaintable for a large repository. The amount of work
required is also likely to use too many system resources for a
background job.

Update the 'git run-job pack-files' command by allowing a direct
--batch-size option that can change the value provided. Update the
default of "0" to a computed size based on the existing pack-files.
While computing that new size, count the existing pack-files and
skip the repack step if there are at most two pack-files.

The dynamic default size is computed with this idea in mind for
a client repository that was cloned from a very large remote: there
is likely one "big" pack-file that was created at clone time. Thus,
do not try repacking it as it is likely packed efficiently by the
server. Instead, try packing the other pack-files into a single
pack-file.

The size is then computed as follows:

	batch size = total size - max pack size

The batch size is then also limited to be at most two gigabytes.
This serves two purposes. First, having a limit prevents doing too
much work when the repository is extremely large. Pack-files
larger than two gigabytes are likely to either contain large blobs
or have been carefully repacked by a previous repack operation.
Second, two gigabytes is the size limit for a signed 32-bit int.
It's a good limit to consider, and to keep it far away from the
unsigned 32-bit int limit. This limit comes to mind because on
Windows an "unsigned long" is 32 bits and OPT_MAGNITUDE() uses
unsigned longs for its parsing logic.

This calculation mimics a similar calculation in Scalar [1],
except for a 3% drop in the calculated batch size due to the
round-off error that can happen with the "expected size"
calculation for a pack-file.

[1] https://github.com/microsoft/scalar/blob/616e9b16dd120b8fdb652d6d5a55618c731a8aea/Scalar.Common/Maintenance/PackfileMaintenanceStep.cs#L141-L143

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
The previous change allowed a user to specify a --batch-size=<size>
option to 'git run-job pack-files'. However, when we eventually
launch these jobs on a schedule, we want users to be able to change
this value through config options.

The new "job.pack-files.batchSize" option will override the default
dynamic batch-size calculation, but will be overridden by the
--batch-size=<size> argument.

This is the first config option of the type
"job.<job-name>.<option>" but is not the last.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Given the existing 'git run-job <job-name>' builtin, users _could_
construct their own scheduling mechanism for running jobs in the
background. However, it is much easier to have a dedicated process
that manages all jobs across multiple repos.

The 'git job-runner' builtin is specifically built to handle this
scenario. It will be customized further in later changes, but for
now it does the following:

* Given a list of '--repo=<path>' arguments, construct a list of
  repositories to manage with jobs.

* Every 30 minutes, iterate over all jobs and all repos to run

	git -C <repo> run-job <job-name>

This builtin needs to be careful about how much of the Git internals
it consumes. The intention is that this is a long-lived process that
could be launched upon login and only closed on logout. For that
reason, we will avoid instantiating any object store or index data.

Run the maintenance jobs by running subcommands. We will update how
we enable or disable these jobs and separate their runs in a later
change.

RFC QUESTIONS:

1. The hardest part of this builtin is "how do we test it?" In
   Scalar, we can unit test the scheduler with mocks. What is the
   equivalent here for "make sure 'git job-runner' runs 'git run-job
   pack-files' on repo X? I expect to add a "--no-loop" option that
   ensures the logic only runs one iteration of the loop.

2. The difference between 'git run-job' and 'git job-runner' is
   subtle and probably confusing. Are there better names?

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
The 'git job-runner' command was introduced with a '--repo=<path>'
option so a user could start a process with an explicit list of
repos. However, it is more likely that we will want 'git
job-runner' to start without any arguments and discover the set of
repos from another source.

A natural source is the Git config. The 'git job-runner' does not
need to run in a Git repository, but the config could be located in
the global or system config.

Create the job.repo config setting as a multi-valued config option.

Read all values for job.repo whenever triggering an iteration of
the job loop. This allows a user to add or remove repos dynamically
without restarting the job-runner.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
We want to run different maintenance tasks at different rates. That
means we cannot rely only on the time between job-runner loop pauses
to reduce the frequency of specific jobs. This means we need to
persist a timestamp for the previous run somewhere. A natural place
is the Git config file for that repo.

Create the job.<job-namme>.lastRun config option to store the
timestamp of the previous run of that job. Set this config option
after a successful run of 'git -C <repo> run-job <job-name>'.

To maximize code reuse, we dynamically construct the config key
and parse the config value into a timestamp in a generic way. This
makes the introduction of another config option extremely trivial:

The job.<job-name>.interval option allows a user to specify a
minimum number of seconds between two calls to 'git run-job
<job-name>' on a given repo. This could be stored in the global
or system config to provide an update on the default for all repos,
or could be updated on a per-repo basis. This is checked on every
iteration of the job loop, so a user could update this and see the
effect without restarting the job-runner process.

RFC QUESTION: I'm using a 'git -C <repo> config <option>' process
to test if enough time has elapsed before running the 'run-job'
process. These 'config' processes are pretty light-weight, so
hopefully they are not too noisy. An alternative would be to
always run 'git -C <repo> run-job <job-name>' and rely on that
process to no-op based on the configured values and how recently
they were run. However, that will likely interrupt users who want
to test the jobs in the foreground. Finally, that user disruption
would be mitigated by adding a "--check-latest" option. A user
running a job manually would not include that argument by default
and would succeed. However, any logging that we might do for the
job-runner would make it look like we are running the run-job
commands all the time even if they are no-ops. Advice welcome!

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
The 'git job-runner' process uses a sleep(X) call to pause its
operation between iterations of the job loop. This defaults to 30
minutes, but is now available to be done with longer or shorter
intervals according to the job.loopInterval config option. For
example, a user may want the job loop to run once every five
minutes while another wants the job loop to run once every six
hours.

This config value is checked immediately before the sleep(X) call,
which allows users to see the effect without restarting the
job-runner process. However, the process will be paused until the
previous sleep(X) call returns and the job loop is executed.

RFC QUESITON: Is this use of sleep(X) the best way to do this?
Is there a better way to delay the process for a time interval, or
until a specified time? This just seemed like the simplest option.
The job-runner is doing low-priority work on an unpredictable
schedule by design, so sleep(X) seemd appropriate.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
The previous change used the job.loopInterval config option to
adjust the default sleep between iterations of the job loop in
'git job-runner'. Now, add a command-line option so a user can
override the configured value with an explicit value.

The value specified by this command-line option cannot be
changed without restarting the job-runner process.

RFC QUESTION: Are things like this useful, or should we only
use config options for this customization?

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Allow a user to specify that a job should not be run on a given
repo by setting "job.<job-name>.enabled = false". The job-runner
will check this config before any other options and stop running
the job on that repo. If the config is disabled in the config
read by the job-runner itself (i.e. in global config) then the
job-runner will not attempt to run the job on any of the repos.

Since this config is checked dynamically, the job-runner does will
see config changes without needing to restart the process.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Git has the ability to fork the process and launch a "daemonized"
copy of the current process. This is used to launch a background
'git gc' command in some cases or to launch the 'git daemon'
server process.

Update the 'git job-runner' command with a --daemonize option that
launches this background process.

The implementation of daemonize() in setup.c is very clear that
this may no-op and return an error if NO_POSIX_GOODIES is not
defined. Include an error message to point out that this mechanism
may not be available on a platform-by-platform basis.

Include a clear error message that daemonize() might fail due to
platform incompatibilities.

I have been running the current version of this series on my
Linux VM using --daemonize to keep my copies of torvalds/linux and
git/git maintained. Using GIT_TRACE2_PERF set to a path, I can see
that the 'git run-job' processes are being created on the correct
schedule according to my config for each.

RFC QUESTION: I notice that 'git gc' does not document --daemonize.
Is that intentional? Or is it an oversight?

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Allow a user to override the default number of loose objects to
place into a new pack-file as part of the loose-objects job. This
can be done via the job.loose-objects.batchSize config option or
the --batch-size=<count> option in the 'git run-job' command. The
config value is checked once per run of 'git run-job loose-objects'
so an instance started by 'git job-runner' will use new values
automatically without restarting the 'git job-runner' process.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
@derrickstolee
Copy link
Author

/submit

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 3, 2020

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 3, 2020

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

>  * git run-job <job-name>: This builtin will run a single instance of a
>    maintenance job.
>    
>  * git job-runner [--repo=<path>]: This builtin will run an infinite loop
>    that executes git run-job as a subcommand.

What does this have to do with "git", though?  IOW, why does this
have to be part of Git, so that those who would benefit from having
a mechanism that makes it easy to run regular maintenance tasks but
are not Git users (or those that want to do such maintenance tasks
that are not necessarily tied to "git") must use "git" to do so?

I'll find out later why it is so after reading thru 15 patches
myself, so no need to give a quick answer to the above; it was just
my knee-jerk reaction.

Thanks.




@@ -0,0 +1,64 @@
git-run-job(1)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Phillip Wood wrote (reply to this):

Hi Stolee

On 03/04/2020 21:48, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
> 
> When working with very large repositories, an incremental 'git fetch'
> command can download a large amount of data. If there are many other
> users pushing to a common repo, then this data can rival the initial
> pack-file size of a 'git clone' of a medium-size repo.
> 
> Users may want to keep the data on their local repos as close as
> possible to the data on the remote repos by fetching periodically in
> the background. This can break up a large daily fetch into several
> smaller hourly fetches.
> 
> However, if we simply ran 'git fetch <remote>' in the background,
> then the user running a foregroudn 'git fetch <remote>' would lose
> some important feedback when a new branch appears or an existing
> branch updates. This is especially true if a remote branch is
> force-updated and this isn't noticed by the user because it occurred
> in the background. Further, the functionality of 'git push
> --force-with-lease' becomes suspect.
> 
> When running 'git fetch <remote> <options>' in the background, use
> the following options for careful updating:
> 
> 1. --no-tags prevents getting a new tag when a user wants to see
>     the new tags appear in their foreground fetches.
> 
> 2. --refmap= removes the configured refspec which usually updates
>     refs/remotes/<remote>/* with the refs advertised by the remote.
> 
> 3. By adding a new refspec "+refs/heads/*:refs/hidden/<remote>/*"
>     we can ensure that we actually load the new values somewhere in
>     our refspace while not updating refs/heads or refs/remotes. By
>     storing these refs here, the commit-graph job will update the
>     commit-graph with the commits from these hidden refs.
> 
> 4. --prune will delete the refs/hidden/<remote> refs that no
>     longer appear on the remote.
> 
> We've been using this step as a critical background job in Scalar
> [1] (and VFS for Git). This solved a pain point that was showing up
> in user reports: fetching was a pain! Users do not like waiting to
> download the data that was created while they were away from their
> machines. After implementing background fetch, the foreground fetch
> commands sped up significantly because they mostly just update refs
> and download a small amount of new data. The effect is especially
> dramatic when paried with --no-show-forced-udpates (through
> fetch.showForcedUpdates=false).
> 
> [1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/FetchStep.cs
> 
> RFC QUESTIONS:
> 
> 1. One downside of the refs/hidden pattern is that 'git log' will
>     decorate commits with twice as many refs if they appear at a
>     remote ref (<remote>/<ref> _and_ refs/hidden/<remote>/<ref>). Is
>     there an easy way to exclude a refspace from decorations? Should
>     we make refs/hidden/* a "special" refspace that is excluded from
>     decorations?

Having some way to specify which refs outside of 
refs/{heads,remote,tags}/ to show or exclude from decorations would be 
useful I think. Fetching to a hidden ref is a good idea (as are the 
other steps you outline above) but as you say we don't want it to show 
up in the output of 'git log' etc.

> 2. This feature is designed for a desktop machine or equivalent
>     that has a permanent wired network connection, and the machine
>     stays on while the user is not present. For things like laptops
>     with inconsistent WiFi connections (that may be metered) the
>     feature can use the less stable connection more than the user
>     wants. Of course, this feature is opt-in for Git, but in Scalar
>     we have a "scalar pause" command [2] that pauses all maintenance
>     for some amount of time. We should consider a similar mechanism
>     for Git, but for the point of this series the user needs to set
>     up the "background" part of these jobs manually.
> 
> [2] https://github.com/microsoft/scalar/blob/master/Scalar/CommandLine/PauseVerb.cs
> [3] https://github.com/microsoft/scalar/blob/master/docs/advanced.md#controlling-background-maintenance
> 
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>   Documentation/git-run-job.txt | 13 ++++-
>   builtin/run-job.c             | 89 ++++++++++++++++++++++++++++++++++-
>   t/t7900-run-job.sh            | 22 +++++++++
>   3 files changed, 122 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/git-run-job.txt b/Documentation/git-run-job.txt
> index 8bf2762d650..eb92e891915 100644
> --- a/Documentation/git-run-job.txt
> +++ b/Documentation/git-run-job.txt
> @@ -9,7 +9,7 @@ git-run-job - Run a maintenance job. Intended for background operation.
>   SYNOPSIS
>   --------
>   [verse]
> -'git run-job commit-graph'
> +'git run-job (commit-graph|fetch)'
>   
>   
>   DESCRIPTION
> @@ -47,6 +47,17 @@ since it will not expire `.graph` files that were in the previous
>   `commit-graph-chain` file. They will be deleted by a later run based on
>   the expiration delay.
>   
> +'fetch'::
> +
> +The `fetch` job updates the object directory with the latest objects
> +from all registered remotes. For each remote, a `git fetch` command is
> +run. The refmap is custom to avoid updating local or remote branches
> +(those in `refs/heads` or `refs/remotes`). Instead, the remote refs are
> +stored in `refs/hidden/<remote>/`. Also, no tags are updated.
> ++
> +This means that foreground fetches are still required to update the
> +remote refs, but the users is notified when the branches and tags are
> +updated on the remote.
>   
>   GIT
>   ---
> diff --git a/builtin/run-job.c b/builtin/run-job.c
> index dd7709952d3..e59056b2918 100644
> --- a/builtin/run-job.c
> +++ b/builtin/run-job.c
> @@ -7,7 +7,7 @@
>   #include "run-command.h"
>   
>   static char const * const builtin_run_job_usage[] = {
> -	N_("git run-job commit-graph"),
> +	N_("git run-job (commit-graph|fetch)"),
>   	NULL
>   };
>   
> @@ -60,6 +60,91 @@ static int run_commit_graph_job(void)
>   	return 1;
>   }
>   
> +static int fetch_remote(const char *remote)
> +{
> +	int result;
> +	struct argv_array cmd = ARGV_ARRAY_INIT;
> +	struct strbuf refmap = STRBUF_INIT;
> +
> +	argv_array_pushl(&cmd, "fetch", remote, "--quiet", "--prune",
> +			 "--no-tags", "--refmap=", NULL);
> +
> +	strbuf_addf(&refmap, "+refs/heads/*:refs/hidden/%s/*", remote);
> +	argv_array_push(&cmd, refmap.buf);
> +
> +	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
> +
> +	strbuf_release(&refmap);
> +	return result;
> +}
> +
> +static int fill_remotes(struct string_list *remotes)

Isn't there a easy way to get this using the config api rather than 
forking 'git remote'?

Best Wishes

Phillip

> +{
> +	int result = 0;
> +	FILE *proc_out;
> +	struct strbuf line = STRBUF_INIT;
> +	struct child_process *remote_proc = xmalloc(sizeof(*remote_proc));
> +
> +	child_process_init(remote_proc);
> +
> +	argv_array_pushl(&remote_proc->args, "git", "remote", NULL);
> +
> +	remote_proc->out = -1;
> +
> +	if (start_command(remote_proc)) {
> +		error(_("failed to start 'git remote' process"));
> +		result = 1;
> +		goto cleanup;
> +	}
> +
> +	proc_out = xfdopen(remote_proc->out, "r");
> +
> +	/* if there is no line, leave the value as given */
> +	while (!strbuf_getline(&line, proc_out))
> +		string_list_append(remotes, line.buf);
> +
> +	strbuf_release(&line);
> +
> +	fclose(proc_out);
> +
> +	if (finish_command(remote_proc)) {
> +		error(_("failed to finish 'git remote' process"));
> +		result = 1;
> +	}
> +
> +cleanup:
> +	free(remote_proc);
> +	return result;
> +}
> +
> +static int run_fetch_job(void)
> +{
> +	int result = 0;
> +	struct string_list_item *item;
> +	struct string_list remotes = STRING_LIST_INIT_DUP;
> +
> +	if (fill_remotes(&remotes)) {
> +		error(_("failed to fill remotes"));
> +		result = 1;
> +		goto cleanup;
> +	}
> +
> +	/*
> +	 * Do not modify the result based on the success of the 'fetch'
> +	 * operation, as a loss of network could cause 'fetch' to fail
> +	 * quickly. We do not want that to stop the rest of our
> +	 * background operations.
> +	 */
> +	for (item = remotes.items;
> +	     item && item < remotes.items + remotes.nr;
> +	     item++)
> +		fetch_remote(item->string);
> +
> +cleanup:
> +	string_list_clear(&remotes, 0);
> +	return result;
> +}
> +
>   int cmd_run_job(int argc, const char **argv, const char *prefix)
>   {
>   	static struct option builtin_run_job_options[] = {
> @@ -79,6 +164,8 @@ int cmd_run_job(int argc, const char **argv, const char *prefix)
>   	if (argc > 0) {
>   		if (!strcmp(argv[0], "commit-graph"))
>   			return run_commit_graph_job();
> +		if (!strcmp(argv[0], "fetch"))
> +			return run_fetch_job();
>   	}
>   
>   	usage_with_options(builtin_run_job_usage,
> diff --git a/t/t7900-run-job.sh b/t/t7900-run-job.sh
> index 18b9bd26b3a..d3faeba135b 100755
> --- a/t/t7900-run-job.sh
> +++ b/t/t7900-run-job.sh
> @@ -44,4 +44,26 @@ test_expect_success 'commit-graph job' '
>   	)
>   '
>   
> +test_expect_success 'fetch job' '
> +	git clone "file://$(pwd)/server" client &&
> +
> +	# Before fetching, build a client commit-graph
> +	git -C client run-job commit-graph &&
> +	chain=client/.git/objects/info/commit-graphs/commit-graph-chain &&
> +	test_line_count = 1 $chain &&
> +
> +	git -C client branch -v --remotes >before-refs &&
> +	test_commit -C server 24 &&
> +
> +	git -C client run-job fetch &&
> +	git -C client branch -v --remotes >after-refs &&
> +	test_cmp before-refs after-refs &&
> +	test_cmp server/.git/refs/heads/master \
> +		 client/.git/refs/hidden/origin/master &&
> +
> +	# the hidden ref should trigger a new layer in the commit-graph
> +	git -C client run-job commit-graph &&
> +	test_line_count = 2 $chain
> +'
> +
>   test_done
> 

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 4/5/2020 11:14 AM, Phillip Wood wrote:
> Hi Stolee
> 
> On 03/04/2020 21:48, Derrick Stolee via GitGitGadget wrote:
>> From: Derrick Stolee <dstolee@microsoft.com>
>> RFC QUESTIONS:
>>
>> 1. One downside of the refs/hidden pattern is that 'git log' will
>>     decorate commits with twice as many refs if they appear at a
>>     remote ref (<remote>/<ref> _and_ refs/hidden/<remote>/<ref>). Is
>>     there an easy way to exclude a refspace from decorations? Should
>>     we make refs/hidden/* a "special" refspace that is excluded from
>>     decorations?
> 
> Having some way to specify which refs outside of refs/{heads,remote,tags}/ to show or exclude from decorations would be useful I think. Fetching to a hidden ref is a good idea (as are the other steps you outline above) but as you say we don't want it to show up in the output of 'git log' etc.

I'll work on this first. It seems less controversial.

>> +static int fill_remotes(struct string_list *remotes)
> 
> Isn't there a easy way to get this using the config api rather than forking 'git remote'?

You're right. I should have used the config here. I've been overly
biased to how we do it in the C# layer of Scalar _plus_ how I wanted
the job-runner to run "outside" a Git repository. This could be much
simpler with the config API.

Thanks,
-Stolee

@@ -0,0 +1,13 @@
job.pack-files.batchSize::
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Phillip Wood wrote (reply to this):

Hi Stolee

On 03/04/2020 21:48, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
> 
> The 'git job-runner' command was introduced with a '--repo=<path>'
> option so a user could start a process with an explicit list of
> repos. However, it is more likely that we will want 'git
> job-runner' to start without any arguments and discover the set of
> repos from another source.
> 
> A natural source is the Git config. The 'git job-runner' does not
> need to run in a Git repository, but the config could be located in
> the global or system config.
> 
> Create the job.repo config setting as a multi-valued config option.
> 
> Read all values for job.repo whenever triggering an iteration of
> the job loop. This allows a user to add or remove repos dynamically
> without restarting the job-runner.
> 
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>   Documentation/config/job.txt |  7 +++++++
>   builtin/job-runner.c         | 20 ++++++++++++++++++++
>   2 files changed, 27 insertions(+)
> 
> diff --git a/Documentation/config/job.txt b/Documentation/config/job.txt
> index efdb76afad3..2dfed3935fa 100644
> --- a/Documentation/config/job.txt
> +++ b/Documentation/config/job.txt
> @@ -4,3 +4,10 @@ job.pack-files.batchSize::
>   	part of `git run-job pack-files`. If not specified, then a
>   	dynamic size calculation is run. See linkgit:git-run-job[1]
>   	for more details.
> +
> +job.repo::
> +	Store an absolute path to a repository that wants background
> +	jobs run by `git job-runner`. This is a multi-valued config
> +	option, but it must be stored in a config file seen by the
> +	background runner. This may be the global or system config
> +	depending on how `git job-runner` is launched on your system.
> diff --git a/builtin/job-runner.c b/builtin/job-runner.c
> index 135288bcaae..bed401f94bf 100644
> --- a/builtin/job-runner.c
> +++ b/builtin/job-runner.c
> @@ -20,6 +20,9 @@ static int arg_repos_append(const struct option *opt,
>   
>   static int load_active_repos(struct string_list *repos)
>   {
> +	struct string_list_item *item;
> +	const struct string_list *config_repos;
> +
>   	if (arg_repos.nr) {
>   		struct string_list_item *item;
>   		for (item = arg_repos.items;
> @@ -29,6 +32,23 @@ static int load_active_repos(struct string_list *repos)
>   		return 0;
>   	}
>   
> +	config_repos = git_config_get_value_multi("job.repo");

Does this really re-read the config files as the commit message suggests 
or just return the cached values? Perhaps the runner could exec itself 
with a --run-jobs option to actually run the jobs so that it sees any 
updated config values.

Best Wishes

Phillip

> +
> +	for (item = config_repos->items;
> +	     item && item < config_repos->items + config_repos->nr;
> +	     item++) {
> +		DIR *dir = opendir(item->string);
> +
> +		if (!dir)
> +			continue;
> +
> +		closedir(dir);
> +		string_list_append(repos, item->string);
> +	}
> +
> +	string_list_sort(repos);
> +	string_list_remove_duplicates(repos, 0);
> +
>   	return 0;
>   }
>   
> 

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 4/5/2020 11:18 AM, Phillip Wood wrote:
> Hi Stolee
> 
> On 03/04/2020 21:48, Derrick Stolee via GitGitGadget wrote:
>>   +    config_repos = git_config_get_value_multi("job.repo");
> 
> Does this really re-read the config files as the commit message suggests or just return the cached values? Perhaps the runner could exec itself with a --run-jobs option to actually run the jobs so that it sees any updated config values.

You're right, I need to double-check that. I'm guessing that you are
right and it uses the cached values. This should use the run_command()
pattern to ensure the config is as new as possible.

Thanks,
-Stolee

@@ -0,0 +1,23 @@
job.<job-name>.interval::
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Phillip Wood wrote (reply to this):

Hi Stolee

On 03/04/2020 21:48, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
> 
> We want to run different maintenance tasks at different rates. That
> means we cannot rely only on the time between job-runner loop pauses
> to reduce the frequency of specific jobs. This means we need to
> persist a timestamp for the previous run somewhere. A natural place
> is the Git config file for that repo.
> 
> Create the job.<job-namme>.lastRun config option to store the
> timestamp of the previous run of that job. Set this config option
> after a successful run of 'git -C <repo> run-job <job-name>'.
> 
> To maximize code reuse, we dynamically construct the config key
> and parse the config value into a timestamp in a generic way. This
> makes the introduction of another config option extremely trivial:
> 
> The job.<job-name>.interval option allows a user to specify a
> minimum number of seconds between two calls to 'git run-job
> <job-name>' on a given repo. This could be stored in the global
> or system config to provide an update on the default for all repos,
> or could be updated on a per-repo basis. This is checked on every
> iteration of the job loop, so a user could update this and see the
> effect without restarting the job-runner process.
> 
> RFC QUESTION: I'm using a 'git -C <repo> config <option>' process
> to test if enough time has elapsed before running the 'run-job'
> process. These 'config' processes are pretty light-weight, so
> hopefully they are not too noisy. An alternative would be to
> always run 'git -C <repo> run-job <job-name>' and rely on that
> process to no-op based on the configured values and how recently
> they were run.

You're still executing another process so it doesn't really save 
anything in the 'noop' case. In the case where something needs to be 
done I think the extra config process is unlikely to be noticed 
(especially as 'git job' forks a lot anyway)

Best Wishes

Phillip

> However, that will likely interrupt users who want
> to test the jobs in the foreground. Finally, that user disruption
> would be mitigated by adding a "--check-latest" option. A user
> running a job manually would not include that argument by default
> and would succeed. However, any logging that we might do for the
> job-runner would make it look like we are running the run-job
> commands all the time even if they are no-ops. Advice welcome!
> 
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>   Documentation/config/job.txt  |  10 ++
>   Documentation/git-run-job.txt |   3 +
>   builtin/job-runner.c          | 177 +++++++++++++++++++++++++++++++++-
>   3 files changed, 189 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/config/job.txt b/Documentation/config/job.txt
> index 2dfed3935fa..7c799d66221 100644
> --- a/Documentation/config/job.txt
> +++ b/Documentation/config/job.txt
> @@ -1,3 +1,13 @@
> +job.<job-name>.interval::
> +	The minimum number of seconds between runs of
> +	`git run-job <job-name>` when running `git job-runner`.
> +
> +job.<job-name>.lastRun::
> +	The Unix epoch for the timestamp of the previous run of
> +	`git run-job <job-name>` when running `git job-runner`. You
> +	can manually update this to a later time to delay a specific
> +	job on this repository.
> +
>   job.pack-files.batchSize::
>   	This string value `<size>` will be passed to the
>   	`git multi-pack-index repack --batch-size=<size>` command as
> diff --git a/Documentation/git-run-job.txt b/Documentation/git-run-job.txt
> index cdd6417f7c9..c6d5674d699 100644
> --- a/Documentation/git-run-job.txt
> +++ b/Documentation/git-run-job.txt
> @@ -90,6 +90,9 @@ pack-files, the job does not attempt to repack. Otherwise, the batch
>   size is the sum of all pack-file sizes minus the largest pack-file size.
>   The batch size is capped at two gigabytes. This intends to pack all
>   small pack-files into a single pack-file.
> ++
> +The `--batch-size=<size>` option will override the dynamic or configured
> +batch size.
>   
>   
>   GIT
> diff --git a/builtin/job-runner.c b/builtin/job-runner.c
> index bed401f94bf..aee55c106e8 100644
> --- a/builtin/job-runner.c
> +++ b/builtin/job-runner.c
> @@ -4,11 +4,175 @@
>   #include "run-command.h"
>   #include "string-list.h"
>   
> +#define MAX_TIMESTAMP ((timestamp_t)-1)
> +
>   static char const * const builtin_job_runner_usage[] = {
>   	N_("git job-runner [<options>]"),
>   	NULL
>   };
>   
> +static char *config_name(const char *prefix,
> +			 const char *job,
> +			 const char *postfix)
> +{
> +	int postfix_dot = 0;
> +	struct strbuf name = STRBUF_INIT;
> +
> +	if (prefix) {
> +		strbuf_addstr(&name, prefix);
> +		strbuf_addch(&name, '.');
> +	}
> +
> +	if (job) {
> +		strbuf_addstr(&name, job);
> +		postfix_dot = 1;
> +	}
> +
> +	if (postfix) {
> +		if (postfix_dot)
> +			strbuf_addch(&name, '.');
> +		strbuf_addstr(&name, postfix);
> +	}
> +
> +	return strbuf_detach(&name, NULL);
> +}
> +
> +static int try_get_config(const char *job,
> +			  const char *repo,
> +			  const char *postfix,
> +			  char **value)
> +{
> +	int result = 0;
> +	FILE *proc_out;
> +	struct strbuf line = STRBUF_INIT;
> +	char *cname = config_name("job", job, postfix);
> +	struct child_process *config_proc = xmalloc(sizeof(*config_proc));
> +
> +	child_process_init(config_proc);
> +
> +	argv_array_push(&config_proc->args, "git");
> +	argv_array_push(&config_proc->args, "-C");
> +	argv_array_push(&config_proc->args, repo);
> +	argv_array_push(&config_proc->args, "config");
> +	argv_array_push(&config_proc->args, cname);
> +	free(cname);
> +
> +	config_proc->out = -1;
> +
> +	if (start_command(config_proc)) {
> +		warning(_("failed to start process for repo '%s'"),
> +			repo);
> +		result = 1;
> +		goto cleanup;
> +	}
> +
> +	proc_out = xfdopen(config_proc->out, "r");
> +
> +	/* if there is no line, leave the value as given */
> +	if (!strbuf_getline(&line, proc_out))
> +		*value = strbuf_detach(&line, NULL);
> +	else
> +		*value = NULL;
> +
> +	strbuf_release(&line);
> +
> +	fclose(proc_out);
> +
> +	result = finish_command(config_proc);
> +
> +cleanup:
> +	free(config_proc);
> +	return result;
> +}
> +
> +static int try_get_timestamp(const char *job,
> +			     const char *repo,
> +			     const char *postfix,
> +			     timestamp_t *t)
> +{
> +	char *value;
> +	int result = try_get_config(job, repo, postfix, &value);
> +
> +	if (!result) {
> +		*t = atol(value);
> +		free(value);
> +	}
> +
> +	return result;
> +}
> +
> +static timestamp_t get_last_run(const char *job,
> +				const char *repo)
> +{
> +	timestamp_t last_run = 0;
> +
> +	/* In an error state, do not run the job */
> +	if (try_get_timestamp(job, repo, "lastrun", &last_run))
> +		return MAX_TIMESTAMP;
> +
> +	return last_run;
> +}
> +
> +static timestamp_t get_interval(const char *job,
> +				const char *repo)
> +{
> +	timestamp_t interval = MAX_TIMESTAMP;
> +
> +	/* In an error state, do not run the job */
> +	if (try_get_timestamp(job, repo, "interval", &interval))
> +		return MAX_TIMESTAMP;
> +
> +	if (interval == MAX_TIMESTAMP) {
> +		/* load defaults for each job */
> +		if (!strcmp(job, "commit-graph") || !strcmp(job, "fetch"))
> +			interval = 60 * 60; /* 1 hour */
> +		else
> +			interval = 24 * 60 * 60; /* 1 day */
> +	}
> +
> +	return interval;
> +}
> +
> +static int set_last_run(const char *job,
> +			const char *repo,
> +			timestamp_t last_run)
> +{
> +	int result = 0;
> +	struct strbuf last_run_string = STRBUF_INIT;
> +	struct child_process *config_proc = xmalloc(sizeof(*config_proc));
> +	char *cname = config_name("job", job, "lastrun");
> +
> +	strbuf_addf(&last_run_string, "%"PRItime, last_run);
> +
> +	child_process_init(config_proc);
> +
> +	argv_array_push(&config_proc->args, "git");
> +	argv_array_push(&config_proc->args, "-C");
> +	argv_array_push(&config_proc->args, repo);
> +	argv_array_push(&config_proc->args, "config");
> +	argv_array_push(&config_proc->args, cname);
> +	argv_array_push(&config_proc->args, last_run_string.buf);
> +	free(cname);
> +	strbuf_release(&last_run_string);
> +
> +	if (start_command(config_proc)) {
> +		warning(_("failed to start process for repo '%s'"),
> +			repo);
> +		result = 1;
> +		goto cleanup;
> +	}
> +
> +	if (finish_command(config_proc)) {
> +		warning(_("failed to finish process for repo '%s'"),
> +			repo);
> +		result = 1;
> +	}
> +
> +cleanup:
> +	free(config_proc);
> +	return result;
> +}
> +
>   static struct string_list arg_repos = STRING_LIST_INIT_DUP;
>   
>   static int arg_repos_append(const struct option *opt,
> @@ -54,9 +218,20 @@ static int load_active_repos(struct string_list *repos)
>   
>   static int run_job(const char *job, const char *repo)
>   {
> +	int result;
>   	struct argv_array cmd = ARGV_ARRAY_INIT;
> +	timestamp_t now = time(NULL);
> +	timestamp_t last_run = get_last_run(job, repo);
> +	timestamp_t interval = get_interval(job, repo);
> +
> +	if (last_run + interval > now)
> +		return 0;
> +
>   	argv_array_pushl(&cmd, "-C", repo, "run-job", job, NULL);
> -	return run_command_v_opt(cmd.argv, RUN_GIT_CMD);
> +	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
> +
> +	set_last_run(job, repo, now);
> +	return result;
>   }
>   
>   static int run_job_loop_step(struct string_list *list)
> 

@@ -0,0 +1,13 @@
job.pack-files.batchSize::
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Phillip Wood wrote (reply to this):

Hi Stolee

On 03/04/2020 21:48, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
> 
> The 'git job-runner' command was introduced with a '--repo=<path>'
> option so a user could start a process with an explicit list of
> repos. However, it is more likely that we will want 'git
> job-runner' to start without any arguments and discover the set of
> repos from another source.

One thought that occurred to me was that it might be convenient to put 
'git job-runner $HOME &' in my .bashrc to have it start on login and 
find all the repositories under $HOME without me having to list each one 
(and remember to update the list each time a clone a new repository). 
There are a couple of issues with this
  1 - We only want to run the jobs once per repo, not for each worktree. 
That should be easy enough to implement by checking if we're in the main 
worktree.
  2 - If I start several shells I only want one instance of 'git 
job-runner' to start. One way to handle that would be for the runner to 
hold a lock file in the directory it's given and quit if the lock is 
already taken. It should probably walk up the directory hierarchy to 
check for lock files as well in case I try to start another job-runner 
instance in a subdirectory.

Best Wishes

Phillip


> A natural source is the Git config. The 'git job-runner' does not
> need to run in a Git repository, but the config could be located in
> the global or system config.
> 
> Create the job.repo config setting as a multi-valued config option.
> 
> Read all values for job.repo whenever triggering an iteration of
> the job loop. This allows a user to add or remove repos dynamically
> without restarting the job-runner.
> 
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>   Documentation/config/job.txt |  7 +++++++
>   builtin/job-runner.c         | 20 ++++++++++++++++++++
>   2 files changed, 27 insertions(+)
> 
> diff --git a/Documentation/config/job.txt b/Documentation/config/job.txt
> index efdb76afad3..2dfed3935fa 100644
> --- a/Documentation/config/job.txt
> +++ b/Documentation/config/job.txt
> @@ -4,3 +4,10 @@ job.pack-files.batchSize::
>   	part of `git run-job pack-files`. If not specified, then a
>   	dynamic size calculation is run. See linkgit:git-run-job[1]
>   	for more details.
> +
> +job.repo::
> +	Store an absolute path to a repository that wants background
> +	jobs run by `git job-runner`. This is a multi-valued config
> +	option, but it must be stored in a config file seen by the
> +	background runner. This may be the global or system config
> +	depending on how `git job-runner` is launched on your system.
> diff --git a/builtin/job-runner.c b/builtin/job-runner.c
> index 135288bcaae..bed401f94bf 100644
> --- a/builtin/job-runner.c
> +++ b/builtin/job-runner.c
> @@ -20,6 +20,9 @@ static int arg_repos_append(const struct option *opt,
>   
>   static int load_active_repos(struct string_list *repos)
>   {
> +	struct string_list_item *item;
> +	const struct string_list *config_repos;
> +
>   	if (arg_repos.nr) {
>   		struct string_list_item *item;
>   		for (item = arg_repos.items;
> @@ -29,6 +32,23 @@ static int load_active_repos(struct string_list *repos)
>   		return 0;
>   	}
>   
> +	config_repos = git_config_get_value_multi("job.repo");
> +
> +	for (item = config_repos->items;
> +	     item && item < config_repos->items + config_repos->nr;
> +	     item++) {
> +		DIR *dir = opendir(item->string);
> +
> +		if (!dir)
> +			continue;
> +
> +		closedir(dir);
> +		string_list_append(repos, item->string);
> +	}
> +
> +	string_list_sort(repos);
> +	string_list_remove_duplicates(repos, 0);
> +
>   	return 0;
>   }
>   
> 

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 4/5/2020 11:41 AM, Phillip Wood wrote:
> Hi Stolee
> 
> On 03/04/2020 21:48, Derrick Stolee via GitGitGadget wrote:
>> From: Derrick Stolee <dstolee@microsoft.com>
>>
>> The 'git job-runner' command was introduced with a '--repo=<path>'
>> option so a user could start a process with an explicit list of
>> repos. However, it is more likely that we will want 'git
>> job-runner' to start without any arguments and discover the set of
>> repos from another source.
> 
> One thought that occurred to me was that it might be convenient to put
> 'git job-runner $HOME &' in my .bashrc to have it start on login and
> find all the repositories under $HOME without me having to list each
> one (and remember to update the list each time a clone a new repository).
> There are a couple of issues with this
> 1 - We only want to run the jobs once per repo, not for each worktree.
> That should be easy enough to implement by checking if we're in the
> main worktree.

This idea of "please check all (immediate) directories under <dir> for
repositories to run maintenance" is an interesting one. It certainly
could be added later.

Your concern about worktrees is actually not a big deal. Since we check
the config for the "last run" of a job on a repo, we will not run the
same job immediately on a worktree after running it on the original
(unless the interval times are incredibly short).

> 2 - If I start several shells I only want one instance of 'git
> job-runner' to start. One way to handle that would be for the runner
> to hold a lock file in the directory it's given and quit if the lock
> is already taken. It should probably walk up the directory hierarchy
> to check for lock files as well in case I try to start another
> job-runner instance in a subdirectory.

This is an interesting idea. My gut reaction is that we don't want
to prevent users from running multiple processes if they want to. But
if they run the process twice in the same directory then they are
likely to be running on the same set of repos (barring "-c jobs.repo"
or equivalent).

Again, my hope is that we would solve this problem by having a cross-
platform "job service" that makes the user setup of editing .bashrc
irrelevant. Not only is that idea getting push back, we should allow
expert users the ability to customize these steps to their delight.

Thanks,
-Stolee

@@ -0,0 +1,64 @@
git-run-job(1)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> 1. One downside of the refs/hidden pattern is that 'git log' will
>    decorate commits with twice as many refs if they appear at a
>    remote ref (<remote>/<ref> _and_ refs/hidden/<remote>/<ref>). Is
>    there an easy way to exclude a refspace from decorations?

I do not think there is, but it makes sense to teach the decoration
machinery to either use only certain refs hierarchies or use all
hierarchies except for certain ones; if we want to make sure we
won't break existing workflows, we should by defautlt use all the
refs we currently use and nothing else, but over time we probably
would want to migrate the default to cover only the local and
remote-tracking branches and tags (and at that point, refs/hidden
would be outside the decoration source).

By the way, I have a moderately strong opinion against the use of
"refs/hidden" for the purpose of "prefetch in advance, without
disrupting refs/remotes".  There may be more than one reason why
some refs want to be "hidden", and depending on the purose, the
exact refs that one workflow (e.g. "decorate") wants to hide may be
the ones that want to be exposed.

If we rename it to "refs/prefetch/", would it make the purpose of
the hierarchy clearer without squatting on a vague (because it does
not tell why it is hidden) name "hidden" that other people might
want to use to hide their stuff for different reasons?

> Should
>    we make refs/hidden/* a "special" refspace that is excluded from
>    decorations?

See above.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 4/5/2020 4:28 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> 1. One downside of the refs/hidden pattern is that 'git log' will
>>    decorate commits with twice as many refs if they appear at a
>>    remote ref (<remote>/<ref> _and_ refs/hidden/<remote>/<ref>). Is
>>    there an easy way to exclude a refspace from decorations?
> 
> I do not think there is, but it makes sense to teach the decoration
> machinery to either use only certain refs hierarchies or use all
> hierarchies except for certain ones; if we want to make sure we
> won't break existing workflows, we should by defautlt use all the
> refs we currently use and nothing else, but over time we probably
> would want to migrate the default to cover only the local and
> remote-tracking branches and tags (and at that point, refs/hidden
> would be outside the decoration source).

I'll see what I can do about adding config to remove some refs from
decorations. That is immediately useful for Scalar users, too.

> By the way, I have a moderately strong opinion against the use of
> "refs/hidden" for the purpose of "prefetch in advance, without
> disrupting refs/remotes".  There may be more than one reason why
> some refs want to be "hidden", and depending on the purose, the
> exact refs that one workflow (e.g. "decorate") wants to hide may be
> the ones that want to be exposed.
> 
> If we rename it to "refs/prefetch/", would it make the purpose of
> the hierarchy clearer without squatting on a vague (because it does
> not tell why it is hidden) name "hidden" that other people might
> want to use to hide their stuff for different reasons?

I like "refs/prefetch". That's more descriptive.

>> Should
>>    we make refs/hidden/* a "special" refspace that is excluded from
>>    decorations?
> 
> See above.
Thanks,
-Stolee

@@ -0,0 +1,76 @@
git-run-job(1)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> Create a 'loose-objects' job for the 'git run-job' command. This
> helps clean up loose objects without disrupting concurrent Git
> commands using the following sequence of events:

This is exactly the kind of thing I would have expected you'd be
adding to "git repack".

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 7, 2020

On the Git mailing list, Danh Doan wrote (reply to this):

On 2020-04-03 20:16:21-0400, Derrick Stolee <stolee@gmail.com> wrote:
> On 4/3/2020 5:40 PM, Junio C Hamano wrote:
> > "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> > 
> >>  * git run-job <job-name>: This builtin will run a single instance of a
> >>    maintenance job.
> >>    
> >>  * git job-runner [--repo=<path>]: This builtin will run an infinite loop
> >>    that executes git run-job as a subcommand.
> > 
> > What does this have to do with "git", though?  IOW, why does this
> > have to be part of Git, so that those who would benefit from having
> > a mechanism that makes it easy to run regular maintenance tasks but
> > are not Git users (or those that want to do such maintenance tasks
> > that are not necessarily tied to "git") must use "git" to do so?

I also agree with Junio,
I don't think Git should be responsible to be a scheduler.
It's the job of either tranditional crontab, at on *nix, or scheduler
on Windows.

> That's a reasonable reaction. The short version of my reasoning is that
> many many people _use_ Git but are not Git experts. While a Git expert
> could find the right set of commands to run and at what frequency to
> keep their repo clean, most users do not want to spend time learning
> these commands. It's also worth our time as contributors to select what

And now, people will need to learn _both_ Git existing maintainance
command, and new scheduler (Do I understand it right?, I haven't go
through all patches)

Yes, it could be a setup it once and forget, but,
if there's a problem with their local repo, they will scratch
their head to understand what wrong with them.

It's easier to destroy their repo, and it's harder to know what
operation is running in their computer.

-- 
Danh

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 7, 2020

On the Git mailing list, "brian m. carlson" wrote (reply to this):


--Z1OTrj3C7qypP14j
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On 2020-04-04 at 00:16:21, Derrick Stolee wrote:
> On 4/3/2020 5:40 PM, Junio C Hamano wrote:
> > "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> >=20
> >>  * git run-job <job-name>: This builtin will run a single instance of a
> >>    maintenance job.
> >>=20
> >>  * git job-runner [--repo=3D<path>]: This builtin will run an infinite=
 loop
> >>    that executes git run-job as a subcommand.
> >=20
> > What does this have to do with "git", though?  IOW, why does this
> > have to be part of Git, so that those who would benefit from having
> > a mechanism that makes it easy to run regular maintenance tasks but
> > are not Git users (or those that want to do such maintenance tasks
> > that are not necessarily tied to "git") must use "git" to do so?
> >=20
> > I'll find out later why it is so after reading thru 15 patches
> > myself, so no need to give a quick answer to the above; it was just
> > my knee-jerk reaction.
>=20
> That's a reasonable reaction. The short version of my reasoning is that
> many many people _use_ Git but are not Git experts. While a Git expert
> could find the right set of commands to run and at what frequency to
> keep their repo clean, most users do not want to spend time learning
> these commands. It's also worth our time as contributors to select what
> a good set of non-intrusive maintenance tasks could be, and make them
> easily accessible to users.
>=20
> This series gets us half of the way there: a user interested in doing
> background maintenance could figure out how to launch "git run-job" on
> a schedule for their platform, or to launch "git job-runner" at start-
> up. That's a lot simpler than learning how the commit-graph,
> multi-pack-index, prune-packed, pack-objects, and fetch builtins work
> with the complicated sets of arguments.

If there are periodic tasks that should be done, even if only on large
repos, then let's have a git gc --periodic that does them.  I'm not sure
that fetch should be in that set, but nothing prevents users from doing
"git fetch origin && git gc --periodic".  Let's make it as simple and
straightforward as possible.

As for handling multiple repositories, the tool to do that could be as
simple as a shell script which reads from ~/.config/git/repo-maintenance
(or whatever) and runs the same command on all of the repos it finds
there, possibly with a subcommand to add and remove repos.

> The second half would be to create a command such as
>=20
> 	git please-run-maintenance-on-this-repo
>=20
> that initializes the background jobs and enables them on the repo they
> are using. This allows the most casual of Git user to work efficiently
> on very large repositories.

I'm not opposed to seeing a tool that can schedule periodic maintenance
jobs, perhaps in contrib, depending on whether other people think it
should go.  However, I think running periodic jobs is best handled on
Unix with cron or anacron and not a custom tool or a command in Git.

I've dealt with systems that implemented periodic tasks without using
the existing tools for doing that, and I've found that usually that's a
mistake.  Despite seeming straightforward, there are a lot of tricky
edge cases to deal with and it's easy to get wrong.

We also don't have to reimplement all the features in the system
scheduler and can let expert users use a different tool of their choice
instead if cron (or the Windows equivalent) is not to their liking.
--=20
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

--Z1OTrj3C7qypP14j
Content-Type: application/pgp-signature; name="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.2.20 (GNU/Linux)

iHUEABYKAB0WIQQILOaKnbxl+4PRw5F8DEliiIeigQUCXovb7AAKCRB8DEliiIei
gQm5AQDAaUos0rSxuNSfIRnRBfszo/WtHwQeFFnV0EWADVx5DgEAhINYDkYW/3n9
3KoNkrQxpOx8/82/JqIUKiFnTbzPEQU=
=f+Oy
-----END PGP SIGNATURE-----

--Z1OTrj3C7qypP14j--

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 7, 2020

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 4/6/2020 8:50 PM, Danh Doan wrote:
> On 2020-04-03 20:16:21-0400, Derrick Stolee <stolee@gmail.com> wrote:
>> On 4/3/2020 5:40 PM, Junio C Hamano wrote:
>>> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>>>
>>>>  * git run-job <job-name>: This builtin will run a single instance of a
>>>>    maintenance job.
>>>>    
>>>>  * git job-runner [--repo=<path>]: This builtin will run an infinite loop
>>>>    that executes git run-job as a subcommand.
>>>
>>> What does this have to do with "git", though?  IOW, why does this
>>> have to be part of Git, so that those who would benefit from having
>>> a mechanism that makes it easy to run regular maintenance tasks but
>>> are not Git users (or those that want to do such maintenance tasks
>>> that are not necessarily tied to "git") must use "git" to do so?
> 
> I also agree with Junio,
> I don't think Git should be responsible to be a scheduler.
> It's the job of either tranditional crontab, at on *nix, or scheduler
> on Windows.
> 
>> That's a reasonable reaction. The short version of my reasoning is that
>> many many people _use_ Git but are not Git experts. While a Git expert
>> could find the right set of commands to run and at what frequency to
>> keep their repo clean, most users do not want to spend time learning
>> these commands. It's also worth our time as contributors to select what
> 
> And now, people will need to learn _both_ Git existing maintainance
> command, and new scheduler (Do I understand it right?, I haven't go
> through all patches)

The point is that they would not need to learn the existing commands.
They could accept the community's "best practices" by running the
simple command to start background maintenance.

In an "enterprise" environment, the users would not even need to learn
the command in the first place, because the engineering tools team
could configure the maintenance tools using the necessary setup scripts
to built the repo.

> Yes, it could be a setup it once and forget, but,
> if there's a problem with their local repo, they will scratch
> their head to understand what wrong with them.
> 
> It's easier to destroy their repo, and it's harder to know what
> operation is running in their computer.

That's why we need to be careful. Luckily, these steps have been
tested in the wild for over a year with great success (as part of
VFS for Git).

Thanks,
-Stolee

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 7, 2020

On the Git mailing list, Danh Doan wrote (reply to this):

On 2020-04-07 06:59:08-0400, Derrick Stolee <stolee@gmail.com> wrote:
> On 4/6/2020 8:50 PM, Danh Doan wrote:
> > On 2020-04-03 20:16:21-0400, Derrick Stolee <stolee@gmail.com> wrote:
> >> On 4/3/2020 5:40 PM, Junio C Hamano wrote:
> >>> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> >>>
> >>>>  * git run-job <job-name>: This builtin will run a single instance of a
> >>>>    maintenance job.
> >>>>    
> >>>>  * git job-runner [--repo=<path>]: This builtin will run an infinite loop
> >>>>    that executes git run-job as a subcommand.
> >>>
> >>> What does this have to do with "git", though?  IOW, why does this
> >>> have to be part of Git, so that those who would benefit from having
> >>> a mechanism that makes it easy to run regular maintenance tasks but
> >>> are not Git users (or those that want to do such maintenance tasks
> >>> that are not necessarily tied to "git") must use "git" to do so?
> > 
> > I also agree with Junio,
> > I don't think Git should be responsible to be a scheduler.
> > It's the job of either tranditional crontab, at on *nix, or scheduler
> > on Windows.
> > 
> >> That's a reasonable reaction. The short version of my reasoning is that
> >> many many people _use_ Git but are not Git experts. While a Git expert
> >> could find the right set of commands to run and at what frequency to
> >> keep their repo clean, most users do not want to spend time learning
> >> these commands. It's also worth our time as contributors to select what
> > 
> > And now, people will need to learn _both_ Git existing maintainance
> > command, and new scheduler (Do I understand it right?, I haven't go
> > through all patches)
> 
> The point is that they would not need to learn the existing commands.
> They could accept the community's "best practices" by running the
> simple command to start background maintenance.

We could provide some "best practices" by an FAQ.
People can refer to it for "best practices" and run their favourite
choice of scheduler.

> In an "enterprise" environment, the users would not even need to learn
> the command in the first place, because the engineering tools team
> could configure the maintenance tools using the necessary setup scripts
> to built the repo.

In that "enterprise" environment, if the engineering tools team could
configure the maintainance tools using the command that introduced
together with this series, that very engineering tools team could
configure the scheduler to run required Git command, or create their
own wrappers. In such "enterprise" environment, most of computers'
software set are configured to be installed, the engineering tools
team know which software're installed in which system, they should
know which set of scheduler should be run, it should be simple for
them to configure their system.

> > Yes, it could be a setup it once and forget, but,
> > if there's a problem with their local repo, they will scratch
> > their head to understand what wrong with them.
> > 
> > It's easier to destroy their repo, and it's harder to know what
> > operation is running in their computer.
> 
> That's why we need to be careful. Luckily, these steps have been
> tested in the wild for over a year with great success (as part of
> VFS for Git).

No offence but I find this quote could be applied:

There are two ways of constructing a software design: One way is to
make it so simple that there are obviously no deficiencies, and the
other way is to make it so complicated that there are no obvious
deficiencies. - Tony Hoare.

Adding this set of commands to Git gonna made Git over-complicated,
IMHO.

-- 
Danh

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 7, 2020

On the Git mailing list, Johannes Schindelin wrote (reply to this):

Hi Danh,

On Tue, 7 Apr 2020, Danh Doan wrote:

> Adding this set of commands to Git gonna made Git over-complicated,
> IMHO.

Let me counter that by The Tale of the Green Button: while it is not
completely historically accurate, it is a good illustration of focusing on
What Matters Most:

Once upon a time, Xerox built really nice copy machines that were very
good and did a lot of useful things such as making one or more copies of
one or more pages at a certain set of paper sizes with a certain set of
paper criteria (heavy paper, shiny paper, double-sided copies) etc.

And the tale goes: nobody used it. Then a usability expert came in,
interviewed the users and figured out that it was too complicated to use.
The result was the big green button with which you can make one copy using
the standard paper size and the default paper.

This set of commands that you are complaining about is intended to be that
Big Green Button.

Also, maybe we should not talk too loudly about "making" Git
over-complicated. If you care to have a look at
https://git-man-page-generator.lokaltog.net/ and find yourself being very
much reminded of the current Git User Interface's complexity, you might
agree that we should probably try to make _using_ Git less complicated.

Even if it means adding new commands. Such as `git restore` and `git
switch`. And, yes, like `git job-runner` (or whatever we end up calling
it, I do agree with Junio that `git maintenance` is a nice name for its
intended purpose).

Don't just believe me. I invite you to interview the software engineers
developing the Windows Operating System,. Or for that matter, _any_
software engineers working on projects substantially larger than git.git.

Ciao,
Dscho

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 7, 2020

On the Git mailing list, Junio C Hamano wrote (reply to this):

"brian m. carlson" <sandals@crustytoothpaste.net> writes:

> If there are periodic tasks that should be done, even if only on large
> repos, then let's have a git gc --periodic that does them.  I'm not sure
> that fetch should be in that set, but nothing prevents users from doing
> "git fetch origin && git gc --periodic".  Let's make it as simple and
> straightforward as possible.

Yeah, thanks for bringing up "git gc" here.  Earlier I mumbled
sometrhing about "git repack", which seemed to have needlessly
muddied the discussion, but "gc" is exactly the "we already have a
wrapper the users are familiar with---why not extend it" thing.

> I'm not opposed to seeing a tool that can schedule periodic maintenance
> jobs, perhaps in contrib, depending on whether other people think it
> should go.  However, I think running periodic jobs is best handled on
> Unix with cron or anacron and not a custom tool or a command in Git.
>
> I've dealt with systems that implemented periodic tasks without using
> the existing tools for doing that, and I've found that usually that's a
> mistake.  Despite seeming straightforward, there are a lot of tricky
> edge cases to deal with and it's easy to get wrong.
>
> We also don't have to reimplement all the features in the system
> scheduler and can let expert users use a different tool of their choice
> instead if cron (or the Windows equivalent) is not to their liking.

;-)  You said it much better than I would have myself.  Thanks.

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 7, 2020

On the Git mailing list, Johannes Schindelin wrote (reply to this):

Hi brian,

On Tue, 7 Apr 2020, brian m. carlson wrote:

> On 2020-04-04 at 00:16:21, Derrick Stolee wrote:
> > On 4/3/2020 5:40 PM, Junio C Hamano wrote:
> > > "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> > >
> > >>  * git run-job <job-name>: This builtin will run a single instance of a
> > >>    maintenance job.
> > >>
> > >>  * git job-runner [--repo=<path>]: This builtin will run an infinite loop
> > >>    that executes git run-job as a subcommand.
> > >
> > > What does this have to do with "git", though?  IOW, why does this
> > > have to be part of Git, so that those who would benefit from having
> > > a mechanism that makes it easy to run regular maintenance tasks but
> > > are not Git users (or those that want to do such maintenance tasks
> > > that are not necessarily tied to "git") must use "git" to do so?
> > >
> > > I'll find out later why it is so after reading thru 15 patches
> > > myself, so no need to give a quick answer to the above; it was just
> > > my knee-jerk reaction.
> >
> > That's a reasonable reaction. The short version of my reasoning is that
> > many many people _use_ Git but are not Git experts. While a Git expert
> > could find the right set of commands to run and at what frequency to
> > keep their repo clean, most users do not want to spend time learning
> > these commands. It's also worth our time as contributors to select what
> > a good set of non-intrusive maintenance tasks could be, and make them
> > easily accessible to users.
> >
> > This series gets us half of the way there: a user interested in doing
> > background maintenance could figure out how to launch "git run-job" on
> > a schedule for their platform, or to launch "git job-runner" at start-
> > up. That's a lot simpler than learning how the commit-graph,
> > multi-pack-index, prune-packed, pack-objects, and fetch builtins work
> > with the complicated sets of arguments.
>
> If there are periodic tasks that should be done, even if only on large
> repos, then let's have a git gc --periodic that does them.  I'm not sure
> that fetch should be in that set, but nothing prevents users from doing
> "git fetch origin && git gc --periodic".

Hmm. Who says that maintenance tasks are essentially only `gc`? With
_maaaaaybe_ a `fetch` thrown in?

And about "nothing prevents users from doing ...": while that is true,
_even less_ is preventing those same users from _forgetting_ to run them,
or _not even knowing_ about the need to run them.

I vividly remember how `git gc --auto` was introduced. A Linux developer
too many was complaining about their Git operations becoming slower and
slower over the course of some months.

That is a very illustrative example of what we're trying to accomplish
here: we are dealing with people whose job it is not to become Git
experts, but whose job is to develop software, and it just so happens that
they use Git as version control system. For the most part, they are not
even all that interested in the internals of Git. Strange, I know.

The solution then, and now, is to teach Git to do these things for the
users, without the need to know a specific command and when to run it.

I am not saying that the current state of the design is perfect. What I am
saying is that the goal is very clear, and important, and we must try to
get there: to teach Git to run maintenance for the user, without much in
the way of the user asking for it, but Git knowing when to do what, and
trying mostly to stay out of the users' way.

> Let's make it as simple and straightforward as possible.

I fear that we're running the danger of confusing _two_ meanings of "as
simple and straightforward as possible" here.

From my perspective, Stolee's patches aim to make using Git as simple and
straightforward as possible, and the historical record shows that Scalar
(after which they are modeled) does a pretty good job at that, or is at
least heading in the right direction.

I get the impression, however, that many reviewers here seem to favor the
goal of making the _patches_ as simple and straightforward as possible,
however, at the expense of the original goal. Like, totally sacrificing
the ease of use in return for "just use a shell script" advice.

> As for handling multiple repositories, the tool to do that could be as
> simple as a shell script which reads from ~/.config/git/repo-maintenance
> (or whatever) and runs the same command on all of the repos it finds
> there, possibly with a subcommand to add and remove repos.

Sure, that is flexible.

And it requires a ton of Git expertise to know what to put into those
scripts. And Git updates cannot deliver more value to those scripts.

Which taps right into the above-mentioned focus on _the patches'_
simplicity, as opposed to the original goal of making _using Git_ simpler
and less painful with big repositories.

> > The second half would be to create a command such as
> >
> > 	git please-run-maintenance-on-this-repo
> >
> > that initializes the background jobs and enables them on the repo they
> > are using. This allows the most casual of Git user to work efficiently
> > on very large repositories.
>
> I'm not opposed to seeing a tool that can schedule periodic maintenance
> jobs, perhaps in contrib, depending on whether other people think it
> should go.  However, I think running periodic jobs is best handled on
> Unix with cron or anacron and not a custom tool or a command in Git.

Okay, here is a challenge for you: design this such that the Windows
experience does _not_ feel like a 3rd-class citizen. Go ahead. Yes, there
is a scheduler. Yep, it does not do cron-like things. Precisely: you have
to feed it an XML to make use of the "advanced" features. Yeah, I also
cannot remember what the semantics are regarding missed jobs due to
shutdown cycles. Nope, you cannot rely on the XML being an option, that
would require Windows 10. The list goes on.

Do you see where I am getting at?

It's really asking the user to take on _all_ the burden of the complexity.
The complexity has to live _somewhere_, after all, and if you want to
prevent Git from taking care of it, there are only the users left to hold
the candle.

I'd rather Git take up that responsibility, and lift that burden from the
users' shoulders. It's not like we have a stellar record of usability, so
we might just as well welcome Stolee's effort to improve Git in that
respect.

> I've dealt with systems that implemented periodic tasks without using
> the existing tools for doing that, and I've found that usually that's a
> mistake.  Despite seeming straightforward, there are a lot of tricky
> edge cases to deal with and it's easy to get wrong.

That might be true for general-purpose periodic taks managers.

But that's not what we need in Git, do we? The scope is a _lot_ more
reduced.

But maybe you found one of those issues in Stolee's patches? If so, please
do contribute your experience there to point out those issues, so that
they can be addressed.

> We also don't have to reimplement all the features in the system
> scheduler and can let expert users use a different tool of their choice
> instead if cron (or the Windows equivalent) is not to their liking.

Do we really want to start relying on `cron`, when the major platform used
by the target audience (enterprise software engineers who deal with rather
larger repositories than git.git or linux.git) quite obviously _lacks_
support for that?

That would be a decision that would not make sense to me. I mean, why go
to all the lengths of making it easier and less involved to manage large
repositories, and then just slam the door shut into the faces of all of
those developers working on Windows, an operating system with a rather
large market share?

Ciao,
Dscho

@gitgitgadget
Copy link

gitgitgadget bot commented Apr 8, 2020

On the Git mailing list, "brian m. carlson" wrote (reply to this):


--AA9g+nFNFPYNJKiL
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

Hey,

On 2020-04-07 at 22:23:43, Johannes Schindelin wrote:
> > If there are periodic tasks that should be done, even if only on large
> > repos, then let's have a git gc --periodic that does them.  I'm not sure
> > that fetch should be in that set, but nothing prevents users from doing
> > "git fetch origin && git gc --periodic".
>=20
> Hmm. Who says that maintenance tasks are essentially only `gc`? With
> _maaaaaybe_ a `fetch` thrown in?

What I'm saying is that we have a tool to run maintenance tasks on the
repository.  If we need to perform additional maintenance tasks, let's
put them in the same place as the ones we have now.  I realize "gc" may
become a less accurate name, but oh, well.

> > Let's make it as simple and straightforward as possible.
>=20
> I get the impression, however, that many reviewers here seem to favor the
> goal of making the _patches_ as simple and straightforward as possible,
> however, at the expense of the original goal. Like, totally sacrificing
> the ease of use in return for "just use a shell script" advice.

I think we can have both.  They are not mutually exclusive, and I've
proposed a suggestion for both.

> > As for handling multiple repositories, the tool to do that could be as
> > simple as a shell script which reads from ~/.config/git/repo-maintenance
> > (or whatever) and runs the same command on all of the repos it finds
> > there, possibly with a subcommand to add and remove repos.
>=20
> Sure, that is flexible.
>=20
> And it requires a ton of Git expertise to know what to put into those
> scripts. And Git updates cannot deliver more value to those scripts.

Perhaps I was unclear what I thought could be the design of this.  My
proposal is something like the following:

  git schedule-gc add [--period=3DTIME] [--fetch=3DREMOTE | --fetch-all] RE=
PO
  git schedule-gc remove REPO

The actual command invoked by the system scheduler would be something
like the following:

  git schedule-gc run

It would work as I proposed under the hood, but it would be relatively
straightforward to use.

> > I'm not opposed to seeing a tool that can schedule periodic maintenance
> > jobs, perhaps in contrib, depending on whether other people think it
> > should go.  However, I think running periodic jobs is best handled on
> > Unix with cron or anacron and not a custom tool or a command in Git.
>=20
> Okay, here is a challenge for you: design this such that the Windows
> experience does _not_ feel like a 3rd-class citizen. Go ahead. Yes, there
> is a scheduler. Yep, it does not do cron-like things. Precisely: you have
> to feed it an XML to make use of the "advanced" features. Yeah, I also
> cannot remember what the semantics are regarding missed jobs due to
> shutdown cycles. Nope, you cannot rely on the XML being an option, that
> would require Windows 10. The list goes on.

I will freely admit that I know next to nothing about Windows.  I have
used it only incidentally, if at all, for at least two decades.  It is
not a platform I generally have an interest in developing for, although
I try to make it work as well as possible when I am working on a project
which supports it.

It is, in general, my assumption, based on its wide usage, that it is a
powerful and robust operating system with many features, but I have
little actual knowledge about how it functions or the exact features it
provides.

I want a solution that builds on the existing Unix tools for Unix,
because that is least surprising to users and it is how Unix tools are
supposed to work.  I think we can agree that Git was designed with the
Unix philosophy in mind.

I also want a solution that works on Windows.  Ideally that solution
would build on existing components that are part of Windows, because it
reduces the maintenance burden on all of us.  But unfortunately, I know
next to nothing about how to build such a solution.

> > I've dealt with systems that implemented periodic tasks without using
> > the existing tools for doing that, and I've found that usually that's a
> > mistake.  Despite seeming straightforward, there are a lot of tricky
> > edge cases to deal with and it's easy to get wrong.
>=20
> But maybe you found one of those issues in Stolee's patches? If so, please
> do contribute your experience there to point out those issues, so that
> they can be addressed.

One of the benefits of using anacron on Unix is that it can skip running
tasks when the user is on battery.  This is not anything we can portably
do across systems, nor is it something that Git should need to know
about.

> > We also don't have to reimplement all the features in the system
> > scheduler and can let expert users use a different tool of their choice
> > instead if cron (or the Windows equivalent) is not to their liking.
>=20
> Do we really want to start relying on `cron`, when the major platform used
> by the target audience (enterprise software engineers who deal with rather
> larger repositories than git.git or linux.git) quite obviously _lacks_
> support for that?

Unix users will be unhappy with us if we use our own scheduling system
when cron is available.  They will expect us to reimplement those
features and they will complain if we do not.  While I cannot name
names, there are a nontrivial number of large, enterprise monorepos that
run only on macOS and Linux.

That doesn't prevent us from building tooling that does the scheduling
on Windows if we can't use the system scheduler, but it would be nice to
try to present a relatively unified interface across the two platforms.
--=20
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

--AA9g+nFNFPYNJKiL
Content-Type: application/pgp-signature; name="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.2.20 (GNU/Linux)

iHUEABYKAB0WIQQILOaKnbxl+4PRw5F8DEliiIeigQUCXo0UbQAKCRB8DEliiIei
gatrAP4gX7U/4gDJuE+pCpTtJyjVZjJ4kw69zWadDMEqxc5yPwEAm2isBudwBllJ
N1ZI7c74KwTtei5xNYU2t/AT5iliuQo=
=axED
-----END PGP SIGNATURE-----

--AA9g+nFNFPYNJKiL--

@sluongng
Copy link

  1. One downside of the refs/hidden pattern is that 'git log' will
    decorate commits with twice as many refs if they appear at a
    remote ref (/ and refs/hidden//). Is
    there an easy way to exclude a refspace from decorations? Should
    we make refs/hidden/* a "special" refspace that is excluded from
    decorations?

we can alias git log to git log --decorate-refs-exclude="refs/hidden/*" but it seems like a patch up solution 😆

@dscho
Copy link
Member

dscho commented Apr 11, 2020

  1. One downside of the refs/hidden pattern is that 'git log' will
    decorate commits with twice as many refs if they appear at a
    remote ref (/ and refs/hidden//). Is
    there an easy way to exclude a refspace from decorations? Should
    we make refs/hidden/* a "special" refspace that is excluded from
    decorations?

we can alias git log to git log --decorate-refs-exclude="refs/hidden/*" but it seems like a patch up solution 😆

@sluongng if this was intended as a serious suggestion, please follow the advice from this line how to reply on the Git mailing list:

On the Git mailing list, Phillip Wood wrote (reply to this):
[...]

@sluongng
Copy link

@@ -0,0 +1,53 @@
git-run-job(1)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Josh Steadmon wrote (reply to this):

Thanks for this series, and sorry for the delayed (and partial) review.
I hope comments here are still useful. I don't have any comments on the
larger design that other people haven't already made, but I do have a
few small questions and suggestions, inline below.


On 2020.04.03 20:48, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
> 
> The first job to implement in the 'git run-job' command is the
> 'commit-graph' job. It is based on the sequence of events in the
> 'commit-graph' job in Scalar [1]. This sequence is as follows:
> 
> 1. git commit-graph write --reachable --split
> 2. git commit-graph verify --shallow
> 3. If the verify succeeds, stop.
> 4. Delete the commit-graph-chain file.
> 5. git commit-graph write --reachable --split
> 
> By writing an incremental commit-graph file using the "--split"
> option we minimize the disruption from this operation. The default
> behavior is to merge layers until the new "top" layer is less than
> half the size of the layer below. This provides quick writes "most"
> of the time, with the longer writes following a power law
> distribution.
> 
> Most importantly, concurrent Git processes only look at the
> commit-graph-chain file for a very short amount of time, so they
> will verly likely not be holding a handle to the file when we try
> to replace it. (This only matters on Windows.)
> 
> If a concurrent process reads the old commit-graph-chain file, but
> our job expires some of the .graph files before they can be read,
> then those processes will see a warning message (but not fail).
> This could be avoided by a future update to use the --expire-time
> argument when writing the commit-graph.
> 
> By using 'git commit-graph verify --shallow' we can ensure that
> the file we just wrote is valid. This is an extra safety precaution
> that is faster than our 'write' subcommand. In the rare situation
> that the newest layer of the commit-graph is corrupt, we can "fix"
> the corruption by deleting the commit-graph-chain file and rewrite
> the full commit-graph as a new one-layer commit graph. This does
> not completely prevent _that_ file from being corrupt, but it does
> recompute the commit-graph by parsing commits from the object
> database. In our use of this step in Scalar and VFS for Git, we
> have only seen this issue arise because our microsoft/git fork
> reverted 43d3561 ("commit-graph write: don't die if the existing
> graph is corrupt" 2019-03-25) for a while to keep commit-graph
> writes very fast. We dropped the revert when updating to v2.23.0.
> The verify still has potential for catching corrupt data across
> the layer boundary: if the new file has commit X with parent Y
> in an old file but the commit ID for Y in the old file had a
> bitswap, then we will notice that in the 'verify' command.
> 
> [1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/CommitGraphStep.cs
> 
> RFC QUESTIONS:
> 
> 1. Are links like [1] helpful?

Yes, I appreciate being able to reference these. However, given that
these will show up in commit logs for ~forever, and ongoing work might
happen on Scalar, perhaps the links should be pinned to a specific
revision?


> 
> 2. Can anyone think of a way to test the rewrite fallback?
>    It requires corrupting the latest file between two steps of
>    this one command, so that is a tricky spot to inject a fault.
> 
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  Documentation/git-run-job.txt | 21 ++++++++++--
>  builtin/run-job.c             | 60 ++++++++++++++++++++++++++++++++++-
>  commit-graph.c                |  2 +-
>  commit-graph.h                |  1 +
>  t/t7900-run-job.sh            | 32 +++++++++++++++++++
>  5 files changed, 112 insertions(+), 4 deletions(-)
> 
> diff --git a/Documentation/git-run-job.txt b/Documentation/git-run-job.txt
> index 0627b3ed259..8bf2762d650 100644
> --- a/Documentation/git-run-job.txt
> +++ b/Documentation/git-run-job.txt
> @@ -9,7 +9,7 @@ git-run-job - Run a maintenance job. Intended for background operation.
>  SYNOPSIS
>  --------
>  [verse]
> -'git run-job <job-name>'
> +'git run-job commit-graph'
>  
>  
>  DESCRIPTION
> @@ -28,7 +28,24 @@ BEHAVIOR MAY BE ALTERED IN THE FUTURE.
>  JOBS
>  ----
>  
> -TBD
> +'commit-graph'::
> +
> +The `commit-graph` job updates the `commit-graph` files incrementally,
> +then verifies that the written data is correct. If the new layer has an
> +issue, then the chain file is removed and the `commit-graph` is
> +rewritten from scratch.
> ++
> +The verification only checks the top layer of the `commit-graph` chain.
> +If the incremental write merged the new commits with at least one
> +existing layer, then there is potential for on-disk corruption being
> +carried forward into the new file. This will be noticed and the new
> +commit-graph file will be clean as Git reparses the commit data from
> +the object database.
> ++
> +The incremental write is safe to run alongside concurrent Git processes
> +since it will not expire `.graph` files that were in the previous
> +`commit-graph-chain` file. They will be deleted by a later run based on
> +the expiration delay.
>  
>  
>  GIT
> diff --git a/builtin/run-job.c b/builtin/run-job.c
> index 2c78d053aa4..dd7709952d3 100644
> --- a/builtin/run-job.c
> +++ b/builtin/run-job.c
> @@ -1,12 +1,65 @@
>  #include "builtin.h"
>  #include "config.h"
> +#include "commit-graph.h"
> +#include "object-store.h"
>  #include "parse-options.h"
> +#include "repository.h"
> +#include "run-command.h"
>  
>  static char const * const builtin_run_job_usage[] = {
> -	N_("git run-job"),
> +	N_("git run-job commit-graph"),
>  	NULL
>  };
>  
> +static int run_write_commit_graph(void)
> +{
> +	struct argv_array cmd = ARGV_ARRAY_INIT;
> +
> +	argv_array_pushl(&cmd, "commit-graph", "write",
> +			 "--split", "--reachable", "--no-progress",
> +			 NULL);
> +
> +	return run_command_v_opt(cmd.argv, RUN_GIT_CMD);
> +}
> +
> +static int run_verify_commit_graph(void)
> +{
> +	struct argv_array cmd = ARGV_ARRAY_INIT;
> +
> +	argv_array_pushl(&cmd, "commit-graph", "verify",
> +			 "--shallow", "--no-progress", NULL);
> +
> +	return run_command_v_opt(cmd.argv, RUN_GIT_CMD);
> +}
> +
> +static int run_commit_graph_job(void)
> +{
> +	char *chain_path;
> +
> +	if (run_write_commit_graph()) {
> +		error(_("failed to write commit-graph"));
> +		return 1;
> +	}
> +
> +	if (!run_verify_commit_graph())
> +		return 0;
> +
> +	warning(_("commit-graph verify caught error, rewriting"));
> +
> +	chain_path = get_chain_filename(the_repository->objects->odb);

Should we avoid new uses of `the_repository` and take a repo pointer
argument here instead?


> +	if (unlink(chain_path)) {
> +		UNLEAK(chain_path);
> +		die(_("failed to remove commit-graph at %s"), chain_path);
> +	}
> +	free(chain_path);
> +
> +	if (!run_write_commit_graph())
> +		return 0;

Is there a reason why we don't verify the second write attempt?


> +
> +	error(_("failed to rewrite commit-graph"));
> +	return 1;
> +}
> +
>  int cmd_run_job(int argc, const char **argv, const char *prefix)
>  {
>  	static struct option builtin_run_job_options[] = {
> @@ -23,6 +76,11 @@ int cmd_run_job(int argc, const char **argv, const char *prefix)
>  			     builtin_run_job_usage,
>  			     PARSE_OPT_KEEP_UNKNOWN);
>  
> +	if (argc > 0) {
> +		if (!strcmp(argv[0], "commit-graph"))
> +			return run_commit_graph_job();
> +	}
> +
>  	usage_with_options(builtin_run_job_usage,
>  			   builtin_run_job_options);
>  }
> diff --git a/commit-graph.c b/commit-graph.c
> index f013a84e294..6867f92d04a 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -56,7 +56,7 @@ static char *get_split_graph_filename(struct object_directory *odb,
>  		       oid_hex);
>  }
>  
> -static char *get_chain_filename(struct object_directory *odb)
> +char *get_chain_filename(struct object_directory *odb)
>  {
>  	return xstrfmt("%s/info/commit-graphs/commit-graph-chain", odb->path);
>  }
> diff --git a/commit-graph.h b/commit-graph.h
> index e87a6f63600..8b7bd5a6cb1 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -13,6 +13,7 @@
>  struct commit;
>  
>  char *get_commit_graph_filename(struct object_directory *odb);
> +char *get_chain_filename(struct object_directory *odb);
>  int open_commit_graph(const char *graph_file, int *fd, struct stat *st);
>  
>  /*
> diff --git a/t/t7900-run-job.sh b/t/t7900-run-job.sh
> index 1eac80b7ed3..18b9bd26b3a 100755
> --- a/t/t7900-run-job.sh
> +++ b/t/t7900-run-job.sh
> @@ -5,6 +5,8 @@ test_description='git run-job
>  Testing the background jobs, in the foreground
>  '
>  
> +GIT_TEST_COMMIT_GRAPH=0
> +
>  . ./test-lib.sh
>  
>  test_expect_success 'help text' '
> @@ -12,4 +14,34 @@ test_expect_success 'help text' '
>  	test_i18ngrep "usage: git run-job" err
>  '
>  
> +test_expect_success 'commit-graph job' '
> +	git init server &&
> +	(
> +		cd server &&
> +		chain=.git/objects/info/commit-graphs/commit-graph-chain &&
> +		git checkout -b master &&
> +		for i in $(test_seq 1 15)
> +		do
> +			test_commit $i || return 1
> +		done &&
> +		git run-job commit-graph 2>../err &&
> +		test_must_be_empty ../err &&
> +		test_line_count = 1 $chain &&
> +		for i in $(test_seq 16 19)
> +		do
> +			test_commit $i || return 1
> +		done &&
> +		git run-job commit-graph 2>../err &&
> +		test_must_be_empty ../err &&
> +		test_line_count = 2 $chain &&
> +		for i in $(test_seq 20 23)
> +		do
> +			test_commit $i || return 1
> +		done &&
> +		git run-job commit-graph 2>../err &&
> +		test_must_be_empty ../err &&
> +		test_line_count = 1 $chain
> +	)
> +'
> +
>  test_done
> -- 
> gitgitgadget
> 

@@ -0,0 +1,64 @@
git-run-job(1)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Josh Steadmon wrote (reply to this):

On 2020.04.03 20:48, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
> 
> When working with very large repositories, an incremental 'git fetch'
> command can download a large amount of data. If there are many other
> users pushing to a common repo, then this data can rival the initial
> pack-file size of a 'git clone' of a medium-size repo.
> 
> Users may want to keep the data on their local repos as close as
> possible to the data on the remote repos by fetching periodically in
> the background. This can break up a large daily fetch into several
> smaller hourly fetches.
> 
> However, if we simply ran 'git fetch <remote>' in the background,
> then the user running a foregroudn 'git fetch <remote>' would lose
> some important feedback when a new branch appears or an existing
> branch updates. This is especially true if a remote branch is
> force-updated and this isn't noticed by the user because it occurred
> in the background. Further, the functionality of 'git push
> --force-with-lease' becomes suspect.

It might be useful here to clarify the interaction between background
fetching & the expectations around --force-with-lease.


> When running 'git fetch <remote> <options>' in the background, use
> the following options for careful updating:
> 
> 1. --no-tags prevents getting a new tag when a user wants to see
>    the new tags appear in their foreground fetches.
> 
> 2. --refmap= removes the configured refspec which usually updates
>    refs/remotes/<remote>/* with the refs advertised by the remote.
> 
> 3. By adding a new refspec "+refs/heads/*:refs/hidden/<remote>/*"
>    we can ensure that we actually load the new values somewhere in
>    our refspace while not updating refs/heads or refs/remotes. By
>    storing these refs here, the commit-graph job will update the
>    commit-graph with the commits from these hidden refs.
> 
> 4. --prune will delete the refs/hidden/<remote> refs that no
>    longer appear on the remote.
> 
> We've been using this step as a critical background job in Scalar
> [1] (and VFS for Git). This solved a pain point that was showing up
> in user reports: fetching was a pain! Users do not like waiting to
> download the data that was created while they were away from their
> machines. After implementing background fetch, the foreground fetch
> commands sped up significantly because they mostly just update refs
> and download a small amount of new data. The effect is especially
> dramatic when paried with --no-show-forced-udpates (through
> fetch.showForcedUpdates=false).
> 
> [1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/FetchStep.cs
> 
> RFC QUESTIONS:
> 
> 1. One downside of the refs/hidden pattern is that 'git log' will
>    decorate commits with twice as many refs if they appear at a
>    remote ref (<remote>/<ref> _and_ refs/hidden/<remote>/<ref>). Is
>    there an easy way to exclude a refspace from decorations? Should
>    we make refs/hidden/* a "special" refspace that is excluded from
>    decorations?
> 
> 2. This feature is designed for a desktop machine or equivalent
>    that has a permanent wired network connection, and the machine
>    stays on while the user is not present. For things like laptops
>    with inconsistent WiFi connections (that may be metered) the
>    feature can use the less stable connection more than the user
>    wants. Of course, this feature is opt-in for Git, but in Scalar
>    we have a "scalar pause" command [2] that pauses all maintenance
>    for some amount of time. We should consider a similar mechanism
>    for Git, but for the point of this series the user needs to set
>    up the "background" part of these jobs manually.
> 
> [2] https://github.com/microsoft/scalar/blob/master/Scalar/CommandLine/PauseVerb.cs
> [3] https://github.com/microsoft/scalar/blob/master/docs/advanced.md#controlling-background-maintenance
> 
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  Documentation/git-run-job.txt | 13 ++++-
>  builtin/run-job.c             | 89 ++++++++++++++++++++++++++++++++++-
>  t/t7900-run-job.sh            | 22 +++++++++
>  3 files changed, 122 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/git-run-job.txt b/Documentation/git-run-job.txt
> index 8bf2762d650..eb92e891915 100644
> --- a/Documentation/git-run-job.txt
> +++ b/Documentation/git-run-job.txt
> @@ -9,7 +9,7 @@ git-run-job - Run a maintenance job. Intended for background operation.
>  SYNOPSIS
>  --------
>  [verse]
> -'git run-job commit-graph'
> +'git run-job (commit-graph|fetch)'
>  
>  
>  DESCRIPTION
> @@ -47,6 +47,17 @@ since it will not expire `.graph` files that were in the previous
>  `commit-graph-chain` file. They will be deleted by a later run based on
>  the expiration delay.
>  
> +'fetch'::
> +
> +The `fetch` job updates the object directory with the latest objects
> +from all registered remotes. For each remote, a `git fetch` command is
> +run. The refmap is custom to avoid updating local or remote branches
> +(those in `refs/heads` or `refs/remotes`). Instead, the remote refs are
> +stored in `refs/hidden/<remote>/`. Also, no tags are updated.
> ++
> +This means that foreground fetches are still required to update the
> +remote refs, but the users is notified when the branches and tags are
> +updated on the remote.
>  
>  GIT
>  ---
> diff --git a/builtin/run-job.c b/builtin/run-job.c
> index dd7709952d3..e59056b2918 100644
> --- a/builtin/run-job.c
> +++ b/builtin/run-job.c
> @@ -7,7 +7,7 @@
>  #include "run-command.h"
>  
>  static char const * const builtin_run_job_usage[] = {
> -	N_("git run-job commit-graph"),
> +	N_("git run-job (commit-graph|fetch)"),
>  	NULL
>  };
>  
> @@ -60,6 +60,91 @@ static int run_commit_graph_job(void)
>  	return 1;
>  }
>  
> +static int fetch_remote(const char *remote)
> +{
> +	int result;
> +	struct argv_array cmd = ARGV_ARRAY_INIT;
> +	struct strbuf refmap = STRBUF_INIT;
> +
> +	argv_array_pushl(&cmd, "fetch", remote, "--quiet", "--prune",
> +			 "--no-tags", "--refmap=", NULL);
> +
> +	strbuf_addf(&refmap, "+refs/heads/*:refs/hidden/%s/*", remote);
> +	argv_array_push(&cmd, refmap.buf);
> +
> +	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
> +
> +	strbuf_release(&refmap);
> +	return result;
> +}
> +
> +static int fill_remotes(struct string_list *remotes)
> +{
> +	int result = 0;
> +	FILE *proc_out;
> +	struct strbuf line = STRBUF_INIT;
> +	struct child_process *remote_proc = xmalloc(sizeof(*remote_proc));

Shouldn't remote_proc just go on the stack here rather than getting
malloced?

> +
> +	child_process_init(remote_proc);
> +
> +	argv_array_pushl(&remote_proc->args, "git", "remote", NULL);
> +
> +	remote_proc->out = -1;
> +
> +	if (start_command(remote_proc)) {
> +		error(_("failed to start 'git remote' process"));
> +		result = 1;
> +		goto cleanup;
> +	}
> +
> +	proc_out = xfdopen(remote_proc->out, "r");
> +
> +	/* if there is no line, leave the value as given */
> +	while (!strbuf_getline(&line, proc_out))
> +		string_list_append(remotes, line.buf);
> +
> +	strbuf_release(&line);
> +
> +	fclose(proc_out);
> +
> +	if (finish_command(remote_proc)) {
> +		error(_("failed to finish 'git remote' process"));
> +		result = 1;
> +	}
> +
> +cleanup:
> +	free(remote_proc);
> +	return result;
> +}
> +
> +static int run_fetch_job(void)
> +{
> +	int result = 0;
> +	struct string_list_item *item;
> +	struct string_list remotes = STRING_LIST_INIT_DUP;
> +
> +	if (fill_remotes(&remotes)) {
> +		error(_("failed to fill remotes"));
> +		result = 1;
> +		goto cleanup;
> +	}
> +
> +	/*
> +	 * Do not modify the result based on the success of the 'fetch'
> +	 * operation, as a loss of network could cause 'fetch' to fail
> +	 * quickly. We do not want that to stop the rest of our
> +	 * background operations.
> +	 */
> +	for (item = remotes.items;
> +	     item && item < remotes.items + remotes.nr;
> +	     item++)
> +		fetch_remote(item->string);
> +
> +cleanup:
> +	string_list_clear(&remotes, 0);
> +	return result;
> +}
> +
>  int cmd_run_job(int argc, const char **argv, const char *prefix)
>  {
>  	static struct option builtin_run_job_options[] = {
> @@ -79,6 +164,8 @@ int cmd_run_job(int argc, const char **argv, const char *prefix)
>  	if (argc > 0) {
>  		if (!strcmp(argv[0], "commit-graph"))
>  			return run_commit_graph_job();
> +		if (!strcmp(argv[0], "fetch"))
> +			return run_fetch_job();
>  	}
>  
>  	usage_with_options(builtin_run_job_usage,
> diff --git a/t/t7900-run-job.sh b/t/t7900-run-job.sh
> index 18b9bd26b3a..d3faeba135b 100755
> --- a/t/t7900-run-job.sh
> +++ b/t/t7900-run-job.sh
> @@ -44,4 +44,26 @@ test_expect_success 'commit-graph job' '
>  	)
>  '
>  
> +test_expect_success 'fetch job' '
> +	git clone "file://$(pwd)/server" client &&
> +
> +	# Before fetching, build a client commit-graph
> +	git -C client run-job commit-graph &&
> +	chain=client/.git/objects/info/commit-graphs/commit-graph-chain &&
> +	test_line_count = 1 $chain &&
> +
> +	git -C client branch -v --remotes >before-refs &&
> +	test_commit -C server 24 &&
> +
> +	git -C client run-job fetch &&
> +	git -C client branch -v --remotes >after-refs &&
> +	test_cmp before-refs after-refs &&
> +	test_cmp server/.git/refs/heads/master \
> +		 client/.git/refs/hidden/origin/master &&
> +
> +	# the hidden ref should trigger a new layer in the commit-graph
> +	git -C client run-job commit-graph &&
> +	test_line_count = 2 $chain
> +'
> +
>  test_done
> -- 
> gitgitgadget
> 

@@ -0,0 +1,92 @@
git-run-job(1)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Josh Steadmon wrote (reply to this):

On 2020.04.03 20:48, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
> 
> The previous change cleaned up loose objects using the
> 'loose-objects' that can be run safely in the background. Add a
> similar job that performs similar cleanups for pack-files.
> 
> One issue with running 'git repack' is that it is designed to
> repack all pack-files into a single pack-file. While this is the
> most space-efficient way to store object data, it is not time or
> memory efficient. This becomes extremely important if the repo is
> so large that a user struggles to store two copies of the pack on
> their disk.
> 
> Instead, perform an "incremental" repack by collecting a few small
> pack-files into a new pack-file. The multi-pack-index facilitates
> this process ever since 'git multi-pack-index expire' was added in
> 19575c7 (multi-pack-index: implement 'expire' subcommand,
> 2019-06-10) and 'git multi-pack-index repack' was added in ce1e4a1
> (midx: implement midx_repack(), 2019-06-10).
> 
> The 'pack-files' job runs the following steps:
> 
> 1. 'git multi-pack-index write' creates a multi-pack-index file if
>    one did not exist, and otherwise will update the multi-pack-index
>    with any new pack-files that appeared since the last write. This
>    is particularly relevant with the background fetch job.
> 
>    When the multi-pack-index sees two copies of the same object, it
>    stores the offset data into the newer pack-file. This means that
>    some old pack-files could become "unreferenced" which I will use
>    to mean "a pack-file that is in the pack-file list of the
>    multi-pack-index but none of the objects in the multi-pack-index
>    reference a location inside that pack-file."
> 
> 2. 'git multi-pack-index expire' deletes any unreferenced pack-files
>    and updaes the multi-pack-index to drop those pack-files from the

Typo: updaes -> updates


>    list. This is safe to do as concurrent Git processes will see the
>    multi-pack-index and not open those packs when looking for object
>    contents. (Similar to the 'loose-objects' job, there are some Git

Is it still safe for concurrent processes if the repo did not have a
multi-pack-index when the first process started?


>    commands that open pack-files regardless of the multi-pack-index,
>    but they are rarely used. Further, a user that self-selects to
>    use background operations would likely refrain from using those
>    commands.)
> 
> 3. 'git multi-pack-index repack --bacth-size=<size>' collects a set

Typo: bacth-size -> batch-size


>    of pack-files that are listed in the multi-pack-index and creates
>    a new pack-file containing the objects whose offsets are listed
>    by the multi-pack-index to be in those objects. The set of pack-
>    files is selected greedily by sorting the pack-files by modified
>    time and adding a pack-file to the set if its "expected size" is
>    smaller than the batch size until the total expected size of the
>    selected pack-files is at least the batch size. The "expected
>    size" is calculated by taking the size of the pack-file divided
>    by the number of objects in the pack-file and multiplied by the
>    number of objects from the multi-pack-index with offset in that
>    pack-file. The expected size approximats how much data from that

Typo: approximats -> approximates


>    pack-file will contribute to the resulting pack-file size. The
>    intention is that the resulting pack-file will be close in size
>    to the provided batch size.
> 
>    The next run of the pack-files job will delete these repacked
>    pack-files during the 'expire' step.
> 
>    In this version, the batch size is set to "0" which ignores the
>    size restrictions when selecting the pack-files. It instead
>    selects all pack-files and repacks all packed objects into a
>    single pack-file. This will be updated in the next change, but
>    it requires doing some calculations that are better isolated to
>    a separate change.
> 
> Each of the above steps update the multi-pack-index file. After
> each step, we verify the new multi-pack-index. If the new
> multi-pack-index is corrupt, then delete the multi-pack-index,
> rewrite it from scratch, and stop doing the later steps of the
> job. This is intended to be an extra-safe check without leaving
> a repo with many pack-files without a multi-pack-index.
> 
> These steps are based on a similar background maintenance step in
> Scalar (and VFS for Git) [1]. This was incredibly effective for
> users of the Windows OS repository. After using the same VFS for Git
> repository for over a year, some users had _thousands_ of pack-files
> that combined to up to 250 GB of data. We noticed a few users were
> running into the open file descriptor limits (due in part to a bug
> in the multi-pack-index fixed by af96fe3392 (midx: add packs to
> packed_git linked list, 2019-04-29).
> 
> These pack-files were mostly small since they contained the commits
> and trees that were pushed to the origin in a given hour. The GVFS
> protocol includes a "prefetch" step that asks for pre-computed pack-
> files containing commits and trees by timestamp. These pack-files
> were grouped into "daily" pack-files once a day for up to 30 days.
> If a user did not request prefetch packs for over 30 days, then they
> would get the entire history of commits and trees in a new, large
> pack-file. This led to a large number of pack-files that had poor
> delta compression.
> 
> By running this pack-file maintenance step once per day, these repos
> with thousands of packs spanning 200+ GB dropped to dozens of pack-
> files spanning 30-50 GB. This was done all without removing objects
> from the system and using a constant batch size of two gigabytes.
> Once the work was done to reduce the pack-files to small sizes, the
> batch size of two gigabytes means that not every run triggers a
> repack operation, so the following run will not expire a pack-file.
> This has kept these repos in a "clean" state.
> 
> [1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/PackfileMaintenanceStep.cs
> 
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  Documentation/git-run-job.txt | 18 ++++++-
>  builtin/run-job.c             | 90 ++++++++++++++++++++++++++++++++++-
>  midx.c                        |  2 +-
>  midx.h                        |  1 +
>  t/t7900-run-job.sh            | 39 +++++++++++++++
>  5 files changed, 147 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/git-run-job.txt b/Documentation/git-run-job.txt
> index 43ca1160b5a..108ed25b8bd 100644
> --- a/Documentation/git-run-job.txt
> +++ b/Documentation/git-run-job.txt
> @@ -9,7 +9,7 @@ git-run-job - Run a maintenance job. Intended for background operation.
>  SYNOPSIS
>  --------
>  [verse]
> -'git run-job (commit-graph|fetch|loose-objects)'
> +'git run-job (commit-graph|fetch|loose-objects|pack-files)'
>  
>  
>  DESCRIPTION
> @@ -71,6 +71,22 @@ a batch of loose objects. The batch size is limited to 50 thousand
>  objects to prevent the job from taking too long on a repository with
>  many loose objects.
>  
> +'pack-files'::
> +
> +The `pack-files` job incrementally repacks the object directory using
> +the `multi-pack-index` feature. In order to prevent race conditions with
> +concurrent Git commands, it follows a two-step process. First, it
> +deletes any pack-files included in the `multi-pack-index` where none of
> +the objects in the `multi-pack-index` reference those pack-files; this
> +only happens if all objects in the pack-file are also stored in a newer
> +pack-file. Second, it selects a group of pack-files whose "expected
> +size" is below the batch size until the group has total expected size at
> +least the batch size; see the `--batch-size` option for the `repack`
> +subcommand in linkgit:git-multi-pack-index[1]. The default batch-size is
> +zero, which is a special case that attempts to repack all pack-files
> +into a single pack-file.
> +
> +
>  GIT
>  ---
>  Part of the linkgit:git[1] suite
> diff --git a/builtin/run-job.c b/builtin/run-job.c
> index cecf9058c51..d3543f7ccb9 100644
> --- a/builtin/run-job.c
> +++ b/builtin/run-job.c
> @@ -1,13 +1,14 @@
>  #include "builtin.h"
>  #include "config.h"
>  #include "commit-graph.h"
> +#include "midx.h"
>  #include "object-store.h"
>  #include "parse-options.h"
>  #include "repository.h"
>  #include "run-command.h"
>  
>  static char const * const builtin_run_job_usage[] = {
> -	N_("git run-job (commit-graph|fetch|loose-objects)"),
> +	N_("git run-job (commit-graph|fetch|loose-objects|pack-files)"),
>  	NULL
>  };
>  
> @@ -238,6 +239,91 @@ static int run_loose_objects_job(void)
>  	return prune_packed() || pack_loose();
>  }
>  
> +static int multi_pack_index_write(void)
> +{
> +	struct argv_array cmd = ARGV_ARRAY_INIT;
> +	argv_array_pushl(&cmd, "multi-pack-index", "write",
> +			 "--no-progress", NULL);
> +	return run_command_v_opt(cmd.argv, RUN_GIT_CMD);
> +}
> +
> +static int rewrite_multi_pack_index(void)
> +{
> +	char *midx_name = get_midx_filename(the_repository->objects->odb->path);
> +
> +	unlink(midx_name);
> +	free(midx_name);
> +
> +	if (multi_pack_index_write()) {
> +		error(_("failed to rewrite multi-pack-index"));
> +		return 1;
> +	}
> +
> +	return 0;
> +}
> +
> +static int multi_pack_index_verify(void)
> +{
> +	struct argv_array cmd = ARGV_ARRAY_INIT;
> +	argv_array_pushl(&cmd, "multi-pack-index", "verify",
> +			 "--no-progress", NULL);
> +	return run_command_v_opt(cmd.argv, RUN_GIT_CMD);
> +}
> +
> +static int multi_pack_index_expire(void)
> +{
> +	struct argv_array cmd = ARGV_ARRAY_INIT;
> +	argv_array_pushl(&cmd, "multi-pack-index", "expire",
> +			 "--no-progress", NULL);
> +	return run_command_v_opt(cmd.argv, RUN_GIT_CMD);
> +}
> +
> +static int multi_pack_index_repack(void)
> +{
> +	int result;
> +	struct argv_array cmd = ARGV_ARRAY_INIT;
> +	argv_array_pushl(&cmd, "multi-pack-index", "repack",
> +			 "--no-progress", "--batch-size=0", NULL);
> +	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
> +
> +	if (result && multi_pack_index_verify()) {
> +		warning(_("multi-pack-index verify failed after repack"));
> +		result = rewrite_multi_pack_index();
> +	}
> +
> +	return result;
> +}
> +
> +static int run_pack_files_job(void)
> +{
> +	if (multi_pack_index_write()) {
> +		error(_("failed to write multi-pack-index"));
> +		return 1;
> +	}
> +
> +	if (multi_pack_index_verify()) {
> +		warning(_("multi-pack-index verify failed after initial write"));
> +		return rewrite_multi_pack_index();
> +	}
> +
> +	if (multi_pack_index_expire()) {
> +		error(_("multi-pack-index expire failed"));
> +		return 1;
> +	}
> +
> +	if (multi_pack_index_verify()) {
> +		warning(_("multi-pack-index verify failed after expire"));
> +		return rewrite_multi_pack_index();
> +	}
> +
> +	if (multi_pack_index_repack()) {
> +		error(_("multi-pack-index repack failed"));
> +		return 1;
> +	}
> +
> +	return 0;
> +}
> +
>  int cmd_run_job(int argc, const char **argv, const char *prefix)
>  {
>  	static struct option builtin_run_job_options[] = {
> @@ -261,6 +347,8 @@ int cmd_run_job(int argc, const char **argv, const char *prefix)
>  			return run_fetch_job();
>  		if (!strcmp(argv[0], "loose-objects"))
>  			return run_loose_objects_job();
> +		if (!strcmp(argv[0], "pack-files"))
> +			return run_pack_files_job();
>  	}
>  
>  	usage_with_options(builtin_run_job_usage,
> diff --git a/midx.c b/midx.c
> index 1527e464a7b..0f0d0a38812 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -36,7 +36,7 @@
>  
>  #define PACK_EXPIRED UINT_MAX
>  
> -static char *get_midx_filename(const char *object_dir)
> +char *get_midx_filename(const char *object_dir)
>  {
>  	return xstrfmt("%s/pack/multi-pack-index", object_dir);
>  }
> diff --git a/midx.h b/midx.h
> index e6fa356b5ca..cf2c09dffc2 100644
> --- a/midx.h
> +++ b/midx.h
> @@ -39,6 +39,7 @@ struct multi_pack_index {
>  
>  #define MIDX_PROGRESS     (1 << 0)
>  
> +char *get_midx_filename(const char *object_dir);
>  struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local);
>  int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, uint32_t pack_int_id);
>  int bsearch_midx(const struct object_id *oid, struct multi_pack_index *m, uint32_t *result);
> diff --git a/t/t7900-run-job.sh b/t/t7900-run-job.sh
> index 41da083257b..416ba04989d 100755
> --- a/t/t7900-run-job.sh
> +++ b/t/t7900-run-job.sh
> @@ -6,6 +6,7 @@ Testing the background jobs, in the foreground
>  '
>  
>  GIT_TEST_COMMIT_GRAPH=0
> +GIT_TEST_MULTI_PACK_INDEX=0
>  
>  . ./test-lib.sh
>  
> @@ -93,4 +94,42 @@ test_expect_success 'loose-objects job' '
>  	test_cmp packs-between packs-after
>  '
>  
> +test_expect_success 'pack-files job' '
> +	packDir=.git/objects/pack &&
> +
> +	# Create three disjoint pack-files with size BIG, small, small.
> +
> +	echo HEAD~2 | git -C client pack-objects --revs $packDir/test-1 &&
> +
> +	test_tick &&
> +	git -C client pack-objects --revs $packDir/test-2 <<-\EOF &&
> +	HEAD~1
> +	^HEAD~2
> +	EOF
> +
> +	test_tick &&
> +	git -C client pack-objects --revs $packDir/test-3 <<-\EOF &&
> +	HEAD
> +	^HEAD~1
> +	EOF
> +
> +	rm -f client/$packDir/pack-* &&
> +	rm -f client/$packDir/loose-* &&
> +
> +	ls client/$packDir/*.pack >packs-before &&
> +	test_line_count = 3 packs-before &&
> +
> +	# the job repacks the two into a new pack, but does not
> +	# delete the old ones.
> +	git -C client run-job pack-files &&
> +	ls client/$packDir/*.pack >packs-between &&
> +	test_line_count = 4 packs-between &&
> +
> +	# the job deletes the two old packs, and does not write
> +	# a new one because only one pack remains.
> +	git -C client run-job pack-files &&
> +	ls client/.git/objects/pack/*.pack >packs-after &&
> +	test_line_count = 1 packs-after
> +'
> +
>  test_done
> -- 
> gitgitgadget
> 

@gitgitgadget
Copy link

gitgitgadget bot commented May 27, 2020

On the Git mailing list, Josh Steadmon wrote (reply to this):

I'm late to the discussion, but I'd like to chime in here too.


On 2020.04.08 00:01, brian m. carlson wrote:
> Hey,
> 
> On 2020-04-07 at 22:23:43, Johannes Schindelin wrote:
> > > If there are periodic tasks that should be done, even if only on large
> > > repos, then let's have a git gc --periodic that does them.  I'm not sure
> > > that fetch should be in that set, but nothing prevents users from doing
> > > "git fetch origin && git gc --periodic".
> > 
> > Hmm. Who says that maintenance tasks are essentially only `gc`? With
> > _maaaaaybe_ a `fetch` thrown in?
> 
> What I'm saying is that we have a tool to run maintenance tasks on the
> repository.  If we need to perform additional maintenance tasks, let's
> put them in the same place as the ones we have now.  I realize "gc" may
> become a less accurate name, but oh, well.
> 
> > > Let's make it as simple and straightforward as possible.
> > 
> > I get the impression, however, that many reviewers here seem to favor the
> > goal of making the _patches_ as simple and straightforward as possible,
> > however, at the expense of the original goal. Like, totally sacrificing
> > the ease of use in return for "just use a shell script" advice.
> 
> I think we can have both.  They are not mutually exclusive, and I've
> proposed a suggestion for both.
> 
> > > As for handling multiple repositories, the tool to do that could be as
> > > simple as a shell script which reads from ~/.config/git/repo-maintenance
> > > (or whatever) and runs the same command on all of the repos it finds
> > > there, possibly with a subcommand to add and remove repos.
> > 
> > Sure, that is flexible.
> > 
> > And it requires a ton of Git expertise to know what to put into those
> > scripts. And Git updates cannot deliver more value to those scripts.
> 
> Perhaps I was unclear what I thought could be the design of this.  My
> proposal is something like the following:
> 
>   git schedule-gc add [--period=TIME] [--fetch=REMOTE | --fetch-all] REPO
>   git schedule-gc remove REPO
> 
> The actual command invoked by the system scheduler would be something
> like the following:
> 
>   git schedule-gc run
> 
> It would work as I proposed under the hood, but it would be relatively
> straightforward to use.

Regardless of what happens with the job-runner, I would like to see a
top-level command that performs a single iteration of all the
recommended maintenance steps, with zero configuration required, on a
single repo. This gives an entry point for users who want to manage
their own maintenance schedule without running a background process.


> > > I'm not opposed to seeing a tool that can schedule periodic maintenance
> > > jobs, perhaps in contrib, depending on whether other people think it
> > > should go.  However, I think running periodic jobs is best handled on
> > > Unix with cron or anacron and not a custom tool or a command in Git.
> > 
> > Okay, here is a challenge for you: design this such that the Windows
> > experience does _not_ feel like a 3rd-class citizen. Go ahead. Yes, there
> > is a scheduler. Yep, it does not do cron-like things. Precisely: you have
> > to feed it an XML to make use of the "advanced" features. Yeah, I also
> > cannot remember what the semantics are regarding missed jobs due to
> > shutdown cycles. Nope, you cannot rely on the XML being an option, that
> > would require Windows 10. The list goes on.
> 
> I will freely admit that I know next to nothing about Windows.  I have
> used it only incidentally, if at all, for at least two decades.  It is
> not a platform I generally have an interest in developing for, although
> I try to make it work as well as possible when I am working on a project
> which supports it.
> 
> It is, in general, my assumption, based on its wide usage, that it is a
> powerful and robust operating system with many features, but I have
> little actual knowledge about how it functions or the exact features it
> provides.
> 
> I want a solution that builds on the existing Unix tools for Unix,
> because that is least surprising to users and it is how Unix tools are
> supposed to work.  I think we can agree that Git was designed with the
> Unix philosophy in mind.
> 
> I also want a solution that works on Windows.  Ideally that solution
> would build on existing components that are part of Windows, because it
> reduces the maintenance burden on all of us.  But unfortunately, I know
> next to nothing about how to build such a solution.
> 
> > > I've dealt with systems that implemented periodic tasks without using
> > > the existing tools for doing that, and I've found that usually that's a
> > > mistake.  Despite seeming straightforward, there are a lot of tricky
> > > edge cases to deal with and it's easy to get wrong.
> > 
> > But maybe you found one of those issues in Stolee's patches? If so, please
> > do contribute your experience there to point out those issues, so that
> > they can be addressed.
> 
> One of the benefits of using anacron on Unix is that it can skip running
> tasks when the user is on battery.  This is not anything we can portably
> do across systems, nor is it something that Git should need to know
> about.
> 
> > > We also don't have to reimplement all the features in the system
> > > scheduler and can let expert users use a different tool of their choice
> > > instead if cron (or the Windows equivalent) is not to their liking.
> > 
> > Do we really want to start relying on `cron`, when the major platform used
> > by the target audience (enterprise software engineers who deal with rather
> > larger repositories than git.git or linux.git) quite obviously _lacks_
> > support for that?
> 
> Unix users will be unhappy with us if we use our own scheduling system
> when cron is available.  They will expect us to reimplement those
> features and they will complain if we do not.  While I cannot name
> names, there are a nontrivial number of large, enterprise monorepos that
> run only on macOS and Linux.

Speaking purely as a user, I agree with this point. This is why I want a
single-iteration top-level maintenance command.

Once we have that, we can provide recommended configs for existing
scheduling solutions (cron, launchd, systemd, etc.) in contrib/. If the
Windows scheduler is cumbersome enough that users don't want to use it,
then I think it's perfectly reasonable to provide our own limited
job-runner in contrib/ as well, so long as we don't require people to
use it.

> That doesn't prevent us from building tooling that does the scheduling
> on Windows if we can't use the system scheduler, but it would be nice to
> try to present a relatively unified interface across the two platforms.
> -- 
> brian m. carlson: Houston, Texas, US
> OpenPGP: https://keybase.io/bk2204


@gitgitgadget
Copy link

gitgitgadget bot commented May 28, 2020

On the Git mailing list, Junio C Hamano wrote (reply to this):

Josh Steadmon <steadmon@google.com> writes:

> Regardless of what happens with the job-runner, I would like to see a
> top-level command that performs a single iteration of all the
> recommended maintenance steps, with zero configuration required, on a
> single repo. This gives an entry point for users who want to manage
> their own maintenance schedule without running a background process.
> ...
>> Unix users will be unhappy with us if we use our own scheduling system
>> when cron is available.  They will expect us to reimplement those
>> features and they will complain if we do not.  While I cannot name
>> names, there are a nontrivial number of large, enterprise monorepos that
>> run only on macOS and Linux.
>
> Speaking purely as a user, I agree with this point. This is why I want a
> single-iteration top-level maintenance command.

Yes, well said.  

It exactly is what "git gc" was meant to be.  To put it differently,
if you asked any non-novice end-user if there is one single command
that s/he would use to keep a repository healthy, it is very likely
that the answer would be "git gc".

And having such a single point of entry would be a good thing.

Thanks.


@gitgitgadget
Copy link

gitgitgadget bot commented May 28, 2020

On the Git mailing list, Johannes Schindelin wrote (reply to this):

Hi Junio,

On Wed, 27 May 2020, Junio C Hamano wrote:

> Josh Steadmon <steadmon@google.com> writes:
>
> > Regardless of what happens with the job-runner, I would like to see a
> > top-level command that performs a single iteration of all the
> > recommended maintenance steps, with zero configuration required, on a
> > single repo. This gives an entry point for users who want to manage
> > their own maintenance schedule without running a background process.
> > ...
> >> Unix users will be unhappy with us if we use our own scheduling system
> >> when cron is available.  They will expect us to reimplement those
> >> features and they will complain if we do not.  While I cannot name
> >> names, there are a nontrivial number of large, enterprise monorepos that
> >> run only on macOS and Linux.
> >
> > Speaking purely as a user, I agree with this point. This is why I want a
> > single-iteration top-level maintenance command.
>
> Yes, well said.
>
> It exactly is what "git gc" was meant to be.  To put it differently,
> if you asked any non-novice end-user if there is one single command
> that s/he would use to keep a repository healthy, it is very likely
> that the answer would be "git gc".

The biggest problem with bringing up `git gc` in this context (and it is
actually a quite big problem) is that "gc" stands for "garbage
collection", and these maintenance tasks are not even close to being about
collecting garbage.

So while `git gc` looks like a good candidate on its technical merits, the
usability/discoverability point of view paints a very different picture.

What Scalar does is conceptually a _very_ different thing from letting
`git gc --auto` (which I guess is what you _actually_ meant, as I have yet
to meet even a single Git user outside of this mailing list who knows `git
gc`, let alone who runs it manually) determine whether loose objects
should be cleaned up and packs should be consolidated.

Like, pre-fetching a daily pack in preparation for the user fetching
updates. You could argue that this is, in a way, _accumulating_ "garbage".

The entire idea of those maintenance tasks is that they are _not_
triggered by the user, not even as a side effect (`git gc --auto` is very
much a side effect, and on Windows, where it does not `daemonize()`
because that concept does not translate well into the Win32 API, having it
run in the foreground is very much felt by the users).

Those maintenance tasks should stay out of the users' way as much as
possible.

> And having such a single point of entry would be a good thing.

I guess I would argue for the introduction of a new command, like `git
maintenance`, which could potentially trigger a `git gc --auto`, but is
primarily intended to run _outside_ of the users' work hours.

Once we have that, we can always figure out whether there is a convenient
way to register this via `crontab` or Windows Task Scheduler, without
asking the users to do all of these tedious steps manually. I, for one,
have to spend the extra time looking up what those positional numbers in
the `crontab` thing _mean_, _every_ _single_ _time_ I touch the `crontab`.
_Every_ _single_ _time_.

Therefore, I would like very much to have a `git maintenance --schedule`
(or something like that), even if only on Windows, on the grounds alone
that it is even more tedious to work with the Windows Task Scheduler. But
I would prefer to have that also in my Linux setup. The convenience for
the users (myself included) is just too compelling.

Ciao,
Dscho

@gitgitgadget
Copy link

gitgitgadget bot commented May 28, 2020

On the Git mailing list, Junio C Hamano wrote (reply to this):

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

>> It exactly is what "git gc" was meant to be.  To put it differently,
>> if you asked any non-novice end-user if there is one single command
>> that s/he would use to keep a repository healthy, it is very likely
>> that the answer would be "git gc".
>
> The biggest problem with bringing up `git gc` in this context (and it is
> actually a quite big problem) is that "gc" stands for ...

Ah, no, I wasn't saying "the single command has to be called 'gc'".
I do not care too much about the name, as it will be hidden scripted
away.

I was responding to Josh's "We want a single point of entry for all
the housekeeping tasks" with "yes---it is not any controversial new
idea---we have had such a single entry point for housekeeping tasks
in 'gc' for a long time and we are receptive of having such a single
command".

@gitgitgadget
Copy link

gitgitgadget bot commented May 28, 2020

On the Git mailing list, Jonathan Nieder wrote (reply to this):

Hi,

Johannes Schindelin wrote:

> So while `git gc` looks like a good candidate on its technical merits, the
> usability/discoverability point of view paints a very different picture.

Here is what I take away from that comment:

    It was a mistake for "git gc" to take responsibility for operations
    like "git pack-refs".  We should migrate to a new "git maintenance
    run" command and make "git gc" print a signpost to help people find
    the new name.

That seems like a reasonable direction to me, but as you're hinting,
from a technical design perspective it doesn't change where e.g. midx
maintenance operations would go.  I would expect the routine
maintenance command, whatever it is called, to perform such
operations.  I would not expect to have to choose between two subtly
different maintenance commands, or in other words to have to pay as a
user for an unrelated naming choice.

I think it's worth discussing naming, but it's kind of a distraction
from what Brian brought up.  The real question is, do we consider the
existing "git gc" infrastructure such a lost cause that we should
touch it as little as possible?

I don't think that was Stolee's intent --- instead, I supsect that it
was just the discoverability issue you mention that led to this series
forgetting about the possibility of incorporating this functionality
into "git gc".

[...]
> I guess I would argue for the introduction of a new command, like `git
> maintenance`, which could potentially trigger a `git gc --auto`, but is
> primarily intended to run _outside_ of the users' work hours.
>
> Once we have that, we can always figure out whether there is a convenient
> way to register this via `crontab` or Windows Task Scheduler, without
> asking the users to do all of these tedious steps manually.

Wonderful, it sounds like we're pretty much in agreement.

[...]
> Therefore, I would like very much to have a `git maintenance --schedule`
> (or something like that), even if only on Windows, on the grounds alone
> that it is even more tedious to work with the Windows Task Scheduler. But
> I would prefer to have that also in my Linux setup. The convenience for
> the users (myself included) is just too compelling.

I think Josh was arguing for deferring that part to a separate series.
Otherwise we could end up spending so much time focusing on scheduling
and service supervision that the series gets delayed and no one
benefits from Stolee's work to improve how routine packing works.

Thanks,
Jonathan

@gitgitgadget
Copy link

gitgitgadget bot commented May 28, 2020

On the Git mailing list, Junio C Hamano wrote (reply to this):

Jonathan Nieder <jrnieder@gmail.com> writes:

> I think it's worth discussing naming, but it's kind of a distraction
> from what Brian brought up.  The real question is, do we consider the
> existing "git gc" infrastructure such a lost cause that we should
> touch it as little as possible?

I am fine with that, as long as the "new" thing will take over what
"git gc" currently does.  It has always been bothering me that the
maintenance pieces for features added in the past few years like
midx were deliberately let outside the scope of "one entry point for
housekeeping".

@gitgitgadget
Copy link

gitgitgadget bot commented May 28, 2020

On the Git mailing list, Jonathan Nieder wrote (reply to this):

Junio C Hamano wrote:
> Jonathan Nieder <jrnieder@gmail.com> writes:

>>                              The real question is, do we consider the
>> existing "git gc" infrastructure such a lost cause that we should
>> touch it as little as possible?
>
> I am fine with that, as long as the "new" thing will take over what
> "git gc" currently does.

Good reminder, thank you.

Yes, as long as we end up replacing the old thing, making a parallel
new thing (e.g. with a config option for switching between during a
transition period) can be a fine approach.

Thanks,
Jonathan

@gitgitgadget
Copy link

gitgitgadget bot commented May 28, 2020

On the Git mailing list, Derrick Stolee wrote (reply to this):

On 5/28/2020 11:03 AM, Jonathan Nieder wrote:
> Junio C Hamano wrote:
>> Jonathan Nieder <jrnieder@gmail.com> writes:
> 
>>>                              The real question is, do we consider the
>>> existing "git gc" infrastructure such a lost cause that we should
>>> touch it as little as possible?
>>
>> I am fine with that, as long as the "new" thing will take over what
>> "git gc" currently does.
> 
> Good reminder, thank you.
> 
> Yes, as long as we end up replacing the old thing, making a parallel
> new thing (e.g. with a config option for switching between during a
> transition period) can be a fine approach.

Thanks for the discussion, everyone. I'm sorry that I'm very late in
providing a new RFC that takes this approach, but yes I intend to create
the "single entry point" for maintenance activities, and incorporate
auto-GC as a default option there.

Something that is a good long-term goal is to have the new maintenance
entry point replace the "git gc --auto" calls, so we have better
customization of post-command "automatic" maintenance. This can be done
without any of the "background" part of my original RFC.

I've just been to busy with other tasks to devote the deep focus that
this feature deserves. Thanks for your patience.

-Stolee

@gitgitgadget
Copy link

gitgitgadget bot commented May 28, 2020

On the Git mailing list, Johannes Schindelin wrote (reply to this):

Hi Stolee,

On Thu, 28 May 2020, Derrick Stolee wrote:

> On 5/28/2020 11:03 AM, Jonathan Nieder wrote:
> > Junio C Hamano wrote:
> >> Jonathan Nieder <jrnieder@gmail.com> writes:
> >
> >>>                              The real question is, do we consider the
> >>> existing "git gc" infrastructure such a lost cause that we should
> >>> touch it as little as possible?
> >>
> >> I am fine with that, as long as the "new" thing will take over what
> >> "git gc" currently does.
> >
> > Good reminder, thank you.
> >
> > Yes, as long as we end up replacing the old thing, making a parallel
> > new thing (e.g. with a config option for switching between during a
> > transition period) can be a fine approach.
>
> Thanks for the discussion, everyone. I'm sorry that I'm very late in
> providing a new RFC that takes this approach, but yes I intend to create
> the "single entry point" for maintenance activities, and incorporate
> auto-GC as a default option there.

Great! I look forward to reviewing your next patch series iteration,
whenever you're ready.

Ciao,
Dscho

>
> Something that is a good long-term goal is to have the new maintenance
> entry point replace the "git gc --auto" calls, so we have better
> customization of post-command "automatic" maintenance. This can be done
> without any of the "background" part of my original RFC.
>
> I've just been to busy with other tasks to devote the deep focus that
> this feature deserves. Thanks for your patience.
>
> -Stolee
>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants