-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Maintenance jobs and job runner #597
Conversation
The 'git run-job' command will be used to execute a short-lived set of maintenance activities by a background job manager. The intention is to perform small batches of work that reduce the foreground time taken by repository maintenance such as 'git gc --auto'. This change does the absolute minimum to create the builtin and show the usage output. Provide an explicit warning that this command is experimental. The set of jobs may change, and each job could alter its behavior in future versions. RFC QUESTION: This builtin is based on the background maintenance in Scalar. Specifically, this builtin is based on the "scalar run <job>" command [1] [2]. My default thought was to make this a "git run <job>" command to maximize similarity. However, it seems like "git run" is too generic. Or, am I being overly verbose for no reason? [1] https://github.com/microsoft/scalar/blob/master/docs/advanced.md#run-maintenance-in-the-foreground [2] https://github.com/microsoft/scalar/blob/master/Scalar/CommandLine/RunVerb.cs Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
7eacc4b
to
2ef356c
Compare
The first job to implement in the 'git run-job' command is the 'commit-graph' job. It is based on the sequence of events in the 'commit-graph' job in Scalar [1]. This sequence is as follows: 1. git commit-graph write --reachable --split 2. git commit-graph verify --shallow 3. If the verify succeeds, stop. 4. Delete the commit-graph-chain file. 5. git commit-graph write --reachable --split By writing an incremental commit-graph file using the "--split" option we minimize the disruption from this operation. The default behavior is to merge layers until the new "top" layer is less than half the size of the layer below. This provides quick writes "most" of the time, with the longer writes following a power law distribution. Most importantly, concurrent Git processes only look at the commit-graph-chain file for a very short amount of time, so they will verly likely not be holding a handle to the file when we try to replace it. (This only matters on Windows.) If a concurrent process reads the old commit-graph-chain file, but our job expires some of the .graph files before they can be read, then those processes will see a warning message (but not fail). This could be avoided by a future update to use the --expire-time argument when writing the commit-graph. By using 'git commit-graph verify --shallow' we can ensure that the file we just wrote is valid. This is an extra safety precaution that is faster than our 'write' subcommand. In the rare situation that the newest layer of the commit-graph is corrupt, we can "fix" the corruption by deleting the commit-graph-chain file and rewrite the full commit-graph as a new one-layer commit graph. This does not completely prevent _that_ file from being corrupt, but it does recompute the commit-graph by parsing commits from the object database. In our use of this step in Scalar and VFS for Git, we have only seen this issue arise because our microsoft/git fork reverted 43d3561 ("commit-graph write: don't die if the existing graph is corrupt" 2019-03-25) for a while to keep commit-graph writes very fast. We dropped the revert when updating to v2.23.0. The verify still has potential for catching corrupt data across the layer boundary: if the new file has commit X with parent Y in an old file but the commit ID for Y in the old file had a bitswap, then we will notice that in the 'verify' command. [1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/CommitGraphStep.cs RFC QUESTIONS: 1. Are links like [1] helpful? 2. Can anyone think of a way to test the rewrite fallback? It requires corrupting the latest file between two steps of this one command, so that is a tricky spot to inject a fault. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
When working with very large repositories, an incremental 'git fetch' command can download a large amount of data. If there are many other users pushing to a common repo, then this data can rival the initial pack-file size of a 'git clone' of a medium-size repo. Users may want to keep the data on their local repos as close as possible to the data on the remote repos by fetching periodically in the background. This can break up a large daily fetch into several smaller hourly fetches. However, if we simply ran 'git fetch <remote>' in the background, then the user running a foregroudn 'git fetch <remote>' would lose some important feedback when a new branch appears or an existing branch updates. This is especially true if a remote branch is force-updated and this isn't noticed by the user because it occurred in the background. Further, the functionality of 'git push --force-with-lease' becomes suspect. When running 'git fetch <remote> <options>' in the background, use the following options for careful updating: 1. --no-tags prevents getting a new tag when a user wants to see the new tags appear in their foreground fetches. 2. --refmap= removes the configured refspec which usually updates refs/remotes/<remote>/* with the refs advertised by the remote. 3. By adding a new refspec "+refs/heads/*:refs/hidden/<remote>/*" we can ensure that we actually load the new values somewhere in our refspace while not updating refs/heads or refs/remotes. By storing these refs here, the commit-graph job will update the commit-graph with the commits from these hidden refs. 4. --prune will delete the refs/hidden/<remote> refs that no longer appear on the remote. We've been using this step as a critical background job in Scalar [1] (and VFS for Git). This solved a pain point that was showing up in user reports: fetching was a pain! Users do not like waiting to download the data that was created while they were away from their machines. After implementing background fetch, the foreground fetch commands sped up significantly because they mostly just update refs and download a small amount of new data. The effect is especially dramatic when paried with --no-show-forced-udpates (through fetch.showForcedUpdates=false). [1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/FetchStep.cs RFC QUESTIONS: 1. One downside of the refs/hidden pattern is that 'git log' will decorate commits with twice as many refs if they appear at a remote ref (<remote>/<ref> _and_ refs/hidden/<remote>/<ref>). Is there an easy way to exclude a refspace from decorations? Should we make refs/hidden/* a "special" refspace that is excluded from decorations? 2. This feature is designed for a desktop machine or equivalent that has a permanent wired network connection, and the machine stays on while the user is not present. For things like laptops with inconsistent WiFi connections (that may be metered) the feature can use the less stable connection more than the user wants. Of course, this feature is opt-in for Git, but in Scalar we have a "scalar pause" command [2] that pauses all maintenance for some amount of time. We should consider a similar mechanism for Git, but for the point of this series the user needs to set up the "background" part of these jobs manually. [2] https://github.com/microsoft/scalar/blob/master/Scalar/CommandLine/PauseVerb.cs [3] https://github.com/microsoft/scalar/blob/master/docs/advanced.md#controlling-background-maintenance Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
One goal of background maintenance jobs is to allow a user to disable auto-gc (gc.auto=0) but keep their repository in a clean state. Without any cleanup, loose objects will clutter the object database and slow operations. In addition, the loose objects will take up extra space because they are not stored with deltas against similar objects. Create a 'loose-objects' job for the 'git run-job' command. This helps clean up loose objects without disrupting concurrent Git commands using the following sequence of events: 1. Run 'git prune-packed' to delete any loose objects that exist in a pack-file. Concurrent commands will prefer the packed version of the object to the loose version. (Of course, there are exceptions for commands that specifically care about the location of an object. These are rare for a user to run on purpose, and we hope a user that has selected background maintenance will not be trying to do foreground maintenance.) 2. Run 'git pack-objects' on a batch of loose objects. These objects are grouped by scanning the loose object directories in lexicographic order until listing all loose objects -or- reaching 50,000 objects. This is more than enough if the loose objects are created only by a user doing normal development. We noticed users with _millions_ of loose objects because VFS for Git downloads blobs on-demand when a file read operation requires populating a virtual file. This has potential of happening in partial clones if someone runs 'git grep' or otherwise evades the batch-download feature for requesting promisor objects. This step is based on a similar step in Scalar [1] and VFS for Git. [1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/LooseObjectsStep.cs Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
The previous change cleaned up loose objects using the 'loose-objects' that can be run safely in the background. Add a similar job that performs similar cleanups for pack-files. One issue with running 'git repack' is that it is designed to repack all pack-files into a single pack-file. While this is the most space-efficient way to store object data, it is not time or memory efficient. This becomes extremely important if the repo is so large that a user struggles to store two copies of the pack on their disk. Instead, perform an "incremental" repack by collecting a few small pack-files into a new pack-file. The multi-pack-index facilitates this process ever since 'git multi-pack-index expire' was added in 19575c7 (multi-pack-index: implement 'expire' subcommand, 2019-06-10) and 'git multi-pack-index repack' was added in ce1e4a1 (midx: implement midx_repack(), 2019-06-10). The 'pack-files' job runs the following steps: 1. 'git multi-pack-index write' creates a multi-pack-index file if one did not exist, and otherwise will update the multi-pack-index with any new pack-files that appeared since the last write. This is particularly relevant with the background fetch job. When the multi-pack-index sees two copies of the same object, it stores the offset data into the newer pack-file. This means that some old pack-files could become "unreferenced" which I will use to mean "a pack-file that is in the pack-file list of the multi-pack-index but none of the objects in the multi-pack-index reference a location inside that pack-file." 2. 'git multi-pack-index expire' deletes any unreferenced pack-files and updaes the multi-pack-index to drop those pack-files from the list. This is safe to do as concurrent Git processes will see the multi-pack-index and not open those packs when looking for object contents. (Similar to the 'loose-objects' job, there are some Git commands that open pack-files regardless of the multi-pack-index, but they are rarely used. Further, a user that self-selects to use background operations would likely refrain from using those commands.) 3. 'git multi-pack-index repack --bacth-size=<size>' collects a set of pack-files that are listed in the multi-pack-index and creates a new pack-file containing the objects whose offsets are listed by the multi-pack-index to be in those objects. The set of pack- files is selected greedily by sorting the pack-files by modified time and adding a pack-file to the set if its "expected size" is smaller than the batch size until the total expected size of the selected pack-files is at least the batch size. The "expected size" is calculated by taking the size of the pack-file divided by the number of objects in the pack-file and multiplied by the number of objects from the multi-pack-index with offset in that pack-file. The expected size approximats how much data from that pack-file will contribute to the resulting pack-file size. The intention is that the resulting pack-file will be close in size to the provided batch size. The next run of the pack-files job will delete these repacked pack-files during the 'expire' step. In this version, the batch size is set to "0" which ignores the size restrictions when selecting the pack-files. It instead selects all pack-files and repacks all packed objects into a single pack-file. This will be updated in the next change, but it requires doing some calculations that are better isolated to a separate change. Each of the above steps update the multi-pack-index file. After each step, we verify the new multi-pack-index. If the new multi-pack-index is corrupt, then delete the multi-pack-index, rewrite it from scratch, and stop doing the later steps of the job. This is intended to be an extra-safe check without leaving a repo with many pack-files without a multi-pack-index. These steps are based on a similar background maintenance step in Scalar (and VFS for Git) [1]. This was incredibly effective for users of the Windows OS repository. After using the same VFS for Git repository for over a year, some users had _thousands_ of pack-files that combined to up to 250 GB of data. We noticed a few users were running into the open file descriptor limits (due in part to a bug in the multi-pack-index fixed by af96fe3 (midx: add packs to packed_git linked list, 2019-04-29). These pack-files were mostly small since they contained the commits and trees that were pushed to the origin in a given hour. The GVFS protocol includes a "prefetch" step that asks for pre-computed pack- files containing commits and trees by timestamp. These pack-files were grouped into "daily" pack-files once a day for up to 30 days. If a user did not request prefetch packs for over 30 days, then they would get the entire history of commits and trees in a new, large pack-file. This led to a large number of pack-files that had poor delta compression. By running this pack-file maintenance step once per day, these repos with thousands of packs spanning 200+ GB dropped to dozens of pack- files spanning 30-50 GB. This was done all without removing objects from the system and using a constant batch size of two gigabytes. Once the work was done to reduce the pack-files to small sizes, the batch size of two gigabytes means that not every run triggers a repack operation, so the following run will not expire a pack-file. This has kept these repos in a "clean" state. [1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/PackfileMaintenanceStep.cs Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
When repacking during the 'pack-files' job, we use the --batch-size option in 'git multi-pack-index repack'. The initial setting used --batch-size=0 to repack everything into a single pack-file. This is not sustaintable for a large repository. The amount of work required is also likely to use too many system resources for a background job. Update the 'git run-job pack-files' command by allowing a direct --batch-size option that can change the value provided. Update the default of "0" to a computed size based on the existing pack-files. While computing that new size, count the existing pack-files and skip the repack step if there are at most two pack-files. The dynamic default size is computed with this idea in mind for a client repository that was cloned from a very large remote: there is likely one "big" pack-file that was created at clone time. Thus, do not try repacking it as it is likely packed efficiently by the server. Instead, try packing the other pack-files into a single pack-file. The size is then computed as follows: batch size = total size - max pack size The batch size is then also limited to be at most two gigabytes. This serves two purposes. First, having a limit prevents doing too much work when the repository is extremely large. Pack-files larger than two gigabytes are likely to either contain large blobs or have been carefully repacked by a previous repack operation. Second, two gigabytes is the size limit for a signed 32-bit int. It's a good limit to consider, and to keep it far away from the unsigned 32-bit int limit. This limit comes to mind because on Windows an "unsigned long" is 32 bits and OPT_MAGNITUDE() uses unsigned longs for its parsing logic. This calculation mimics a similar calculation in Scalar [1], except for a 3% drop in the calculated batch size due to the round-off error that can happen with the "expected size" calculation for a pack-file. [1] https://github.com/microsoft/scalar/blob/616e9b16dd120b8fdb652d6d5a55618c731a8aea/Scalar.Common/Maintenance/PackfileMaintenanceStep.cs#L141-L143 Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
The previous change allowed a user to specify a --batch-size=<size> option to 'git run-job pack-files'. However, when we eventually launch these jobs on a schedule, we want users to be able to change this value through config options. The new "job.pack-files.batchSize" option will override the default dynamic batch-size calculation, but will be overridden by the --batch-size=<size> argument. This is the first config option of the type "job.<job-name>.<option>" but is not the last. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Given the existing 'git run-job <job-name>' builtin, users _could_ construct their own scheduling mechanism for running jobs in the background. However, it is much easier to have a dedicated process that manages all jobs across multiple repos. The 'git job-runner' builtin is specifically built to handle this scenario. It will be customized further in later changes, but for now it does the following: * Given a list of '--repo=<path>' arguments, construct a list of repositories to manage with jobs. * Every 30 minutes, iterate over all jobs and all repos to run git -C <repo> run-job <job-name> This builtin needs to be careful about how much of the Git internals it consumes. The intention is that this is a long-lived process that could be launched upon login and only closed on logout. For that reason, we will avoid instantiating any object store or index data. Run the maintenance jobs by running subcommands. We will update how we enable or disable these jobs and separate their runs in a later change. RFC QUESTIONS: 1. The hardest part of this builtin is "how do we test it?" In Scalar, we can unit test the scheduler with mocks. What is the equivalent here for "make sure 'git job-runner' runs 'git run-job pack-files' on repo X? I expect to add a "--no-loop" option that ensures the logic only runs one iteration of the loop. 2. The difference between 'git run-job' and 'git job-runner' is subtle and probably confusing. Are there better names? Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
The 'git job-runner' command was introduced with a '--repo=<path>' option so a user could start a process with an explicit list of repos. However, it is more likely that we will want 'git job-runner' to start without any arguments and discover the set of repos from another source. A natural source is the Git config. The 'git job-runner' does not need to run in a Git repository, but the config could be located in the global or system config. Create the job.repo config setting as a multi-valued config option. Read all values for job.repo whenever triggering an iteration of the job loop. This allows a user to add or remove repos dynamically without restarting the job-runner. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
We want to run different maintenance tasks at different rates. That means we cannot rely only on the time between job-runner loop pauses to reduce the frequency of specific jobs. This means we need to persist a timestamp for the previous run somewhere. A natural place is the Git config file for that repo. Create the job.<job-namme>.lastRun config option to store the timestamp of the previous run of that job. Set this config option after a successful run of 'git -C <repo> run-job <job-name>'. To maximize code reuse, we dynamically construct the config key and parse the config value into a timestamp in a generic way. This makes the introduction of another config option extremely trivial: The job.<job-name>.interval option allows a user to specify a minimum number of seconds between two calls to 'git run-job <job-name>' on a given repo. This could be stored in the global or system config to provide an update on the default for all repos, or could be updated on a per-repo basis. This is checked on every iteration of the job loop, so a user could update this and see the effect without restarting the job-runner process. RFC QUESTION: I'm using a 'git -C <repo> config <option>' process to test if enough time has elapsed before running the 'run-job' process. These 'config' processes are pretty light-weight, so hopefully they are not too noisy. An alternative would be to always run 'git -C <repo> run-job <job-name>' and rely on that process to no-op based on the configured values and how recently they were run. However, that will likely interrupt users who want to test the jobs in the foreground. Finally, that user disruption would be mitigated by adding a "--check-latest" option. A user running a job manually would not include that argument by default and would succeed. However, any logging that we might do for the job-runner would make it look like we are running the run-job commands all the time even if they are no-ops. Advice welcome! Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
The 'git job-runner' process uses a sleep(X) call to pause its operation between iterations of the job loop. This defaults to 30 minutes, but is now available to be done with longer or shorter intervals according to the job.loopInterval config option. For example, a user may want the job loop to run once every five minutes while another wants the job loop to run once every six hours. This config value is checked immediately before the sleep(X) call, which allows users to see the effect without restarting the job-runner process. However, the process will be paused until the previous sleep(X) call returns and the job loop is executed. RFC QUESITON: Is this use of sleep(X) the best way to do this? Is there a better way to delay the process for a time interval, or until a specified time? This just seemed like the simplest option. The job-runner is doing low-priority work on an unpredictable schedule by design, so sleep(X) seemd appropriate. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
The previous change used the job.loopInterval config option to adjust the default sleep between iterations of the job loop in 'git job-runner'. Now, add a command-line option so a user can override the configured value with an explicit value. The value specified by this command-line option cannot be changed without restarting the job-runner process. RFC QUESTION: Are things like this useful, or should we only use config options for this customization? Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Allow a user to specify that a job should not be run on a given repo by setting "job.<job-name>.enabled = false". The job-runner will check this config before any other options and stop running the job on that repo. If the config is disabled in the config read by the job-runner itself (i.e. in global config) then the job-runner will not attempt to run the job on any of the repos. Since this config is checked dynamically, the job-runner does will see config changes without needing to restart the process. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Git has the ability to fork the process and launch a "daemonized" copy of the current process. This is used to launch a background 'git gc' command in some cases or to launch the 'git daemon' server process. Update the 'git job-runner' command with a --daemonize option that launches this background process. The implementation of daemonize() in setup.c is very clear that this may no-op and return an error if NO_POSIX_GOODIES is not defined. Include an error message to point out that this mechanism may not be available on a platform-by-platform basis. Include a clear error message that daemonize() might fail due to platform incompatibilities. I have been running the current version of this series on my Linux VM using --daemonize to keep my copies of torvalds/linux and git/git maintained. Using GIT_TRACE2_PERF set to a path, I can see that the 'git run-job' processes are being created on the correct schedule according to my config for each. RFC QUESTION: I notice that 'git gc' does not document --daemonize. Is that intentional? Or is it an oversight? Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
Allow a user to override the default number of loose objects to place into a new pack-file as part of the loose-objects job. This can be done via the job.loose-objects.batchSize config option or the --batch-size=<count> option in the 'git run-job' command. The config value is checked once per run of 'git run-job loose-objects' so an instance started by 'git job-runner' will use new values automatically without restarting the 'git job-runner' process. Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
/submit |
Submitted as pull.597.git.1585946894.gitgitgadget@gmail.com |
On the Git mailing list, Junio C Hamano wrote (reply to this):
|
@@ -0,0 +1,64 @@ | |||
git-run-job(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Phillip Wood wrote (reply to this):
Hi Stolee
On 03/04/2020 21:48, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
>
> When working with very large repositories, an incremental 'git fetch'
> command can download a large amount of data. If there are many other
> users pushing to a common repo, then this data can rival the initial
> pack-file size of a 'git clone' of a medium-size repo.
>
> Users may want to keep the data on their local repos as close as
> possible to the data on the remote repos by fetching periodically in
> the background. This can break up a large daily fetch into several
> smaller hourly fetches.
>
> However, if we simply ran 'git fetch <remote>' in the background,
> then the user running a foregroudn 'git fetch <remote>' would lose
> some important feedback when a new branch appears or an existing
> branch updates. This is especially true if a remote branch is
> force-updated and this isn't noticed by the user because it occurred
> in the background. Further, the functionality of 'git push
> --force-with-lease' becomes suspect.
>
> When running 'git fetch <remote> <options>' in the background, use
> the following options for careful updating:
>
> 1. --no-tags prevents getting a new tag when a user wants to see
> the new tags appear in their foreground fetches.
>
> 2. --refmap= removes the configured refspec which usually updates
> refs/remotes/<remote>/* with the refs advertised by the remote.
>
> 3. By adding a new refspec "+refs/heads/*:refs/hidden/<remote>/*"
> we can ensure that we actually load the new values somewhere in
> our refspace while not updating refs/heads or refs/remotes. By
> storing these refs here, the commit-graph job will update the
> commit-graph with the commits from these hidden refs.
>
> 4. --prune will delete the refs/hidden/<remote> refs that no
> longer appear on the remote.
>
> We've been using this step as a critical background job in Scalar
> [1] (and VFS for Git). This solved a pain point that was showing up
> in user reports: fetching was a pain! Users do not like waiting to
> download the data that was created while they were away from their
> machines. After implementing background fetch, the foreground fetch
> commands sped up significantly because they mostly just update refs
> and download a small amount of new data. The effect is especially
> dramatic when paried with --no-show-forced-udpates (through
> fetch.showForcedUpdates=false).
>
> [1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/FetchStep.cs
>
> RFC QUESTIONS:
>
> 1. One downside of the refs/hidden pattern is that 'git log' will
> decorate commits with twice as many refs if they appear at a
> remote ref (<remote>/<ref> _and_ refs/hidden/<remote>/<ref>). Is
> there an easy way to exclude a refspace from decorations? Should
> we make refs/hidden/* a "special" refspace that is excluded from
> decorations?
Having some way to specify which refs outside of
refs/{heads,remote,tags}/ to show or exclude from decorations would be
useful I think. Fetching to a hidden ref is a good idea (as are the
other steps you outline above) but as you say we don't want it to show
up in the output of 'git log' etc.
> 2. This feature is designed for a desktop machine or equivalent
> that has a permanent wired network connection, and the machine
> stays on while the user is not present. For things like laptops
> with inconsistent WiFi connections (that may be metered) the
> feature can use the less stable connection more than the user
> wants. Of course, this feature is opt-in for Git, but in Scalar
> we have a "scalar pause" command [2] that pauses all maintenance
> for some amount of time. We should consider a similar mechanism
> for Git, but for the point of this series the user needs to set
> up the "background" part of these jobs manually.
>
> [2] https://github.com/microsoft/scalar/blob/master/Scalar/CommandLine/PauseVerb.cs
> [3] https://github.com/microsoft/scalar/blob/master/docs/advanced.md#controlling-background-maintenance
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
> Documentation/git-run-job.txt | 13 ++++-
> builtin/run-job.c | 89 ++++++++++++++++++++++++++++++++++-
> t/t7900-run-job.sh | 22 +++++++++
> 3 files changed, 122 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/git-run-job.txt b/Documentation/git-run-job.txt
> index 8bf2762d650..eb92e891915 100644
> --- a/Documentation/git-run-job.txt
> +++ b/Documentation/git-run-job.txt
> @@ -9,7 +9,7 @@ git-run-job - Run a maintenance job. Intended for background operation.
> SYNOPSIS
> --------
> [verse]
> -'git run-job commit-graph'
> +'git run-job (commit-graph|fetch)'
>
>
> DESCRIPTION
> @@ -47,6 +47,17 @@ since it will not expire `.graph` files that were in the previous
> `commit-graph-chain` file. They will be deleted by a later run based on
> the expiration delay.
>
> +'fetch'::
> +
> +The `fetch` job updates the object directory with the latest objects
> +from all registered remotes. For each remote, a `git fetch` command is
> +run. The refmap is custom to avoid updating local or remote branches
> +(those in `refs/heads` or `refs/remotes`). Instead, the remote refs are
> +stored in `refs/hidden/<remote>/`. Also, no tags are updated.
> ++
> +This means that foreground fetches are still required to update the
> +remote refs, but the users is notified when the branches and tags are
> +updated on the remote.
>
> GIT
> ---
> diff --git a/builtin/run-job.c b/builtin/run-job.c
> index dd7709952d3..e59056b2918 100644
> --- a/builtin/run-job.c
> +++ b/builtin/run-job.c
> @@ -7,7 +7,7 @@
> #include "run-command.h"
>
> static char const * const builtin_run_job_usage[] = {
> - N_("git run-job commit-graph"),
> + N_("git run-job (commit-graph|fetch)"),
> NULL
> };
>
> @@ -60,6 +60,91 @@ static int run_commit_graph_job(void)
> return 1;
> }
>
> +static int fetch_remote(const char *remote)
> +{
> + int result;
> + struct argv_array cmd = ARGV_ARRAY_INIT;
> + struct strbuf refmap = STRBUF_INIT;
> +
> + argv_array_pushl(&cmd, "fetch", remote, "--quiet", "--prune",
> + "--no-tags", "--refmap=", NULL);
> +
> + strbuf_addf(&refmap, "+refs/heads/*:refs/hidden/%s/*", remote);
> + argv_array_push(&cmd, refmap.buf);
> +
> + result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
> +
> + strbuf_release(&refmap);
> + return result;
> +}
> +
> +static int fill_remotes(struct string_list *remotes)
Isn't there a easy way to get this using the config api rather than
forking 'git remote'?
Best Wishes
Phillip
> +{
> + int result = 0;
> + FILE *proc_out;
> + struct strbuf line = STRBUF_INIT;
> + struct child_process *remote_proc = xmalloc(sizeof(*remote_proc));
> +
> + child_process_init(remote_proc);
> +
> + argv_array_pushl(&remote_proc->args, "git", "remote", NULL);
> +
> + remote_proc->out = -1;
> +
> + if (start_command(remote_proc)) {
> + error(_("failed to start 'git remote' process"));
> + result = 1;
> + goto cleanup;
> + }
> +
> + proc_out = xfdopen(remote_proc->out, "r");
> +
> + /* if there is no line, leave the value as given */
> + while (!strbuf_getline(&line, proc_out))
> + string_list_append(remotes, line.buf);
> +
> + strbuf_release(&line);
> +
> + fclose(proc_out);
> +
> + if (finish_command(remote_proc)) {
> + error(_("failed to finish 'git remote' process"));
> + result = 1;
> + }
> +
> +cleanup:
> + free(remote_proc);
> + return result;
> +}
> +
> +static int run_fetch_job(void)
> +{
> + int result = 0;
> + struct string_list_item *item;
> + struct string_list remotes = STRING_LIST_INIT_DUP;
> +
> + if (fill_remotes(&remotes)) {
> + error(_("failed to fill remotes"));
> + result = 1;
> + goto cleanup;
> + }
> +
> + /*
> + * Do not modify the result based on the success of the 'fetch'
> + * operation, as a loss of network could cause 'fetch' to fail
> + * quickly. We do not want that to stop the rest of our
> + * background operations.
> + */
> + for (item = remotes.items;
> + item && item < remotes.items + remotes.nr;
> + item++)
> + fetch_remote(item->string);
> +
> +cleanup:
> + string_list_clear(&remotes, 0);
> + return result;
> +}
> +
> int cmd_run_job(int argc, const char **argv, const char *prefix)
> {
> static struct option builtin_run_job_options[] = {
> @@ -79,6 +164,8 @@ int cmd_run_job(int argc, const char **argv, const char *prefix)
> if (argc > 0) {
> if (!strcmp(argv[0], "commit-graph"))
> return run_commit_graph_job();
> + if (!strcmp(argv[0], "fetch"))
> + return run_fetch_job();
> }
>
> usage_with_options(builtin_run_job_usage,
> diff --git a/t/t7900-run-job.sh b/t/t7900-run-job.sh
> index 18b9bd26b3a..d3faeba135b 100755
> --- a/t/t7900-run-job.sh
> +++ b/t/t7900-run-job.sh
> @@ -44,4 +44,26 @@ test_expect_success 'commit-graph job' '
> )
> '
>
> +test_expect_success 'fetch job' '
> + git clone "file://$(pwd)/server" client &&
> +
> + # Before fetching, build a client commit-graph
> + git -C client run-job commit-graph &&
> + chain=client/.git/objects/info/commit-graphs/commit-graph-chain &&
> + test_line_count = 1 $chain &&
> +
> + git -C client branch -v --remotes >before-refs &&
> + test_commit -C server 24 &&
> +
> + git -C client run-job fetch &&
> + git -C client branch -v --remotes >after-refs &&
> + test_cmp before-refs after-refs &&
> + test_cmp server/.git/refs/heads/master \
> + client/.git/refs/hidden/origin/master &&
> +
> + # the hidden ref should trigger a new layer in the commit-graph
> + git -C client run-job commit-graph &&
> + test_line_count = 2 $chain
> +'
> +
> test_done
>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Derrick Stolee wrote (reply to this):
On 4/5/2020 11:14 AM, Phillip Wood wrote:
> Hi Stolee
>
> On 03/04/2020 21:48, Derrick Stolee via GitGitGadget wrote:
>> From: Derrick Stolee <dstolee@microsoft.com>
>> RFC QUESTIONS:
>>
>> 1. One downside of the refs/hidden pattern is that 'git log' will
>> decorate commits with twice as many refs if they appear at a
>> remote ref (<remote>/<ref> _and_ refs/hidden/<remote>/<ref>). Is
>> there an easy way to exclude a refspace from decorations? Should
>> we make refs/hidden/* a "special" refspace that is excluded from
>> decorations?
>
> Having some way to specify which refs outside of refs/{heads,remote,tags}/ to show or exclude from decorations would be useful I think. Fetching to a hidden ref is a good idea (as are the other steps you outline above) but as you say we don't want it to show up in the output of 'git log' etc.
I'll work on this first. It seems less controversial.
>> +static int fill_remotes(struct string_list *remotes)
>
> Isn't there a easy way to get this using the config api rather than forking 'git remote'?
You're right. I should have used the config here. I've been overly
biased to how we do it in the C# layer of Scalar _plus_ how I wanted
the job-runner to run "outside" a Git repository. This could be much
simpler with the config API.
Thanks,
-Stolee
@@ -0,0 +1,13 @@ | |||
job.pack-files.batchSize:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Phillip Wood wrote (reply to this):
Hi Stolee
On 03/04/2020 21:48, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
>
> The 'git job-runner' command was introduced with a '--repo=<path>'
> option so a user could start a process with an explicit list of
> repos. However, it is more likely that we will want 'git
> job-runner' to start without any arguments and discover the set of
> repos from another source.
>
> A natural source is the Git config. The 'git job-runner' does not
> need to run in a Git repository, but the config could be located in
> the global or system config.
>
> Create the job.repo config setting as a multi-valued config option.
>
> Read all values for job.repo whenever triggering an iteration of
> the job loop. This allows a user to add or remove repos dynamically
> without restarting the job-runner.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
> Documentation/config/job.txt | 7 +++++++
> builtin/job-runner.c | 20 ++++++++++++++++++++
> 2 files changed, 27 insertions(+)
>
> diff --git a/Documentation/config/job.txt b/Documentation/config/job.txt
> index efdb76afad3..2dfed3935fa 100644
> --- a/Documentation/config/job.txt
> +++ b/Documentation/config/job.txt
> @@ -4,3 +4,10 @@ job.pack-files.batchSize::
> part of `git run-job pack-files`. If not specified, then a
> dynamic size calculation is run. See linkgit:git-run-job[1]
> for more details.
> +
> +job.repo::
> + Store an absolute path to a repository that wants background
> + jobs run by `git job-runner`. This is a multi-valued config
> + option, but it must be stored in a config file seen by the
> + background runner. This may be the global or system config
> + depending on how `git job-runner` is launched on your system.
> diff --git a/builtin/job-runner.c b/builtin/job-runner.c
> index 135288bcaae..bed401f94bf 100644
> --- a/builtin/job-runner.c
> +++ b/builtin/job-runner.c
> @@ -20,6 +20,9 @@ static int arg_repos_append(const struct option *opt,
>
> static int load_active_repos(struct string_list *repos)
> {
> + struct string_list_item *item;
> + const struct string_list *config_repos;
> +
> if (arg_repos.nr) {
> struct string_list_item *item;
> for (item = arg_repos.items;
> @@ -29,6 +32,23 @@ static int load_active_repos(struct string_list *repos)
> return 0;
> }
>
> + config_repos = git_config_get_value_multi("job.repo");
Does this really re-read the config files as the commit message suggests
or just return the cached values? Perhaps the runner could exec itself
with a --run-jobs option to actually run the jobs so that it sees any
updated config values.
Best Wishes
Phillip
> +
> + for (item = config_repos->items;
> + item && item < config_repos->items + config_repos->nr;
> + item++) {
> + DIR *dir = opendir(item->string);
> +
> + if (!dir)
> + continue;
> +
> + closedir(dir);
> + string_list_append(repos, item->string);
> + }
> +
> + string_list_sort(repos);
> + string_list_remove_duplicates(repos, 0);
> +
> return 0;
> }
>
>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Derrick Stolee wrote (reply to this):
On 4/5/2020 11:18 AM, Phillip Wood wrote:
> Hi Stolee
>
> On 03/04/2020 21:48, Derrick Stolee via GitGitGadget wrote:
>> + config_repos = git_config_get_value_multi("job.repo");
>
> Does this really re-read the config files as the commit message suggests or just return the cached values? Perhaps the runner could exec itself with a --run-jobs option to actually run the jobs so that it sees any updated config values.
You're right, I need to double-check that. I'm guessing that you are
right and it uses the cached values. This should use the run_command()
pattern to ensure the config is as new as possible.
Thanks,
-Stolee
@@ -0,0 +1,23 @@ | |||
job.<job-name>.interval:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Phillip Wood wrote (reply to this):
Hi Stolee
On 03/04/2020 21:48, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
>
> We want to run different maintenance tasks at different rates. That
> means we cannot rely only on the time between job-runner loop pauses
> to reduce the frequency of specific jobs. This means we need to
> persist a timestamp for the previous run somewhere. A natural place
> is the Git config file for that repo.
>
> Create the job.<job-namme>.lastRun config option to store the
> timestamp of the previous run of that job. Set this config option
> after a successful run of 'git -C <repo> run-job <job-name>'.
>
> To maximize code reuse, we dynamically construct the config key
> and parse the config value into a timestamp in a generic way. This
> makes the introduction of another config option extremely trivial:
>
> The job.<job-name>.interval option allows a user to specify a
> minimum number of seconds between two calls to 'git run-job
> <job-name>' on a given repo. This could be stored in the global
> or system config to provide an update on the default for all repos,
> or could be updated on a per-repo basis. This is checked on every
> iteration of the job loop, so a user could update this and see the
> effect without restarting the job-runner process.
>
> RFC QUESTION: I'm using a 'git -C <repo> config <option>' process
> to test if enough time has elapsed before running the 'run-job'
> process. These 'config' processes are pretty light-weight, so
> hopefully they are not too noisy. An alternative would be to
> always run 'git -C <repo> run-job <job-name>' and rely on that
> process to no-op based on the configured values and how recently
> they were run.
You're still executing another process so it doesn't really save
anything in the 'noop' case. In the case where something needs to be
done I think the extra config process is unlikely to be noticed
(especially as 'git job' forks a lot anyway)
Best Wishes
Phillip
> However, that will likely interrupt users who want
> to test the jobs in the foreground. Finally, that user disruption
> would be mitigated by adding a "--check-latest" option. A user
> running a job manually would not include that argument by default
> and would succeed. However, any logging that we might do for the
> job-runner would make it look like we are running the run-job
> commands all the time even if they are no-ops. Advice welcome!
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
> Documentation/config/job.txt | 10 ++
> Documentation/git-run-job.txt | 3 +
> builtin/job-runner.c | 177 +++++++++++++++++++++++++++++++++-
> 3 files changed, 189 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/config/job.txt b/Documentation/config/job.txt
> index 2dfed3935fa..7c799d66221 100644
> --- a/Documentation/config/job.txt
> +++ b/Documentation/config/job.txt
> @@ -1,3 +1,13 @@
> +job.<job-name>.interval::
> + The minimum number of seconds between runs of
> + `git run-job <job-name>` when running `git job-runner`.
> +
> +job.<job-name>.lastRun::
> + The Unix epoch for the timestamp of the previous run of
> + `git run-job <job-name>` when running `git job-runner`. You
> + can manually update this to a later time to delay a specific
> + job on this repository.
> +
> job.pack-files.batchSize::
> This string value `<size>` will be passed to the
> `git multi-pack-index repack --batch-size=<size>` command as
> diff --git a/Documentation/git-run-job.txt b/Documentation/git-run-job.txt
> index cdd6417f7c9..c6d5674d699 100644
> --- a/Documentation/git-run-job.txt
> +++ b/Documentation/git-run-job.txt
> @@ -90,6 +90,9 @@ pack-files, the job does not attempt to repack. Otherwise, the batch
> size is the sum of all pack-file sizes minus the largest pack-file size.
> The batch size is capped at two gigabytes. This intends to pack all
> small pack-files into a single pack-file.
> ++
> +The `--batch-size=<size>` option will override the dynamic or configured
> +batch size.
>
>
> GIT
> diff --git a/builtin/job-runner.c b/builtin/job-runner.c
> index bed401f94bf..aee55c106e8 100644
> --- a/builtin/job-runner.c
> +++ b/builtin/job-runner.c
> @@ -4,11 +4,175 @@
> #include "run-command.h"
> #include "string-list.h"
>
> +#define MAX_TIMESTAMP ((timestamp_t)-1)
> +
> static char const * const builtin_job_runner_usage[] = {
> N_("git job-runner [<options>]"),
> NULL
> };
>
> +static char *config_name(const char *prefix,
> + const char *job,
> + const char *postfix)
> +{
> + int postfix_dot = 0;
> + struct strbuf name = STRBUF_INIT;
> +
> + if (prefix) {
> + strbuf_addstr(&name, prefix);
> + strbuf_addch(&name, '.');
> + }
> +
> + if (job) {
> + strbuf_addstr(&name, job);
> + postfix_dot = 1;
> + }
> +
> + if (postfix) {
> + if (postfix_dot)
> + strbuf_addch(&name, '.');
> + strbuf_addstr(&name, postfix);
> + }
> +
> + return strbuf_detach(&name, NULL);
> +}
> +
> +static int try_get_config(const char *job,
> + const char *repo,
> + const char *postfix,
> + char **value)
> +{
> + int result = 0;
> + FILE *proc_out;
> + struct strbuf line = STRBUF_INIT;
> + char *cname = config_name("job", job, postfix);
> + struct child_process *config_proc = xmalloc(sizeof(*config_proc));
> +
> + child_process_init(config_proc);
> +
> + argv_array_push(&config_proc->args, "git");
> + argv_array_push(&config_proc->args, "-C");
> + argv_array_push(&config_proc->args, repo);
> + argv_array_push(&config_proc->args, "config");
> + argv_array_push(&config_proc->args, cname);
> + free(cname);
> +
> + config_proc->out = -1;
> +
> + if (start_command(config_proc)) {
> + warning(_("failed to start process for repo '%s'"),
> + repo);
> + result = 1;
> + goto cleanup;
> + }
> +
> + proc_out = xfdopen(config_proc->out, "r");
> +
> + /* if there is no line, leave the value as given */
> + if (!strbuf_getline(&line, proc_out))
> + *value = strbuf_detach(&line, NULL);
> + else
> + *value = NULL;
> +
> + strbuf_release(&line);
> +
> + fclose(proc_out);
> +
> + result = finish_command(config_proc);
> +
> +cleanup:
> + free(config_proc);
> + return result;
> +}
> +
> +static int try_get_timestamp(const char *job,
> + const char *repo,
> + const char *postfix,
> + timestamp_t *t)
> +{
> + char *value;
> + int result = try_get_config(job, repo, postfix, &value);
> +
> + if (!result) {
> + *t = atol(value);
> + free(value);
> + }
> +
> + return result;
> +}
> +
> +static timestamp_t get_last_run(const char *job,
> + const char *repo)
> +{
> + timestamp_t last_run = 0;
> +
> + /* In an error state, do not run the job */
> + if (try_get_timestamp(job, repo, "lastrun", &last_run))
> + return MAX_TIMESTAMP;
> +
> + return last_run;
> +}
> +
> +static timestamp_t get_interval(const char *job,
> + const char *repo)
> +{
> + timestamp_t interval = MAX_TIMESTAMP;
> +
> + /* In an error state, do not run the job */
> + if (try_get_timestamp(job, repo, "interval", &interval))
> + return MAX_TIMESTAMP;
> +
> + if (interval == MAX_TIMESTAMP) {
> + /* load defaults for each job */
> + if (!strcmp(job, "commit-graph") || !strcmp(job, "fetch"))
> + interval = 60 * 60; /* 1 hour */
> + else
> + interval = 24 * 60 * 60; /* 1 day */
> + }
> +
> + return interval;
> +}
> +
> +static int set_last_run(const char *job,
> + const char *repo,
> + timestamp_t last_run)
> +{
> + int result = 0;
> + struct strbuf last_run_string = STRBUF_INIT;
> + struct child_process *config_proc = xmalloc(sizeof(*config_proc));
> + char *cname = config_name("job", job, "lastrun");
> +
> + strbuf_addf(&last_run_string, "%"PRItime, last_run);
> +
> + child_process_init(config_proc);
> +
> + argv_array_push(&config_proc->args, "git");
> + argv_array_push(&config_proc->args, "-C");
> + argv_array_push(&config_proc->args, repo);
> + argv_array_push(&config_proc->args, "config");
> + argv_array_push(&config_proc->args, cname);
> + argv_array_push(&config_proc->args, last_run_string.buf);
> + free(cname);
> + strbuf_release(&last_run_string);
> +
> + if (start_command(config_proc)) {
> + warning(_("failed to start process for repo '%s'"),
> + repo);
> + result = 1;
> + goto cleanup;
> + }
> +
> + if (finish_command(config_proc)) {
> + warning(_("failed to finish process for repo '%s'"),
> + repo);
> + result = 1;
> + }
> +
> +cleanup:
> + free(config_proc);
> + return result;
> +}
> +
> static struct string_list arg_repos = STRING_LIST_INIT_DUP;
>
> static int arg_repos_append(const struct option *opt,
> @@ -54,9 +218,20 @@ static int load_active_repos(struct string_list *repos)
>
> static int run_job(const char *job, const char *repo)
> {
> + int result;
> struct argv_array cmd = ARGV_ARRAY_INIT;
> + timestamp_t now = time(NULL);
> + timestamp_t last_run = get_last_run(job, repo);
> + timestamp_t interval = get_interval(job, repo);
> +
> + if (last_run + interval > now)
> + return 0;
> +
> argv_array_pushl(&cmd, "-C", repo, "run-job", job, NULL);
> - return run_command_v_opt(cmd.argv, RUN_GIT_CMD);
> + result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
> +
> + set_last_run(job, repo, now);
> + return result;
> }
>
> static int run_job_loop_step(struct string_list *list)
>
@@ -0,0 +1,13 @@ | |||
job.pack-files.batchSize:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Phillip Wood wrote (reply to this):
Hi Stolee
On 03/04/2020 21:48, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
>
> The 'git job-runner' command was introduced with a '--repo=<path>'
> option so a user could start a process with an explicit list of
> repos. However, it is more likely that we will want 'git
> job-runner' to start without any arguments and discover the set of
> repos from another source.
One thought that occurred to me was that it might be convenient to put
'git job-runner $HOME &' in my .bashrc to have it start on login and
find all the repositories under $HOME without me having to list each one
(and remember to update the list each time a clone a new repository).
There are a couple of issues with this
1 - We only want to run the jobs once per repo, not for each worktree.
That should be easy enough to implement by checking if we're in the main
worktree.
2 - If I start several shells I only want one instance of 'git
job-runner' to start. One way to handle that would be for the runner to
hold a lock file in the directory it's given and quit if the lock is
already taken. It should probably walk up the directory hierarchy to
check for lock files as well in case I try to start another job-runner
instance in a subdirectory.
Best Wishes
Phillip
> A natural source is the Git config. The 'git job-runner' does not
> need to run in a Git repository, but the config could be located in
> the global or system config.
>
> Create the job.repo config setting as a multi-valued config option.
>
> Read all values for job.repo whenever triggering an iteration of
> the job loop. This allows a user to add or remove repos dynamically
> without restarting the job-runner.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
> Documentation/config/job.txt | 7 +++++++
> builtin/job-runner.c | 20 ++++++++++++++++++++
> 2 files changed, 27 insertions(+)
>
> diff --git a/Documentation/config/job.txt b/Documentation/config/job.txt
> index efdb76afad3..2dfed3935fa 100644
> --- a/Documentation/config/job.txt
> +++ b/Documentation/config/job.txt
> @@ -4,3 +4,10 @@ job.pack-files.batchSize::
> part of `git run-job pack-files`. If not specified, then a
> dynamic size calculation is run. See linkgit:git-run-job[1]
> for more details.
> +
> +job.repo::
> + Store an absolute path to a repository that wants background
> + jobs run by `git job-runner`. This is a multi-valued config
> + option, but it must be stored in a config file seen by the
> + background runner. This may be the global or system config
> + depending on how `git job-runner` is launched on your system.
> diff --git a/builtin/job-runner.c b/builtin/job-runner.c
> index 135288bcaae..bed401f94bf 100644
> --- a/builtin/job-runner.c
> +++ b/builtin/job-runner.c
> @@ -20,6 +20,9 @@ static int arg_repos_append(const struct option *opt,
>
> static int load_active_repos(struct string_list *repos)
> {
> + struct string_list_item *item;
> + const struct string_list *config_repos;
> +
> if (arg_repos.nr) {
> struct string_list_item *item;
> for (item = arg_repos.items;
> @@ -29,6 +32,23 @@ static int load_active_repos(struct string_list *repos)
> return 0;
> }
>
> + config_repos = git_config_get_value_multi("job.repo");
> +
> + for (item = config_repos->items;
> + item && item < config_repos->items + config_repos->nr;
> + item++) {
> + DIR *dir = opendir(item->string);
> +
> + if (!dir)
> + continue;
> +
> + closedir(dir);
> + string_list_append(repos, item->string);
> + }
> +
> + string_list_sort(repos);
> + string_list_remove_duplicates(repos, 0);
> +
> return 0;
> }
>
>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Derrick Stolee wrote (reply to this):
On 4/5/2020 11:41 AM, Phillip Wood wrote:
> Hi Stolee
>
> On 03/04/2020 21:48, Derrick Stolee via GitGitGadget wrote:
>> From: Derrick Stolee <dstolee@microsoft.com>
>>
>> The 'git job-runner' command was introduced with a '--repo=<path>'
>> option so a user could start a process with an explicit list of
>> repos. However, it is more likely that we will want 'git
>> job-runner' to start without any arguments and discover the set of
>> repos from another source.
>
> One thought that occurred to me was that it might be convenient to put
> 'git job-runner $HOME &' in my .bashrc to have it start on login and
> find all the repositories under $HOME without me having to list each
> one (and remember to update the list each time a clone a new repository).
> There are a couple of issues with this
> 1 - We only want to run the jobs once per repo, not for each worktree.
> That should be easy enough to implement by checking if we're in the
> main worktree.
This idea of "please check all (immediate) directories under <dir> for
repositories to run maintenance" is an interesting one. It certainly
could be added later.
Your concern about worktrees is actually not a big deal. Since we check
the config for the "last run" of a job on a repo, we will not run the
same job immediately on a worktree after running it on the original
(unless the interval times are incredibly short).
> 2 - If I start several shells I only want one instance of 'git
> job-runner' to start. One way to handle that would be for the runner
> to hold a lock file in the directory it's given and quit if the lock
> is already taken. It should probably walk up the directory hierarchy
> to check for lock files as well in case I try to start another
> job-runner instance in a subdirectory.
This is an interesting idea. My gut reaction is that we don't want
to prevent users from running multiple processes if they want to. But
if they run the process twice in the same directory then they are
likely to be running on the same set of repos (barring "-c jobs.repo"
or equivalent).
Again, my hope is that we would solve this problem by having a cross-
platform "job service" that makes the user setup of editing .bashrc
irrelevant. Not only is that idea getting push back, we should allow
expert users the ability to customize these steps to their delight.
Thanks,
-Stolee
@@ -0,0 +1,64 @@ | |||
git-run-job(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Junio C Hamano wrote (reply to this):
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 1. One downside of the refs/hidden pattern is that 'git log' will
> decorate commits with twice as many refs if they appear at a
> remote ref (<remote>/<ref> _and_ refs/hidden/<remote>/<ref>). Is
> there an easy way to exclude a refspace from decorations?
I do not think there is, but it makes sense to teach the decoration
machinery to either use only certain refs hierarchies or use all
hierarchies except for certain ones; if we want to make sure we
won't break existing workflows, we should by defautlt use all the
refs we currently use and nothing else, but over time we probably
would want to migrate the default to cover only the local and
remote-tracking branches and tags (and at that point, refs/hidden
would be outside the decoration source).
By the way, I have a moderately strong opinion against the use of
"refs/hidden" for the purpose of "prefetch in advance, without
disrupting refs/remotes". There may be more than one reason why
some refs want to be "hidden", and depending on the purose, the
exact refs that one workflow (e.g. "decorate") wants to hide may be
the ones that want to be exposed.
If we rename it to "refs/prefetch/", would it make the purpose of
the hierarchy clearer without squatting on a vague (because it does
not tell why it is hidden) name "hidden" that other people might
want to use to hide their stuff for different reasons?
> Should
> we make refs/hidden/* a "special" refspace that is excluded from
> decorations?
See above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Derrick Stolee wrote (reply to this):
On 4/5/2020 4:28 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
>> 1. One downside of the refs/hidden pattern is that 'git log' will
>> decorate commits with twice as many refs if they appear at a
>> remote ref (<remote>/<ref> _and_ refs/hidden/<remote>/<ref>). Is
>> there an easy way to exclude a refspace from decorations?
>
> I do not think there is, but it makes sense to teach the decoration
> machinery to either use only certain refs hierarchies or use all
> hierarchies except for certain ones; if we want to make sure we
> won't break existing workflows, we should by defautlt use all the
> refs we currently use and nothing else, but over time we probably
> would want to migrate the default to cover only the local and
> remote-tracking branches and tags (and at that point, refs/hidden
> would be outside the decoration source).
I'll see what I can do about adding config to remove some refs from
decorations. That is immediately useful for Scalar users, too.
> By the way, I have a moderately strong opinion against the use of
> "refs/hidden" for the purpose of "prefetch in advance, without
> disrupting refs/remotes". There may be more than one reason why
> some refs want to be "hidden", and depending on the purose, the
> exact refs that one workflow (e.g. "decorate") wants to hide may be
> the ones that want to be exposed.
>
> If we rename it to "refs/prefetch/", would it make the purpose of
> the hierarchy clearer without squatting on a vague (because it does
> not tell why it is hidden) name "hidden" that other people might
> want to use to hide their stuff for different reasons?
I like "refs/prefetch". That's more descriptive.
>> Should
>> we make refs/hidden/* a "special" refspace that is excluded from
>> decorations?
>
> See above.
Thanks,
-Stolee
@@ -0,0 +1,76 @@ | |||
git-run-job(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Junio C Hamano wrote (reply to this):
"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> Create a 'loose-objects' job for the 'git run-job' command. This
> helps clean up loose objects without disrupting concurrent Git
> commands using the following sequence of events:
This is exactly the kind of thing I would have expected you'd be
adding to "git repack".
On the Git mailing list, Danh Doan wrote (reply to this):
|
On the Git mailing list, "brian m. carlson" wrote (reply to this):
|
On the Git mailing list, Derrick Stolee wrote (reply to this):
|
On the Git mailing list, Danh Doan wrote (reply to this):
|
On the Git mailing list, Johannes Schindelin wrote (reply to this):
|
On the Git mailing list, Junio C Hamano wrote (reply to this):
|
On the Git mailing list, Johannes Schindelin wrote (reply to this):
|
On the Git mailing list, "brian m. carlson" wrote (reply to this):
|
we can alias |
@sluongng if this was intended as a serious suggestion, please follow the advice from this line how to reply on the Git mailing list:
|
@dscho thanks, I wrote a quick reply on this https://public-inbox.org/git/CAL3xRKew_RHbPbp0qSa7WcDbaMmMWWmBi_nvPbmKaSpVDJM08g@mail.gmail.com/T/#u |
@@ -0,0 +1,53 @@ | |||
git-run-job(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Josh Steadmon wrote (reply to this):
Thanks for this series, and sorry for the delayed (and partial) review.
I hope comments here are still useful. I don't have any comments on the
larger design that other people haven't already made, but I do have a
few small questions and suggestions, inline below.
On 2020.04.03 20:48, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
>
> The first job to implement in the 'git run-job' command is the
> 'commit-graph' job. It is based on the sequence of events in the
> 'commit-graph' job in Scalar [1]. This sequence is as follows:
>
> 1. git commit-graph write --reachable --split
> 2. git commit-graph verify --shallow
> 3. If the verify succeeds, stop.
> 4. Delete the commit-graph-chain file.
> 5. git commit-graph write --reachable --split
>
> By writing an incremental commit-graph file using the "--split"
> option we minimize the disruption from this operation. The default
> behavior is to merge layers until the new "top" layer is less than
> half the size of the layer below. This provides quick writes "most"
> of the time, with the longer writes following a power law
> distribution.
>
> Most importantly, concurrent Git processes only look at the
> commit-graph-chain file for a very short amount of time, so they
> will verly likely not be holding a handle to the file when we try
> to replace it. (This only matters on Windows.)
>
> If a concurrent process reads the old commit-graph-chain file, but
> our job expires some of the .graph files before they can be read,
> then those processes will see a warning message (but not fail).
> This could be avoided by a future update to use the --expire-time
> argument when writing the commit-graph.
>
> By using 'git commit-graph verify --shallow' we can ensure that
> the file we just wrote is valid. This is an extra safety precaution
> that is faster than our 'write' subcommand. In the rare situation
> that the newest layer of the commit-graph is corrupt, we can "fix"
> the corruption by deleting the commit-graph-chain file and rewrite
> the full commit-graph as a new one-layer commit graph. This does
> not completely prevent _that_ file from being corrupt, but it does
> recompute the commit-graph by parsing commits from the object
> database. In our use of this step in Scalar and VFS for Git, we
> have only seen this issue arise because our microsoft/git fork
> reverted 43d3561 ("commit-graph write: don't die if the existing
> graph is corrupt" 2019-03-25) for a while to keep commit-graph
> writes very fast. We dropped the revert when updating to v2.23.0.
> The verify still has potential for catching corrupt data across
> the layer boundary: if the new file has commit X with parent Y
> in an old file but the commit ID for Y in the old file had a
> bitswap, then we will notice that in the 'verify' command.
>
> [1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/CommitGraphStep.cs
>
> RFC QUESTIONS:
>
> 1. Are links like [1] helpful?
Yes, I appreciate being able to reference these. However, given that
these will show up in commit logs for ~forever, and ongoing work might
happen on Scalar, perhaps the links should be pinned to a specific
revision?
>
> 2. Can anyone think of a way to test the rewrite fallback?
> It requires corrupting the latest file between two steps of
> this one command, so that is a tricky spot to inject a fault.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
> Documentation/git-run-job.txt | 21 ++++++++++--
> builtin/run-job.c | 60 ++++++++++++++++++++++++++++++++++-
> commit-graph.c | 2 +-
> commit-graph.h | 1 +
> t/t7900-run-job.sh | 32 +++++++++++++++++++
> 5 files changed, 112 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/git-run-job.txt b/Documentation/git-run-job.txt
> index 0627b3ed259..8bf2762d650 100644
> --- a/Documentation/git-run-job.txt
> +++ b/Documentation/git-run-job.txt
> @@ -9,7 +9,7 @@ git-run-job - Run a maintenance job. Intended for background operation.
> SYNOPSIS
> --------
> [verse]
> -'git run-job <job-name>'
> +'git run-job commit-graph'
>
>
> DESCRIPTION
> @@ -28,7 +28,24 @@ BEHAVIOR MAY BE ALTERED IN THE FUTURE.
> JOBS
> ----
>
> -TBD
> +'commit-graph'::
> +
> +The `commit-graph` job updates the `commit-graph` files incrementally,
> +then verifies that the written data is correct. If the new layer has an
> +issue, then the chain file is removed and the `commit-graph` is
> +rewritten from scratch.
> ++
> +The verification only checks the top layer of the `commit-graph` chain.
> +If the incremental write merged the new commits with at least one
> +existing layer, then there is potential for on-disk corruption being
> +carried forward into the new file. This will be noticed and the new
> +commit-graph file will be clean as Git reparses the commit data from
> +the object database.
> ++
> +The incremental write is safe to run alongside concurrent Git processes
> +since it will not expire `.graph` files that were in the previous
> +`commit-graph-chain` file. They will be deleted by a later run based on
> +the expiration delay.
>
>
> GIT
> diff --git a/builtin/run-job.c b/builtin/run-job.c
> index 2c78d053aa4..dd7709952d3 100644
> --- a/builtin/run-job.c
> +++ b/builtin/run-job.c
> @@ -1,12 +1,65 @@
> #include "builtin.h"
> #include "config.h"
> +#include "commit-graph.h"
> +#include "object-store.h"
> #include "parse-options.h"
> +#include "repository.h"
> +#include "run-command.h"
>
> static char const * const builtin_run_job_usage[] = {
> - N_("git run-job"),
> + N_("git run-job commit-graph"),
> NULL
> };
>
> +static int run_write_commit_graph(void)
> +{
> + struct argv_array cmd = ARGV_ARRAY_INIT;
> +
> + argv_array_pushl(&cmd, "commit-graph", "write",
> + "--split", "--reachable", "--no-progress",
> + NULL);
> +
> + return run_command_v_opt(cmd.argv, RUN_GIT_CMD);
> +}
> +
> +static int run_verify_commit_graph(void)
> +{
> + struct argv_array cmd = ARGV_ARRAY_INIT;
> +
> + argv_array_pushl(&cmd, "commit-graph", "verify",
> + "--shallow", "--no-progress", NULL);
> +
> + return run_command_v_opt(cmd.argv, RUN_GIT_CMD);
> +}
> +
> +static int run_commit_graph_job(void)
> +{
> + char *chain_path;
> +
> + if (run_write_commit_graph()) {
> + error(_("failed to write commit-graph"));
> + return 1;
> + }
> +
> + if (!run_verify_commit_graph())
> + return 0;
> +
> + warning(_("commit-graph verify caught error, rewriting"));
> +
> + chain_path = get_chain_filename(the_repository->objects->odb);
Should we avoid new uses of `the_repository` and take a repo pointer
argument here instead?
> + if (unlink(chain_path)) {
> + UNLEAK(chain_path);
> + die(_("failed to remove commit-graph at %s"), chain_path);
> + }
> + free(chain_path);
> +
> + if (!run_write_commit_graph())
> + return 0;
Is there a reason why we don't verify the second write attempt?
> +
> + error(_("failed to rewrite commit-graph"));
> + return 1;
> +}
> +
> int cmd_run_job(int argc, const char **argv, const char *prefix)
> {
> static struct option builtin_run_job_options[] = {
> @@ -23,6 +76,11 @@ int cmd_run_job(int argc, const char **argv, const char *prefix)
> builtin_run_job_usage,
> PARSE_OPT_KEEP_UNKNOWN);
>
> + if (argc > 0) {
> + if (!strcmp(argv[0], "commit-graph"))
> + return run_commit_graph_job();
> + }
> +
> usage_with_options(builtin_run_job_usage,
> builtin_run_job_options);
> }
> diff --git a/commit-graph.c b/commit-graph.c
> index f013a84e294..6867f92d04a 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -56,7 +56,7 @@ static char *get_split_graph_filename(struct object_directory *odb,
> oid_hex);
> }
>
> -static char *get_chain_filename(struct object_directory *odb)
> +char *get_chain_filename(struct object_directory *odb)
> {
> return xstrfmt("%s/info/commit-graphs/commit-graph-chain", odb->path);
> }
> diff --git a/commit-graph.h b/commit-graph.h
> index e87a6f63600..8b7bd5a6cb1 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -13,6 +13,7 @@
> struct commit;
>
> char *get_commit_graph_filename(struct object_directory *odb);
> +char *get_chain_filename(struct object_directory *odb);
> int open_commit_graph(const char *graph_file, int *fd, struct stat *st);
>
> /*
> diff --git a/t/t7900-run-job.sh b/t/t7900-run-job.sh
> index 1eac80b7ed3..18b9bd26b3a 100755
> --- a/t/t7900-run-job.sh
> +++ b/t/t7900-run-job.sh
> @@ -5,6 +5,8 @@ test_description='git run-job
> Testing the background jobs, in the foreground
> '
>
> +GIT_TEST_COMMIT_GRAPH=0
> +
> . ./test-lib.sh
>
> test_expect_success 'help text' '
> @@ -12,4 +14,34 @@ test_expect_success 'help text' '
> test_i18ngrep "usage: git run-job" err
> '
>
> +test_expect_success 'commit-graph job' '
> + git init server &&
> + (
> + cd server &&
> + chain=.git/objects/info/commit-graphs/commit-graph-chain &&
> + git checkout -b master &&
> + for i in $(test_seq 1 15)
> + do
> + test_commit $i || return 1
> + done &&
> + git run-job commit-graph 2>../err &&
> + test_must_be_empty ../err &&
> + test_line_count = 1 $chain &&
> + for i in $(test_seq 16 19)
> + do
> + test_commit $i || return 1
> + done &&
> + git run-job commit-graph 2>../err &&
> + test_must_be_empty ../err &&
> + test_line_count = 2 $chain &&
> + for i in $(test_seq 20 23)
> + do
> + test_commit $i || return 1
> + done &&
> + git run-job commit-graph 2>../err &&
> + test_must_be_empty ../err &&
> + test_line_count = 1 $chain
> + )
> +'
> +
> test_done
> --
> gitgitgadget
>
@@ -0,0 +1,64 @@ | |||
git-run-job(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Josh Steadmon wrote (reply to this):
On 2020.04.03 20:48, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
>
> When working with very large repositories, an incremental 'git fetch'
> command can download a large amount of data. If there are many other
> users pushing to a common repo, then this data can rival the initial
> pack-file size of a 'git clone' of a medium-size repo.
>
> Users may want to keep the data on their local repos as close as
> possible to the data on the remote repos by fetching periodically in
> the background. This can break up a large daily fetch into several
> smaller hourly fetches.
>
> However, if we simply ran 'git fetch <remote>' in the background,
> then the user running a foregroudn 'git fetch <remote>' would lose
> some important feedback when a new branch appears or an existing
> branch updates. This is especially true if a remote branch is
> force-updated and this isn't noticed by the user because it occurred
> in the background. Further, the functionality of 'git push
> --force-with-lease' becomes suspect.
It might be useful here to clarify the interaction between background
fetching & the expectations around --force-with-lease.
> When running 'git fetch <remote> <options>' in the background, use
> the following options for careful updating:
>
> 1. --no-tags prevents getting a new tag when a user wants to see
> the new tags appear in their foreground fetches.
>
> 2. --refmap= removes the configured refspec which usually updates
> refs/remotes/<remote>/* with the refs advertised by the remote.
>
> 3. By adding a new refspec "+refs/heads/*:refs/hidden/<remote>/*"
> we can ensure that we actually load the new values somewhere in
> our refspace while not updating refs/heads or refs/remotes. By
> storing these refs here, the commit-graph job will update the
> commit-graph with the commits from these hidden refs.
>
> 4. --prune will delete the refs/hidden/<remote> refs that no
> longer appear on the remote.
>
> We've been using this step as a critical background job in Scalar
> [1] (and VFS for Git). This solved a pain point that was showing up
> in user reports: fetching was a pain! Users do not like waiting to
> download the data that was created while they were away from their
> machines. After implementing background fetch, the foreground fetch
> commands sped up significantly because they mostly just update refs
> and download a small amount of new data. The effect is especially
> dramatic when paried with --no-show-forced-udpates (through
> fetch.showForcedUpdates=false).
>
> [1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/FetchStep.cs
>
> RFC QUESTIONS:
>
> 1. One downside of the refs/hidden pattern is that 'git log' will
> decorate commits with twice as many refs if they appear at a
> remote ref (<remote>/<ref> _and_ refs/hidden/<remote>/<ref>). Is
> there an easy way to exclude a refspace from decorations? Should
> we make refs/hidden/* a "special" refspace that is excluded from
> decorations?
>
> 2. This feature is designed for a desktop machine or equivalent
> that has a permanent wired network connection, and the machine
> stays on while the user is not present. For things like laptops
> with inconsistent WiFi connections (that may be metered) the
> feature can use the less stable connection more than the user
> wants. Of course, this feature is opt-in for Git, but in Scalar
> we have a "scalar pause" command [2] that pauses all maintenance
> for some amount of time. We should consider a similar mechanism
> for Git, but for the point of this series the user needs to set
> up the "background" part of these jobs manually.
>
> [2] https://github.com/microsoft/scalar/blob/master/Scalar/CommandLine/PauseVerb.cs
> [3] https://github.com/microsoft/scalar/blob/master/docs/advanced.md#controlling-background-maintenance
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
> Documentation/git-run-job.txt | 13 ++++-
> builtin/run-job.c | 89 ++++++++++++++++++++++++++++++++++-
> t/t7900-run-job.sh | 22 +++++++++
> 3 files changed, 122 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/git-run-job.txt b/Documentation/git-run-job.txt
> index 8bf2762d650..eb92e891915 100644
> --- a/Documentation/git-run-job.txt
> +++ b/Documentation/git-run-job.txt
> @@ -9,7 +9,7 @@ git-run-job - Run a maintenance job. Intended for background operation.
> SYNOPSIS
> --------
> [verse]
> -'git run-job commit-graph'
> +'git run-job (commit-graph|fetch)'
>
>
> DESCRIPTION
> @@ -47,6 +47,17 @@ since it will not expire `.graph` files that were in the previous
> `commit-graph-chain` file. They will be deleted by a later run based on
> the expiration delay.
>
> +'fetch'::
> +
> +The `fetch` job updates the object directory with the latest objects
> +from all registered remotes. For each remote, a `git fetch` command is
> +run. The refmap is custom to avoid updating local or remote branches
> +(those in `refs/heads` or `refs/remotes`). Instead, the remote refs are
> +stored in `refs/hidden/<remote>/`. Also, no tags are updated.
> ++
> +This means that foreground fetches are still required to update the
> +remote refs, but the users is notified when the branches and tags are
> +updated on the remote.
>
> GIT
> ---
> diff --git a/builtin/run-job.c b/builtin/run-job.c
> index dd7709952d3..e59056b2918 100644
> --- a/builtin/run-job.c
> +++ b/builtin/run-job.c
> @@ -7,7 +7,7 @@
> #include "run-command.h"
>
> static char const * const builtin_run_job_usage[] = {
> - N_("git run-job commit-graph"),
> + N_("git run-job (commit-graph|fetch)"),
> NULL
> };
>
> @@ -60,6 +60,91 @@ static int run_commit_graph_job(void)
> return 1;
> }
>
> +static int fetch_remote(const char *remote)
> +{
> + int result;
> + struct argv_array cmd = ARGV_ARRAY_INIT;
> + struct strbuf refmap = STRBUF_INIT;
> +
> + argv_array_pushl(&cmd, "fetch", remote, "--quiet", "--prune",
> + "--no-tags", "--refmap=", NULL);
> +
> + strbuf_addf(&refmap, "+refs/heads/*:refs/hidden/%s/*", remote);
> + argv_array_push(&cmd, refmap.buf);
> +
> + result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
> +
> + strbuf_release(&refmap);
> + return result;
> +}
> +
> +static int fill_remotes(struct string_list *remotes)
> +{
> + int result = 0;
> + FILE *proc_out;
> + struct strbuf line = STRBUF_INIT;
> + struct child_process *remote_proc = xmalloc(sizeof(*remote_proc));
Shouldn't remote_proc just go on the stack here rather than getting
malloced?
> +
> + child_process_init(remote_proc);
> +
> + argv_array_pushl(&remote_proc->args, "git", "remote", NULL);
> +
> + remote_proc->out = -1;
> +
> + if (start_command(remote_proc)) {
> + error(_("failed to start 'git remote' process"));
> + result = 1;
> + goto cleanup;
> + }
> +
> + proc_out = xfdopen(remote_proc->out, "r");
> +
> + /* if there is no line, leave the value as given */
> + while (!strbuf_getline(&line, proc_out))
> + string_list_append(remotes, line.buf);
> +
> + strbuf_release(&line);
> +
> + fclose(proc_out);
> +
> + if (finish_command(remote_proc)) {
> + error(_("failed to finish 'git remote' process"));
> + result = 1;
> + }
> +
> +cleanup:
> + free(remote_proc);
> + return result;
> +}
> +
> +static int run_fetch_job(void)
> +{
> + int result = 0;
> + struct string_list_item *item;
> + struct string_list remotes = STRING_LIST_INIT_DUP;
> +
> + if (fill_remotes(&remotes)) {
> + error(_("failed to fill remotes"));
> + result = 1;
> + goto cleanup;
> + }
> +
> + /*
> + * Do not modify the result based on the success of the 'fetch'
> + * operation, as a loss of network could cause 'fetch' to fail
> + * quickly. We do not want that to stop the rest of our
> + * background operations.
> + */
> + for (item = remotes.items;
> + item && item < remotes.items + remotes.nr;
> + item++)
> + fetch_remote(item->string);
> +
> +cleanup:
> + string_list_clear(&remotes, 0);
> + return result;
> +}
> +
> int cmd_run_job(int argc, const char **argv, const char *prefix)
> {
> static struct option builtin_run_job_options[] = {
> @@ -79,6 +164,8 @@ int cmd_run_job(int argc, const char **argv, const char *prefix)
> if (argc > 0) {
> if (!strcmp(argv[0], "commit-graph"))
> return run_commit_graph_job();
> + if (!strcmp(argv[0], "fetch"))
> + return run_fetch_job();
> }
>
> usage_with_options(builtin_run_job_usage,
> diff --git a/t/t7900-run-job.sh b/t/t7900-run-job.sh
> index 18b9bd26b3a..d3faeba135b 100755
> --- a/t/t7900-run-job.sh
> +++ b/t/t7900-run-job.sh
> @@ -44,4 +44,26 @@ test_expect_success 'commit-graph job' '
> )
> '
>
> +test_expect_success 'fetch job' '
> + git clone "file://$(pwd)/server" client &&
> +
> + # Before fetching, build a client commit-graph
> + git -C client run-job commit-graph &&
> + chain=client/.git/objects/info/commit-graphs/commit-graph-chain &&
> + test_line_count = 1 $chain &&
> +
> + git -C client branch -v --remotes >before-refs &&
> + test_commit -C server 24 &&
> +
> + git -C client run-job fetch &&
> + git -C client branch -v --remotes >after-refs &&
> + test_cmp before-refs after-refs &&
> + test_cmp server/.git/refs/heads/master \
> + client/.git/refs/hidden/origin/master &&
> +
> + # the hidden ref should trigger a new layer in the commit-graph
> + git -C client run-job commit-graph &&
> + test_line_count = 2 $chain
> +'
> +
> test_done
> --
> gitgitgadget
>
@@ -0,0 +1,92 @@ | |||
git-run-job(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Josh Steadmon wrote (reply to this):
On 2020.04.03 20:48, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
>
> The previous change cleaned up loose objects using the
> 'loose-objects' that can be run safely in the background. Add a
> similar job that performs similar cleanups for pack-files.
>
> One issue with running 'git repack' is that it is designed to
> repack all pack-files into a single pack-file. While this is the
> most space-efficient way to store object data, it is not time or
> memory efficient. This becomes extremely important if the repo is
> so large that a user struggles to store two copies of the pack on
> their disk.
>
> Instead, perform an "incremental" repack by collecting a few small
> pack-files into a new pack-file. The multi-pack-index facilitates
> this process ever since 'git multi-pack-index expire' was added in
> 19575c7 (multi-pack-index: implement 'expire' subcommand,
> 2019-06-10) and 'git multi-pack-index repack' was added in ce1e4a1
> (midx: implement midx_repack(), 2019-06-10).
>
> The 'pack-files' job runs the following steps:
>
> 1. 'git multi-pack-index write' creates a multi-pack-index file if
> one did not exist, and otherwise will update the multi-pack-index
> with any new pack-files that appeared since the last write. This
> is particularly relevant with the background fetch job.
>
> When the multi-pack-index sees two copies of the same object, it
> stores the offset data into the newer pack-file. This means that
> some old pack-files could become "unreferenced" which I will use
> to mean "a pack-file that is in the pack-file list of the
> multi-pack-index but none of the objects in the multi-pack-index
> reference a location inside that pack-file."
>
> 2. 'git multi-pack-index expire' deletes any unreferenced pack-files
> and updaes the multi-pack-index to drop those pack-files from the
Typo: updaes -> updates
> list. This is safe to do as concurrent Git processes will see the
> multi-pack-index and not open those packs when looking for object
> contents. (Similar to the 'loose-objects' job, there are some Git
Is it still safe for concurrent processes if the repo did not have a
multi-pack-index when the first process started?
> commands that open pack-files regardless of the multi-pack-index,
> but they are rarely used. Further, a user that self-selects to
> use background operations would likely refrain from using those
> commands.)
>
> 3. 'git multi-pack-index repack --bacth-size=<size>' collects a set
Typo: bacth-size -> batch-size
> of pack-files that are listed in the multi-pack-index and creates
> a new pack-file containing the objects whose offsets are listed
> by the multi-pack-index to be in those objects. The set of pack-
> files is selected greedily by sorting the pack-files by modified
> time and adding a pack-file to the set if its "expected size" is
> smaller than the batch size until the total expected size of the
> selected pack-files is at least the batch size. The "expected
> size" is calculated by taking the size of the pack-file divided
> by the number of objects in the pack-file and multiplied by the
> number of objects from the multi-pack-index with offset in that
> pack-file. The expected size approximats how much data from that
Typo: approximats -> approximates
> pack-file will contribute to the resulting pack-file size. The
> intention is that the resulting pack-file will be close in size
> to the provided batch size.
>
> The next run of the pack-files job will delete these repacked
> pack-files during the 'expire' step.
>
> In this version, the batch size is set to "0" which ignores the
> size restrictions when selecting the pack-files. It instead
> selects all pack-files and repacks all packed objects into a
> single pack-file. This will be updated in the next change, but
> it requires doing some calculations that are better isolated to
> a separate change.
>
> Each of the above steps update the multi-pack-index file. After
> each step, we verify the new multi-pack-index. If the new
> multi-pack-index is corrupt, then delete the multi-pack-index,
> rewrite it from scratch, and stop doing the later steps of the
> job. This is intended to be an extra-safe check without leaving
> a repo with many pack-files without a multi-pack-index.
>
> These steps are based on a similar background maintenance step in
> Scalar (and VFS for Git) [1]. This was incredibly effective for
> users of the Windows OS repository. After using the same VFS for Git
> repository for over a year, some users had _thousands_ of pack-files
> that combined to up to 250 GB of data. We noticed a few users were
> running into the open file descriptor limits (due in part to a bug
> in the multi-pack-index fixed by af96fe3392 (midx: add packs to
> packed_git linked list, 2019-04-29).
>
> These pack-files were mostly small since they contained the commits
> and trees that were pushed to the origin in a given hour. The GVFS
> protocol includes a "prefetch" step that asks for pre-computed pack-
> files containing commits and trees by timestamp. These pack-files
> were grouped into "daily" pack-files once a day for up to 30 days.
> If a user did not request prefetch packs for over 30 days, then they
> would get the entire history of commits and trees in a new, large
> pack-file. This led to a large number of pack-files that had poor
> delta compression.
>
> By running this pack-file maintenance step once per day, these repos
> with thousands of packs spanning 200+ GB dropped to dozens of pack-
> files spanning 30-50 GB. This was done all without removing objects
> from the system and using a constant batch size of two gigabytes.
> Once the work was done to reduce the pack-files to small sizes, the
> batch size of two gigabytes means that not every run triggers a
> repack operation, so the following run will not expire a pack-file.
> This has kept these repos in a "clean" state.
>
> [1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/PackfileMaintenanceStep.cs
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
> Documentation/git-run-job.txt | 18 ++++++-
> builtin/run-job.c | 90 ++++++++++++++++++++++++++++++++++-
> midx.c | 2 +-
> midx.h | 1 +
> t/t7900-run-job.sh | 39 +++++++++++++++
> 5 files changed, 147 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/git-run-job.txt b/Documentation/git-run-job.txt
> index 43ca1160b5a..108ed25b8bd 100644
> --- a/Documentation/git-run-job.txt
> +++ b/Documentation/git-run-job.txt
> @@ -9,7 +9,7 @@ git-run-job - Run a maintenance job. Intended for background operation.
> SYNOPSIS
> --------
> [verse]
> -'git run-job (commit-graph|fetch|loose-objects)'
> +'git run-job (commit-graph|fetch|loose-objects|pack-files)'
>
>
> DESCRIPTION
> @@ -71,6 +71,22 @@ a batch of loose objects. The batch size is limited to 50 thousand
> objects to prevent the job from taking too long on a repository with
> many loose objects.
>
> +'pack-files'::
> +
> +The `pack-files` job incrementally repacks the object directory using
> +the `multi-pack-index` feature. In order to prevent race conditions with
> +concurrent Git commands, it follows a two-step process. First, it
> +deletes any pack-files included in the `multi-pack-index` where none of
> +the objects in the `multi-pack-index` reference those pack-files; this
> +only happens if all objects in the pack-file are also stored in a newer
> +pack-file. Second, it selects a group of pack-files whose "expected
> +size" is below the batch size until the group has total expected size at
> +least the batch size; see the `--batch-size` option for the `repack`
> +subcommand in linkgit:git-multi-pack-index[1]. The default batch-size is
> +zero, which is a special case that attempts to repack all pack-files
> +into a single pack-file.
> +
> +
> GIT
> ---
> Part of the linkgit:git[1] suite
> diff --git a/builtin/run-job.c b/builtin/run-job.c
> index cecf9058c51..d3543f7ccb9 100644
> --- a/builtin/run-job.c
> +++ b/builtin/run-job.c
> @@ -1,13 +1,14 @@
> #include "builtin.h"
> #include "config.h"
> #include "commit-graph.h"
> +#include "midx.h"
> #include "object-store.h"
> #include "parse-options.h"
> #include "repository.h"
> #include "run-command.h"
>
> static char const * const builtin_run_job_usage[] = {
> - N_("git run-job (commit-graph|fetch|loose-objects)"),
> + N_("git run-job (commit-graph|fetch|loose-objects|pack-files)"),
> NULL
> };
>
> @@ -238,6 +239,91 @@ static int run_loose_objects_job(void)
> return prune_packed() || pack_loose();
> }
>
> +static int multi_pack_index_write(void)
> +{
> + struct argv_array cmd = ARGV_ARRAY_INIT;
> + argv_array_pushl(&cmd, "multi-pack-index", "write",
> + "--no-progress", NULL);
> + return run_command_v_opt(cmd.argv, RUN_GIT_CMD);
> +}
> +
> +static int rewrite_multi_pack_index(void)
> +{
> + char *midx_name = get_midx_filename(the_repository->objects->odb->path);
> +
> + unlink(midx_name);
> + free(midx_name);
> +
> + if (multi_pack_index_write()) {
> + error(_("failed to rewrite multi-pack-index"));
> + return 1;
> + }
> +
> + return 0;
> +}
> +
> +static int multi_pack_index_verify(void)
> +{
> + struct argv_array cmd = ARGV_ARRAY_INIT;
> + argv_array_pushl(&cmd, "multi-pack-index", "verify",
> + "--no-progress", NULL);
> + return run_command_v_opt(cmd.argv, RUN_GIT_CMD);
> +}
> +
> +static int multi_pack_index_expire(void)
> +{
> + struct argv_array cmd = ARGV_ARRAY_INIT;
> + argv_array_pushl(&cmd, "multi-pack-index", "expire",
> + "--no-progress", NULL);
> + return run_command_v_opt(cmd.argv, RUN_GIT_CMD);
> +}
> +
> +static int multi_pack_index_repack(void)
> +{
> + int result;
> + struct argv_array cmd = ARGV_ARRAY_INIT;
> + argv_array_pushl(&cmd, "multi-pack-index", "repack",
> + "--no-progress", "--batch-size=0", NULL);
> + result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
> +
> + if (result && multi_pack_index_verify()) {
> + warning(_("multi-pack-index verify failed after repack"));
> + result = rewrite_multi_pack_index();
> + }
> +
> + return result;
> +}
> +
> +static int run_pack_files_job(void)
> +{
> + if (multi_pack_index_write()) {
> + error(_("failed to write multi-pack-index"));
> + return 1;
> + }
> +
> + if (multi_pack_index_verify()) {
> + warning(_("multi-pack-index verify failed after initial write"));
> + return rewrite_multi_pack_index();
> + }
> +
> + if (multi_pack_index_expire()) {
> + error(_("multi-pack-index expire failed"));
> + return 1;
> + }
> +
> + if (multi_pack_index_verify()) {
> + warning(_("multi-pack-index verify failed after expire"));
> + return rewrite_multi_pack_index();
> + }
> +
> + if (multi_pack_index_repack()) {
> + error(_("multi-pack-index repack failed"));
> + return 1;
> + }
> +
> + return 0;
> +}
> +
> int cmd_run_job(int argc, const char **argv, const char *prefix)
> {
> static struct option builtin_run_job_options[] = {
> @@ -261,6 +347,8 @@ int cmd_run_job(int argc, const char **argv, const char *prefix)
> return run_fetch_job();
> if (!strcmp(argv[0], "loose-objects"))
> return run_loose_objects_job();
> + if (!strcmp(argv[0], "pack-files"))
> + return run_pack_files_job();
> }
>
> usage_with_options(builtin_run_job_usage,
> diff --git a/midx.c b/midx.c
> index 1527e464a7b..0f0d0a38812 100644
> --- a/midx.c
> +++ b/midx.c
> @@ -36,7 +36,7 @@
>
> #define PACK_EXPIRED UINT_MAX
>
> -static char *get_midx_filename(const char *object_dir)
> +char *get_midx_filename(const char *object_dir)
> {
> return xstrfmt("%s/pack/multi-pack-index", object_dir);
> }
> diff --git a/midx.h b/midx.h
> index e6fa356b5ca..cf2c09dffc2 100644
> --- a/midx.h
> +++ b/midx.h
> @@ -39,6 +39,7 @@ struct multi_pack_index {
>
> #define MIDX_PROGRESS (1 << 0)
>
> +char *get_midx_filename(const char *object_dir);
> struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local);
> int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, uint32_t pack_int_id);
> int bsearch_midx(const struct object_id *oid, struct multi_pack_index *m, uint32_t *result);
> diff --git a/t/t7900-run-job.sh b/t/t7900-run-job.sh
> index 41da083257b..416ba04989d 100755
> --- a/t/t7900-run-job.sh
> +++ b/t/t7900-run-job.sh
> @@ -6,6 +6,7 @@ Testing the background jobs, in the foreground
> '
>
> GIT_TEST_COMMIT_GRAPH=0
> +GIT_TEST_MULTI_PACK_INDEX=0
>
> . ./test-lib.sh
>
> @@ -93,4 +94,42 @@ test_expect_success 'loose-objects job' '
> test_cmp packs-between packs-after
> '
>
> +test_expect_success 'pack-files job' '
> + packDir=.git/objects/pack &&
> +
> + # Create three disjoint pack-files with size BIG, small, small.
> +
> + echo HEAD~2 | git -C client pack-objects --revs $packDir/test-1 &&
> +
> + test_tick &&
> + git -C client pack-objects --revs $packDir/test-2 <<-\EOF &&
> + HEAD~1
> + ^HEAD~2
> + EOF
> +
> + test_tick &&
> + git -C client pack-objects --revs $packDir/test-3 <<-\EOF &&
> + HEAD
> + ^HEAD~1
> + EOF
> +
> + rm -f client/$packDir/pack-* &&
> + rm -f client/$packDir/loose-* &&
> +
> + ls client/$packDir/*.pack >packs-before &&
> + test_line_count = 3 packs-before &&
> +
> + # the job repacks the two into a new pack, but does not
> + # delete the old ones.
> + git -C client run-job pack-files &&
> + ls client/$packDir/*.pack >packs-between &&
> + test_line_count = 4 packs-between &&
> +
> + # the job deletes the two old packs, and does not write
> + # a new one because only one pack remains.
> + git -C client run-job pack-files &&
> + ls client/.git/objects/pack/*.pack >packs-after &&
> + test_line_count = 1 packs-after
> +'
> +
> test_done
> --
> gitgitgadget
>
On the Git mailing list, Josh Steadmon wrote (reply to this):
|
On the Git mailing list, Junio C Hamano wrote (reply to this):
|
On the Git mailing list, Johannes Schindelin wrote (reply to this):
|
On the Git mailing list, Junio C Hamano wrote (reply to this):
|
On the Git mailing list, Jonathan Nieder wrote (reply to this):
|
On the Git mailing list, Junio C Hamano wrote (reply to this):
|
On the Git mailing list, Jonathan Nieder wrote (reply to this):
|
On the Git mailing list, Derrick Stolee wrote (reply to this):
|
On the Git mailing list, Johannes Schindelin wrote (reply to this):
|
Background maintenance. I've blogged about it [1]. I've talked about it [2]. I brought it up at the contributor summit [3]. I think it's important. It's a critical feature in Scalar and VFS for Git. It's the main reason we created the "scalar register" feature for existing repos (so the Scalar background maintenance would apply to a "vanilla" Git repo).
[1] https://devblogs.microsoft.com/devops/introducing-scalar/
[2] https://stolee.dev/docs/git-merge-2020.pdf
[3] https://lore.kernel.org/git/35FDF767-CE0C-4C51-88A8-12965CD2D4FF@jramsay.com.au/
This RFC does the "maintenance" half of "background maintenance". It creates two new builtins to assist users in scheduling their own background maintenance:
git run-job <job-name>
: This builtin will run a single instance of a maintenance job.git job-runner [--repo=<path>]
: This builtin will run an infinite loop that executesgit run-job
as a subcommand.I believe that I could replace most maintenance steps in Scalar with the
git run-job
command and all that we would lose is some logging. (The only one that would remain is the one that sets recommended config settings, but that could be changed to a one-time operation at service start-up.) I haven't tested this yet, but I plan to before sending a non-RFC option.Of course, just because we think these maintenance operations work for our scenario does not mean that they immediately work as the "canonical" maintenance activities. I attempt to demonstrate the value of these jobs as they are implemented. My perspective is skewed towards very large repositories (2000+ full-time contributors), but even medium-size repositories (100-200 full-time contributors) can benefit from these jobs.
I've tried to split the series into logical commits. This includes each specific maintenance job being completely described in its own commit (plus maybe an extra one to allow a custom option).
Most commit messages have "RFC QUESTION(S)" for things I was unsure about.
I'm currently testing this locally by setting
job.repo
in my global config to be a few important repos on my Linux VM then runningGIT_TRACE2_PERF=<path> git job-runner --daemonize
to launch a background process that logs the subcommands to<path>
.Here I will call out things that could definitely be improved:
The
git job-runner
process is not tested AT ALL. It's a bit tricky to test it since it shouldn't exit when things work as expected. I expect to create a--no-loop
option to stop after one iteration of the job loop. But I would like a bit more feedback on the concept before jumping into those tests. (I do test thegit run-job
builtin for each job, but not the customization through arguments and config.) Perhaps we could do some funny business about mockinggit
using--exec-path
to check that it is callinggit run-job
in the correct order (and more importantly, not calling it when certain config settings are present).The
--daemonize
option at the end is shamelessly stolen fromgit gc --daemonize
andgit daemon
, but has limited abilities on some platforms (I've tested on Linux and macOS). I have not done my research on how far this gets us to allowing users to launch this at startup or something.As I said, this is the "maintenance" "half" of "background maintenance". The "background" part is harder in my opinion because it involves creating platform-specific ways to consistently launch background processes. For example, Unix systems should have one way to
service start X
while Windows has another. macOS haslaunchd
to launch processes as users log in, which should be a good way forward. Scalar implements a Windows Service that runs as root but impersonates the latest user to log in, and it implements a macOS "service" that is running only with the current user. I expect to need to create these services myself as a follow-up, but I lack the expertise to do it (currently). If someone else has experience creating these things and wants to take over or advise that half then I would appreciate the help!I noticed late in the RFC process that I'm not clearing my argv_arrays carefully in the job-runner. This will need to be rectified and carefully checked with valgrind before merging this code. While it leaks memory very slowly, it will be important that we don't leak any memory at all since this is a long-lived process. There's also some places where I was careful to not include too much of libgit.a to help keep the memory footprint low.
Thanks,
-Stolee
Cc: peff@peff.net, jrnieder@google.com, stolee@gmail.com