forked from git/git
-
Notifications
You must be signed in to change notification settings - Fork 127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pack-objects: introduce --exclude-delta=<pattern> option #1392
Open
adlternative
wants to merge
171
commits into
gitgitgadget:master
Choose a base branch
from
adlternative:adl/pack-object-no-try-delta
base: master
Could not load branches
Branch not found: {{ refName }}
Could not load tags
Nothing to show
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
pack-objects: introduce --exclude-delta=<pattern> option #1392
adlternative
wants to merge
171
commits into
gitgitgadget:master
from
adlternative:adl/pack-object-no-try-delta
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Even though the name of the environment variable is mentioned in "git config --help" from http.sslVerify, there is no description for it. Add one. Note that this is not a usual Boolean environment variable whose value can be yes/true/on vs no/false/off; the existence of it is enough to trigger the feature named by the variable. Signed-off-by: Junio C Hamano <gitster@pobox.com>
Many environment variables use the git_env_bool() API to parse their values, and allow the usual "true/yes/on are true, false/no/off are false. In addition non-zero numbers are true and zero is false. An empty string is also false." set of values. Mark them as such, and consistently say "true" or "false", instead of random mixes of '1', '0', 'yes', 'true', etc. in their description. Signed-off-by: Junio C Hamano <gitster@pobox.com>
This uses atoi() and checks if the result is not zero to decide what to do. Turning it into the usual Boolean environment variable to use git_env_bool() would not break those who have been using "set to 0, or set to non-zero, that can be parsed with atoi()" values, but will match the expectation of those who expected "true" to mean "yes". Signed-off-by: Junio C Hamano <gitster@pobox.com>
Get rid of unneeded "else" statements in func_by_opt(). Signed-off-by: Sergey Organov <sorganov@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Get rid of special-casing of 'suppress' in set_diff_merges(). Instead set 'merges_need_diff' flag correctly in every option handling function. Signed-off-by: Sergey Organov <sorganov@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Sergey Organov <sorganov@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The variable is consulted whenever we write the index file. Signed-off-by: Junio C Hamano <gitster@pobox.com>
The logic to handle worktree refs (worktrees/NAME/REF and main-worktree/REF) existed in two places: * ref_type() in refs.c * parse_worktree_ref() in worktree.c Collapse this logic together in one function parse_worktree_ref(): this avoids having to cross-check the result of parse_worktree_ref() and ref_type(). Introduce enum ref_worktree_type, which is slightly different from enum ref_type. The latter is a misleading name (one would think that 'ref_type' would have the symref option). Instead, enum ref_worktree_type only makes explicit how a refname relates to a worktree. From this point of view, HEAD and refs/bisect/abc are the same: they specify the current worktree implicitly. The files-backend must avoid packing refs/bisect/* and friends into packed-refs, so expose is_per_worktree_ref() separately. Signed-off-by: Han-Wen Nienhuys <hanwen@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Remove the extra space character between "tracked" and "by", which dates back to when this paragraph was originally written in cff9711 (multi-pack-index: prepare for 'expire' subcommand, 2019-06-10). Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
The `expire` sub-command of `git multi-pack-index` will never expire `.keep` packs, regardless of whether or not any of their objects were selected in the MIDX. This has always been the case since 19575c7 (multi-pack-index: implement 'expire' subcommand, 2019-06-10), which came after cff9711 (multi-pack-index: prepare for 'expire' subcommand, 2019-06-10), when this documentation was originally written. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
The `expire` sub-command unlinks any packs that are (a) contained in the MIDX, but (b) have no objects referenced by the MIDX. This sub-command ignores `.keep` packs, which remain on-disk even if they have no objects referenced by the MIDX. Cruft packs, however, aren't given the same treatment: if none of the objects contained in the cruft pack are selected from the cruft pack by the MIDX, then the cruft pack is eligible to be expired. This is less than desireable, since the cruft pack has important metadata about the individual object mtimes, which is useful to determine how quickly an object should age out of the repository when pruning. Ordinarily, we wouldn't expect the contents of a cruft pack to duplicated across non-cruft packs (and we'd expect to see the MIDX select all cruft objects from other sources even less often). But nonetheless, it is still possible to trick the `expire` sub-command into removing the `.mtimes` file in this circumstance. Teach the `expire` sub-command to ignore cruft packs in the same manner as it does `.keep` packs, in order to keep their metadata around, even when they are unreferenced by the MIDX. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
The `repack` sub-command of the `git multi-pack-index` builtin creates a new pack aggregating smaller packs contained in the MIDX up to some given `--batch-size`. When `--batch-size=0`, this instructs the MIDX builtin to repack everything contained in the MIDX into a single pack. In similar spirit as a previous commit, it is undesirable to repack the contents of a cruft pack in this step. Teach `repack` to ignore any cruft pack(s) when `--batch-size=0` for the same reason(s). (The case of a non-zero `--batch-size` will be handled in a subsequent commit). Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Replace a direct invocation of Git's `xcalloc()` wrapper with the `CALLOC_ARRAY()` macro instead. The latter is preferred since it is more conventional in Git's codebase, but also because it automatically picks the correct value for the record size. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
The fill_included_packs_batch() routine is responsible for aggregating objects in packs with a non-zero value for the `--batch-size` option of the `git multi-pack-index repack` sub-command. Since this routine is explicitly called only when `--batch-size` is non-zero, there is no point in checking that this is the case in our loop condition. Remove the unnecessary part of this condition to avoid confusion. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Apply similar treatment with respect to cruft packs as in a few commits ago to `repack` with a non-zero `--batch-size`. Since the case of a non-zero `--batch-size` is handled separately (in `fill_included_packs_batch()` instead of `fill_included_packs_all()`), a separate fix must be applied for this case. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
These commands have no placeholders to be translated. Signed-off-by: Alex Henrie <alexhenrie24@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
After calling fsck_walk(), a tree object struct may be left in the parsed state, with the full tree contents available via tree->buffer. It's the responsibility of the caller to free these when it's done with the object to avoid having many trees allocated at once. In a regular "git fsck", we hit fsck_walk() only from fsck_obj(), which does call free_tree_buffer(). Likewise for "--connectivity-only", we see most objects via traverse_one_object(), which makes a similar call. The exception is in mark_unreachable_referents(). When using both "--connectivity-only" and "--dangling" (the latter of which is the default), we walk all of the unreachable objects, and there we forget to free. Most cases would not notice this, because they don't have a lot of unreachable objects, but you can make a pathological case like this: git clone --bare /path/to/linux.git repo.git cd repo.git rm packed-refs ;# now everything is unreachable! git fsck --connectivity-only That ends up with peak heap usage ~18GB, which is (not coincidentally) close to the size of all uncompressed trees in the repository. After this patch, the peak heap is only ~2GB. A few things to note: - it might seem like fsck_walk(), if it is parsing the trees, should be responsible for freeing them. But the situation is quite tricky. In the non-connectivity mode, after we call fsck_walk() we then proceed with fsck_object() which actually does the type-specific sanity checks on the object contents. We do pass our own separate buffer to fsck_object(), but there's a catch: our earlier call to parse_object_buffer() may have attached that buffer to the object struct! So by freeing it, we leave the rest of the code with a dangling pointer. Likewise, the call to fsck_walk() in index-pack is subtle. It attaches a buffer to the tree object that must not be freed! And so rather than calling free_tree_buffer(), it actually detaches it by setting tree->buffer to NULL. These cases would _probably_ be fixable by having fsck_walk() free the tree buffer only when it was the one who allocated it via parse_tree(). But that would still leave the callers responsible for freeing other cases, so they wouldn't be simplified. While the current semantics for fsck_walk() make it easy to accidentally leak in new callers, at least they are simple to explain, and it's not a function that's likely to get a lot of new call-sites. And in any case, it's probably sensible to fix the leak first with this simple patch, and try any more complicated refactoring separately. - a careful reader may notice that fsck_obj() also frees commit buffers, but neither the call in traverse_one_object() nor the one touched in this patch does so. And indeed, this is another problem for --connectivity-only (and accounts for most of the 2GB heap after this patch), but it's one we'll fix in a separate commit. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
When parsing a commit, the default behavior is to stuff the original buffer into a commit_slab (which takes ownership of it). But for a tool like fsck, this isn't useful. While we may look at the buffer further as part of fsck_commit(), we'll always do so through a separate pointer; attaching the buffer to the slab doesn't help. Worse, it means we have to remember to free the commit buffer in all call paths. We do so in fsck_obj(), which covers a regular "git fsck". But with "--connectivity-only", we forget to do so in both traverse_one_object(), which covers reachable objects, and mark_unreachable_referents(), which covers unreachable ones. As a result, that mode ends up storing an uncompressed copy of every commit on the heap at once. We could teach the code paths for --connectivity-only to also free commit buffers. But there's an even easier fix: we can just turn off the save_commit_buffer flag, and then we won't attach them to the commits in the first place. This reduces the peak heap of running "git fsck --connectivity-only" in a clone of linux.git from ~2GB to ~1GB. According to massif, the remaining memory goes where you'd expect: the object structs themselves, the obj_hash containing them, and the delta base cache. Note that we'll leave the call to free commit buffers in fsck_obj() for now; it's not quite redundant because of a related bug that we'll fix in a subsequent commit. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
If the global variable "save_commit_buffer" is set to 0, then parse_commit() will throw away the commit object data after parsing it, rather than sticking it into a commit slab. This goes all the way back to 60ab26d ([PATCH] Avoid wasting memory in git-rev-list, 2005-09-15). But there's another code path which may similarly stash the buffer: parse_object_buffer(). This is where we end up if we parse a commit via parse_object(), and it's used directly in a few other code paths like git-fsck. The original goal of 60ab26d was avoiding extra memory usage for rev-list. And there it's not all that important to catch parse_object(). We use that function only for looking at the tips of the traversal, and the majority of the commits are parsed by following parent links, where we use parse_commit() directly. So we were wasting some memory, but only a small portion. It's much easier to see the effect with fsck. Since we now turn off save_commit_buffer by default there, we _should_ be able to drop the freeing of the commit buffer in fsck_obj(). But if we do so (taking the first hunk of this patch without the rest), then the peak heap of "git fsck" in a clone of git.git goes from 136MB to 194MB. Teaching parse_object_buffer() to respect save_commit_buffer brings that down to 134.5MB (it's hard to tell from massif's output, but I suspect the savings comes from avoiding the overhead of the mostly-empty commit slab). Other programs should see a small improvement. Both "rev-list --all" and "fsck --connectivity-only" improve by a few hundred kilobytes, as they'd avoid loading the tip objects of their traversals. Most importantly, no code should be hurt by doing this. Any program that turns off save_commit_buffer is already making the assumption that any commit it sees may need to have its object data loaded on demand, as it doesn't know which ones were parsed by parse_commit() versus parse_object(). Not to mention that anything parsed by the commit graph may be in the same boat, even if save_commit_buffer was not disabled. This should be the only spot that needs to be fixed. Grepping for set_commit_buffer() shows that this and parse_commit() are the only relevant calls. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
We explicitly forbid the combination of "--bare" with "-o", but there doesn't seem to be any good reason to do so. The original logic came as part of e6489a1 (clone: do not accept more than one -o option., 2006-01-22), but that commit does not give any reason. Furthermore, the equivalent combination via config is allowed: git -c clone.defaultRemoteName=foo clone ... and works as expected. It may be that this combination was considered useless, because a bare clone does not set remote.origin.fetch (and hence there is no refs/remotes/origin hierarchy). But it does set remote.origin.url, and that name is visible to the user via "git fetch origin", etc. Let's allow the options to be used together, and switch the "forbid" test in t5606 to check that we use the requested name. That test came much later in 349cff7 (clone: add tests for --template and some disallowed option pairs, 2020-09-29), and does not offer any logic beyond "let's test what the code currently does". Reported-by: John A. Leuenhagen <john@zlima12.com> Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
We return an error when trying to rename a remote that has no fetch refspec: $ git config --unset-all remote.origin.fetch $ git remote rename origin foo fatal: could not unset 'remote.foo.fetch' To make things even more confusing, we actually _do_ complete the config modification, via git_config_rename_section(). After that we try to rewrite the fetch refspec (to say refs/remotes/foo instead of origin). But our call to git_config_set_multivar() to remove the existing entries fails, since there aren't any, and it calls die(). We could fix this by using the "gently" form of the config call, and checking the error code. But there is an even simpler fix: if we know that there are no refspecs to rewrite, then we can skip that part entirely. Reported-by: John A. Leuenhagen <john@zlima12.com> Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
It is the expectation that credential helpers be liberal in what they accept and conservative in what they return, to allow for future growth and evolution of the protocol/interaction. All of the other helpers (store, cache, osxkeychain, libsecret, gnome-keyring) except `netrc` currently ignore any credential lines that are not recognised, whereas the Windows helper (wincred) instead dies. Fix the discrepancy and ignore unknown lines in the wincred helper. Signed-off-by: Matthew John Cheetham <mjcheetham@outlook.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Contrary to the documentation on credential helpers, as well as the help text for git-credential-netrc itself, this helper will `die` when presented with an unknown property/attribute/token. Correct the behaviour here by skipping and ignoring any tokens that are unknown. This means all helpers in the tree are consistent and ignore any unknown credential properties/attributes. Signed-off-by: Matthew John Cheetham <mjcheetham@outlook.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Like in all the other credential helpers, the osxkeychain helper ignores unknown credential lines. Add a comment (a la the other helpers) to make it clear and explicit that this is the desired behaviour. Signed-off-by: Matthew John Cheetham <mjcheetham@outlook.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Turn on sparse index and remove ensure_full_index(). Before this patch, `git-grep` utilizes the ensure_full_index() method to expand the index and search all the entries. Because this method requires walking all the trees and constructing the index, it is the slow part within the whole command. To achieve better performance, this patch uses grep_tree() to search the sparse directory entries and get rid of the ensure_full_index() method. Why grep_tree() is a better choice over ensure_full_index()? 1) grep_tree() is as correct as ensure_full_index(). grep_tree() looks into every sparse-directory entry (represented by a tree) recursively when looping over the index, and the result of doing so matches the result of expanding the index. 2) grep_tree() utilizes pathspecs to limit the scope of searching. ensure_full_index() always expands the index, which means it will always walk all the trees and blobs in the repo without caring if the user only wants a subset of the content, i.e. using a pathspec. On the other hand, grep_tree() will only search the contents that match the pathspec, and thus possibly walking fewer trees. 3) grep_tree() does not construct and copy back a new index, while ensure_full_index() does. This also saves some time. ---------------- Performance test - Summary: p2000 tests demonstrate a ~71% execution time reduction for `git grep --cached bogus -- "f2/f1/f1/*"` using tree-walking logic. However, notice that this result varies depending on the pathspec given. See below "Command used for testing" for more details. Test HEAD~ HEAD ------------------------------------------------------- 2000.78: git grep ... (full-v3) 0.35 0.39 (≈) 2000.79: git grep ... (full-v4) 0.36 0.30 (≈) 2000.80: git grep ... (sparse-v3) 0.88 0.23 (-73.8%) 2000.81: git grep ... (sparse-v4) 0.83 0.26 (-68.6%) - Command used for testing: git grep --cached bogus -- "f2/f1/f1/*" The reason for specifying a pathspec is that, if we don't specify a pathspec, then grep_tree() will walk all the trees and blobs to find the pattern, and the time consumed doing so is not too different from using the original ensure_full_index() method, which also spends most of the time walking trees. However, when a pathspec is specified, this latest logic will only walk the area of trees enclosed by the pathspec, and the time consumed is reasonably a lot less. Generally speaking, because the performance gain is acheived by walking less trees, which are specified by the pathspec, the HEAD time v.s. HEAD~ time in sparse-v[3|4], should be proportional to "pathspec enclosed area" v.s. "all area", respectively. Namely, the wider the <pathspec> is encompassing, the less the performance difference between HEAD~ and HEAD, and vice versa. That is, if we don't specify a pathspec, the performance difference [1] is indistinguishable: both methods walk all the trees and take generally same amount of time (even with the index construction time included for ensure_full_index()). [1] Performance test result without pathspec (hence walking all trees): Command used: git grep --cached bogus Test HEAD~ HEAD --------------------------------------------------- 2000.78: git grep ... (full-v3) 6.17 5.19 (≈) 2000.79: git grep ... (full-v4) 6.19 5.46 (≈) 2000.80: git grep ... (sparse-v3) 6.57 6.44 (≈) 2000.81: git grep ... (sparse-v4) 6.65 6.28 (≈) -------------------------- NEEDSWORK about submodules There are a few NEEDSWORKs that belong to improvements beyond this topic. See the NEEDSWORK in builtin/grep.c::grep_submodule() for more context. The other two NEEDSWORKs in t1092 are also relative. Suggested-by: Derrick Stolee <derrickstolee@github.com> Helped-by: Derrick Stolee <derrickstolee@github.com> Helped-by: Victoria Dye <vdye@github.com> Helped-by: Elijah Newren <newren@gmail.com> Signed-off-by: Shaoxuan Yuan <shaoxuan.yuan02@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Commit 68d5d03 (rebase: teach --autosquash to match on sha1 in addition to message, 2010-11-04) taught autosquash to recognize subjects like "fixup! 7a235b" where 7a235b is an OID-prefix. It actually did more than advertised: 7a235b can be an arbitrary commit-ish (as long as it's not trailed by spaces). Accidental(?) use of this secret feature revealed a bug where we would silently drop a fixup commit. The bug can also be triggered when using an OID-prefix but that's unlikely in practice. Let the commit with subject "fixup! main" be the tip of the "main" branch. When computing the fixup target for this commit, we find the commit itself. This is wrong because, by definition, a fixup target must be an earlier commit in the todo list. We wrongly find the current commit because we added it to the todo list prematurely. Avoid these fixup-cycles by only adding the current commit to the todo list after we have finished looking for the fixup target. Reported-by: Erik Cervin Edin <erik@cervined.in> Signed-off-by: Johannes Altmanninger <aclopte@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
In 2708ce6 (branch: sort detached HEAD based on a flag, 2021-01-07) a call to wt_status_state_free_buffers, responsible of freeing the resources that could be allocated in the local struct wt_status_state state, was eliminated. The call to wt_status_state_free_buffers was introduced in 962dd7e (wt-status: introduce wt_status_state_free_buffers(), 2020-09-27). This commit brings back that call in get_head_description. Signed-off-by: Rubén Justo <rjusto@gmail.com> Reviewed-by: Martin Ågren <martin.agren@gmail.com> Acked-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
The 'git maintenance unregister' subcommand has a step that removes the current repository from the multi-valued maitenance.repo config key. This fails if the repository is not listed in that key. This makes running 'git maintenance unregister' twice result in a failure in the second instance. This failure exit code is helpful, but its message is not. Add a new die() message that explicitly calls out the failure due to the repository not being registered. In some cases, users may want to run 'git maintenance unregister' just to make sure that background jobs will not start on this repository, but they do not want to check to see if it is registered first. Add a new '--force' option that will siltently succeed if the repository is not already registered. Also add an extra test of 'git maintenance unregister' at a point where there are no registered repositories. This should fail without --force. Signed-off-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
The 'scalar unregister' command removes a repository from the list of registered Scalar repositories and removes it from the list of repositories registered for background maintenance. If the repository was not already registered for background maintenance, then the command fails, even if the repository was still registered as a Scalar repository. After using 'scalar clone' or 'scalar register', the repository would be enrolled in background maintenance since those commands run 'git maintenance start'. If the user runs 'git maintenance unregister' on that repository, then it is still in the list of repositories which get new config updates from 'scalar reconfigure'. The 'scalar unregister' command would fail since 'git maintenance unregister' would fail. Further, the add_or_remove_enlistment() method in scalar.c already has this idempotent nature built in as an expectation since it returns zero when the scalar.repo list already has the proper containment of the repository. The previous change added the 'git maintenance unregister --force' option, so use it within 'scalar unregister' to make it idempotent. Signed-off-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Code clean-up. * jk/cleanup-callback-parameters: attr: drop DEBUG_ATTR code commit: avoid writing to global in option callback multi-pack-index: avoid writing to global in option callback test-submodule: inline resolve_relative_url() function
"GIT_EDITOR=: git branch --edit-description" resulted in failure, which has been corrected. * jc/branch-description-unset: branch: do not fail a no-op --edit-desc
The code to clean temporary object directories (used for quarantine) tried to remove them inside its signal handler, which was a no-no. * jc/tmp-objdir: tmp-objdir: skip clean up when handling a signal
Compilation fix for ancient compilers. * ab/unused-annotation: git-compat-util.h: GCC deprecated message arg only in GCC 4.5+
Update comment in the Makefile about the RUNTIME_PREFIX config knob. * dd/document-runtime-prefix-better: Makefile: clarify runtime relative gitexecdir
Clarify that "the sentence after <area>: prefix does not begin with a capital letter" rule applies only to the commit title. * jc/use-of-uc-in-log-messages: SubmittingPatches: use usual capitalization in the log message body
Code clean-up. * rs/use-fspathncmp: dir: use fspathncmp() in pl_hashmap_cmp()
Remove error detection from a function that fetches from promisor remotes, and make it die when such a fetch fails to bring all the requested objects, to give an early failure to various operations. * jt/promisor-remote-fetch-tweak: promisor-remote: die upon failing fetch promisor-remote: remove a return value
"git branch --edit-description" on an unborh branch misleadingly said that no such branch exists, which has been corrected. * rj/branch-edit-desc-unborn: branch: description for non-existent branch errors
Remove outdated test. * pw/remove-rebase-p-test: t3435: remove redundant test case
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Compiling with -O2 can interact badly with LSan's leak-checker, causing false positives. Imagine a simplified example like: char *str = allocate_some_string(); if (some_func(str) < 0) die("bad str"); free(str); The compiler may eliminate "str" as a stack variable, and just leave it in a register. The register is preserved through most of the function, including across the call to some_func(), since we'd eventually need to free it. But because die() is marked with NORETURN, the compiler knows that it doesn't need to save registers, and just clobbers it. When die() eventually exits, the leak-checker runs. It looks in registers and on the stack for any reference to the memory allocated by str (which would indicate that it's not leaked), but can't find one. So it reports it as a leak. Neither system is wrong, really. The C standard (mostly section 5.1.2.3) defines an abstract machine, and compilers are allowed to modify the program as long as the observable behavior of that abstract machine is unchanged. Looking at random memory values on the stack is undefined behavior, and not something that the optimizer needs to support. But there really isn't any other way for a leak checker to work; it inherently has to do undefined things like scouring memory for pointers. So the two things are inherently at odds with each other. We can't fix it by changing the code, because from the perspective of the program running in an abstract machine, there is no leak. This has caused real false positives in the past, like: - https://lore.kernel.org/git/patch-v3-5.6-9a44204c4c9-20211022T175227Z-avarab@gmail.com/ - https://lore.kernel.org/git/Yy4eo6500C0ijhk+@coredump.intra.peff.net/ - https://lore.kernel.org/git/Y07yeEQu+C7AH7oN@nand.local/ This patch makes those go away by forcing -O0 when compiling with LSan. There are a few ways we could do this: - we could just teach the linux-leaks CI job to set -O0. That's the smallest change, and means we wouldn't get spurious CI failures. But it doesn't help people looking for leaks manually or in a specific test (and because the problem depends on the vagaries of the optimizer, investigating these can waste a lot of time in head-scratching as the problem comes and goes) - we default to -O2 in CFLAGS; we could pull this out to a separate variable ("-O$(O)" or something) and modify "O" when LSan is in use. This is the most flexible, in that you could still build with "make O=2 SANITIZE=leak" if you really wanted to (say, for experimenting). But it would also fail to kick in if the user defines their own CFLAGS variable, which again leads to head-scratching. - we can just stick -O0 into BASIC_CFLAGS when enabling LSan. Since this comes after the user-provided CFLAGS, it will override any previous -O setting found there. This is more foolproof, albeit less flexible. If you want to experiment with an optimized leak-checking build, you'll have to put "-O2 -fsanitize=leak" into CFLAGS manually, rather than using our SANITIZE=leak Makefile magic. Since the final one is the least likely to break in normal use, this patch uses that approach. The resulting build is a little slower, of course, but since LSan is already about 2x slower than a regular build, another 10% slowdown isn't that big a deal. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
As we'll address in subsequent commits the "DC_SHA1=YesPlease" is not on by default on OSX, instead we use Apple Common Crypto's SHA-1 implementation. In 6beb268 (fsmonitor: relocate socket file if .git directory is remote, 2022-10-04) the build was broken with "DC_SHA1=YesPlease" (and probably other non-"APPLE_COMMON_CRYPTO" SHA-1 backends). So let's extract the fix for this from [1] to get the build working again with "DC_SHA1=YesPlease". In addition to the fix in [1] we also need to replace "SHA_DIGEST_LENGTH" with "GIT_MAX_RAWSZ". 1. https://lore.kernel.org/git/c085fc15b314abcb5e5ca6b4ee5ac54a28327cab.1665326258.git.gitgitgadget@gmail.com/ Signed-off-by: Eric DeCosta <edecosta@mathworks.com> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
Update CodingGuidelines to clarify what features to use and avoid in C99. * ab/coding-guidelines-c99: CodingGuidelines: recommend against unportable C99 struct syntax CodingGuidelines: mention C99 features we can't use CodingGuidelines: allow declaring variables in for loops CodingGuidelines: mention dynamic C99 initializer elements CodingGuidelines: update for C99
Code simplification. * rs/archive-dedup-printf: archive: deduplicate verbose printing
Work around older clang that warns against C99 zero initialization syntax for struct. * jh/struct-zero-init-with-older-clang: config.mak.dev: disable suggest braces error on old clang versions
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Enable macOS build with sha1dc hash function. * ab/macos-build-fix-with-sha1dc: fsmonitor OSX: compile with DC_SHA1=YesPlease
Avoid false-positive from LSan whose assumption may be broken with higher optimization levels. * jk/use-o0-in-leak-sanitizer: Makefile: force -O0 when compiling with SANITIZE=leak
After checking out a "branch" that is a symbolic-ref that points at another branch, "git symbolic-ref HEAD" reports the underlying branch, not the symbolic-ref the user gave checkout as argument. The command learned the "--no-recurse" option to stop after dereferencing a symbolic-ref only once. * jc/symbolic-ref-no-recurse: symbolic-ref: teach "--[no-]recurse" option
Giving "--invert-grep" and "--all-match" without "--grep" to the "git log" command resulted in an attempt to access grep pattern expression structure that has not been allocated, which has been corrected. * ab/grep-simplify-extended-expression: grep.c: remove "extended" in favor of "pattern_expression", fix segfault
Code clean-up. * ds/cmd-main-reorder: git.c: improve code readability in cmd_main()
"git branch --edit-description @{-1}" is now a way to edit branch description of the branch you were on before switching to the current branch. * rj/branch-edit-description-with-nth-checkout: branch: support for shortcuts like @{-1}, completed
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The server uses delta compression during git clone to reduce the amount of data transferred over the network, but delta compression for large binary blobs often does not reduce storage size significantly and wastes a lot of CPU. Git now disables delta compression for objects that meet these conditions: 1. files that have -delta set in .gitattributes 2. files that its size exceed the big_file_threshold However, in 1, .gitattributes needs to be set manually by the user, and in most cases the user does not actively set it, and it is not something that can be actively adjusted on the server aside. In 2, the big_file_threshold now defaults to 512MB, and many binary files smaller than that will be uselessly delta-compressed, and this is made worse if the server actively increases the big_file_threshold. Therefore, we need a way to be able to actively skip the delta compression of some files on the server. Introduces the `-exclude-delta=<pattern>` option, which can be used to disable delta compression for objects that satisfy the pattern. Signed-off-by: ZheNing Hu <adlternative@gmail.com>
/submit |
Submitted as pull.1392.git.1666453564661.gitgitgadget@gmail.com To fetch this version into
To fetch this version to local tag
|
On the Git mailing list, Junio C Hamano wrote (reply to this): "ZheNing Hu via GitGitGadget" <gitgitgadget@gmail.com> writes:
> From: ZheNing Hu <adlternative@gmail.com>
>
> The server uses delta compression during git clone to reduce
> the amount of data transferred over the network, but delta
> compression for large binary blobs often does not reduce
> storage size significantly and wastes a lot of CPU. Git now
> disables delta compression for objects that meet these conditions:
>
> 1. files that have -delta set in .gitattributes
> 2. files that its size exceed the big_file_threshold
>
> However, in 1, .gitattributes needs to be set manually by the user,
> and in most cases the user does not actively set it, and it is not
> something that can be actively adjusted on the server aside. In 2,
> the big_file_threshold now defaults to 512MB, and many binary files
> smaller than that will be uselessly delta-compressed, and this is
> made worse if the server actively increases the big_file_threshold.
Who are you trying to help, though? The user has to somehow
manually cause the --exclude-delta=<pattern> to be added at the
server side, so I do not see the new feature as correcting the
weakness you perceive in the approach to mark the undeltifiable
blobs in the attributes system at all. Does this feature assume
that the server operator knows better than the project that can
maintain their own .gitattributes in tree? If so, the server
operator can already use <bare-repository>/info/attributes file
to achieve that already, no?
Now hosting sites may not give hosted projects flexibility to
configure their own server side repositories, like its "config"
and "info/attributes", but that limitation would equally apply
what command line options pack-objects runs with.
So, again, it is not clear who this patch is trying to help and how.
If we assume that the hosting operators can give project owners more
control of how the server side is configured, then we can do that
already without a new option, no?
IOW
> Therefore, we need a way to be able to actively skip the delta
> compression of some files on the server.
I do not quite follow the "Therefore" here.
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
While analyzing some repositories using
git filter-repo -analyze
,I noticed that many huge binaries in the repositories were
delta-compressed without much reduction in size.
$ cat .git/filter-repo/analysis/path-all-sizes.txt | more
=== All paths by reverse accumulated size ===
Format: unpacked size, packed size, date deleted, path name
23816778 23765921 2022-08-22 managed/src/universal/ybc/ybc-1.0.0-b1-linux-x86_64.tar.gz
22504398 22445676 2022-08-22 managed/src/universal/ybc/ybc-1.0.0-b1-el8-aarch64.tar.gz
11726471 6424233 2022-08-09 managed/yba-installer/yba-installer_linux_amd64
294644800 5794201 src/yb/master/catalog_manager.cc
2912780 2872186 docs/static/images/yp/tables-view-ycql.png
2992192 2634232 docs/static/images/yb-cloud/cloud-clusters-backups.png
2757095 2501915 docs/static/images/deploy/aws/aws-cf-configure-options.png
...
The current solution to avoid delta compression is not very
suitable for git servers. First, files that exceed the
big_file_threshold are not delta compressed, but the above
analysis indicates that many big binary files do not exceed
the the big_file_threshold (default to 512MB). Second, there
is not .gitattrbutes to disable delta compression for them, we
also don't really can let repo administrators add it manually.
But we can also see that the large files in these repositories often
have some common characteristics: they end in ".tar.gz"or “.png".
So perhaps we can take advantage of this feature and disable delta
compression on the server for some common type binary files.
This is currently implemented by command line parameters
--exclude-delta=<pattern>
. But maybe we can also try passingit through git config.
cc: Christian Couder christian.couder@gmail.com
cc: Ævar Arnfjörð Bjarmason avarab@gmail.com
cc: Junio C Hamano gitster@pobox.com
cc: Derrick Stolee derrickstolee@github.com
cc: Johannes Schindelin johannes.schindelin@gmx.de
cc: Elijah Newren newren@gmail.com
cc: James Ramsay james@jramsay.com.au