Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix up index logic from gh-3009 #3365

Merged
merged 2 commits into from May 7, 2019
Merged

Conversation

@kyleam
Copy link
Contributor

@kyleam kyleam commented Apr 30, 2019

This PR

  • fixes an issue with the partial commit workaround in AnnexRepo.commit() introduced by gh-3009.
  • avoids using the temporary index in a case where it is unnecessary
Demo of the issue
#!/bin/sh

set -ex

cd $(mktemp -dt dl-XXXXXXX)
datalad create
printf '\n* annex.largefiles=(largerthan=4b)\n' >>.gitattributes
datalad add .gitattributes

printf "so big; to annex i go" >ishrink
datalad save ishrink

datalad unlock ishrink
: >ishrink

# A staged change that won't be specified as a path to save,
# triggering the partial commit.
: >staged && git add staged

datalad save ishrink

git status -s

Without this PR, the status output is

TT ishrink
A  staged

even though ishrink was just committed and should be clean.

kyleam added 2 commits Apr 30, 2019
git-annex's pre-commit hook throws the following error if an unlocked
file is given in a 'git commit PATH ...' call:

  Cannot make a partial commit with unlocked annexed files. You should
  `git annex add` the files you want to commit, and then run git
  commit.

The motivation for this is explained in git-annex's
adc5ca70a (pre-commit: Block partial commit of unlocked annexed file,
since that left a typechange staged in index, 2014-11-10)

In the typical case, we could fix this by running 'git annex add'
before the 'git commit PATH' call.  (And 'datalad save' already does
this, so this is an issue only when directly calling
AnnexRepo.commit().)  *But* a complication is introduced when 'git
annex add' adds the file to index as a regular file while the file in
HEAD is an annex symbolic link (e.g., the common case would be the
file size shrinking below the threshold specified via
annex.largefiles).  git-annex can't distinguish the typechange of the
added file from that of an unlocked file, so 'git commit PATH' fails
with the partial commit error [*].

To work around this, dataladgh-3009 taught AnnexRepo.commit() to stage the
specified paths in a temporary index and then call 'git commit'
without the pathspec.  After the commit, we need to reset the main
index to the state in HEAD to reflect changes that git-annex might
have made in the pre-commit hook (e.g., changing the regular file to
an annex symlink).  dataladgh-3009 did this for the specified paths that had
unstaged changes, but we need to do the same thing for paths with
staged changes.  Otherwise, calling 'datalad save PATH' with a path
moved below the annex.largefiles cutoff will show a typechange after
the save call.

[*]: https://git-annex.branchable.com/bugs/cannot_commit___34__annex_add__34__ed_modified_file_which_switched_its_largefile_status_to_be_committed_to_git_now/
We decide to commit with a temporary index if there are any files with
staged changes that should _not_ be committed or if any of the
specified paths have unstaged changes.  The latter actually doesn't
require a temporary index; we just need to update the paths in the
current index, as we already do.

As a side note, by adding the files, dataladgh-3009 introduced a subtle
change in behavior.  Before the specified files would be added to git
by 'git commit'.  After dataladgh-3009, they are added with 'git annex add'.
This is unlikely to matter because the behavior was undefined and is
only visible when calling AnnexRepo.commit() directly as save adds the
paths before calling .commit().
@codecov
Copy link

@codecov codecov bot commented Apr 30, 2019

Codecov Report

Merging #3365 into 0.11.x will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           0.11.x    #3365      +/-   ##
==========================================
+ Coverage   90.84%   90.84%   +<.01%     
==========================================
  Files         252      252              
  Lines       33127    33135       +8     
==========================================
+ Hits        30095    30103       +8     
  Misses       3032     3032
Impacted Files Coverage Δ
datalad/interface/tests/test_save.py 100% <ø> (ø) ⬆️
datalad/support/annexrepo.py 88.97% <100%> (-0.01%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 514545a...8bd908e. Read the comment docs.

1 similar comment
@codecov
Copy link

@codecov codecov bot commented Apr 30, 2019

Codecov Report

Merging #3365 into 0.11.x will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           0.11.x    #3365      +/-   ##
==========================================
+ Coverage   90.84%   90.84%   +<.01%     
==========================================
  Files         252      252              
  Lines       33127    33135       +8     
==========================================
+ Hits        30095    30103       +8     
  Misses       3032     3032
Impacted Files Coverage Δ
datalad/interface/tests/test_save.py 100% <ø> (ø) ⬆️
datalad/support/annexrepo.py 88.97% <100%> (-0.01%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 514545a...8bd908e. Read the comment docs.

@kyleam kyleam requested a review from yarikoptic May 6, 2019
@@ -2937,15 +2937,15 @@ def commit(self, msg=None, options=None, _datalad_msg=False,
# Files which were staged but not among files
staged_not_to_commit = all_changed_staged.difference(files_changed_staged)

if staged_not_to_commit or files_changed_notstaged:
if files_changed_notstaged:
self.add(files=list(files_changed_notstaged))
Copy link
Member

@yarikoptic yarikoptic May 6, 2019

uff... this whole changed/staged/etc hurts my brain ;-)
I agree that probably a single condition (if staged_not_to_commit) is the most sensible to decide either we need a temp index or not. But then I am getting lost in what needs to be reset and possible interactions from adding here which possibly could be not taken into account I think for staged_not_to_commit (since it differs with files_changed_staged so no unstaged before but staged now files are taken into account).

By doing this .add here we in effect stage files_changed_notstaged. So I wonder if we could/should move assignments for files_changed_staged and staged_not_to_commit after this add?:

diff --git a/datalad/support/annexrepo.py b/datalad/support/annexrepo.py
index 449fe297f..e954125b3 100644
--- a/datalad/support/annexrepo.py
+++ b/datalad/support/annexrepo.py
@@ -2919,27 +2919,29 @@ class AnnexRepo(GitRepo, RepoInterface):
                     # files "jumped" between git/annex.  Then also preparing a
                     # custom index and calling "commit" without files resolves
                     # the issue
-                    all_changed_staged = \
-                        set(self.get_changed_files(staged=True))
-
                     files_normalized = [
                         _normalize_path(self.path, f) if isabs(f) else f
                         for f in files
                     ]
 
-                    files_changed_staged = \
-                        set(self.get_changed_files(staged=True, files=files_normalized))
                     files_changed_notstaged = \
                         set() \
                         if direct_mode \
                         else set(self.get_changed_files(staged=False, files=files_normalized))
 
-                    # Files which were staged but not among files
-                    staged_not_to_commit = all_changed_staged.difference(files_changed_staged)
-
+                    # Stage all files which are not staged yet
                     if files_changed_notstaged:
                         self.add(files=list(files_changed_notstaged))
 
+                    files_changed_staged = \
+                        set(self.get_changed_files(staged=True, files=files_normalized))
+
+                    all_changed_staged = \
+                        set(self.get_changed_files(staged=True))
+
+                    # Files which were staged but not among files
+                    staged_not_to_commit = all_changed_staged.difference(files_changed_staged)
+
                     if staged_not_to_commit:
                         # Need an alternative index_file
                         with make_tempfile(dir=opj(self.path,

or am i ruining some logic (relying on original values before any changes) this way?

Copy link
Contributor Author

@kyleam kyleam May 6, 2019

uff... this whole changed/staged/etc hurts my brain ;-)

Mine too :]

By doing this .add here we in effect stage files_changed_notstaged.

Yes, though functionally it is no different then doing it in the old spot. We just avoid the unnecessary temporary index.

So I wonder if we could/should move assignments for files_changed_staged and staged_not_to_commit after this add?

There are two places where the values of files_changed_{not,}staged come in: (1) determining the value of staged_not_to_commit and (2) doing the clean-up reset of the final index. No. 1 doesn't change based on whether files_changed_staged and all_changed_staged are updated to include the just added files_changed_notstaged because that is adding files to both sets, which will be dropped with the all_changed_staged - files_changed_notstaged difference. No 2. isn't affected because the union of files_changed_notstaged and files_changed_staged are used for the reset, so how we partition files among the two sets doesn't matter.

Copy link
Member

@yarikoptic yarikoptic May 7, 2019

ok, let's then stick to your PR as is, thanks!

@yarikoptic yarikoptic merged commit 9107eda into datalad:0.11.x May 7, 2019
5 checks passed
@yarikoptic yarikoptic added this to the Release 0.11.5 milestone May 7, 2019
@kyleam kyleam deleted the annex-shrink-fixup branch May 7, 2019
yarikoptic added a commit that referenced this issue May 28, 2019
0.11.5 (May 23, 2019) -- stability is not overrated

Should be faster and less buggy, with a few enhancements.

 Fixes

- [create-sibling][]  ([#3318][])
  - Siblings are no longer configured with a post-update hook unless a
    web interface is requested with `--ui`.
  - `git submodule update --init` is no longer called from the
    post-update hook.
  - If `--inherit` is given for a dataset without a superdataset, a
    warning is now given instead of raising an error.
- The internal command runner failed on Python 2 when its `env`
  argument had unicode values.  ([#3332][])
- The safeguard that prevents creating a dataset in a subdirectory
  that already contains tracked files for another repository failed on
  Git versions before 2.14.  For older Git versions, we now warn the
  caller that the safeguard is not active.  ([#3347][])
- A regression introduced in v0.11.1 prevented [save][] from committing
  changes under a subdirectory when the subdirectory was specified as
  a path argument.  ([#3106][])
- A workaround introduced in v0.11.1 made it possible for [save][] to
  do a partial commit with an annex file that has gone below the
  `annex.largefiles` threshold.  The logic of this workaround was
  faulty, leading to files being displayed as typechanged in the index
  following the commit.  ([#3365][])
- The resolve_path() helper confused paths that had a semicolon for
  SSH RIs.  ([#3425][])
- The detection of SSH RIs has been improved.  ([#3425][])

 Enhancements and new features

- The internal command runner was too aggressive in its decision to
  sleep.  ([#3322][])
- The "INFO" label in log messages now retains the default text color
  for the terminal rather than using white, which only worked well for
  terminals with dark backgrounds.  ([#3334][])
- A short flag `-R` is now available for the `--recursion-limit` flag,
  a flag shared by several subcommands.  ([#3340][])
- The authentication logic for [create-sibling-github][] has been
  revamped and now supports 2FA.  ([#3180][])
- New configuration option `datalad.ui.progressbar` can be used to
  configure the default backend for progress reporting ("none", for
  example, results in no progress bars being shown).  ([#3396][])
- A new progress backend, available by setting datalad.ui.progressbar
  to "log", replaces progress bars with a log message upon completion
  of an action.  ([#3396][])
- DataLad learned to consult the [NO_COLOR][] environment variable and
  the new `datalad.ui.color` configuration option when deciding to
  color output.  The default value, "auto", retains the current
  behavior of coloring output if attached to a TTY ([#3407][]).
- [clean][] now removes annex transfer directories, which is useful
  for cleaning up failed downloads. ([#3374][])
- [clone][] no longer refuses to clone into a local path that looks
  like a URL, making its behavior consistent with `git clone`.
  ([#3425][])
- [wtf][]
  - Learned to fall back to the `dist` package if `platform.dist`,
    which has been removed in the yet-to-be-release Python 3.8, does
    not exist.  ([#3439][])
  - Gained a `--section` option for limiting the output to specific
    sections and a `--decor` option, which currently knows how to
    format the output as GitHub's `<details>` section.  ([#3440][])

* tag '0.11.5': (96 commits)
  [DATALAD RUNCMD] make update-changelog
  Version boost and finalize CHANGELOG.md record
  ENH: new Makefile rule linkissues-changelog to link issues, which now will also be prerequisite for update-changelog
  CHANGELOG.md: Add entries for recently merged PRs
  ENH: require "distro" for python >= 3.8
  ENH: compat with python 3.8 which removed .dist -- try distro
  CLN: wtf: Remove unused (and duplicated) import
  DOC: wtf: Avoid double period in -S's description
  ENH: -D|--decor html_details -- to make it ready for pasting to github issue without clutter
  BF: assure bytes while giving to pyperclip upon its demand (on Py2)
  RF: move always present path + type "section" into "location" section, retain order of sections from cmdline
  RF: switch from nargs="*" to action=append for wtf -S
  ENH: wtf -S to specify which sections to query/display (by default -- all)
  MNT: Avoid invalid escape sequences in strings
  BF: export_to_figshare: Don't test identity of string literal
  BF(TST): do not assume user naiveness - treat any url-like looking path as a path
  BF: Check for /, \ or # in the username@hostname part while detecting SSHRI
  CHANGELOG.md: Add entry for gh-3374
  BF: revert back (remove) check for path being PathRI
  BF: list annex-transfer also in cmdline opt choice for "what"
  ...
yarikoptic added a commit that referenced this issue May 28, 2019
0.11.5 (May 23, 2019) -- stability is not overrated

Should be faster and less buggy, with a few enhancements.

 Fixes

- [create-sibling][]  ([#3318][])
  - Siblings are no longer configured with a post-update hook unless a
    web interface is requested with `--ui`.
  - `git submodule update --init` is no longer called from the
    post-update hook.
  - If `--inherit` is given for a dataset without a superdataset, a
    warning is now given instead of raising an error.
- The internal command runner failed on Python 2 when its `env`
  argument had unicode values.  ([#3332][])
- The safeguard that prevents creating a dataset in a subdirectory
  that already contains tracked files for another repository failed on
  Git versions before 2.14.  For older Git versions, we now warn the
  caller that the safeguard is not active.  ([#3347][])
- A regression introduced in v0.11.1 prevented [save][] from committing
  changes under a subdirectory when the subdirectory was specified as
  a path argument.  ([#3106][])
- A workaround introduced in v0.11.1 made it possible for [save][] to
  do a partial commit with an annex file that has gone below the
  `annex.largefiles` threshold.  The logic of this workaround was
  faulty, leading to files being displayed as typechanged in the index
  following the commit.  ([#3365][])
- The resolve_path() helper confused paths that had a semicolon for
  SSH RIs.  ([#3425][])
- The detection of SSH RIs has been improved.  ([#3425][])

 Enhancements and new features

- The internal command runner was too aggressive in its decision to
  sleep.  ([#3322][])
- The "INFO" label in log messages now retains the default text color
  for the terminal rather than using white, which only worked well for
  terminals with dark backgrounds.  ([#3334][])
- A short flag `-R` is now available for the `--recursion-limit` flag,
  a flag shared by several subcommands.  ([#3340][])
- The authentication logic for [create-sibling-github][] has been
  revamped and now supports 2FA.  ([#3180][])
- New configuration option `datalad.ui.progressbar` can be used to
  configure the default backend for progress reporting ("none", for
  example, results in no progress bars being shown).  ([#3396][])
- A new progress backend, available by setting datalad.ui.progressbar
  to "log", replaces progress bars with a log message upon completion
  of an action.  ([#3396][])
- DataLad learned to consult the [NO_COLOR][] environment variable and
  the new `datalad.ui.color` configuration option when deciding to
  color output.  The default value, "auto", retains the current
  behavior of coloring output if attached to a TTY ([#3407][]).
- [clean][] now removes annex transfer directories, which is useful
  for cleaning up failed downloads. ([#3374][])
- [clone][] no longer refuses to clone into a local path that looks
  like a URL, making its behavior consistent with `git clone`.
  ([#3425][])
- [wtf][]
  - Learned to fall back to the `dist` package if `platform.dist`,
    which has been removed in the yet-to-be-release Python 3.8, does
    not exist.  ([#3439][])
  - Gained a `--section` option for limiting the output to specific
    sections and a `--decor` option, which currently knows how to
    format the output as GitHub's `<details>` section.  ([#3440][])

* tag '0.11.5':
  BF: make test for url download more reliable in cases where connection fails
  RF: remove stale commented out duecredit in setup.py.  It has now its own section
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

2 participants