Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BF: be able to commit annexed changed files which potentially jump between git/annex #3009

merged 11 commits into from Nov 27, 2018


Copy link

@yarikoptic yarikoptic commented Nov 23, 2018

This pull request fixes #1651, fixes #2752. Ran into it again in the course of preparing PR for #3007

  • includes some minor refactoring and BFing around the code

  • overall it boils down to adopting our "hack" for direct mode to be applied generally when we are working with a commit for specific subset of files. All the files listed to be committed and which were not staged, would get first annex add-ed, index would be copied into a temporary one, staged files which are not to be committed, reset from that index, we commit, and if we added some files explicitly - we reset original index to match the committed state.

What is more "useful" in this PR is actually the unittest which triggers original failure. Now I am testing only with first optionally unlocking of the files to be edited, but we should test also with "unannex"ing first (left TODO for the future or revolution).


side effect 1:

ERROR: datalad.interface.tests.test_save.test_recursive_save
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/nose/", line 197, in runTest
  File "/home/buildbot/datalad-pr-docker-dl-nd16_04/build/datalad/tests/", line 596, in newfunc
    return t(*(arg + (filename,)), **kw)
  File "/home/buildbot/datalad-pr-docker-dl-nd16_04/build/datalad/interface/tests/", line 230, in test_recursive_save"savingtestmessage", super_datasets=True)
  File "/home/buildbot/datalad-pr-docker-dl-nd16_04/build/venv-ci/local/lib/python2.7/site-packages/wrapt/", line 562, in __call__
    args, kwargs)
  File "/home/buildbot/datalad-pr-docker-dl-nd16_04/build/datalad/distribution/", line 494, in apply_func
    return f(**kwargs)
  File "/home/buildbot/datalad-pr-docker-dl-nd16_04/build/venv-ci/local/lib/python2.7/site-packages/wrapt/", line 523, in __call__
    args, kwargs)
  File "/home/buildbot/datalad-pr-docker-dl-nd16_04/build/datalad/interface/", line 479, in eval_func
    return return_func(generator_func)(*args, **kwargs)
  File "/home/buildbot/datalad-pr-docker-dl-nd16_04/build/venv-ci/local/lib/python2.7/site-packages/wrapt/", line 523, in __call__
    args, kwargs)
  File "/home/buildbot/datalad-pr-docker-dl-nd16_04/build/datalad/interface/", line 467, in return_func
    results = list(results)
  File "/home/buildbot/datalad-pr-docker-dl-nd16_04/build/datalad/interface/", line 422, in generator_func
    result_renderer, result_xfm, _result_filter, **_kwargs):
  File "/home/buildbot/datalad-pr-docker-dl-nd16_04/build/datalad/interface/", line 509, in _process_results
    for res in results:
  File "/home/buildbot/datalad-pr-docker-dl-nd16_04/build/datalad/interface/", line 398, in __call__
  File "/home/buildbot/datalad-pr-docker-dl-nd16_04/build/datalad/interface/", line 142, in save_dataset
  File "/home/buildbot/datalad-pr-docker-dl-nd16_04/build/datalad/support/", line 2941, in commit
  File "/home/buildbot/datalad-pr-docker-dl-nd16_04/build/datalad/support/", line 1308, in commit
  File "/home/buildbot/datalad-pr-docker-dl-nd16_04/build/datalad/support/", line 1256, in _git_custom_command
    return super(AnnexRepo, self)._git_custom_command(*args, **kwargs)
  File "/home/buildbot/datalad-pr-docker-dl-nd16_04/build/datalad/support/", line 313, in newfunc
    result = func(self, files_new, *args, **kwargs)
  File "/home/buildbot/datalad-pr-docker-dl-nd16_04/build/datalad/support/", line 1747, in _git_custom_command
  File "/home/buildbot/datalad-pr-docker-dl-nd16_04/build/datalad/", line 673, in run
    cmd, env=self.get_git_environ_adjusted(env), *args, **kwargs)
  File "/home/buildbot/datalad-pr-docker-dl-nd16_04/build/datalad/", line 533, in run
    raise CommandError(str(cmd), msg, status, out[0], out[1])
CommandError: CommandError: command '['git', '-c', 'receive.autogc=0', '-c', '', 'commit', '-m', 'savingtestmessage']' failed with exitcode 1
Failed to run ['git', '-c', 'receive.autogc=0', '-c', '', 'commit', '-m', 'savingtestmessage'] under u'/tmp/datalad_temp_test_recursive_savewPsceM/sub'. Exit code=1. out=On branch master
Changes not staged for commit:
	modified:   subsub (untracked content)

Untracked files:

no changes added to commit
yarikoptic added 7 commits Nov 23, 2018
So we could create over existing annexed content
…mode) committing a list of files

This should avoid the issue of a plain "commit -- files" to fail if annex
migration happens in the background with

    Cannot make a partial commit with unlocked annexed files
…e annexed file(s)

A small price to pay to make sure that behaves consistently
one thing at a time, so postponed to dig into this for now
I have disabled careless mode while testing and forgot to enable it back.
That resulted in a commit attempt to commit nothing failing
Copy link

@codecov codecov bot commented Nov 23, 2018

Codecov Report

Merging #3009 into master will increase coverage by 0.01%.
The diff coverage is 97.91%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3009      +/-   ##
+ Coverage   90.36%   90.38%   +0.01%     
  Files         245      245              
  Lines       32270    32307      +37     
+ Hits        29161    29200      +39     
+ Misses       3109     3107       -2
Impacted Files Coverage Δ
datalad/tests/ 89.38% <100%> (ø) ⬆️
datalad/support/ 90.24% <100%> (+0.24%) ⬆️
datalad/support/ 88.38% <100%> (+0.27%) ⬆️
datalad/ 86.36% <100%> (+0.03%) ⬆️
datalad/support/tests/ 96.4% <96.42%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ec128a6...f5d6a6f. Read the comment docs.

Copy link
Member Author

@yarikoptic yarikoptic commented Nov 24, 2018

ok, seems to be good now. Here is some benchmark comparison although yet to check what was up with all the ls (shouldn't be relevant here anyways);

$> asv compare -m hopa master HEAD       

All benchmarks:

       before           after         ratio
     [5663d7ef]       [820c875d]
     <master>         <bf-commit-changed-annex>
       1.58±0.02s       1.68±0.03s     1.06  api.supers.time_createadd [hopa/virtualenv-py2.7]
       1.58±0.02s       1.64±0.01s     1.04  api.supers.time_createadd [hopa/virtualenv-py3.4]
       1.57±0.03s       1.65±0.01s     1.05  api.supers.time_createadd_to_dataset [hopa/virtualenv-py2.7]
       1.54±0.02s       1.64±0.02s     1.06  api.supers.time_createadd_to_dataset [hopa/virtualenv-py3.4]
       4.71±0.02s       4.67±0.02s     0.99  api.supers.time_installr [hopa/virtualenv-py2.7]
       4.15±0.02s       4.18±0.01s     1.01  api.supers.time_installr [hopa/virtualenv-py3.4]
           failed           failed      n/a  api.supers.time_ls [hopa/virtualenv-py2.7]
           failed           failed      n/a  api.supers.time_ls [hopa/virtualenv-py3.4]
           failed           failed      n/a  api.supers.time_ls_recursive [hopa/virtualenv-py2.7]
           failed           failed      n/a  api.supers.time_ls_recursive [hopa/virtualenv-py3.4]
           failed           failed      n/a  api.supers.time_ls_recursive_long_all [hopa/virtualenv-py2.7]
           failed           failed      n/a  api.supers.time_ls_recursive_long_all [hopa/virtualenv-py3.4]
       2.46±0.02s       2.48±0.02s     1.01  api.supers.time_remove [hopa/virtualenv-py2.7]
       2.43±0.01s       2.41±0.01s     0.99  api.supers.time_remove [hopa/virtualenv-py3.4]
          495±5ms          488±9ms     0.99  api.testds.time_create_test_dataset1 [hopa/virtualenv-py2.7]
         483±10ms         491±10ms     1.02  api.testds.time_create_test_dataset1 [hopa/virtualenv-py3.4]
       3.04±0.02s       3.06±0.01s     1.01  api.testds.time_create_test_dataset2x2 [hopa/virtualenv-py2.7]
       3.66±0.01s       3.86±0.02s     1.05  api.testds.time_create_test_dataset2x2 [hopa/virtualenv-py3.4]
       2.33±0.1ms       2.35±0.2ms     1.01  core.runner.time_echo [hopa/virtualenv-py2.7]
      1.98±0.08ms       2.02±0.1ms     1.02  core.runner.time_echo [hopa/virtualenv-py3.4]
       2.70±0.1ms      3.23±0.09ms    ~1.19  core.runner.time_echo_gitrunner [hopa/virtualenv-py2.7]
      2.13±0.07ms       2.32±0.2ms     1.09  core.runner.time_echo_gitrunner [hopa/virtualenv-py3.4]
         450±10ms          453±6ms     1.00  core.startup.time_help_np [hopa/virtualenv-py2.7]
          517±5ms          531±4ms     1.03  core.startup.time_help_np [hopa/virtualenv-py3.4]
         82.0±4ms         79.1±2ms     0.96  core.startup.time_import [hopa/virtualenv-py2.7]
          109±3ms          108±2ms     0.99  core.startup.time_import [hopa/virtualenv-py3.4]
          399±3ms          404±9ms     1.01  core.startup.time_import_api [hopa/virtualenv-py2.7]
          463±3ms          464±6ms     1.00  core.startup.time_import_api [hopa/virtualenv-py3.4]

@yarikoptic yarikoptic requested review from mih and kyleam Nov 24, 2018
@yarikoptic yarikoptic added this to the Release 0.11.1 milestone Nov 24, 2018
Copy link
Member Author

@yarikoptic yarikoptic commented Nov 24, 2018

given that codecov isn't reachable for me ATM at all, I guess the reduced coverage is also not really trustworthy

@yarikoptic yarikoptic mentioned this pull request Nov 24, 2018
@@ -2896,55 +2896,64 @@ def commit(self, msg=None, options=None, _datalad_msg=False,
direct_mode = self.is_direct_mode()
# we might need to avoid explicit paths
files_to_commit = None if direct_mode else files
if direct_mode and files:
if files:
Copy link

@bpoldrack bpoldrack Nov 24, 2018

At least one additional condition needed: "We are not in v6+"

Copy link

@kyleam kyleam commented Nov 25, 2018

I don't know. I've given a pass through the commit messages, linked references, and code, but I still have a very flimsy understanding of the problem and potential consequences of the proposed workaround. I'd need to spend substantially more time to gain any confidence that this is a sensible approach.

I don't see a reason to rush this into a release, as you suggest you'd like to in #3012.

Copy link
Member Author

@yarikoptic yarikoptic commented Nov 25, 2018

The reason to rush it as a part of the release is

  • metadata aggregation cannot work correctly if metadata file is to be rewritten (as old as #1651, #2752)
  • actually nothing can work correctly if annex files is rewritten and might be considered for migration between git/annex - no other solution (but possibly some in revolution) was suggested and i kept running into this over and over again
  • this PR for the first time introduces a test which points to this issue and fixes it (at least as far as the test goes) without breaking any other test. So I have some confidence. The only major fear I have is further performance impact (although I've tried to hide it while discussing with @bpoldrack ;))

Copy link

@kyleam kyleam commented Nov 25, 2018

The reason to rush it as a part of the release is [...]

I don't view any of those as convincing reasons to rush this into an unrelated 0.11.1 emergency release.

Copy link
Member Author

@yarikoptic yarikoptic commented Nov 25, 2018

And what would be the reason(s) to not include it in otherwise a generic bugfix release? that it wasn't tested enough? define "enough" then.
I am not planing to make release just for the annex emergency fix (plenty of other stuff to do), it would incorporate everything we have so far in master, so adding another fix is imho worthwhile. I promise to deal with the fallout (as always anyways) if any.

Copy link

@kyleam kyleam commented Nov 26, 2018

And what would be the reason(s) to not include it in otherwise a generic bugfix release?

"generic" bugfix release doesn't seem to be an apt description when you tag #3012 with "!! urgent !!" and say it needs to be released "asap". Instead of "merging all links PRs", I think a better response is to make an emergency release that just addresses the single issue rather than a rushed release. But anyway, I've given my opinion, and all of that is your call of course. Plus considering you've yet to cut the release, I'll update my interpretation of "!! urgent !!", particularly when requested to review something.

@yarikoptic yarikoptic merged commit f5d6a6f into datalad:master Nov 27, 2018
5 of 7 checks passed
yarikoptic added a commit that referenced this issue Nov 27, 2018
	## 0.11.1 (Nov 25, 2018) -- v7-better-than-v6

	Rushed out bugfix release to stay fully compatible with recent
	[git-annex] which introduced v7 to replace v6.

	### Fixes

	- [install]: be able to install recursively into a dataset ([#2982])
	- [save]: be able to commit/save changes whenever files potentially
	  could have swapped their storage between git and annex
	  ([#1651]) ([#2752]) ([#3009])
	- [aggregate-metadata]:
	  - dataset's itself is now not "aggregated" if specific paths are
		provided for aggregation ([#3002]). That resolves the issue of
		`-r` invocation aggregating all subdatasets of the specified dataset
		as well
	  - also compare/verify the actual content checksum of aggregated metadata
		while considering subdataset metadata for re-aggregation ([#3007])
	- `annex` commands are now chunked assuming 50% "safety margin" on the
	  maximal command line length. Should resolve crashes while operating
	  ot too many files at ones ([#3001])
	- `run` sidecar config processing ([#2991])
	- no double trailing period in docs ([#2984])
	- correct identification of the repository with symlinks in the paths
	  in the tests ([#2972])
	- re-evaluation of dataset properties in case of dataset changes ([#2946])
	- [text2git] procedure to use `ds.repo.set_gitattributes`
	  ([#2974]) ([#2954])
	- Switch to use plain `os.getcwd()` if inconsistency with env var
	  `$PWD` is detected ([#2914])
	- Make sure that credential defined in env var takes precedence
	  ([#2960]) ([#2950])

	### Enhancements and new features

	- [shub://datalad/datalad:git-annex-dev](
	  provides a Debian buster Singularity image with build environment for
	  [git-annex]. [tools/bisect-git-annex]() provides a helper for running
	  `git bisect` on git-annex using that Singularity container ([#2995])
	- Added [.zenodo.json]() for better integration with Zenodo for citation
	- [run-procedure] now provides names and help messages with a custom
	  renderer for ([#2993])
	- Documentation: point to [datalad-revolution] extension (prototype of
	  the greater DataLad future)
	- [run]
	  - support injecting of a detached command ([#2937])
	- `annex` metadata extractor now extracts `annex.key` metadata record.
	  Should allow now to identify uses of specific files etc ([#2952])
	- Test that we can install from
	- Proper rendering of `CommandError` (e.g. in case of "out of space"
	  error) ([#2958])

* tag '0.11.1':
  Adjust the date -- 25th fell through due to __version__ fiasco
  BF+ENH(TST): boost hardcoded version + provide a test to guarantee consistency in the future
  This (expensive) approach is not needed in v6+
  small tuneup to changelog
@yarikoptic yarikoptic deleted the bf-commit-changed-annex branch Feb 17, 2019
kyleam added a commit that referenced this issue Apr 29, 2019
As of gh-3009, if path arguments are given, AnnexRepo.commit() uses a
temporary index that excludes any staged changes that are not included
in the specified paths.  It does this to avoid git-annex failing with
"Cannot make a partial commit with unlocked annexed files".  This
logic broke calling 'datalad save' with a subdirectory argument
because the subdirectory was incorrectly counted as a "staged change
not be committed".

Fix this by teaching get_changed_files() how to restrict its output to
the specified paths and then finding the "staged changes not be
committed" by comparing the output of the file-restricted and
non-restricted calls of get_changed_files(staged=True).

Note: rev-save does not go through AnnexRepo.commit(), so there is no
need to add a switch to disable this in master.
kyleam added a commit to kyleam/datalad that referenced this issue Apr 30, 2019
git-annex's pre-commit hook throws the following error if an unlocked
file is given in a 'git commit PATH ...' call:

  Cannot make a partial commit with unlocked annexed files. You should
  `git annex add` the files you want to commit, and then run git

The motivation for this is explained in git-annex's
adc5ca70a (pre-commit: Block partial commit of unlocked annexed file,
since that left a typechange staged in index, 2014-11-10)

In the typical case, we could fix this by running 'git annex add'
before the 'git commit PATH' call.  (And 'datalad save' already does
this, so this is an issue only when directly calling
AnnexRepo.commit().)  *But* a complication is introduced when 'git
annex add' adds the file to index as a regular file while the file in
HEAD is an annex symbolic link (e.g., the common case would be the
file size shrinking below the threshold specified via
annex.largefiles).  git-annex can't distinguish the typechange of the
added file from that of an unlocked file, so 'git commit PATH' fails
with the partial commit error [*].

To work around this, dataladgh-3009 taught AnnexRepo.commit() to stage the
specified paths in a temporary index and then call 'git commit'
without the pathspec.  After the commit, we need to reset the main
index to the state in HEAD to reflect changes that git-annex might
have made in the pre-commit hook (e.g., changing the regular file to
an annex symlink).  dataladgh-3009 did this for the specified paths that had
unstaged changes, but we need to do the same thing for paths with
staged changes.  Otherwise, calling 'datalad save PATH' with a path
moved below the annex.largefiles cutoff will show a typechange after
the save call.

kyleam added a commit to kyleam/datalad that referenced this issue Apr 30, 2019
We decide to commit with a temporary index if there are any files with
staged changes that should _not_ be committed or if any of the
specified paths have unstaged changes.  The latter actually doesn't
require a temporary index; we just need to update the paths in the
current index, as we already do.

As a side note, by adding the files, dataladgh-3009 introduced a subtle
change in behavior.  Before the specified files would be added to git
by 'git commit'.  After dataladgh-3009, they are added with 'git annex add'.
This is unlikely to matter because the behavior was undefined and is
only visible when calling AnnexRepo.commit() directly as save adds the
paths before calling .commit().
yarikoptic added a commit that referenced this issue May 7, 2019
Merge pull request #3365 from kyleam/annex-shrink-fixup
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet
None yet
3 participants