Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BF+RF: unlock: Rewrite using status() #3459

Merged
merged 10 commits into from
Jun 1, 2019
Merged

Conversation

kyleam
Copy link
Contributor

@kyleam kyleam commented May 25, 2019

If a file without content is passed to unlock(), it yields a
result (with a status of "impossible") saying that the file doesn't
have content.  But if the files without content weren't explicitly
provided as paths (e.g., no path was given or a directory containing
the files was given as the path), then files without content lead to a
CommandError from the underlying 'git annex unlock' call.  (Or in v6+,
they lead to the pointer being unlocked which is inconsistent with the
error given when a file is explicitly specified in v6+ and also
potentially confusing to callers.)

To fix the "no paths specified, missing content" behavior, rely on the
information provided by status() to restrict the unlock calls to files
with content.  This fix raises the question of how to report files
without content when the path isn't explicitly given.  Because
'datalad unlock' at the dataset or a directory level could potentially
lead to a lot of results, avoid reporting "can't unlock" results when
explicit paths aren't given.  Doing this at the __call__ level rather
than as a result filter means we can do the faster status(...,
untracked=no") query unless explicit, non-directory paths are given.

Aside from the fix above, this tries to match the behavior of the
previous unlock().  The main behavioral change, shown by the test
updates, is that unlock() no longer infers the dataset from a path
when the current working directory is outside a dataset and no dataset
is explicitly given.  This behavioral change is consistent with, for
example, distribution.add vs core.local.save.

This nudges annotate_paths() closer to retirement (gh-3368).

If the default renderer _were_ to run with the status results, it
would fail with

  relpath() argument must be str or bytes, not 'PosixPath'
  [genericpath.py:_check_arg_types:143] (TypeError)

This codepath isn't accessible when calling the status() command
because status() uses a tailored renderer as its custom renderer (see
dataladgh-2481).  However, if another command that uses the default renderer
runs status() with on_failure="ignore" and then yields the result, as
unlock() will do in the next commit, then we will hit the above
failure.
@kyleam
Copy link
Contributor Author

kyleam commented May 25, 2019

This fix raises the question of how [...]

I was on the fence about this part. Feedback welcome.

@kyleam
Copy link
Contributor Author

kyleam commented May 25, 2019

Demo of main failure
#!/bin/sh

set -eu

cd $(mktemp -dt dl-XXXXXXX)
datalad create
echo a >a
echo b >b
datalad save
datalad drop --nocheck a
datalad unlock

Before:

[INFO   ] Creating a new annex repo at /tmp/dl-EQ13qlv 
create(ok): /tmp/dl-EQ13qlv (dataset)
add(ok): b (file)
add(ok): a (file)
save(ok): . (dataset)
action summary:
  add (ok: 2)
  save (ok: 1)
drop(ok): /tmp/dl-EQ13qlv/a (file)
CommandError: command '['git', '-c', 'receive.autogc=0', '-c', 'gc.auto=0', 'annex', 'unlock', '--', '.']' failed with exitcode 1
Failed to run ['git', '-c', 'receive.autogc=0', '-c', 'gc.auto=0', 'annex', 'unlock', '--', '.'] under '/tmp/dl-EQ13qlv'. Exit code=1. out=unlock a 
failed
unlock b (copying...) ok
 err=  content not present; cannot unlock
git-annex: unlock: 1 failed

unlock a 
failed
unlock b (copying...) ok
  content not present; cannot unlock
git-annex: unlock: 1 failed

After:

[INFO   ] Creating a new annex repo at /tmp/dl-yPVgt3v
create(ok): /tmp/dl-yPVgt3v (dataset)
add(ok): b (file)
add(ok): a (file)
save(ok): . (dataset)
action summary:
  add (ok: 2)
  save (ok: 1)
drop(ok): /tmp/dl-yPVgt3v/a (file)
unlock(ok): b (file)

@mih mih mentioned this pull request May 25, 2019
15 tasks
@codecov
Copy link

codecov bot commented May 26, 2019

Codecov Report

Merging #3459 into master will decrease coverage by 0.27%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3459      +/-   ##
==========================================
- Coverage   91.27%   90.99%   -0.28%     
==========================================
  Files         272      272              
  Lines       34834    34855      +21     
==========================================
- Hits        31796    31718      -78     
- Misses       3038     3137      +99
Impacted Files Coverage Δ
datalad/core/local/run.py 100% <100%> (ø) ⬆️
datalad/interface/unlock.py 100% <100%> (ø) ⬆️
datalad/interface/tests/test_unlock.py 100% <100%> (ø) ⬆️
datalad/interface/utils.py 95.33% <100%> (+0.03%) ⬆️
datalad/support/due.py 48% <0%> (-32%) ⬇️
datalad/ui/utils.py 46.87% <0%> (-25%) ⬇️
datalad/tests/test_misc.py 90.47% <0%> (-9.53%) ⬇️
datalad/metadata/tests/test_extract_metadata.py 91.3% <0%> (-8.7%) ⬇️
datalad/log.py 82.69% <0%> (-7.22%) ⬇️
datalad/downloaders/http.py 82.53% <0%> (-3.97%) ⬇️
... and 17 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update afdaca2...a7551eb. Read the comment docs.

@codecov
Copy link

codecov bot commented May 26, 2019

Codecov Report

Merging #3459 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3459      +/-   ##
==========================================
- Coverage   91.27%   91.27%   -0.01%     
==========================================
  Files         272      272              
  Lines       34834    34865      +31     
==========================================
+ Hits        31796    31822      +26     
- Misses       3038     3043       +5
Impacted Files Coverage Δ
datalad/core/local/run.py 100% <100%> (ø) ⬆️
datalad/interface/unlock.py 100% <100%> (ø) ⬆️
datalad/interface/tests/test_unlock.py 100% <100%> (ø) ⬆️
datalad/interface/utils.py 95.33% <100%> (+0.03%) ⬆️
datalad/downloaders/tests/test_http.py 91.08% <0%> (-1.24%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update afdaca2...ced581e. Read the comment docs.

@kyleam kyleam force-pushed the status-powered-unlock branch 2 times, most recently from c15383a to 190c64c Compare May 26, 2019 04:37
If a file without content is passed to unlock(), it yields a
result (with a status of "impossible") saying that the file doesn't
have content.  But if the files without content weren't explicitly
provided as paths (e.g., no path was given or a directory containing
the files was given as the path), then files without content lead to a
CommandError from the underlying 'git annex unlock' call.  (Or in v6+,
they lead to the pointer being unlocked which is inconsistent with the
error given when a file is explicitly specified in v6+ and also
potentially confusing to callers.)

To fix the "no paths specified, missing content" behavior, rely on the
information provided by status() to restrict the unlock calls to files
with content.  This fix raises the question of how to report files
without content when the path isn't explicitly given.  Because
'datalad unlock' at the dataset or a directory level could potentially
lead to a lot of results, avoid reporting "can't unlock" results when
explicit paths aren't given.  Doing this at the __call__ level rather
than as a result filter means we can do the faster status(...,
untracked=no") query unless explicit, non-directory paths are given.

Aside from the fix above, this tries to match the behavior of the
previous unlock().  The main behavioral change, shown by the test
updates, is that unlock() no longer infers the dataset from a path
when the current working directory is outside a dataset and no dataset
is explicitly given.  This behavioral change is consistent with, for
example, distribution.add vs core.local.save.

This nudges annotate_paths() closer to retirement (dataladgh-3368).
@mih
Copy link
Member

mih commented May 26, 2019

First of all: Awesome! Thx for approaching this!

Quick note (primarily to myself, as I could not look at this in detail yet): Make sure that we make use of the Repo.diffstatus() caching ability for ls-files/tree calls, in order to avoid possibly duplicate calls.

datalad/interface/unlock.py Show resolved Hide resolved
datalad/interface/unlock.py Outdated Show resolved Hide resolved
datalad/interface/unlock.py Show resolved Hide resolved
# Do the actual unlocking.
for ds_path, files in to_unlock.items():
ds = Dataset(ds_path)
for r in ds.repo.unlock(files):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to recode file names to be relative to the repository. Otherwise, repos in symlinked paths will not work.

f.relative_to(ds.pathobj) for f in to_unlock...

if fs are Path instances.

If anyhow possible, it would be great to also incorporate AnnexRepo.unlock() into this RF:

  • shed normalize_paths deocrator -- which prevents actual generator-like behavior
  • call git annex unlock with --json --json-error-messages and relay the full results, rather than manual parsing
  • strip options argument -- there are no non-generic options supported by git annex unlock, our API should reflect that, and not make unlock-specific exceptions.
  • strip AnnexRepo.lock() from the codebase -- unused, untested, no high-level API companion (a simple save does the job)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to recode file names to be relative to the repository. Otherwise, repos in symlinked paths will not work.

Quick testing with a repo that had a symlink in the leading path, things seem to work, but I'm may not be testing the right case. I know you've dealt with this issue a good amount and I have no reason not to use a relative path, so I'll go with your suggestion.

If anyhow possible, it would be great to also incorporate AnnexRepo.unlock() into this RF [...]

Hmm looking into that hadn't occurred to me. Will do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Started to look into AnnexRepo.unlock()...

  • shed normalize_paths deocrator -- which prevents actual generator-like behavior

I'm for dropping normalize_paths (here and in general), but isn't any useful generator-like behavior also prevented by git annex unlock not supporting --batch? So, assuming we use _run_annex_command_json(), we get a list of json objects, and it doesn't seem particularly useful to, say, yield those in turn rather than just return the list.

  • call git annex unlock with --json --json-error-messages and relay the full results, rather than manual parsing

I agree we should use --json. --json-error-messages sounds like it'd be useful, but I'm not sure it is in its current form:

#!/bin/sh

set -eu

cd $(mktemp -dt gx-XXXXXXX)

git init
git annex init
git annex unlock --json --json-error-messages notthere
Initialized empty Git repository in /tmp/gx-rHLSVpQ/.git/
init  ok
(recording state in git...)
git-annex: notthere not found
git-annex: unlock: 1 failed
  • strip options argument -- there are no non-generic options supported by git annex unlock, our API should reflect that, and not make unlock-specific exceptions.

Sounds fine to me.

  • strip AnnexRepo.lock() from the codebase -- unused, untested, no high-level API companion (a simple save does the job)

Also OK with me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I've pushed updates to AnnexRepo.lock(). It only partially addresses your suggestions.

It

  • drops lock()
  • removes unlock's options parameter.
  • uses --json (via _run_annex_command_json).

It does not

  • strip the normalize_paths decorator
    I think it'd make more sense to do wholesale removal of it from methods rather than incrementally dropping it from existing methods.
  • bother using --json-error-messages for the reasons I noted above

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FTR: I disagree with your conclusion re not stripping normalize_paths. It silently changes the shape of the path going in an out. Personally I think there should be a single pattern: high-level code hands over path to a repo method in relative form, or absolute pointing inside ds.repo.path. Repo method must never try to be clever and guess how a caller might want the results and report paths in full and leave any recoding to the top-level code.

That being said, not removing it doesn't make it worse than it is, and the PR is still a massive net improvement.

@kyleam
Copy link
Contributor Author

kyleam commented May 26, 2019

@mih Thanks for your feedback!

@kyleam kyleam added the WIP work in progress label May 26, 2019
For better handling of symlinks in the leading path.  See
<datalad#3459 (comment)>.
If the file is already unlocked, it is more appropriate to tell the
caller to say that it is not needed rather than impossible.  2080df1
used impossible under the incorrect impression that this is what the
old unlock() did.
This isn't used or tested, there's no high-level lock dataset method,
and save() can be used to keep the modifications and switch back to a
locked state, so let's drop this until it's clear there's a need for
it.

Re: datalad#3459 (comment)
It's not currently used.  Let's drop it until it's clear we need it.

Re: datalad#3459 (comment)
Use the json output rather than manually parsing the plain text
output.  This makes the handling less brittle, but it doesn't improve
the efficiency (unlock doesn't support --batch) or the lack of error
information (--json-error-messages doesn't doesn't result in an unlock
error being represented in as a json record).

Re: datalad#3459 (comment)
@kyleam kyleam removed the WIP work in progress label May 28, 2019
@mih
Copy link
Member

mih commented Jun 1, 2019

Cool! Thx much!

@mih mih merged commit 3fd8b2a into datalad:master Jun 1, 2019
@kyleam kyleam deleted the status-powered-unlock branch June 3, 2019 12:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants