Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

status: Pass repo.pathobj-based paths to get_content_annexinfo() #4760

Merged
merged 1 commit into from Jul 28, 2020

Conversation

kyleam
Copy link
Collaborator

@kyleam kyleam commented Jul 22, 2020

This change avoids the two failures in the _DL_TMPDIR="/var/tmp/sym link" job that were triggered by running with a recent git-annex (gh-4756).

@kyleam kyleam force-pushed the git-annex-symlink-fail branch from 8298c58 to 5281005 Compare Jul 22, 2020
@kyleam kyleam marked this pull request as ready for review Jul 22, 2020
@codecov
Copy link

codecov bot commented Jul 22, 2020

Codecov Report

Merging #4760 into maint will increase coverage by 0.01%.
The diff coverage is 83.33%.

Impacted file tree graph

@@            Coverage Diff             @@
##            maint    #4760      +/-   ##
==========================================
+ Coverage   89.60%   89.62%   +0.01%     
==========================================
  Files         288      288              
  Lines       40197    40225      +28     
==========================================
+ Hits        36019    36050      +31     
+ Misses       4178     4175       -3     
Impacted Files Coverage Δ
datalad/core/local/tests/test_status.py 96.33% <82.35%> (-2.59%) ⬇️
datalad/core/local/status.py 98.19% <100.00%> (+0.01%) ⬆️
datalad/support/tests/test_sshconnector.py 97.67% <0.00%> (-0.46%) ⬇️
datalad/support/annexrepo.py 86.42% <0.00%> (-0.09%) ⬇️
datalad/support/sshconnector.py 85.84% <0.00%> (-0.07%) ⬇️
datalad/core/local/create.py 94.44% <0.00%> (ø)
datalad/tests/test_config.py 99.27% <0.00%> (ø)
datalad/interface/common_cfg.py 100.00% <0.00%> (ø)
datalad/plugin/tests/test_addurls.py 100.00% <0.00%> (ø)
datalad/support/tests/test_annexrepo.py 96.58% <0.00%> (ø)
... and 9 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0dd5a5c...101b5cb. Read the comment docs.

assert_result_count(call(), 0)
else:
# As of 2a8fdfc7d (Display a warning message when asked to operate on a
# file inside a symlinked directory, 2020-05-11), git-annex will error.
Copy link
Member

@yarikoptic yarikoptic Jul 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oi... I is that if a directory within repository or somewhere up the hierarchy?

Copy link
Collaborator Author

@kyleam kyleam Jul 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oi... I is that if a directory within repository or somewhere up the hierarchy?

This test is specifically about the within directory case. In this situation, git-annex now fails instead of silently doing nothing. Either behavior is consistent with git, depending on which git command you look at :)

However, as mentioned in the commit message, "[git-annex] does so in cases like above, but also in cases where the symlinked directory is upstream of the top-level directory of the repository". That description should be improved because it doesn't explicitly mention that this is only relevant if absolute paths are used. Anyway, this is the change in behavior that is behind the _DL_TMPDIR="/var/tmp/sym link" failures.

I think this change in behavior is worth raising on git-annex's side because the git-annex change was motivated by git's behavior in the within-repo case, but it's more aggressive than git is by extending to the symlinked directories upstream of the repo.

I haven't gotten around to filing that yet, but I think, even if Joey agrees with adjusting the new behavior, we should still make the change here for at least these reasons:

  • It makes status work for the "linked upstream directory" case for git-annex >= 8.20200522, with no downside that I'm aware of.

  • status already passes these GitRepo.path-recoded paths to diffstatus. It then takes the result from that call, which is keyed by the recoded paths, and passes it as init to get_content_annexinfo, while passing the original Dataset.repo-based paths as the paths argument. When those two sets don't match, this happens to not cause any issues (because paths come out of annex find{,ref} as relative and are joined to the GitRepo.path), but it's still inconsistent and unnecessarily confusing.

  • The commands are run with GitRepo.path as the current directory. In the upstream symlink case, it's confusing to pass in absolute paths that are instead based off of Dataset.path, especially when we already have the GitRepo.path-adjusted paths at hand because we pass them to diffstatus.

Copy link
Collaborator Author

@kyleam kyleam left a comment

In the commit message I said:

status() converts paths to use .repo.pathobj (in which
symlinks have been resolved) instead of .pathobj (in which
they have not) before passing the paths to .diffstatus().

It does and this is important to note, but...

At least one advantage of this approach is that it handles symlink
scenarios like this:

<repo>
|-- bar -> <repo>/foo
`-- foo
    `-- f

In this case, some git commands like ls-files return empty when passed
the linked bar/f, and other commands like git-add complain that the
path "is beyond a symbolic link".

... saying that it helps with this situation is confusing and wrong. This above scenario is still worth describing though because it's the motivation for the changes on git-annex's side.

In contrast, status() passes the unconverted paths to
get_content_annexinfo(). However, as of 2a8fdfc7d (Display a warning
message when asked to operate on a file inside a directory that's a
symbolic link to elsewhere, 2020-05-11), git-annex behaves like
git-add and complains when asked to act on a path that "is beyond a
symbolic link". It does so in cases like above, but also in cases
where the symlinked directory is upstream of the top-level directory
of the repository.

As noted elsewhere, this should explicitly mention that the upstream scenario is relevant only when absolute paths are used.

Also, ideally git-annex would align more closely with git here and not fail in the upstream scenario. This message should argue why we should still make this adjustment on our end.

@kyleam kyleam marked this pull request as draft Jul 23, 2020
@kyleam kyleam force-pushed the git-annex-symlink-fail branch from 5281005 to 1bbfb0d Compare Jul 23, 2020
@kyleam kyleam marked this pull request as ready for review Jul 23, 2020
With a repository structured as

    <repo>
    |-- bar -> foo
    `-- foo
        `-- f

some git commands like ls-files return empty when passed the linked
bar/f, while other commands like git-add complain that the path "is
beyond a symbolic link".  In this scenario, git-annex commands take
the no-op/silent route prior to 2a8fdfc7d (Display a warning message
when asked to operate on a file inside a directory that's a symbolic
link to elsewhere, 2020-05-11), but following that commit git-annex
behaves like git-add and errors.

However, unlike git, git-annex also complains when the symlinked
directory is upstream of the top-level directory of the repository and
an absolute path is specified.  This change in behavior results in two
test failures in our test suite for the _DL_TMPDIR="/var/tmp/sym link"
job [0].  These failures have the same underlying cause: status()
calls get_content_annexinfo() with absolute paths that start with
"/var/tmp/sym link", leading to 'git annex find' emitting an error.

Avoid these failures by making status() call get_content_annexinfo()
in a way that is consistent with how it calls diffstatus().  In its
call to diffstatus(), status() converts paths to use
<dataset>.repo.pathobj (in which symlinks have been resolved) instead
of <dataset>.pathobj (in which they have not).

Also, add a new test case for the "symlinked directory within repo"
scenario depicted above, showing how git-annex-find behaves before
8.20200522 (empty result) and after (CommandError).

The change in behavior for linked directories outside of the
repository is probably worth discussing on git-annex's side.  Even if
addressed there, though, the change here is still good to make for the
following reasons.

  * It makes `status` work for the "linked upstream directory" case
    for git-annex >= 8.20200522, with no downside that I'm aware of.

  * `status` already passes these GitRepo.path-recoded paths to
    `diffstatus`.  It then takes the result from that call, which is
    keyed by the recoded paths, and passes it as `init` to
    `get_content_annexinfo`, while passing the original
    Dataset.repo-based paths as the `paths` argument.  When those two
    sets don't match, this happens to not cause any issues (because
    paths come out of 'annex find' as relative and are joined to
    GitRepo.path), but it's still inconsistent and unnecessarily
    confusing.

  * The commands are run with GitRepo.path as the current directory.
    In the upstream symlink case, it's confusing to pass in absolute
    paths that are instead based off of Dataset.path, especially when
    (1) we already have the GitRepo.path-adjusted paths at hand
    because we pass them to `diffstatus` and (2) the keys of the
    get_content_annexinfo() result will be GitRepo.path-based.

[0] https://travis-ci.org/github/datalad/datalad/jobs/710485450#L7806
@kyleam kyleam force-pushed the git-annex-symlink-fail branch from 1bbfb0d to 101b5cb Compare Jul 27, 2020
@kyleam
Copy link
Collaborator Author

kyleam commented Jul 27, 2020

I think this change in behavior is worth raising on git-annex's side because the git-annex change was motivated by git's behavior in the within-repo case, but it's more aggressive than git is by extending to the symlinked directories upstream of the repo.

Filed.

I'll plan to merge this tomorrow unless any objections are raised.

mih
mih approved these changes Jul 28, 2020
Copy link
Member

@mih mih left a comment

LGTM, thx -- especially for the detailed explanation!

@mih mih merged commit 6c82a64 into datalad:maint Jul 28, 2020
3 of 4 checks passed
@kyleam kyleam deleted the git-annex-symlink-fail branch Jul 28, 2020
@kyleam
Copy link
Collaborator Author

kyleam commented Aug 7, 2020

I think this change in behavior is worth raising on git-annex's side because the git-annex change was motivated by git's behavior in the within-repo case, but it's more aggressive than git is by extending to the symlinked directories upstream of the repo.

Filed.

For posterity: Joey fixed this in 506ffea5e (stop symlink check once the top of the working tree is reached, 2020-08-06).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants