Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BF: Prefer committed subdataset location info over guesses #5760

Merged
merged 3 commits into from Jul 11, 2021

Conversation

mih
Copy link
Member

@mih mih commented Jun 23, 2021

gh-5033 describes a situation where the installation of a sub-subdataset
fails due to the particular condition of the checkout the subdataset
was cloned from. The issues originally expressed the desire for a more
in-depth exploration of possible superdatasets and their subdataset
configurations. However, the undesirable failure can also be avoided by
simply preferring the location information that is committed in each
superdataset, at all clone iterations on the way to a target subdataset.

Effectively, the change reorders the locations priorities from

  1. Assumed availability within the assumed existing checkout of the
    superdataset
  2. Location info from .gitmodules

to

  1. Location info from .gitmodules
  2. Other candidates and guesses

Specifically, the following URLs will be tried by default:

  1. original URL given to datalad when registering the subdataset
  2. url from .gitmodules
  3. other candidates

Changes to this default order are possible via configuration changes, of
course.

Resolves #5033

@codecov
Copy link

codecov bot commented Jun 23, 2021

Codecov Report

Merging #5760 (c6ad6e4) into master (3fd9f66) will decrease coverage by 2.72%.
The diff coverage is 100.00%.

❗ Current head c6ad6e4 differs from pull request most recent head 99ea5c1. Consider uploading reports for the commit 99ea5c1 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5760      +/-   ##
==========================================
- Coverage   90.62%   87.90%   -2.73%     
==========================================
  Files         309      306       -3     
  Lines       42318    42304      -14     
==========================================
- Hits        38350    37186    -1164     
- Misses       3968     5118    +1150     
Impacted Files Coverage Δ
datalad/distribution/get.py 97.90% <ø> (ø)
datalad/distribution/tests/test_get.py 100.00% <ø> (ø)
datalad/distribution/tests/test_publish.py 100.00% <100.00%> (ø)
datalad/distribution/tests/test_update.py 99.12% <100.00%> (-0.88%) ⬇️
datalad/plugin/wtf.py 0.00% <0.00%> (-100.00%) ⬇️
datalad/plugin/addurls.py 0.00% <0.00%> (-100.00%) ⬇️
datalad/plugin/no_annex.py 0.00% <0.00%> (-100.00%) ⬇️
datalad/plugin/add_readme.py 0.00% <0.00%> (-100.00%) ⬇️
datalad/plugin/check_dates.py 0.00% <0.00%> (-100.00%) ⬇️
datalad/support/tests/utils.py 0.00% <0.00%> (-100.00%) ⬇️
... and 104 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3fd9f66...99ea5c1. Read the comment docs.

@mih mih marked this pull request as draft June 23, 2021 08:42
@mih
Copy link
Member Author

mih commented Jun 23, 2021

Two interesting and, superficially, surprising test failures:

ERROR: datalad.distribution.tests.test_update.test_update_simple
File "/home/appveyor/dlvenv/lib/python3.8/site-packages/datalad/distribution/tests/test_update.py", line 193, in test_update_simple
    dest.get('2/load.dat')
[{'action': 'get',
  'message': 'path does not exist',
  'path': '/home/appveyor/DLTMP/datalad_temp_test_update_simple27w7x8_3/2/load.dat',
  'refds': '/home/appveyor/DLTMP/datalad_temp_test_update_simple27w7x8_3',
  'status': 'impossible'}]

FAIL: datalad.distribution.tests.test_publish.test_publish_recursive
File "/home/appveyor/dlvenv/lib/python3.8/site-packages/datalad/distribution/tests/test_publish.py", line 395, in test_publish_recursive
    assert_status(('ok', 'notneeded'), res_)
  File "/home/appveyor/dlvenv/lib/python3.8/site-packages/datalad/tests/utils.py", line 1265, in assert_status
    raise AssertionError('Test {}/{}: expected status {} not found in...

Confirmed locally!

Update: so this is complicated, but not entirely unexpected. The situation in the update case is a rather twisted test setup. A prepared testrepo dataset (with subdatasets) is cloned to place A, then its origin "forgotten". Then it is cloned to place B where tests are performed. Close to the end of the test a subdataset checkout under A is modified, and the test fails because the subdataset under B was never cloned from A (explicit file:// URL config pointing to the testrepo origin), so it does not receive this update.

A similar situation was the source of the publish test failure.

Overall, given that only two tests, out of the variety of test setups we carry, fail gives me confidence that this change is meaningful and rather non-intrusive. The only failing tests were those that explicitly tried to "decouple" from their origin, and this decoupling now needs to be done more thoroughly. In the standard case, decoupling from origin is not desirable, and the fact that it could happen unintentionally is the very reason for #5033.

mih added 3 commits June 25, 2021 11:20
dataladgh-5033 describes a situation where the installation of a sub-subdataset
fails due to the particular condition of the _checkout_ the subdataset
was cloned from. The issues originally expressed the desire for a more
in-depth exploration of possible superdatasets and their subdataset
configurations. However, the undesirable failure can also be avoided by
simply preferring the location information that is committed in each
superdataset, at all clone iterations on the way to a target subdataset.

Effectively, the change reorders the locations priorities from

1. Assumed availability within the assumed existing checkout of the
   superdataset
2. Location info from .gitmodules

to

1. Location info from .gitmodules
2. Other candidates and guesses

Specifically, the following URLs will be tried by default:

1. original URL given to datalad when registering the subdataset
2. `url` from .gitmodules
3. other candidates

Changes to this default order are possible via configuration changes, of
course.

Resolves datalad#5033
The configuration of the test dataset is adjusted to reflect the
assumptions of the test, and maintain prior behavior despite the
change in the default behavior.
Besides explicitly setting up the superdataset with relative submodule
URLs this also required tuning some test conditions, because the
submodule addition is not longer the last change on the branch.
@mih mih marked this pull request as ready for review June 25, 2021 10:09
Copy link
Member

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the minimalism of the change but have not yet fully grasped (and thus left comments) on either the priorities tune up requires tune up of .gitmodules in the test(s)?

# uncouple subdataset from testrepo sources after recursive install
# to make this clone the source of all `get` attempts
for sub in origin.subdatasets(result_xfm=lambda x: x['gitmodule_name']):
origin.subdatasets(path=sub, set_property=[('url', './{}'.format(sub))])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was the intention to provide a "new" test scenario or due to code changes it became required to alter url entries?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is a direct consequence of #5033. The testrepos configure an explicit URL in .gitmodules. But here we want to test what happens if there is a relative URL (it shall clone from the initial clone target done in this test). With the change introduced by the PR we would clone from the true origin. So in order to maintain the premise of the test, we need to reconfigure to a relative URL

# also forget the declared absolute location of the submodules, and turn them
# relative to this/a clone
for sub in source.subdatasets(result_xfm=lambda x: x['gitmodule_name']):
source.subdatasets(path=sub, set_property=[('url', './{}'.format(sub))])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here -- a new scenario or needed (i.e. something what worked before, no longer works)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and same answer as above. The very nature of this PR makes this change necessary.

@mih mih added this to the 0.15.0 milestone Jul 11, 2021
@mih
Copy link
Member Author

mih commented Jul 11, 2021

FTR: Now that #5749 is merged, this PR is needed in master in order to not break scenarios like the remodnav executable paper presented in the handbook. Current master still prefers a "local" checkout of a subdataset to clone from. Now that URL is mapped to a Github project URL (with #5749), and datalad asks for credentials to a project that does not exist (in the case of that paper). The relevant submodule record is:

[submodule "remodnav/tests/data/anderson_etal"]
        path = remodnav/tests/data/anderson_etal
        url = https://github.com/richardandersson/EyeMovementDetectorEvaluation.git

and this URL is the one that must be cloned.

With this PR on top of master, the desirable behavior is maintained/restored.

I added the 0.15 milestone. We should not release with such a change in behavior.

@yarikoptic
Copy link
Member

thanks @mih. let's proceed and see where sea takes us ;-)

@yarikoptic yarikoptic merged commit 86a55c8 into datalad:master Jul 11, 2021
@mih mih deleted the bf-5033 branch August 26, 2021 13:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

install/get: ideally should add to candidates a URL based on super's
2 participants