Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

save: Fix handling of explicit paths for nested new subdatasets #5049

Merged
merged 1 commit into from
Oct 19, 2020

Conversation

kyleam
Copy link
Contributor

@kyleam kyleam commented Oct 17, 2020

After collecting the status information, save() adds a status record
for any subdatasets between the reference dataset and the target
dataset. Here's an example that shows when these extra records are
needed:

$ datalad create
$ datalad create -d. a
$ datalad create -d. a/b

During the last step, create() calls save() with "a/b" as the path.
After save() collects the status() information in paths_by_ds, the
value looks like this:

{'{top}/a': {PosixPath('{top}/a/b'): {'state': 'untracked',
                                      'type': 'directory'}}}

Before calling GitRepo.save_(), save() adds another record for "a":

{'{top}': {PosixPath('{top}/a'): {'state': 'untracked',
                                  'type': 'dataset'}},
 '{top}/a': {PosixPath('{top}/a/b'): {'state': 'untracked',
                                      'type': 'directory'}}}

If it didn't tack on the first entry, "a" would be left modified in
the top-level dataset.

This handling, however, interferes with explicitly adding new nested
subdatasets, as in the example from gh-5021:

$ datalad create
$ datalad create a
$ datalad create a/b
$ datalad save -d. a a/b

For the last step, the status information is originally

{'{top}': {PosixPath('{top}/a'): {'state': 'untracked',
                                  'type': 'directory'}},
 '{top}/a': {PosixPath('{top}/a/b'): {'state': 'untracked',
                                      'type': 'directory'}}}

The handling described above results in save() effectively switching
the type of "a" to "dataset":

{'{top}': {PosixPath('{top}/a'): {'state': 'untracked',
                                  'type': 'dataset'}},
 '{top}/a': {PosixPath('{top}/a/b'): {'state': 'untracked',
                                      'type': 'directory'}}}

When GitRepo.save_() receives status information, it doesn't properly
register the submodule: it feeds the dataset to git add rather than
to git submodule add because, as far as it can tell, "a" is an
existing submodule.

To avoid this issue, stop save() from overriding the status
information for records that have a type of "directory" and status of
"untracked", letting GitRepo.save_() treat these paths in the same way
as other unregistered submodules.

Note that this fix is only relevant when the intermediate subdatasets
are explicitly passed. For example, dropping "a" from the last
command above (i.e. datalad save -d. a/b) would still result in the
state reported in gh-5021. One way to address that would be to update
GitRepo.save_() to send "type=dataset state=untracked" records down
the same code path as "type=directory state=untracked", but hold off
on that because in some cases it would cost additional (and
unnecessary) get_content_info() calls.

Fixes #5021.

After collecting the status information, save() adds a status record
for any subdatasets between the reference dataset and the target
dataset.  Here's an example that shows when these extra records are
needed:

    $ datalad create
    $ datalad create -d. a
    $ datalad create -d. a/b

During the last step, create() calls save() with "a/b" as the path.
After save() collects the status() information in paths_by_ds, the
value looks like this:

    {'{top}/a': {PosixPath('{top}/a/b'): {'state': 'untracked',
                                          'type': 'directory'}}}

Before calling GitRepo.save_(), save() adds another record for "a":

    {'{top}': {PosixPath('{top}/a'): {'state': 'untracked',
                                      'type': 'dataset'}},
     '{top}/a': {PosixPath('{top}/a/b'): {'state': 'untracked',
                                          'type': 'directory'}}}

If it didn't tack on the first entry, "a" would be left modified in
the top-level dataset.

This handling, however, interferes with explicitly adding new nested
subdatasets, as in the example from dataladgh-5021:

    $ datalad create
    $ datalad create a
    $ datalad create a/b
    $ datalad save -d. a a/b

For the last step, the status information is originally

    {'{top}': {PosixPath('{top}/a'): {'state': 'untracked',
                                      'type': 'directory'}},
     '{top}/a': {PosixPath('{top}/a/b'): {'state': 'untracked',
                                          'type': 'directory'}}}

The handling described above results in save() effectively switching
the type of "a" to "dataset":

    {'{top}': {PosixPath('{top}/a'): {'state': 'untracked',
                                      'type': 'dataset'}},
     '{top}/a': {PosixPath('{top}/a/b'): {'state': 'untracked',
                                          'type': 'directory'}}}

When GitRepo.save_() receives status information, it doesn't properly
register the submodule: it feeds the dataset to `git add` rather than
to `git submodule add` because, as far as it can tell, "a" is an
existing submodule.

To avoid this issue, stop save() from overriding the status
information for records that have a type of "directory" and status of
"untracked", letting GitRepo.save_() treat these paths in the same way
as other unregistered submodules.

Note that this fix is only relevant when the intermediate subdatasets
are explicitly passed.  For example, dropping "a" from the last
command above (i.e. `datalad save -d. a/b`) would still result in the
state reported in dataladgh-5021.  One way to address that would be to update
GitRepo.save_() to send "type=dataset state=untracked" records down
the same code path as "type=directory state=untracked", but hold off
on that because in some cases it would cost additional (and
unnecessary) get_content_info() calls.

Fixes datalad#5021.
@codecov
Copy link

codecov bot commented Oct 17, 2020

Codecov Report

Merging #5049 into master will decrease coverage by 0.19%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5049      +/-   ##
==========================================
- Coverage   89.94%   89.74%   -0.20%     
==========================================
  Files         292      292              
  Lines       40834    41103     +269     
==========================================
+ Hits        36727    36888     +161     
- Misses       4107     4215     +108     
Impacted Files Coverage Δ
datalad/core/local/save.py 98.73% <100.00%> (+0.04%) ⬆️
datalad/core/local/tests/test_save.py 98.00% <100.00%> (+0.03%) ⬆️
datalad/distribution/tests/test_create_sibling.py 80.53% <0.00%> (-18.74%) ⬇️
datalad/support/tests/test_sshconnector.py 84.53% <0.00%> (-13.26%) ⬇️
datalad/support/sshconnector.py 82.18% <0.00%> (-5.37%) ⬇️
datalad/tests/test_direct_mode.py 44.89% <0.00%> (-5.11%) ⬇️
datalad/downloaders/http.py 81.85% <0.00%> (-2.71%) ⬇️
datalad/distribution/drop.py 96.42% <0.00%> (-2.37%) ⬇️
datalad/interface/annotate_paths.py 88.00% <0.00%> (-2.19%) ⬇️
datalad/downloaders/tests/test_http.py 88.22% <0.00%> (-1.93%) ⬇️
... and 72 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c728888...effa32c. Read the comment docs.

@yarikoptic
Copy link
Member

Thank you @kyleam ! I wondered, in the first example, why a is "untracked" in

{'{top}': {PosixPath('{top}/a'): {'state': 'untracked',
                                  'type': 'dataset'}},

if prior datalad create -d. a should have "registered" it, shouldn't it be "modified"? (If there is such state, as I am taking without knowledge of the codebase there)

@yarikoptic
Copy link
Member

And the last comment about a singular a/b invocation reminded me about #3943 . IMHO they are "related" through the implementation detail that "what are the subdatasets" discovered only once at the beginning, and does not react to the fact that an untracked subdataset might be "discovered" needing saving at run time given its subpaths (or recursion option as in #3943), and thus should become also a part of that listing to act upon (in the context of the subpaths specified).
Anyways, In addurls I believe we do have an exhaustive list of datasets boundaries and will list them all explicitly, so should be not yet hitting this other issue.
Thanks again

@yarikoptic
Copy link
Member

But can we have this for maint, not master?

@kyleam
Copy link
Contributor Author

kyleam commented Oct 17, 2020

But can we have this for maint, not master?

Sorry, I forgot to select it when opening the PR. It's based on maint already.

@kyleam kyleam changed the base branch from master to maint October 17, 2020 04:15
@kyleam
Copy link
Contributor Author

kyleam commented Oct 17, 2020

I wondered, in the first example, why a is "untracked" in

{'{top}': {PosixPath('{top}/a'): {'state': 'untracked',
                                  'type': 'dataset'}},

if prior datalad create -d. a should have "registered" it, shouldn't it be "modified"? (If there is such state, as I am taking without knowledge of the codebase there)

Good question. I started to add something about this in the commit message, but dropped it because I decided it might be getting too into the weeds.

save injects this status information in order to trigger the save of intermediate subdatasets. Whether it chooses state=untracked or state=modified doesn't matter for the result: type=dataset records with state=untracked go down the exact same handling as state=modified in GitRepo.save_. (On maint and this PR, if you s/state='untracked'/state='modified'/ in save, all of the test_save tests still pass.)

However, the path these injected records trigger is the "modified subdatset" path (both before and after this PR), and I do agree that using a state of "modified" here would be clearer. And with that change, AFAICT the state=untracked, state=dataset handling could be dropped entirely from GitRepo.save_. (Again, that's the case before and after this PR.)

Copy link
Member

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It gets us 1 step further to correct operation, so I vote for the merge

@yarikoptic
Copy link
Member

@mih your feedback would be appreciated on this one. Merging this fix should help to finalize #5022.

@yarikoptic
Copy link
Member

Merging this one as is we would probably need to issue a follow up on #5021 with that 2nd problematic case @kyleam have identified (i.e. datalad save -d. a/b)

@yarikoptic
Copy link
Member

FWIW, I have tested merged (including fresh master) into #5022 -- test_addurls tests pass. I will push the merge as a part of #5022 to see what CI thinks about it overall

Copy link
Member

@mih mih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thx!

@mih mih merged commit cf97236 into datalad:maint Oct 19, 2020
@kyleam kyleam deleted the save-nested-ds branch October 19, 2020 13:45
@yarikoptic yarikoptic added this to the 0.13.5 milestone Dec 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

save: does not add subdataset(s) to .gitmodules if nested
3 participants