Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BF: gitrepo.save_: Don't save unspecified subdatasets #3288

Merged
merged 1 commit into from Apr 5, 2019

Conversation

@kyleam
Copy link
Contributor

@kyleam kyleam commented Apr 3, 2019

Calling save_() with explicit paths that do not include an untracked
directory will result in an empty list being passed as the paths
argument to get_content_info(). This causes get_content_info() to
operate on all content and leads to any untracked subdatasets being
saved.

Avoid this behavior by checking that there are untracked directories
before passing them as paths to get_content_info(). Another option
would be to adjust get_content_info() to only fall back to all content
when paths=None, not paths=[], but that deviates from its currently
documented behavior and might affect other callers.

Fixes #3285.

# get content info for any untracked directory
[f.relative_to(self.pathobj) for f, props in iteritems(status)
if props.get('state', None) == 'untracked' and
props.get('type', None) == 'directory'],
Copy link
Contributor Author

@kyleam kyleam Apr 3, 2019

Aside from pulling this argument out as a variable, this diff boils down to indenting the current code under a new conditional.

@kyleam
Copy link
Contributor Author

@kyleam kyleam commented Apr 3, 2019

I'll have to revisit this. The new test failed in the TMPDIR="/var/tmp/sym link" run:
https://travis-ci.org/datalad/datalad/jobs/515298810#L3897

@kyleam kyleam added the WIP label Apr 3, 2019
Copy link
Member

@yarikoptic yarikoptic left a comment

Thank you for looking into it!

Re

Another option would be to adjust get_content_info() to only fall back to all content
when paths=None, not paths=[], but that deviates from its currently
documented behavior and might affect other callers.

ATM it indeed says "If none are given, info is reported for all content.", but that indeed precludes easy use of it without explicit check at every point of invocation, that we didn't end up with an empty list of paths.
IMHO suboptimal and probably would result and thus should not call this function (used only in a few spots ATM as far as I see it). Given also the fact of desiring this to be a core API stable, if we see a way to improve it now, instead of coming up with workarounds.

If we change doc/behavior to be

If None is given info is reported for all content.

which wouldn't even require spelling out that for an empty list, nothing is returned, which would only be logical. I wonder if any test would fail (may be you've tried already?)

Overall I do not quite get it why we should even analyze any other untracked path (might be many), if originally just some specific paths are passed into the function, so if it could be avoided - could be for the better. But may be it is not possible in principle?

PS I pushed directly tiny docstring fix f9ff64e so please merge/rebase in case you want to adjust the docstring for the get_content_info

datalad/core/local/tests/test_save.py Outdated Show resolved Hide resolved
@codecov
Copy link

@codecov codecov bot commented Apr 3, 2019

Codecov Report

Merging #3288 into master will decrease coverage by 0.02%.
The diff coverage is 80%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3288      +/-   ##
==========================================
- Coverage   91.01%   90.99%   -0.03%     
==========================================
  Files         263      263              
  Lines       34149    34156       +7     
==========================================
- Hits        31082    31080       -2     
- Misses       3067     3076       +9
Impacted Files Coverage Δ
datalad/core/local/tests/test_save.py 99.69% <100%> (ø) ⬆️
datalad/support/gitrepo.py 88.92% <70%> (+0.01%) ⬆️
datalad/downloaders/tests/test_http.py 88.86% <0%> (-2.23%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8772fd1...388f535. Read the comment docs.

@kyleam
Copy link
Contributor Author

@kyleam kyleam commented Apr 4, 2019

@yarikoptic
Copy link
Member

@yarikoptic yarikoptic commented Apr 4, 2019

Applying the more local fix to save_() avoids the possibility of this being held up by a larger discussion about how get_content_info() should treat paths=[] (a discussion which I agree is worth having).

I agree. My only (not tested) possible concern that added analysis for untracked files could be a costly (in time) endeavor. But we don't have a benchmark for that yet to test anyhow sensibly. (#3290)

so I guess let's just get it fixed up for travis failure and merge

======================================================================
FAIL: datalad.core.local.tests.test_save.test_bf3285
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/virtualenv/python3.5.6/lib/python3.5/site-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/home/travis/virtualenv/python3.5.6/lib/python3.5/site-packages/datalad/tests/utils.py", line 434, in newfunc
    return t(*(arg + (d,)), **kw)
  File "/home/travis/virtualenv/python3.5.6/lib/python3.5/site-packages/datalad/core/local/tests/test_save.py", line 639, in test_bf3285
    assert_repo_status(ds.path, untracked=[subds.path])
  File "/home/travis/virtualenv/python3.5.6/lib/python3.5/site-packages/datalad/tests/utils.py", line 1539, in assert_repo_status
    % (state, state_files, oktobefound))
AssertionError: unexpected content of state "untracked": [PosixPath('/tmp/datalad_temp_tree_test_bf3285u05p4927/subds')] != [PosixPath('/var/tmp/sym link/datalad_temp_tree_test_bf3285u05p4927/subds')]
-------------------- >> begin captured logging << --------------------
datalad.core.local.create: INFO: RSS/VMS: 88272/384956 kB ...>create:301  Creating a new annex repo at /var/tmp/sym link/datalad_temp_tree_test_bf3285u05p4927
datalad.core.local.create: INFO: RSS/VMS: 88272/384956 kB ...>create:301  Creating a new annex repo at /var/tmp/sym link/datalad_temp_tree_test_bf3285u05p4927/subds
--------------------- >> end captured logging << ---------------------

but would be nice to have a dedicated issue to discuss possible change of semantic for get_content_info

@mih
Copy link
Member

@mih mih commented Apr 4, 2019

FTR: Something feels wrong here, and testing for untracked paths does not sound good. However, I don't comprehend the scenario yet. Do you have a demo handy?

@yarikoptic
Copy link
Member

@yarikoptic yarikoptic commented Apr 4, 2019

Demo beyond full example in #3285 and a test here?

Calling save_() with explicit paths that do not include an untracked
directory will result in an empty list being passed as the `paths`
argument to get_content_info().  This causes get_content_info() to
operate on _all_ content and leads to any untracked subdatasets being
saved.

Avoid this behavior by checking that there are untracked directories
before passing them as paths to get_content_info().  Another option
would be to adjust get_content_info() to only fall back to all content
when paths=None, not paths=[], but that deviates from its currently
documented behavior and might affect other callers.  Go with the more
direct fix for now.  Even if get_content_info() is later adjusted, the
change here is arguably worth keeping for readability.

Fixes datalad#3285.
@kyleam kyleam force-pushed the save-filter-subds branch from 0652f72 to 388f535 Apr 4, 2019
@kyleam
Copy link
Contributor Author

@kyleam kyleam commented Apr 4, 2019

@yarikoptic
Copy link
Member

@yarikoptic yarikoptic commented Apr 5, 2019

in one fix at a time fashion, merging!

@yarikoptic yarikoptic merged commit f8752f9 into datalad:master Apr 5, 2019
3 of 5 checks passed
@kyleam kyleam deleted the save-filter-subds branch Apr 5, 2019
kyleam added a commit to kyleam/datalad that referenced this issue May 22, 2021
GitRepo.get_content_info() and AnnexRepo.get_content_annexinfo()
report on all paths in a tree if the `paths` argument is false-y,
including if it is an empty list.

This behavior requires callers to guard against the empty list case if
they really want to restrict results to the paths in the list.  That
means it's usually not safe to do something like

  get_content_info(paths=[<some filtering list comprehension>], ...)

which was the source of the bug in dataladgh-3288.

This behavior is also problematic because GitRepo.diff()
GitRepo.status() claim to treat None and an empty list differently in
their docstrings.  However, this isn't the case because underneath
they use get_content_info().

Update get_content_info() and get_content_annexinfo() to distinguish
None versus an empty list.  It's a breaking change in behavior.
However, it seems unlikely that many third party callers are using
get_content_{,annex}info and even more unlikely that those that do
rely on an empty list reporting the full tree.  So, risk the breakage
to align diff() and status() with their documented behavior and to
avoid future bugs like dataladgh-3288.

Closes datalad#3292.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants