Skip to content

OPT: save - do not bother running full status within subdatasets unless recursive #4526

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

yarikoptic
Copy link
Member

@yarikoptic yarikoptic commented May 13, 2020

Closes: #4523

I guess if all the other tests pass, this change does not break any "semantic" which would be great.

Notes:

  • Even while being very basic, the unittest is quite slow (24sec) on my laptop. save is "non trivial" even in this tiny datasets, heh. my bet has to do with that infinite recursion mentioned below

Possible TODOs:

  • It is an incomplete solution: it would still do "full" (and thus eg fail the test if added) if I do recursive=True even if should be cut off by recursion_limit parameter, which suggests that this location is not ideal for this fix (if there is any better).
  • That exception I am catching in the test, which is supposed to be CommandError causes some other meltdown somewhere
Here is the displayed traceback which I do not know where to attribute to
======================================================================
ERROR: datalad.core.local.tests.test_save.test_subsuperdataset_save
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/home/yoh/proj/datalad/datalad-master/datalad/tests/utils.py", line 691, in newfunc
    return t(*(arg + (filename,)), **kw)
  File "/home/yoh/proj/datalad/datalad-master/datalad/core/local/tests/test_save.py", line 232, in test_subsuperdataset_save
    assert_raises(CommandError, parent.save, 'sub1', recursive=True)
  File "/usr/lib/python3.7/unittest/case.py", line 756, in assertRaises
    return context.handle('assertRaises', args, kwargs)
  File "/usr/lib/python3.7/unittest/case.py", line 178, in handle
    callable_obj(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/wrapt/wrappers.py", line 603, in __call__
    args, kwargs)
  File "/home/yoh/proj/datalad/datalad-master/datalad/distribution/dataset.py", line 498, in apply_func
    return f(**kwargs)
  File "/usr/lib/python3/dist-packages/wrapt/wrappers.py", line 564, in __call__
    args, kwargs)
  File "/home/yoh/proj/datalad/datalad-master/datalad/interface/utils.py", line 494, in eval_func
    return return_func(generator_func)(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/wrapt/wrappers.py", line 564, in __call__
    args, kwargs)
  File "/home/yoh/proj/datalad/datalad-master/datalad/interface/utils.py", line 482, in return_func
    results = list(results)
  File "/home/yoh/proj/datalad/datalad-master/datalad/interface/utils.py", line 413, in generator_func
    allkwargs):
  File "/home/yoh/proj/datalad/datalad-master/datalad/interface/utils.py", line 552, in _process_results
    for res in results:
  File "/home/yoh/proj/datalad/datalad-master/datalad/core/local/save.py", line 211, in __call__
    result_renderer='disabled'):
  File "/usr/lib/python3/dist-packages/wrapt/wrappers.py", line 564, in __call__
    args, kwargs)
  File "/home/yoh/proj/datalad/datalad-master/datalad/interface/utils.py", line 494, in eval_func
    return return_func(generator_func)(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/wrapt/wrappers.py", line 564, in __call__
    args, kwargs)
  File "/home/yoh/proj/datalad/datalad-master/datalad/interface/utils.py", line 482, in return_func
    results = list(results)
  File "/home/yoh/proj/datalad/datalad-master/datalad/interface/utils.py", line 413, in generator_func
    allkwargs):
  File "/home/yoh/proj/datalad/datalad-master/datalad/interface/utils.py", line 552, in _process_results
    for res in results:
  File "/home/yoh/proj/datalad/datalad-master/datalad/core/local/status.py", line 413, in __call__
    content_info_cache):
  File "/home/yoh/proj/datalad/datalad-master/datalad/core/local/status.py", line 152, in _yield_status
    cache):
  File "/home/yoh/proj/datalad/datalad-master/datalad/core/local/status.py", line 152, in _yield_status
    cache):
  File "/home/yoh/proj/datalad/datalad-master/datalad/core/local/status.py", line 152, in _yield_status
    cache):
  [Previous line repeated 924 more times]
  File "/home/yoh/proj/datalad/datalad-master/datalad/core/local/status.py", line 114, in _yield_status
    fr='HEAD' if repo.get_hexsha() else None,
  File "/home/yoh/proj/datalad/datalad-master/datalad/support/gitrepo.py", line 1747, in get_hexsha
    commitish)
  File "/home/yoh/proj/datalad/datalad-master/datalad/support/gitrepo.py", line 1718, in format_commit
    '', cmd, expect_stderr=True, expect_fail=True)
  File "/home/yoh/proj/datalad/datalad-master/datalad/support/gitrepo.py", line 328, in newfunc
    result = func(self, files_new, *args, **kwargs)
  File "/home/yoh/proj/datalad/datalad-master/datalad/support/gitrepo.py", line 2058, in _git_custom_command
    expect_fail=expect_fail)
  File "/home/yoh/proj/datalad/datalad-master/datalad/cmd.py", line 143, in run_gitcommand_on_file_list_chunks
    results.append(func(cmd, *args, **kwargs))
  File "/home/yoh/proj/datalad/datalad-master/datalad/cmd.py", line 1114, in run
    *args, **kwargs)
  File "/home/yoh/proj/datalad/datalad-master/datalad/cmd.py", line 883, in run
    stdin=stdin)
  File "/usr/lib/python3.7/subprocess.py", line 800, in __init__
    restore_signals, start_new_session)
  File "/usr/lib/python3.7/subprocess.py", line 1472, in _execute_child
    for dir in os.get_exec_path(env))
  File "/home/yoh/proj/datalad/datalad-master/venvs/dev3/lib/python3.7/os.py", line 637, in get_exec_path
    warnings.simplefilter("ignore", BytesWarning)
  File "/home/yoh/proj/datalad/datalad-master/venvs/dev3/lib/python3.7/warnings.py", line 179, in simplefilter
    _add_filter(action, None, category, None, lineno, append=append)
  File "/home/yoh/proj/datalad/datalad-master/venvs/dev3/lib/python3.7/warnings.py", line 186, in _add_filter
    filters.remove(item)
RecursionError: maximum recursion depth exceeded in comparison
Running with high log level eventually started to bombard me with the lines
at the bottom of this:
2020-05-13 19:14:45,797 [DEBUG  ] ...>gitrepo:3399  Done AnnexRepo(/home/yoh/.tmp/datalad_temp_test_subsuperdataset_savewi9yqcg_/sub1/sub2/sub3).get_content_info(...) 
2020-05-13 19:14:45,798 [Level 5] ...>cmd:1010  Running: ['git', 'ls-files', '-z', '-m'] 
2020-05-13 19:14:45,825 [Level 8] ...>cmd:1010  Finished running ['git', 'ls-files', '-z', '-m'] with status 0 
2020-05-13 19:14:45,827 [DEBUG  ] ...>gitrepo:3289  AnnexRepo(/home/yoh/.tmp/datalad_temp_test_subsuperdataset_savewi9yqcg_/sub1/sub2/sub3).get_content_info(...) 
2020-05-13 19:14:45,827 [DEBUG  ] ...>gitrepo:3339  Query repo: ['git', 'ls-tree', 'HEAD', '-z', '-r', '--full-tree', '-l'] 
2020-05-13 19:14:45,827 [Level 5] ...>cmd:1010  Running: ['git', 'ls-tree', 'HEAD', '-z', '-r', '--full-tree', '-l'] 
2020-05-13 19:14:45,865 [Level 8] ...>cmd:1010  Finished running ['git', 'ls-tree', 'HEAD', '-z', '-r', '--full-tree', '-l'] with status 0 
2020-05-13 19:14:45,866 [DEBUG  ] ...>gitrepo:3356  Done query repo: ['git', 'ls-tree', 'HEAD', '-z', '-r', '--full-tree', '-l'] 
2020-05-13 19:14:45,867 [DEBUG  ] ...>gitrepo:3399  Done AnnexRepo(/home/yoh/.tmp/datalad_temp_test_subsuperdataset_savewi9yqcg_/sub1/sub2/sub3).get_content_info(...) 
2020-05-13 19:14:45,868 [DEBUG  ] ...>status:112  query AnnexRepo(/home/yoh/.tmp/datalad_temp_test_subsuperdataset_savewi9yqcg_/sub1/sub2/sub3).diffstatus() for paths: None 
2020-05-13 19:14:45,870 [Level 5] ...>cmd:1010  Running: ['git', 'show', '-z', '--no-patch', '--format=%H', '--'] 
2020-05-13 19:14:45,916 [Level 8] ...>cmd:1010  Finished running ['git', 'show', '-z', '--no-patch', '--format=%H', '--'] with status 0 
2020-05-13 19:14:45,917 [DEBUG  ] ...>status:112  query AnnexRepo(/home/yoh/.tmp/datalad_temp_test_subsuperdataset_savewi9yqcg_/sub1/sub2/sub3).diffstatus() for paths: None 
2020-05-13 19:14:45,918 [Level 5] ...>cmd:1010  Running: ['git', 'show', '-z', '--no-patch', '--format=%H', '--'] 
2020-05-13 19:14:45,956 [Level 8] ...>cmd:1010  Finished running ['git', 'show', '-z', '--no-patch', '--format=%H', '--'] with status 0 
2020-05-13 19:14:45,957 [DEBUG  ] ...>status:112  query AnnexRepo(/home/yoh/.tmp/datalad_temp_test_subsuperdataset_savewi9yqcg_/sub1/sub2/sub3).diffstatus() for paths: None 
2020-05-13 19:14:45,959 [Level 5] ...>cmd:1010  Running: ['git', 'show', '-z', '--no-patch', '--format=%H', '--'] 
2020-05-13 19:14:46,001 [Level 8] ...>cmd:1010  Finished running ['git', 'show', '-z', '--no-patch', '--format=%H', '--'] with status 0 
2020-05-13 19:14:46,005 [DEBUG  ] ...>status:112  query AnnexRepo(/home/yoh/.tmp/datalad_temp_test_subsuperdataset_savewi9yqcg_/sub1/sub2/sub3).diffstatus() for paths: None 
...

@yarikoptic yarikoptic added the UX user experience label May 13, 2020
Situation is probably abnormal, and something else would throw a proper
exception, or someone else would fix it properly in some other place.
@yarikoptic
Copy link
Member Author

ok, workaround (or may be a fix?) for infinite recursion is in d5d9670. Test became faster - only 11 sec on my laptop ;)

@codecov
Copy link

codecov bot commented May 14, 2020

Codecov Report

Merging #4526 into master will increase coverage by 0.02%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4526      +/-   ##
==========================================
+ Coverage   89.20%   89.23%   +0.02%     
==========================================
  Files         285      285              
  Lines       38558    38575      +17     
==========================================
+ Hits        34397    34423      +26     
+ Misses       4161     4152       -9     
Impacted Files Coverage Δ
datalad/core/local/save.py 98.68% <ø> (ø)
datalad/core/local/status.py 98.14% <100.00%> (+0.05%) ⬆️
datalad/core/local/tests/test_save.py 96.86% <100.00%> (+0.10%) ⬆️
datalad/support/gitrepo.py 90.11% <0.00%> (+0.15%) ⬆️
datalad/downloaders/base.py 77.09% <0.00%> (+0.36%) ⬆️
datalad/downloaders/http.py 75.29% <0.00%> (+0.39%) ⬆️
datalad/downloaders/tests/test_http.py 62.16% <0.00%> (+1.20%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 344c628...6684f0b. Read the comment docs.

@@ -206,6 +206,8 @@ def __call__(path=None, message=None, dataset=None,
recursive=recursive,
recursion_limit=recursion_limit,
on_failure='ignore',
# for save without recursion only commit matters
eval_subdataset_state='full' if recursive else 'commit',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mih proposing the same change in gh-4531 is a pretty good approval of this change :)

Copy link
Member Author

@yarikoptic yarikoptic May 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed ;-) but may be he would come up even with a better one which would work for recursive with a limit?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because you bring up recursion limits: Have you ever seen any actual use of it? I mean outside of tests. My shell history doesn't have it. This feels like a historical artifact of "why not have a limit too". It causes quite a few complications by requiring a relatively high level of sophistication in some code pieces. I am afraid that we will add another one here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consciously -- I do not recollect using it. My eternal shell history setup is broken, but on smaug I found entire 2 invocations for diff:

datalad diff -r --recursion-limit 1 openneuro
datalad diff -r --recursion-limit 2 openneuro

which kinda makes sense BUT could have been worked around by -C or -d openneuro and not using -r. So indeed -- I don't have immediate use cases. Moreover I think it would make sense only in "homogeneous" hierarchy, e.g. like HCP dataset (if it had more subdataset levels ;-)) -- probably a rare use case on its own. In all others it would likely be on a single path/subdataset, thus could be workedaround.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gy, search on github brings up https://github.com/recoveringyank/T1ify/blob/1772167b3d771632441377fc47c729cc33d82a4b/get_data_datalad.sh with

datalad install ///hcp-openaccess -r --recursion-limit 3

when I exclude some repos I still do not see more of 3rd party relevant hints arriving... so at least nobody uses it heavily in the code besides us interfacing from one function to another ;)

sub1.save('sub2')
# and should fail if we demand recursive operation
# Fun part: causes RecursionError, not just CommandError ATM but that is IMHO
# a separate issue, TODO.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this comment stale as of d5d9670?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

d'oh right -- will remove. It seems that everyone agrees, so I will remove, commit, merge locally and push master to avoid needless round of CI

@yarikoptic
Copy link
Member Author

did commits cleanup, merge locally 4f0876e, and pushed

@yarikoptic yarikoptic closed this May 15, 2020
@yarikoptic yarikoptic deleted the bf-4523 branch May 21, 2020 17:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
UX user experience
Projects
None yet
Development

Successfully merging this pull request may close these issues.

save (without -r) sub-superdataset takes too long - checks its subdatasets
3 participants