Skip to content

addurls: Fixes to subpath handling #3561

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jul 26, 2019

Conversation

kyleam
Copy link
Contributor

@kyleam kyleam commented Jul 25, 2019

This series addresses the subdataset issues we hit on the last call.

The remaining issue discussed on that call was that it seemed like a subdirectory needed to exist for annex addurl --file <subdir>/file, but that's not the case when I test locally (either with git annex addurl or datalad addurls) and I don't see anything in annex's history that suggests that it was a recent change. I think that might have actually been related to one of the issues here, but please provide details here if you can trigger it on your end.

kyleam added 3 commits July 25, 2019 08:17
get_subpaths() uses "//" as the special subdataset marker, but it
constructs paths with os.path.sep, so it should be portable.
As get_subpaths() moves across the path, it builds a list of subpaths.
Each subpath is the *extension* of the one before it.  For example,
"a//b//c" is broken into ["a", "a/b"].  The logic for joining the
subpaths is flawed because *all* previous subpaths in the list, not
just the last one, are joined together at the beginning of the next
subpath.  This leads to subpaths with duplicated elements if
get_subpaths() is called with a path that has more than two "//".

Use only the last subpath in the list to construct the start of the
new subpath.
We need to create datasets that are higher up in the path first before
creating their subdatasets.
@codecov
Copy link

codecov bot commented Jul 25, 2019

Codecov Report

Merging #3561 into 0.11.x will decrease coverage by 0.03%.
The diff coverage is 66.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##           0.11.x    #3561      +/-   ##
==========================================
- Coverage    81.3%   81.26%   -0.04%     
==========================================
  Files         255      255              
  Lines       33588    33603      +15     
==========================================
+ Hits        27308    27309       +1     
- Misses       6280     6294      +14
Impacted Files Coverage Δ
datalad/plugin/tests/test_addurls.py 100% <100%> (ø) ⬆️
datalad/plugin/addurls.py 18.34% <14.28%> (+0.02%) ⬆️
datalad/downloaders/tests/test_http.py 88.86% <0%> (-2.23%) ⬇️
datalad/support/json_py.py 95.94% <0%> (-1.36%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 54a8ff4...5415ce9. Read the comment docs.

@kyleam
Copy link
Contributor Author

kyleam commented Jul 26, 2019

I don't see how that codecov info could be correct. Things are quiet on Travis right now, so I'll re-trigger the build to see if that rights codecov.

@yarikoptic
Copy link
Member

FWIW

  1. that "addurl" issue seems was solely due to not enabled "datalad" remote (to be configured via -c of ENH: addurls: Add --cfg-proc option that is passed to create() #3562 ) for s3:// urls to work out in the subdataset

  2. for a full test I merged ENH: addurls: Add --cfg-proc option that is passed to create() #3562 locally ito this PR (thus brought all of the master upon it) and it all worked out:

$> datalad addurls -c datalad_remote ../testhcp/urls.csv '{original_url}' '{subject}//{preprocessing}//{filename}'             
[INFO   ] Creating a new annex repo at /tmp/testhcp11/101006                                           
create(ok): 101006 (dataset)                                                                                                             
[INFO   ] Running procedure cfg_datalad_remote 
[INFO   ] == Command start (output follows) ===== 
initremote datalad ok
(recording state in git...)
[INFO   ] == Command exit (modification check follows) ===== 
[INFO   ] Creating a new annex repo at /tmp/testhcp11/101006/MNINonLinear 
create(ok): 101006/MNINonLinear (dataset)                                                                                                
[INFO   ] Running procedure cfg_datalad_remote 
[INFO   ] == Command start (output follows) ===== 
initremote datalad ok
(recording state in git...)
[INFO   ] == Command exit (modification check follows) ===== 
[INFO   ] Creating a new annex repo at /tmp/testhcp11/101006/unprocessed 
create(ok): 101006/unprocessed (dataset)                                                                                                 
[INFO   ] Running procedure cfg_datalad_remote 
[INFO   ] == Command start (output follows) ===== 
initremote datalad ok
(recording state in git...)
[INFO   ] == Command exit (modification check follows) ===== 
[INFO   ] Creating a new annex repo at /tmp/testhcp11/101006/MNINonLinear/Results/rfMRI_REST1_LR 
create(ok): 101006/MNINonLinear/Results/rfMRI_REST1_LR (dataset)                                                                         
[INFO   ] Running procedure cfg_datalad_remote 
[INFO   ] == Command start (output follows) ===== 
initremote datalad ok
(recording state in git...)
[INFO   ] == Command exit (modification check follows) ===== 
addurl(ok): /tmp/testhcp11/101006/unprocessed/3T/rfMRI_REST1_LR/101006_3T_BIAS_BC.nii.gz (file) [from datalad; to 3T/rfMRI_REST1_LR/101006_3T_BIAS_BC.nii.gz]
addurl(ok): /tmp/testhcp11/101006/MNINonLinear/Results/rfMRI_REST1_LR/rfMRI_REST1_LR_hp2000.ica/Noise.txt (file) [from datalad; to rfMRI_REST1_LR_hp2000.ica/Noise.txt]
add(ok): /tmp/testhcp11/101006/unprocessed/3T/rfMRI_REST1_LR/101006_3T_BIAS_BC.nii.gz (file)                                             
metadata(ok): /tmp/testhcp11/101006/unprocessed/3T/rfMRI_REST1_LR/101006_3T_BIAS_BC.nii.gz (file)
add(ok): /tmp/testhcp11/101006/MNINonLinear/Results/rfMRI_REST1_LR/rfMRI_REST1_LR_hp2000.ica/Noise.txt (file)
metadata(ok): /tmp/testhcp11/101006/MNINonLinear/Results/rfMRI_REST1_LR/rfMRI_REST1_LR_hp2000.ica/Noise.txt (file)
save(ok): 101006/unprocessed (dataset)                                                                                                   
save(ok): 101006/MNINonLinear/Results/rfMRI_REST1_LR (dataset)
add(ok): Results/rfMRI_REST1_LR (file)
save(ok): 101006/MNINonLinear (dataset)
add(ok): MNINonLinear (file)
add(ok): unprocessed (file)
save(ok): 101006 (dataset)
add(ok): 101006 (file)
save(ok): . (dataset)
action summary:
  add (ok: 6)
  addurl (ok: 2)
  create (ok: 4)
  metadata (ok: 2)
  save (ok: 5)
datalad addurls -c datalad_remote ../testhcp/urls.csv '{original_url}'   11.02s user 4.23s system 78% cpu 19.437 total
(dev) 1 13396 [1].....................................:Fri 26 Jul 2019 12:11:58 AM EDT:.
(git-annex)hopa:/tmp/testhcp11[master]git
$> datalad ls -r .
.                                            [annex]  master  ✗ 2019-07-26/00:11:58  ✓
101006                                       [annex]  master  ✗ 2019-07-26/00:11:58  ✓
101006/MNINonLinear                          [annex]  master  ✗ 2019-07-26/00:11:58  ✓
101006/MNINonLinear/Results/rfMRI_REST1_LR   [annex]  master  ✗ 2019-07-26/00:11:57  ✓
101006/unprocessed                           [annex]  master  ✗ 2019-07-26/00:11:57  ✓

@yarikoptic
Copy link
Member

I don't see how that codecov info could be correct.

could well be that inconsistency brought in by the difference of runs between master and PRs -- whatever I was screaming for us to be careful to avoid ;)

@kyleam
Copy link
Contributor Author

kyleam commented Jul 26, 2019

2\. for a full test I merged #3562 locally ito this PR (thus brought all of the master upon it) and it all worked out:

Great, thanks for testing it out on your end.

@kyleam
Copy link
Contributor Author

kyleam commented Jul 26, 2019

I don't see how that codecov info could be correct.

could well be that inconsistency brought in by the difference of runs between master and PRs -- whatever I was screaming for us to be careful to avoid ;)

That doesn't make sense to me. I'm talking about the patch coverage (it was reporting that no lines were covered), not the overall project coverage.

@yarikoptic
Copy link
Member

I don't see how that codecov info could be correct.

could well be that inconsistency brought in by the difference of runs between master and PRs -- whatever I was screaming for us to be careful to avoid ;)

That doesn't make sense to me. I'm talking about the patch coverage (it was reporting that no lines were covered), not the overall project coverage.

And I was talking about overall project coverage since nothing in the original sentence mentioned patch . Now we are still on different pages but aware of that ;)

Anyways -- something is funky with codecov. I stopped getting colored annotations of the diff via browser extension and do not see them on codecov either, so can't even see what in the patch it considers not covered. But I bet it is all good, I have no code comments, so let's proceed with the merge! Thanks again for the fixes!

@yarikoptic yarikoptic merged commit a0809ed into datalad:0.11.x Jul 26, 2019
@yarikoptic yarikoptic added this to the Release 0.11.6 milestone Jul 26, 2019
@kyleam kyleam deleted the addurls-path-logic branch July 26, 2019 21:21
kyleam added a commit that referenced this pull request Jul 26, 2019
yarikoptic added a commit that referenced this pull request Jul 31, 2019
0.11.6 (Jul 30, 2019) -- am I the last of 0.11.x?

Primarily bug fixes to achieve more robust performance

Fixes

- Our tests needed various adjustments to keep up with upstream
  changes in Travis and Git. ([#3479][]) ([#3492][]) ([#3493][])

- `AnnexRepo.is_special_annex_remote` was too selective in what it
  considered to be a special remote.  ([#3499][])

- We now provide information about unexpected output when git-annex is
  called with `--json`.  ([#3516][])

- Exception logging in the `__del__` method of `GitRepo` and
  `AnnexRepo` no longer fails if the names it needs are no longer
  bound.  ([#3527][])

- [addurls][] botched the construction of subdataset paths that were
  more than two levels deep and failed to create datasets in a
  reliable, breadth-first order.  ([#3561][])

- Cloning a `type=git` special remote showed a spurious warning about
  the remote not being enabled.  ([#3547][])

Enhancements and new features

- For calls to git and git-annex, we disable automatic garbage
  collection due to past issues with GitPython's state becoming stale,
  but doing so results in a larger .git/objects/ directory that isn't
  cleaned up until garbage collection is triggered outside of DataLad.
  Tests with the latest GitPython didn't reveal any state issues, so
  we've re-enabled automatic garbage collection.  ([#3458][])

- [rerun][] learned an `--explicit` flag, which it relays to its calls
  to [run][[]].  This makes it possible to call `rerun` in a dirty
  working tree ([#3498][]).

- The [metadata][] command aborts earlier if a metadata extractor is
  unavailable.  ([#3525][])

* tag '0.11.6': (56 commits)
  [DATALAD RUNCMD] make update-changelog
  finalize CHANGELOG.md entry and boost version
  BF(DOC): close [create] with [] to not cause WARNING by md-strict pandoc
  CHANGELOG.md: Link entry from b3e8adb
  CHANGELOG.md: Add entry for gh-3547
  CHANGELOG.md: Add entry for gh-3561
  CHANGELOG.md: Add link for addurls
  RF: inform about special remotes based on autoenable config
  CHANGELOG.md: Second batch for 0.11.6
  BF: addurls: Process datasets in a stable, breadth-first order
  BF: addurls: Fix construction of nested subpaths
  TST: addurls: Don't hard-code path separator
  BF(TST): skip test_v7_detached_get in direct mode - fails to annex upgrade
  TST: benchmark-travis-pr: Swap 'pip install' and 'git show'
  TST: benchmark-travis-pr: Move repeated logic to run_asv()
  TST: benchmark-travis-pr: Support other bases
  TST: benchmark-travis-pr: Tweak message about current HEAD
  TST: benchmark-travis-pr: Simplify two git commands into one
  TST: benchmark-travis-pr: Reorder and break up lines
  TST: benchmark-travis-pr: Move command for running asv into function
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants