-
Notifications
You must be signed in to change notification settings - Fork 112
addurls: Fixes to subpath handling #3561
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
get_subpaths() uses "//" as the special subdataset marker, but it constructs paths with os.path.sep, so it should be portable.
As get_subpaths() moves across the path, it builds a list of subpaths. Each subpath is the *extension* of the one before it. For example, "a//b//c" is broken into ["a", "a/b"]. The logic for joining the subpaths is flawed because *all* previous subpaths in the list, not just the last one, are joined together at the beginning of the next subpath. This leads to subpaths with duplicated elements if get_subpaths() is called with a path that has more than two "//". Use only the last subpath in the list to construct the start of the new subpath.
We need to create datasets that are higher up in the path first before creating their subdatasets.
Codecov Report
@@ Coverage Diff @@
## 0.11.x #3561 +/- ##
==========================================
- Coverage 81.3% 81.26% -0.04%
==========================================
Files 255 255
Lines 33588 33603 +15
==========================================
+ Hits 27308 27309 +1
- Misses 6280 6294 +14
Continue to review full report at Codecov.
|
I don't see how that codecov info could be correct. Things are quiet on Travis right now, so I'll re-trigger the build to see if that rights codecov. |
FWIW
$> datalad addurls -c datalad_remote ../testhcp/urls.csv '{original_url}' '{subject}//{preprocessing}//{filename}'
[INFO ] Creating a new annex repo at /tmp/testhcp11/101006
create(ok): 101006 (dataset)
[INFO ] Running procedure cfg_datalad_remote
[INFO ] == Command start (output follows) =====
initremote datalad ok
(recording state in git...)
[INFO ] == Command exit (modification check follows) =====
[INFO ] Creating a new annex repo at /tmp/testhcp11/101006/MNINonLinear
create(ok): 101006/MNINonLinear (dataset)
[INFO ] Running procedure cfg_datalad_remote
[INFO ] == Command start (output follows) =====
initremote datalad ok
(recording state in git...)
[INFO ] == Command exit (modification check follows) =====
[INFO ] Creating a new annex repo at /tmp/testhcp11/101006/unprocessed
create(ok): 101006/unprocessed (dataset)
[INFO ] Running procedure cfg_datalad_remote
[INFO ] == Command start (output follows) =====
initremote datalad ok
(recording state in git...)
[INFO ] == Command exit (modification check follows) =====
[INFO ] Creating a new annex repo at /tmp/testhcp11/101006/MNINonLinear/Results/rfMRI_REST1_LR
create(ok): 101006/MNINonLinear/Results/rfMRI_REST1_LR (dataset)
[INFO ] Running procedure cfg_datalad_remote
[INFO ] == Command start (output follows) =====
initremote datalad ok
(recording state in git...)
[INFO ] == Command exit (modification check follows) =====
addurl(ok): /tmp/testhcp11/101006/unprocessed/3T/rfMRI_REST1_LR/101006_3T_BIAS_BC.nii.gz (file) [from datalad; to 3T/rfMRI_REST1_LR/101006_3T_BIAS_BC.nii.gz]
addurl(ok): /tmp/testhcp11/101006/MNINonLinear/Results/rfMRI_REST1_LR/rfMRI_REST1_LR_hp2000.ica/Noise.txt (file) [from datalad; to rfMRI_REST1_LR_hp2000.ica/Noise.txt]
add(ok): /tmp/testhcp11/101006/unprocessed/3T/rfMRI_REST1_LR/101006_3T_BIAS_BC.nii.gz (file)
metadata(ok): /tmp/testhcp11/101006/unprocessed/3T/rfMRI_REST1_LR/101006_3T_BIAS_BC.nii.gz (file)
add(ok): /tmp/testhcp11/101006/MNINonLinear/Results/rfMRI_REST1_LR/rfMRI_REST1_LR_hp2000.ica/Noise.txt (file)
metadata(ok): /tmp/testhcp11/101006/MNINonLinear/Results/rfMRI_REST1_LR/rfMRI_REST1_LR_hp2000.ica/Noise.txt (file)
save(ok): 101006/unprocessed (dataset)
save(ok): 101006/MNINonLinear/Results/rfMRI_REST1_LR (dataset)
add(ok): Results/rfMRI_REST1_LR (file)
save(ok): 101006/MNINonLinear (dataset)
add(ok): MNINonLinear (file)
add(ok): unprocessed (file)
save(ok): 101006 (dataset)
add(ok): 101006 (file)
save(ok): . (dataset)
action summary:
add (ok: 6)
addurl (ok: 2)
create (ok: 4)
metadata (ok: 2)
save (ok: 5)
datalad addurls -c datalad_remote ../testhcp/urls.csv '{original_url}' 11.02s user 4.23s system 78% cpu 19.437 total
(dev) 1 13396 [1].....................................:Fri 26 Jul 2019 12:11:58 AM EDT:.
(git-annex)hopa:/tmp/testhcp11[master]git
$> datalad ls -r .
. [annex] master ✗ 2019-07-26/00:11:58 ✓
101006 [annex] master ✗ 2019-07-26/00:11:58 ✓
101006/MNINonLinear [annex] master ✗ 2019-07-26/00:11:58 ✓
101006/MNINonLinear/Results/rfMRI_REST1_LR [annex] master ✗ 2019-07-26/00:11:57 ✓
101006/unprocessed [annex] master ✗ 2019-07-26/00:11:57 ✓ |
could well be that inconsistency brought in by the difference of runs between master and PRs -- whatever I was screaming for us to be careful to avoid ;) |
Great, thanks for testing it out on your end. |
That doesn't make sense to me. I'm talking about the patch coverage (it was reporting that no lines were covered), not the overall project coverage. |
And I was talking about overall project coverage since nothing in the original sentence mentioned patch . Now we are still on different pages but aware of that ;) Anyways -- something is funky with codecov. I stopped getting colored annotations of the diff via browser extension and do not see them on codecov either, so can't even see what in the patch it considers not covered. But I bet it is all good, I have no code comments, so let's proceed with the merge! Thanks again for the fixes! |
0.11.6 (Jul 30, 2019) -- am I the last of 0.11.x? Primarily bug fixes to achieve more robust performance Fixes - Our tests needed various adjustments to keep up with upstream changes in Travis and Git. ([#3479][]) ([#3492][]) ([#3493][]) - `AnnexRepo.is_special_annex_remote` was too selective in what it considered to be a special remote. ([#3499][]) - We now provide information about unexpected output when git-annex is called with `--json`. ([#3516][]) - Exception logging in the `__del__` method of `GitRepo` and `AnnexRepo` no longer fails if the names it needs are no longer bound. ([#3527][]) - [addurls][] botched the construction of subdataset paths that were more than two levels deep and failed to create datasets in a reliable, breadth-first order. ([#3561][]) - Cloning a `type=git` special remote showed a spurious warning about the remote not being enabled. ([#3547][]) Enhancements and new features - For calls to git and git-annex, we disable automatic garbage collection due to past issues with GitPython's state becoming stale, but doing so results in a larger .git/objects/ directory that isn't cleaned up until garbage collection is triggered outside of DataLad. Tests with the latest GitPython didn't reveal any state issues, so we've re-enabled automatic garbage collection. ([#3458][]) - [rerun][] learned an `--explicit` flag, which it relays to its calls to [run][[]]. This makes it possible to call `rerun` in a dirty working tree ([#3498][]). - The [metadata][] command aborts earlier if a metadata extractor is unavailable. ([#3525][]) * tag '0.11.6': (56 commits) [DATALAD RUNCMD] make update-changelog finalize CHANGELOG.md entry and boost version BF(DOC): close [create] with [] to not cause WARNING by md-strict pandoc CHANGELOG.md: Link entry from b3e8adb CHANGELOG.md: Add entry for gh-3547 CHANGELOG.md: Add entry for gh-3561 CHANGELOG.md: Add link for addurls RF: inform about special remotes based on autoenable config CHANGELOG.md: Second batch for 0.11.6 BF: addurls: Process datasets in a stable, breadth-first order BF: addurls: Fix construction of nested subpaths TST: addurls: Don't hard-code path separator BF(TST): skip test_v7_detached_get in direct mode - fails to annex upgrade TST: benchmark-travis-pr: Swap 'pip install' and 'git show' TST: benchmark-travis-pr: Move repeated logic to run_asv() TST: benchmark-travis-pr: Support other bases TST: benchmark-travis-pr: Tweak message about current HEAD TST: benchmark-travis-pr: Simplify two git commands into one TST: benchmark-travis-pr: Reorder and break up lines TST: benchmark-travis-pr: Move command for running asv into function ...
This series addresses the subdataset issues we hit on the last call.
The remaining issue discussed on that call was that it seemed like a subdirectory needed to exist for
annex addurl --file <subdir>/file
, but that's not the case when I test locally (either withgit annex addurl
ordatalad addurls
) and I don't see anything in annex's history that suggests that it was a recent change. I think that might have actually been related to one of the issues here, but please provide details here if you can trigger it on your end.