Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NF: copy_file() command #4430

Merged
merged 57 commits into from May 2, 2020
Merged

NF: copy_file() command #4430

merged 57 commits into from May 2, 2020

Conversation

mih
Copy link
Member

@mih mih commented Apr 21, 2020

This command is an approach to fix #3681 and fix #600. However, instead of lumping things together and having git-annex sort things out, this is a more targeted attempt at picking individual files, looking for their availability info, and transfer/enable necessary special remote configuration.

This can only be considered a first step. Many (speed) optimizations are possible, various features could be added (e.g. arbitrary special remote support, for now only datalad remote). However, it is reasonably complete and functional to a degree that it is capable of filtering a dataset like https://github.com/datalad-datasets/human-connectome-project-openaccess

Acknowledgements

This work was, in part, supported by the European Union's Horizon 2020 Research and Innovation Programme under Grant Agreement no. 785907 (HBP SGA2).

@codecov
Copy link

@codecov codecov bot commented Apr 22, 2020

Codecov Report

Merging #4430 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #4430   +/-   ##
=======================================
  Coverage   88.95%   88.95%           
=======================================
  Files         287      287           
  Lines       38253    38253           
=======================================
  Hits        34029    34029           
  Misses       4224     4224           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c3cf96a...c3cf96a. Read the comment docs.

@mih mih force-pushed the nf-copy branch 2 times, most recently from dcdd0e1 to e437fc8 Compare Apr 22, 2020
@mih
Copy link
Member Author

@mih mih commented Apr 22, 2020

This is far from being "done", but a bunch of tests showing the intended behavior are in place. If anyone could give it a test ride and report how it feels that would be very useful! Thx.

@kyleam
Copy link
Contributor

@kyleam kyleam commented Apr 22, 2020

If anyone could give it a test ride and report how it feels that would be very useful!

The first thing I tried was

datalad create a
echo one >a/one
datalad -C a save
datalad create b
datalad copyfile a/one b/two
tree --charset=ascii a b | colrm 50
[...]
a
`-- one -> .git/annex/objects/zZ/KF/MD5E-s4--5bbf
b
`-- one -> .git/annex/objects/zZ/KF/MD5E-s4--5bbf

0 directories, 2 files

There might be good implementation or design reasons for not being able to specify a target with a different name via path, but perhaps that being the first thing I tried suggests that it'd be worth a warning or documentation that the non-directory part of the target will be lopped off.

@adswa
Copy link
Member

@adswa adswa commented Apr 23, 2020

If anyone could give it a test ride and report how it feels that would be very useful!

I only played around with it a bit and did not look into the code, but here are the things that caught my attention:

If specifying different target names is not possible, maybe warn/report better in accidental attempts of copying identically named files?

After I saw @kyleam's comment, I wondered what would happen if I copied a file into a directory with a file of the same name but different contents

datalad create a
echo one >a/one
datalad -C a save
datalad create b
echo somethingelse > b/one
datalad -C b save
datalad copyfile a/one b/two

It would be cool if the warning could state something along the lines of "File b/one already exist, and is not identical - can't copy". Its a bit hard to find the command that is run in the output, and the Failed to clean up temporary directory along with the MD5E keys look so much like an internal error that it is hard to infer a mistake on the user side

create(ok): /tmp/a (dataset)
add(ok): one (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  save (ok: 1)
[INFO   ] Creating a new annex repo at /tmp/b 
create(ok): /tmp/b (dataset)
add(ok): one (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  save (ok: 1)
[WARNING] Running fromkey resulted in stderr output: git-annex: fromkey: 1 failed
 
[WARNING] Failed to clean up temporary directory: [Errno 39] Directory not empty: '/tmp/b/.git/tmp/datalad-copy' 
CommandError: 'git-annex reinject -c annex.dotfiles=true -c annex.retry=3 /tmp/b/.git/tmp/datalad-copy/MD5E-s4--5bbf5a52328e7439ae6e719dfe712200 /tmp/b/one' failed with exitcode 1 under /tmp/b
reinject /tmp/b/.git/tmp/datalad-copy/MD5E-s4--5bbf5a52328e7439ae6e719dfe712200 
failed
git-annex: /tmp/b/.git/tmp/datalad-copy/MD5E-s4--5bbf5a52328e7439ae6e719dfe712200 does not have expected content of /tmp/b/one
git-annex: reinject: 1 failed

Copying things into datasets without an annex -- maybe warn?

By chance, the very first thing I did was to copy an annexed file (with retrieved content) into a dataset with no annex (a few HCP subjects files into the top-level human-connectome-project-openacess dataset). Afterwards I end up with a broken symlink (understandable), but I wondered whether this info message is accurate (in particular: Copying non-annexed file or copy - it sounds like the json file would not be annexed, but it is)

─adina@muninn /tmp/hcp on master
╰─➤ datalad copyfile HCP1200/103111/.xdlm/103111_3T_rfMRI_REST1_fixextended.json .
[INFO   ] Copying non-annexed file or copy into non-annex dataset: /tmp/hcp/HCP1200/103111/.xdlm/103111_3T_rfMRI_REST1_fixextended.json -> <GitRepo path=/tmp/hcp (<class 'datalad.support.gitrepo.GitRepo'>)> 
copy(ok): /tmp/hcp/HCP1200/103111/.xdlm/103111_3T_rfMRI_REST1_fixextended.json
add(ok): 103111_3T_rfMRI_REST1_fixextended.json (file)
save(ok): . (dataset)
action summary:
  add (ok: 1)
  copy (ok: 1)
  save (ok: 1)

Would it be maybe good to warn when it is detected that annexed contents are copied into datasets without an annex? I'm thinking of someone copying a dataset with large data into a plain Git repository (e.g, maybe one for their code) and being not able to retrieve any data afterwards.

Also: When I copyfile an annexed file I didn't get yet into a non-annex dataset it fails, but the warning is misleading. To me, the warning would suggest that I would need to get content first, which is futile in this case - even if I did, copying it into a dataset without an annex will turn it into a broken symlink.

datalad copyfile HCP1200/103010/.xdlm/103010_3T_rfMRI_REST1_fixextended.json .
[INFO   ] Copying non-annexed file or copy into non-annex dataset: /tmp/hcp/HCP1200/103010/.xdlm/103010_3T_rfMRI_REST1_fixextended.json -> <GitRepo path=/tmp/hcp (<class 'datalad.support.gitrepo.GitRepo'>)> 
copy(impossible): /tmp/hcp/HCP1200/103010/.xdlm/103010_3T_rfMRI_REST1_fixextended.json [file has no content available]

Behavior of --recursive:

I was confused that recursive copy did not preserve the directory structure.

datalad create -c yoda yodads
echo "my script" > yodads/code/script
datalad -C yodads save
datalad create noyodads      
datalad copyfile -r yodads/* noyodads

tree yodads noyodads
yodads
├── CHANGELOG.md
├── code
│   ├── README.md
│   └── script
└── README.md
noyodads
├── CHANGELOG.md
├── README.md
└── script

(I was also a bit confused that --recursive in the case of copyfile is about directories, when all other --recursive operations are about subdatasets, but on second thought, it matches the behavior of cp, and its written in the help text)


Something I anticipate to be confusing, but can't see a way around

If I have a dataset with configurations that prevent files from being annexed, copying these files into a new dataset will annex them. This is behavior makes sense as long someone knows about files in annex versus in Git, but for anyone using the command naively it may be confusing:

datalad create anotherds
datalad copyfile yodads/code/script anotherds
tree anotherds
anotherds
└── script -> .git/annex/objects/vz/z1/MD5E-s10--4af6de6a8afa5377c382b539d8479406/MD5E-s10--4af6de6a8afa5377c382b539d8479406

I don't think this can be easily prevented, but maybe its something to keep in mind.

@mih
Copy link
Member Author

@mih mih commented Apr 23, 2020

Thx @kyleam @adswa that is very helpful. It is actually pointing to the necessity to turn thing a bit inside out internally. Will report once done.

@yarikoptic
Copy link
Member

@yarikoptic yarikoptic commented Apr 23, 2020

minor comment: I would have made it copy-file (copy_file in Python) to be more inline with the rest of the datalad API separating action and object (e.g. create-sibling, download-url, etc)

mih added a commit to mih/datalad that referenced this issue Apr 23, 2020
datalad#4430 (comment)

Now it looks like this:

```
datalad create a
echo one >a/one
datalad -C a save
datalad create b
datalad copyfile a/one b/two
tree --charset=ascii a b | colrm 50
a
`-- one -> .git/annex/objects/zZ/KF/MD5E-s4--5bbf
b
`-- two -> .git/annex/objects/zZ/KF/MD5E-s4--5bbf
```
@mih
Copy link
Member Author

@mih mih commented Apr 23, 2020

There might be good implementation or design reasons for not being able to specify a target with a different name

There weren't ;-)

Works now (857fd8d), added to tests, too.

@mih
Copy link
Member Author

@mih mih commented Apr 23, 2020

@adswa

If specifying different target names is not possible, maybe warn/report better in accidental attempts of copying identically named files?

I went another way. cp would overwrite the original file, and now this command does that too. In combination with the rectified behavior of being able to perform renaming copies, I think this addresses this aspect.

@mih
Copy link
Member Author

@mih mih commented Apr 23, 2020

minor comment: I would have made it copy-file (copy_file in Python) to be more inline with the rest of the datalad API separating action and object (e.g. create-sibling, download-url, etc)

@yarikoptic This is implemented now.

@mih
Copy link
Member Author

@mih mih commented Apr 23, 2020

By chance, the very first thing I did was to copy an annexed file (with retrieved content) into a dataset with no annex (a few HCP subjects files into the top-level human-connectome-project-openacess dataset). Afterwards I end up with a broken symlink (understandable)

This is not intentional. It should have copied the content, not the symlink. I will investigate.

Would it be maybe good to warn when it is detected that annexed contents are copied into datasets without an annex? I'm thinking of someone copying a dataset with large data into a plain Git repository (e.g, maybe one for their code) and being not able to retrieve any data afterwards.

I have not thought about it this way so far, but I think that is a fair point: If a file was annexed in its original location, it probably was for a reason. Hence might might make sense to fail....

This might hint at the necessity to introduce a --force argument with some modes, one of them being to commit originally annexed files into git at the destination.

Also: When I copyfile an annexed file I didn't get yet into a non-annex dataset it fails, but the warning is misleading. To me, the warning would suggest that I would need to get content first, which is futile in this case - even if I did, copying it into a dataset without an annex will turn it into a broken symlink.

The warning in this case says what I intended, the file should get annexed.

@mih
Copy link
Member Author

@mih mih commented Apr 23, 2020

Behavior of --recursive

This is fixed now, and the prev behavior was unintentional:

% tree yodads noyodads
yodads
├── CHANGELOG.md
├── code
│   ├── README.md
│   └── script
└── README.md
noyodads
├── CHANGELOG.md
├── code
│   ├── README.md
│   └── script
└── README.md

@mih
Copy link
Member Author

@mih mih commented Apr 23, 2020

Something I anticipate to be confusing, but can't see a way around

If I have a dataset with configurations that prevent files from being annexed, copying these files into a new dataset will annex them. This is behavior makes sense as long someone knows about files in annex versus in Git, but for anyone using the command naively it may be confusing

I have to investigate that. I would really really like to avoid having to anticipate/inspect what would happen on save, but maybe there is a way to catch common scenarios and work them int o a force mode.

@mih
Copy link
Member Author

@mih mih commented Apr 24, 2020

@adswa I think I have found and fixed all immediate bugs you discovered. There is no --force yet, and the warning strategy stays the same for now. But If you could have another look, and see if the present behavior is an improvment, I'd be glad. Thx!

@mih mih force-pushed the nf-copy branch 2 times, most recently from e1338dd to 1ad7123 Compare Apr 24, 2020
@kyleam
Copy link
Contributor

@kyleam kyleam commented Apr 24, 2020

I pushed some commits, most of them related to the failing test. Please revert/rework any of them if needed.

@mih
Copy link
Member Author

@mih mih commented Apr 24, 2020

@kyleam wonderful, thx!

@mih mih changed the title NF: Rough draft of a copy() command NF: copy_file() command Apr 26, 2020
@mih
Copy link
Member Author

@mih mih commented Apr 26, 2020

I hope the last push will make the cripples FS tests pass. The only thing missing from my POV is a docstring.

@mih mih marked this pull request as ready for review Apr 27, 2020
@mih
Copy link
Member Author

@mih mih commented Apr 27, 2020

I can tune the history, if desired. I would like to maintain the contributions by individual people, though.

Copy link
Contributor

@kyleam kyleam left a comment

Thanks for the examples, @adswa.

datalad/local/copy_file.py Outdated Show resolved Hide resolved
datalad/local/copy_file.py Outdated Show resolved Hide resolved
datalad/local/copy_file.py Outdated Show resolved Hide resolved
datalad/local/copy_file.py Outdated Show resolved Hide resolved
@kyleam
Copy link
Contributor

@kyleam kyleam commented Apr 27, 2020

I was hoping to see more people chime in on the interface.

From my perspective, it looks nice and the simple things I tried behaved as I expected. The one thing I notice from the examples is that common use cases would require repeating the dataset path as both the target directory and dataset. There is a comment asking whether the target directory, if not given, should be set to the dataset. That'd remove the need to repeat in many cases, but then I guess there's an interaction between options that makes the interface conceptually more complicated. @mih, is that your main concern with making the target directory default to --dataset?

mih and others added 23 commits May 2, 2020
`annex fromkey` doesn't work there, hence built a special copy mode
that simply copies the (in this case) required file content, and let's
save do the rest.
This changed in the previous commit.
test_copy_file_prevent_dotgit_placement calls copy_file() with
.git/config.  This path gets passed to get_content_annexinfo(), which
returns an empty dict because 'git ls-files -o' reasonably does not
produce output for .git/config files [0].  This copy_file() call ends
up signaling a KeyError [1] because we unconditionally call popitem()
on the return value of get_content_annexinfo().

Guard against calling popitem() on an empty dict.

[0]: Note, though, that there was a longstanding regression where
     calling 'git ls-files -o' on .git/ files would return output.
     This was fixed in Git v2.25.0, specifically b9670c1f5e (dir: fix
     checks on common prefix directory, 2019-12-19).

[1]: https://ci.appveyor.com/project/mih/datalad/builds/32479988/job/6d90edw45gbc03p1#L1080
…ormer not

This avoids duplication of input argument values (see examples), without
changing behavior of the command. All examples and unittest calls are
minified to make use of this new feature.
Drop the comment about setting target_dir to ds.pathobj because
6b227f4 took care of that.

Drop part of the comment about guarding against .git/ copying, which
was addressed by d2d7411 (BF: Prevent total breakage of target Git
by copying Git-internals) and f77d5b9 (BF: Prevent placement of a
rogue '.git' via recursion or madness).
@mih
Copy link
Member Author

@mih mih commented May 2, 2020

I decided to rebase on master in order to strip

  • 929102a (function rename)
  • 48adce4 (approximate special remote filtering)

rather than adding explicit commits that revert these.

I left the exc_str() call in. I still very much dislike the messing setup presently used by DataLad. It is targeting developers, optimizing for debugging. It confuses users with wordy, hard to parse, duplicate messages. But this single message won't make a dent. I filed a general issue re exc_str(). #4478

@mih
Copy link
Member Author

@mih mih commented May 2, 2020

I force-pushed the rebase and added a final documentation tuning. I will merge, once the tests are completed.

@mih mih merged commit c1a660c into datalad:master May 2, 2020
12 checks passed
@mih mih deleted the nf-copy branch May 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants