Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dandi move command #1006

Closed
yarikoptic opened this issue May 11, 2022 · 24 comments · Fixed by #1024
Closed

dandi move command #1006

yarikoptic opened this issue May 11, 2022 · 24 comments · Fixed by #1024
Assignees
Labels
enhancement New feature or request released

Comments

@yarikoptic
Copy link
Member

yarikoptic commented May 11, 2022

Prior mentioning/need for such (dandi mv) command is at #848 (comment) and it came up again in context of zarr archives. Since they are mutable ATM, asset ids are changing upon metadata change and not stored locally, there is no sense of identity of a zarr folder. That forbids any efficient renaming of a zarr archive locally.

With dandi move, similar one to dandi delete we have, we could rename locally and on the server really fast, thus making subsequent dandi upload if so needed update an existing zarr archive instead of uploading a new one and deleting an old one.

Extra points to keep in mind

  • IIRC dandi organize can operate in-place (within existing dandiset). Then in such scenario "by default" it should use dandi move internally to move files. But it should be some kind of a switch which would do server side change as well or not.
  • Originally I thought to file it as dandi mv but decided to name it dandi move instead -- Harmonize commands flavor #1005
@jwodder
Copy link
Member

jwodder commented May 12, 2022

@yarikoptic

  • What exactly do you want the syntax of this command to be? Should it be possible to run it outside of a local Dandiset folder, or is a local Dandiset required?
  • What exactly are you asking for with regard to dandi organize? That sort of seems like it should be a separate issue+PR.

@yarikoptic
Copy link
Member Author

  • What exactly do you want the syntax of this command to be?

something simple/obvious, thus basic functionality very close to

$> mv --help
Usage: mv [OPTION]... [-T] SOURCE DEST
  or:  mv [OPTION]... SOURCE... DIRECTORY
  or:  mv [OPTION]... -t DIRECTORY SOURCE...
Rename SOURCE to DEST, or move SOURCE(s) to DIRECTORY.

for starters I think we can avoid needing -T or -t, and just have move [OPTIONS] SOURCE(S) DEST, and if DEST is not a directory -- just a single SOURCE should it be, otherwise -- error. If DEST is a directory -- move ALL SOURCE(S) into that directory. SOURCE(S) could be folders.

OPTIONS:

  • --regex FROM_REGEX, so if specified, SOURCEs would be regexes to select paths by [*], DEST is a regex "TO" rename to each of the SOURCE(S) (if application of regex changes the filename).
  • --work-on auto,local,remote,both: if --dataset (below) is specified -- explicit 'local' or 'both' should cause error. Otherwise 'auto' == 'both' i.e. changes would performed on local and remote dandisets. if 'remote' is specified, must be in a dandiset or have --dandiset option to point to dandiset
  • --existing error,overwrite,skip: if destination path (not just DEST/ but the destination path) already exists -- what to do?!

FWIW -- I would be ok if restriction for --regex REGEX mode would be to have a single SOURCE to not gather a union from multiple regex match sets.

Should it be possible to run it outside of a local Dandiset folder, or is a local Dandiset required?

Good question. I think it would be valuable to allow to operate on remote dandiset.
Let's add -d|--dandiset URI (so could be -d DANDI:000029) option to specify remote dandiset. Then SOURCE(S) and must be paths relative to the top of the dandiset.

So sample uses

  • UC1: rename all .tiff to .tif in a dandiset in the main archive: dandi move -d DANDI:009999 --regex '\.tiff$' '.*' '.tif' which would just apply that regex to all files but will not do anything to non-.tiff files since transformation would not change their names. More specific targetting would be dandi move -d DANDI:009999 --regex '\.tiff$' '.*\.tiff$' '.tif'
  • UC2: move data files (or could be dirs - who knows) starting with sub- under a micr/ folder: dandi move --regex '\(ses-[^/]+\)/sub-' '.*' '\1/micr/sub-' (demonstrates reuse of group from FROM_REGEX in DEST)

NB alternative is just to make it --regex option, and then SOURCE is both regex for selection and specification (via groups) on what to replace in DEST. So UC1: dandi move -d DANDI:009999 --regex '\.tiff$' '.tif'; UC2: dandi move --regex '\(ses-[^/]+\)/sub-' '\1/micr/sub-'. WDYT @jwodder -- would this be better?

  • What exactly are you asking for with regard to dandi organize? That sort of seems like it should be a separate issue+PR.

Sure -- could be a separate one (will do later), but I thought that might be good to have that use case in mind.

Notes:

  • regex option makes it easier to shoot in the foot so that multiple renamed files point to the same filename. In such a case we must just error out
    • to implement that , first the renames: dict of from: to should be prepared and target list should have only unique entries len(list(renames.values())) == len(set(renames.values())))
    • renames includes only renames -- i.e. key != value
  • a tricky case could be renames like swap filenames between A and B but I don't think we can easily specify such rename with regex. But let's guard for that specifically:
    • assert not set(renames).intersection(renames.values()) since if there is intersection - there might (but not guaranteed) to be a loop, and also renames order would matter -- we can just raise NotImplemntedError for now
  • [*] eventually using native regex support add glob/regex support in the dandi library #979 but could start with getting all assets and applying regex to include if matches. Also if --existing is not equal to "overwrite" - we are likely doomed to get a full list of assets first to ensure that we are not overwriting some existing one. Of cause in case of just few files renames it is cheaper just to test for those specific paths... dunno yet if we should check destinations one by one or get a full list first.

WDYT @jwodder and @satra ?

@jwodder
Copy link
Member

jwodder commented May 31, 2022

@yarikoptic

  • I think I prefer the --regex MATCH DEST syntax to --regex MATCH FIND REPLACE.
  • Should there be a corresponding move() function, like there is for upload(), download(), and organize()? If so, what should its signature be?
  • What should happen when --work-on both is given and the local & remote file hierarchies don't match?

@yarikoptic
Copy link
Member Author

  • I think I prefer the --regex MATCH DEST syntax to --regex MATCH FIND REPLACE.

good. so like in the NB above -- Let's proceed with it then. So in a way MATCH will provide FIND since if there is no MATCH, there is no REPLACE and thus no "move" to happen.

  • Should there be a corresponding move() function, like there is for upload(), download(), and organize()? If so, what should its signature be?

I think so. Signature -- following CLI, I guess smth like

def move(source: Union[str,list], destination: str, *, regex:bool=False, work_on:str='auto', dandiset: Optional[Union[str,Dandiset]]=None, existing: str='error')

so regex would turn source into a regex to match (on list just raise ValueError that should be a single regex) and destination is what to rename into. So following the NB

  • What should happen when --work-on both is given and the local & remote file hierarchies don't match?

Good point! let's just error out with

RuntimeError('Local & remote hierarchies do not match. 10 files (e.g., mumba.nwb) are local only. 5 files (e.g., onserver.nwb) are only on server')

or alike - i.e. providing minimal sufficient information to point to the problem and possibly troubleshoot it.

@jwodder jwodder mentioned this issue May 31, 2022
6 tasks
@jwodder
Copy link
Member

jwodder commented May 31, 2022

@yarikoptic

dandiset: Optional[Union[str,Dandiset]]=None

I assume that the str option is a path to a local Dandiset, and None means to the use the current directory. Is Dandiset here supposed to be dandi.dandiset.Dandiset or a typo for dandi.dandiapi.RemoteDandiset? If the former, how would one specify the archive instance without being able to pass a RemoteDandiset?

@jwodder
Copy link
Member

jwodder commented Jun 1, 2022

@yarikoptic Also:

  • What should be the default value of --existing?
  • Exactly what should trigger an error when the local & remote file hierarchies don't match? Should move() only care if a given source or destination path doesn't match, or should it also error if a path not involved in the move doesn't match?
  • Do you want the interface to use pyout like dandi delete does, or should it just log each renaming and call it a day?

@yarikoptic
Copy link
Member Author

dandiset: Optional[Union[str,Dandiset]]=None

I assume that the str option is a path to a local Dandiset, and None means to the use the current directory. Is Dandiset here supposed to be dandi.dandiset.Dandiset or a typo for dandi.dandiapi.RemoteDandiset? If the former, how would one specify the archive instance without being able to pass a RemoteDandiset?

couldn't give an answer right away... as I depicted in UC1 I thought to have it correspond to the remote dandiset so people could perform operations on it if work_on is "remote" or "both"...

ATM: in most of the high level APIs we rely on CWD to point to local dandiset:

  • dandi organize: There is -d|--dandiset-path in organize only.
  • dandi delete : 1. exposes -i|--dandi-instance to be used along with PATHs. 2. uses provided arg PATH to possibly point to remote dandiset/path - so does not use/need -d . Relies on CWD
  • dandi download - relies on URL args but also has the -i|--dandi-instance (I guess for the cases to overload or disambiguate server deduced from URL?? or why?). Uses CWD to dump to.
  • dandi upload - deduces from CWD, has --dandi-instance
  • dandi digest uses -d but for --digest and since never to operate on dandiset afaik -- no problem/inconsistency

So, none command besides organize has explicit way to point to local dandiset path and at large relies on CWD if it is needed to figure out a local dandiset. And then based on id for local, uses dandi-instance to figure out remote. delete and download also use URI arguments to identify remote dandisets.

It feels that for some consistency we want

  • stay close to delete syntax/semantics
  • get --dandi-instance here too to map to remote dandiset similarly to how delete and upload do
  • need a way to point to remote dandiset for work_on="remote" so -d should allow to take URI to identify the remote dandiset
  • rely on CWD being in the target dandiset if work_on local or both (we might want later on to add --dandiset-path, to allow for "operate on a dandiset not under CWD, and add it also to upload)

WDYT @jwodder -- may be you see some other, simpler/more intuitive interface?

@yarikoptic
Copy link
Member Author

  • What should be the default value of --existing?

"error" as I have it in #1006 (comment) -- we do not want to blindly overwrite.

  • Exactly what should trigger an error when the local & remote file hierarchies don't match? Should move() only care if a given source or destination path doesn't match, or should it also error if a path not involved in the move doesn't match?

I would KISS - compare entire hierarchies in case of work_on "both" . If any difference -- error.

  • Do you want the interface to use pyout like dandi delete does, or should it just log each renaming and call it a day?

I don't think there is much of job for pyout but we could keep it consistent with delete for now -- at some point we should provide alternative "frontends" to render those messages. Could also then (ab)use its parallelization. Can have source, target local, remote, message columns I guess, where local and remote are for corresponding status'es and would be present/absent depending on work_on setting.

another wishlist item which came to mind due to presence of --regex: --dry-run which would just show what would have been done without doing it. Then those status'es (local/remote) could just say 'dry'.

And later we would add some flag like ls has -f, --format to switch from pyout to simple logging or whatnot... some time...

@jwodder
Copy link
Member

jwodder commented Jun 1, 2022

@yarikoptic That command-line syntax for dandi move sounds fine, but I was primarily asking about the move() function. I'm thinking that it should take dandiset: Union[str, Path, RemoteDandiset, None] = None, where the RemoteDandiset option is for when -d is given on the command line, along with a dandi_instance argument for use when dandiset is a path and work_on is not local.

Another question: Should this be able to operate locally on folders that are not Dandisets (i.e., that do not have a dandiset.yaml)?

@yarikoptic
Copy link
Member Author

@yarikoptic That command-line syntax for dandi move sounds fine, but I was primarily asking about the move() function.

well, those (CLI and function) should at large be the same where CLI just passes in args into move function. move indeed just could be smarter to accept the RemoteDandiset and Path so could be more flexible I guess, but if it would be -- it should probably be "consistent" with CLI:

... where the RemoteDandiset option is for when -d is given on the command line ...

IMHO it would be "consistent" if dandiset: str when passed to move function is treated as a URI of a remote dandiset (thus mapping into RemoteDandiset), and to point to local path it would have to be Path. Then CLI -d "DANDI:000009" and Python code move(dandiset="DANDI:000009",...) would mean the same thing.

Another question: Should this be able to operate locally on folders that are not Dandisets (i.e., that do not have a dandiset.yaml)?

ATM "I don't think so" although I can see then how someone could start using dandi move --work_on local --regex as some kind of a generic renamer helper ;)

@jwodder
Copy link
Member

jwodder commented Jun 1, 2022

@yarikoptic When determining whether the local and remote file hierarchies differ, what should be considered part of the local file hierarchy? Keep in mind that upload() only considers NWB, Zarr, and NGFF assets to be part of the Dandiset unless --allow-any-path is given.

@yarikoptic
Copy link
Member Author

oh right... that makes it tricky/impossible... but may be we want to error out with above information , mention --dry-run, and add explicit flag (--force) to do despite discrepancies? but then it would need to allow for a superset of matching paths (if regex) where some are present only locally or only remotely I guess.

@jwodder
Copy link
Member

jwodder commented Jun 2, 2022

@yarikoptic I'm not clear on what behavior you're describing. Are you saying that only NWB, Zarr, and NGFF assets should be honored in local Dandisets, and if anything else is present, we error out?

@jwodder
Copy link
Member

jwodder commented Jun 2, 2022

@yarikoptic Also, how should this work when run from inside a subdirectory of a Dandiset? Should assets outside the sub-hierarchy be movable?

@yarikoptic
Copy link
Member Author

@yarikoptic I'm not clear on what behavior you're describing. Are you saying that only NWB, Zarr, and NGFF assets should be honored in local Dandisets, and if anything else is present, we error out?

I was thinking -- not specifically any type of an asset, just

  • if no --force:
    • get all local files as we were during dandi upload
    • compare that list to the list of assets on server
    • if different -- raise an Error

but I reflected back on your comment/question yesterday

Should move() only care if a given source or destination path doesn't match...?

and may be let's do just that! i.e. instead of the list of files for dandi upload, do get the files as specified in CLI (regex or not), compare that one to matching assets on the server. They should match. For now then we can even avoid providing --forceed mode of operation until someone asks for it, since it might also be needed to be parametrized so let's wait for use case.

@yarikoptic Also, how should this work when run from inside a subdirectory of a Dandiset? Should assets outside the sub-hierarchy be movable?

I think it would be great to be able to support that since I might want to move * ../somefolder or alike. So, as long as we are not moving outside of the dandiset boundary -- should be all legit/supported.

@jwodder
Copy link
Member

jwodder commented Jun 6, 2022

@yarikoptic

  • When --work-on is both, if moving a local path fails (assuming the local move happens first), should the corresponding remote path still be moved?
  • When --work-on is remote or both, should the client fetch all asset paths from the server at the start, or should it make a request whenever it needs to look up a path or collection of paths?

@yarikoptic
Copy link
Member Author

  • I don't think so, let's error that file move so we retain consistency.
  • I think all at the start, so we could analyze if move would overwrite anything

@jwodder
Copy link
Member

jwodder commented Jun 16, 2022

@yarikoptic You said you want to support moving assets outside of the current directory when running inside a subdirectory. Should this also apply when using regexes? Specifically, when --regex is given, should only assets in the current directory be taken into account, or all assets in the Dandiset?

@yarikoptic
Copy link
Member Author

yarikoptic commented Jun 16, 2022

I think assets from current directory. So, whenever ran on the top directory of dandiset - would be all assets

@jwodder
Copy link
Member

jwodder commented Jun 16, 2022

@yarikoptic If a non-regex move command says to move a file to itself, should that be a warning or error, or should the movement just be ignored? For comparison, mv allows moving a file to itself, and the --regex code currently discards any replacements that don't change anything.

@jwodder
Copy link
Member

jwodder commented Jun 16, 2022

@yarikoptic Also, if the user runs move --work-on=<"both" or "remote"> somepath somedir, where somedir is a directory that exists locally but not remotely (and it doesn't end in a slash), what should happen?

@yarikoptic
Copy link
Member Author

Hmm, if both, and exists locally paths on server should be adjusted as if there was a directory on the server (but we don't have a notion of directory). But if it is 'remote' then we shouldn't AFAIK consult local filesystem and it becomes tricky - consider it a directory only if trailing / was provided in its name

@yarikoptic
Copy link
Member Author

@yarikoptic If a non-regex move command says to move a file to itself, should that be a warning or error, or should the movement just be ignored? For comparison, mv allows moving a file to itself, and the --regex code currently discards any replacements that don't change anything.

I guess just ignore (thus in effect move into itself) with a debug log message

yarikoptic added a commit that referenced this issue Jun 27, 2022
Add `dandi move` command
@github-actions
Copy link

🚀 Issue was released in 0.41.0 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request released
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants