Skip to content

Conversation

@ianthomas23
Copy link
Collaborator

Fixes #1234.

This PR fixes the cp, get and put of a list of files in different directories to a single target directory. For example

cp(["source/file1", "source/subdir/subfile2"], "target/")

results in

📁 source
├── 📄 file1
└── 📄 subfile2

This fix has involved adding a new kwarg to other_paths to tell it when to ignore the directory structure of the input files as this information cannot be derived from the existing arguments. I've called this argument flatten for brevity, but this could be changed to another names such as flatten_input or ignore_dirs for example.

@martindurant
Copy link
Member

Windows, huh

@ianthomas23
Copy link
Collaborator Author

Indeed! I will check on this next week when I have access to my Windows machine.

@ianthomas23
Copy link
Collaborator Author

This now works on Windows. With hindsight it is obvious that a check for endswith("/") doesn't work on Windows as the slash can be the other way around, depending on whether it has already been posix-ised or not. I've added two new functions for this, trailing_sep and trailing_sep_maybe_asterisk and put them near make_posix_path.

The fix here only applied to sync.py not asyn.py. If I try the same fixes to asyn.py they work but break something else and I veer off on a tangent. So here I am keeping the scope as tight as possible. When I have fixed the other cp/get/put issues for the sync code I will port them all to async and get them working with the derived tests in s3fs and gcsfs.

@martindurant
Copy link
Member

Ugh, all a bit painful :|

This is fine, but I wonder the following:

  • should we posix-ise paths sooner to avoid this kind of thing (and this only need happen for the local filesystem)
  • are we contributing significant overhead here? My guess is no, the string checks shouldn't be happening millions of times, but I would appreciate a thought to it

@ianthomas23
Copy link
Collaborator Author

Taking your questions in reverse order:

I don't think we are introducing significant overhead. Each of the trailing_whatever functions only needs to look at the last and possibly last but one character of the supplied string, and we call the trailing_whatever functions a maximum of twice per call to cp, get or put.

In comparison, make_path_posix is quite complicated and although there are some fast code paths it is possible that the function needs to walk the supplied path multiple times, e.g. for the "~" in path and the two re.match calls (but the latter are compiled and therefore fast?). If we put a compulsory make_path_posix call on each of the 2 path args to cp, get and put we will also have to do so for expand_path and other_paths as they are external functions, so that is 4 make_path_posix calls per cp, get or put call, and will be more if either source or target paths are lists of files. This must typically be a lot slower than the maximum two trailing_whatever calls.

However, it hadn't really occurred to me that the make_path_posix is only needed for the local filesystem. So we could override every external function in LocalFileSystem that takes a path to posix-fy it, and then we'd be sure that all other code only ever needs to use forward slashes. I guess that is more work than this PR but the end point will be simpler code overall, but probably we will (at least temporarily) break something in the transition.

Do you have preference?

@martindurant
Copy link
Member

OK, that sounds like good reasoning.

Let's leave a set of TODO comments in trailing_whatever explaining this with a link to this PR, so that if we feel like making the effort to refactor later, the trail is clear. Then this can go in as is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Copying a list of files from different directories gives (possibly) wrong result

2 participants