-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ignore keyword argument to copytree #272
Conversation
Approved the CI tests. Note: the live tests will fail since this is a PR from a fork, but that is fine. We'll run those on a local branch once this PR is ready. |
Fixed the warnings from Black 👌 Wasn't sure if this action was used in the project. I think there's already an issue suggesting the inclusion of a CONTRIBUTING.md file, but it has my +1 to explicitly state that Black is being used and also explain how to execute the tests ( |
Yep, thanks. Would you mind noting anything you hit in that issue so it can go in when we write it? |
Ok, seems like the latest issue was related to typing. I was using the wrong return type for the
Sure, I'll leave a comment with some ideas |
cloudpathlib/cloudpath.py
Outdated
@@ -773,9 +792,16 @@ def copytree( | |||
"Destination path {destination} of copytree must be a directory." | |||
) | |||
|
|||
if ignore is not None: | |||
ignored_names = ignore(self.fspath, [x.name for x in self.iterdir()]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two thoughts here:
(1) I think that we may want self._no_prefix_no_drive
instead of self.fspath
, which removes the s3://bucket
portion but keeps the rest. This way the first argument won't include the path to the potentially arbitrary local folder in the cache.
(2) I know it follows the CPython implementation, but it's not ideal to call iterdir
twice since there's network call overhead on both of them. Would it be feasible to do something like the following within the main loop instead?
...
for subpath in self.iterdir():
if not ignore(subpath.parent._no_prefix_no_drive, [subpath.name]):
continue
...
Maybe not ideal if calls to ignore
are expensive, but there's a tradeoff with the network overhead of iterdir
that may be worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just looked a little closer at the CPython one—at least point 2 we could address like their implementation where we introduce a _copytree
that gets passed entries
so that we don't do the expensive operation twice (like they do to avoid calling scandir
twice)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I get it. Is not a bad idea since it replaces a single call on every file for multiple calls on a single file each time, but keeping the interface untouched. Would it be a bad idea to cache or store the result of self.iterdir()
and use it twice?
contents = list(self.iterdir()) # Using list because iterdir is a generator
if ignore is not None:
ignored_names = ignore(subpath.parent._no_prefix_no_drive, [x.name for x in contents])
else:
ignored_names = set()
# ...
for subpath in contents:
# ...
EDIT: Just noticed your second comment, I'll give it a try later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like what I proposed is equivalent to what you described. They do the step of creating a list of entries here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Oh yeah, that looks simple and saves us some network overhead! Let's do it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, maybe we should still use their solution. copytree
is a recursive function, so we still are making network calls (at least is half now). Seems like the only way to make a single network call is using their approach to determine the list of entries before actually calling copytree
(which will be named _copytree
then)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the recursive call (when subpath
is a directory) also needs the ignore
argument passed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, maybe we should still use their solution.
copytree
is a recursive function, so we still are making network calls (at least is half now). Seems like the only way to make a single network call is using their approach to determine the list of entries before actually callingcopytree
(which will be named_copytree
then)
NVM, seems like this won't make a difference since iterdir
is not recursive, so we actually need to call iterdir
on each depth level of the recursive function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the recursive call (when
subpath
is a directory) also needs theignore
argument passed
Fixed in the latest push
The `ignore` argument expects a callable that returns a list of names that will be ignored while copying. It supports `shutil.ignore_patterns` [1] to allow using a similar interface. Also, include proper typing according to the behaviour described in the Python docs on `shutil.copytree` and expanding it to support the `cloudpathlib` environment. [1]: https://docs.python.org/3/library/shutil.html#shutil.ignore_patterns Signed-off-by: Antonio Ossa Guerra <aaossa@uc.cl>
When the argument is not used (defaults to `None`), the function should work normally. The arguments is expected to be a callable that, given a list of names, returns a list of ignored names to skip those names while performing the copy. The tests create additional files in the reference path (`p`): a Python file (`ignored.py`) and two directories (`dir1/` and `dir2/`). These files are ignored in two different ways, and tested separatelly: using `shutil.ignore_patterns` and using a custom ignore function The tests are performed by copying the tree (and ignoring the files) and then comparing the source and destination (checking that every file in the destination is also in the source), and asserting that the ignored files do not exist in the destination. Signed-off-by: Antonio Ossa Guerra <aaossa@uc.cl>
* Add ignore keyword argument to copytree (#272) * Add `ignore` keyword argument to `copytree` The `ignore` argument expects a callable that returns a list of names that will be ignored while copying. It supports `shutil.ignore_patterns` [1] to allow using a similar interface. Also, include proper typing according to the behaviour described in the Python docs on `shutil.copytree` and expanding it to support the `cloudpathlib` environment. [1]: https://docs.python.org/3/library/shutil.html#shutil.ignore_patterns Signed-off-by: Antonio Ossa Guerra <aaossa@uc.cl> * Add tests for `ignore` argument on `copytree` When the argument is not used (defaults to `None`), the function should work normally. The arguments is expected to be a callable that, given a list of names, returns a list of ignored names to skip those names while performing the copy. The tests create additional files in the reference path (`p`): a Python file (`ignored.py`) and two directories (`dir1/` and `dir2/`). These files are ignored in two different ways, and tested separatelly: using `shutil.ignore_patterns` and using a custom ignore function The tests are performed by copying the tree (and ignoring the files) and then comparing the source and destination (checking that every file in the destination is also in the source), and asserting that the ignored files do not exist in the destination. Signed-off-by: Antonio Ossa Guerra <aaossa@uc.cl> Signed-off-by: Antonio Ossa Guerra <aaossa@uc.cl> * Update changelog and version Signed-off-by: Antonio Ossa Guerra <aaossa@uc.cl> Co-authored-by: Antonio Ossa-Guerra <aaossa@uc.cl>
Following #145 , the
ignore
argument works similarly toshutil.copytree
, and supportsshutil.ignore_patterns
to allow using a similar interface. Added two tests by creating additional files and ignoring them usingshutil.ignore_patterns
and a custom ignore function.Closes #145