Add ignore keyword argument to copytree #272

aaossa · 2022-09-26T20:21:57Z

Following #145 , the ignore argument works similarly to shutil.copytree, and supports shutil.ignore_patterns to allow using a similar interface. Added two tests by creating additional files and ignoring them using shutil.ignore_patterns and a custom ignore function.

Closes #145

pjbull · 2022-09-26T20:36:49Z

Approved the CI tests. Note: the live tests will fail since this is a PR from a fork, but that is fine. We'll run those on a local branch once this PR is ready.

aaossa · 2022-09-26T20:43:51Z

Fixed the warnings from Black 👌 Wasn't sure if this action was used in the project. I think there's already an issue suggesting the inclusion of a CONTRIBUTING.md file, but it has my +1 to explicitly state that Black is being used and also explain how to execute the tests (make test)

pjbull · 2022-09-26T23:03:59Z

I think there's already an issue suggesting the inclusion of a CONTRIBUTING.md file, but it has my +1 to explicitly state that Black is being used and also explain how to execute the tests (make test)

Yep, thanks. Would you mind noting anything you hit in that issue so it can go in when we write it?

aaossa · 2022-09-27T00:35:34Z

Ok, seems like the latest issue was related to typing. I was using the wrong return type for the ignore keyword argument. Instead of collections.abc.Sequence the correct type is collections.abc.Iterable because a set is not a sequence.

Yep, thanks. Would you mind noting anything you hit in that issue so it can go in when we write it?

Sure, I'll leave a comment with some ideas

pjbull · 2022-09-27T01:12:54Z

cloudpathlib/cloudpath.py

@@ -773,9 +792,16 @@ def copytree(
                "Destination path {destination} of copytree must be a directory."
            )

+        if ignore is not None:
+            ignored_names = ignore(self.fspath, [x.name for x in self.iterdir()])


Two thoughts here:

(1) I think that we may want self._no_prefix_no_drive instead of self.fspath, which removes the s3://bucket portion but keeps the rest. This way the first argument won't include the path to the potentially arbitrary local folder in the cache.

(2) I know it follows the CPython implementation, but it's not ideal to call iterdir twice since there's network call overhead on both of them. Would it be feasible to do something like the following within the main loop instead?

... for subpath in self.iterdir(): if not ignore(subpath.parent._no_prefix_no_drive, [subpath.name]): continue ...

Maybe not ideal if calls to ignore are expensive, but there's a tradeoff with the network overhead of iterdir that may be worth it.

Just looked a little closer at the CPython one—at least point 2 we could address like their implementation where we introduce a _copytree that gets passed entries so that we don't do the expensive operation twice (like they do to avoid calling scandir twice)

Yeah, I get it. Is not a bad idea since it replaces a single call on every file for multiple calls on a single file each time, but keeping the interface untouched. Would it be a bad idea to cache or store the result of self.iterdir() and use it twice?

contents = list(self.iterdir()) # Using list because iterdir is a generator if ignore is not None: ignored_names = ignore(subpath.parent._no_prefix_no_drive, [x.name for x in contents]) else: ignored_names = set() # ... for subpath in contents: # ...

EDIT: Just noticed your second comment, I'll give it a try later

Seems like what I proposed is equivalent to what you described. They do the step of creating a list of entries here

👍 Oh yeah, that looks simple and saves us some network overhead! Let's do it

Ah, maybe we should still use their solution. copytree is a recursive function, so we still are making network calls (at least is half now). Seems like the only way to make a single network call is using their approach to determine the list of entries before actually calling copytree (which will be named _copytree then)

Also, the recursive call (when subpath is a directory) also needs the ignore argument passed

Ah, maybe we should still use their solution. copytree is a recursive function, so we still are making network calls (at least is half now). Seems like the only way to make a single network call is using their approach to determine the list of entries before actually calling copytree (which will be named _copytree then)

NVM, seems like this won't make a difference since iterdir is not recursive, so we actually need to call iterdir on each depth level of the recursive function

Also, the recursive call (when subpath is a directory) also needs the ignore argument passed

Fixed in the latest push

The `ignore` argument expects a callable that returns a list of names that will be ignored while copying. It supports `shutil.ignore_patterns` [1] to allow using a similar interface. Also, include proper typing according to the behaviour described in the Python docs on `shutil.copytree` and expanding it to support the `cloudpathlib` environment. [1]: https://docs.python.org/3/library/shutil.html#shutil.ignore_patterns Signed-off-by: Antonio Ossa Guerra <aaossa@uc.cl>

When the argument is not used (defaults to `None`), the function should work normally. The arguments is expected to be a callable that, given a list of names, returns a list of ignored names to skip those names while performing the copy. The tests create additional files in the reference path (`p`): a Python file (`ignored.py`) and two directories (`dir1/` and `dir2/`). These files are ignored in two different ways, and tested separatelly: using `shutil.ignore_patterns` and using a custom ignore function The tests are performed by copying the tree (and ignoring the files) and then comparing the source and destination (checking that every file in the destination is also in the source), and asserting that the ignored files do not exist in the destination. Signed-off-by: Antonio Ossa Guerra <aaossa@uc.cl>

pjbull · 2022-09-27T04:15:40Z

Thanks @aaossa! I've pulled your changes into a branch that will run tests against the live servers and opened a PR for that branch at #273. If those all pass, we'll merge it. Much appreciated.

* Add ignore keyword argument to copytree (#272) * Add `ignore` keyword argument to `copytree` The `ignore` argument expects a callable that returns a list of names that will be ignored while copying. It supports `shutil.ignore_patterns` [1] to allow using a similar interface. Also, include proper typing according to the behaviour described in the Python docs on `shutil.copytree` and expanding it to support the `cloudpathlib` environment. [1]: https://docs.python.org/3/library/shutil.html#shutil.ignore_patterns Signed-off-by: Antonio Ossa Guerra <aaossa@uc.cl> * Add tests for `ignore` argument on `copytree` When the argument is not used (defaults to `None`), the function should work normally. The arguments is expected to be a callable that, given a list of names, returns a list of ignored names to skip those names while performing the copy. The tests create additional files in the reference path (`p`): a Python file (`ignored.py`) and two directories (`dir1/` and `dir2/`). These files are ignored in two different ways, and tested separatelly: using `shutil.ignore_patterns` and using a custom ignore function The tests are performed by copying the tree (and ignoring the files) and then comparing the source and destination (checking that every file in the destination is also in the source), and asserting that the ignored files do not exist in the destination. Signed-off-by: Antonio Ossa Guerra <aaossa@uc.cl> Signed-off-by: Antonio Ossa Guerra <aaossa@uc.cl> * Update changelog and version Signed-off-by: Antonio Ossa Guerra <aaossa@uc.cl> Co-authored-by: Antonio Ossa-Guerra <aaossa@uc.cl>

aaossa mentioned this pull request Sep 26, 2022

Add ignore argument to copytree that lets one filter stuff out #145

Closed

pjbull reviewed Sep 27, 2022

View reviewed changes

aaossa added 2 commits September 27, 2022 00:17

pjbull changed the base branch from master to 272-merge-copytree September 27, 2022 04:13

pjbull merged commit 5bfa91e into drivendataorg:272-merge-copytree Sep 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ignore keyword argument to copytree #272

Add ignore keyword argument to copytree #272

aaossa commented Sep 26, 2022

pjbull commented Sep 26, 2022

aaossa commented Sep 26, 2022

pjbull commented Sep 26, 2022

aaossa commented Sep 27, 2022

pjbull Sep 27, 2022 •

edited

pjbull Sep 27, 2022

aaossa Sep 27, 2022 •

edited

aaossa Sep 27, 2022

pjbull Sep 27, 2022

aaossa Sep 27, 2022

aaossa Sep 27, 2022

aaossa Sep 27, 2022

aaossa Sep 27, 2022

pjbull commented Sep 27, 2022

Add ignore keyword argument to copytree #272

Add ignore keyword argument to copytree #272

Conversation

aaossa commented Sep 26, 2022

pjbull commented Sep 26, 2022

aaossa commented Sep 26, 2022

pjbull commented Sep 26, 2022

aaossa commented Sep 27, 2022

pjbull Sep 27, 2022 • edited

Choose a reason for hiding this comment

pjbull Sep 27, 2022

Choose a reason for hiding this comment

aaossa Sep 27, 2022 • edited

Choose a reason for hiding this comment

aaossa Sep 27, 2022

Choose a reason for hiding this comment

pjbull Sep 27, 2022

Choose a reason for hiding this comment

aaossa Sep 27, 2022

Choose a reason for hiding this comment

aaossa Sep 27, 2022

Choose a reason for hiding this comment

aaossa Sep 27, 2022

Choose a reason for hiding this comment

aaossa Sep 27, 2022

Choose a reason for hiding this comment

pjbull commented Sep 27, 2022

pjbull Sep 27, 2022 •

edited

aaossa Sep 27, 2022 •

edited