Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ignore keyword argument to copytree #272

Merged
merged 2 commits into from Sep 27, 2022
Merged

Add ignore keyword argument to copytree #272

merged 2 commits into from Sep 27, 2022

Conversation

aaossa
Copy link
Contributor

@aaossa aaossa commented Sep 26, 2022

Following #145 , the ignore argument works similarly to shutil.copytree, and supports shutil.ignore_patterns to allow using a similar interface. Added two tests by creating additional files and ignoring them using shutil.ignore_patterns and a custom ignore function.

Closes #145

@pjbull
Copy link
Member

pjbull commented Sep 26, 2022

Approved the CI tests. Note: the live tests will fail since this is a PR from a fork, but that is fine. We'll run those on a local branch once this PR is ready.

@aaossa
Copy link
Contributor Author

aaossa commented Sep 26, 2022

Fixed the warnings from Black 👌 Wasn't sure if this action was used in the project. I think there's already an issue suggesting the inclusion of a CONTRIBUTING.md file, but it has my +1 to explicitly state that Black is being used and also explain how to execute the tests (make test)

@pjbull
Copy link
Member

pjbull commented Sep 26, 2022

I think there's already an issue suggesting the inclusion of a CONTRIBUTING.md file, but it has my +1 to explicitly state that Black is being used and also explain how to execute the tests (make test)

Yep, thanks. Would you mind noting anything you hit in that issue so it can go in when we write it?

@aaossa
Copy link
Contributor Author

aaossa commented Sep 27, 2022

Ok, seems like the latest issue was related to typing. I was using the wrong return type for the ignore keyword argument. Instead of collections.abc.Sequence the correct type is collections.abc.Iterable because a set is not a sequence.


Yep, thanks. Would you mind noting anything you hit in that issue so it can go in when we write it?

Sure, I'll leave a comment with some ideas

@@ -773,9 +792,16 @@ def copytree(
"Destination path {destination} of copytree must be a directory."
)

if ignore is not None:
ignored_names = ignore(self.fspath, [x.name for x in self.iterdir()])
Copy link
Member

@pjbull pjbull Sep 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two thoughts here:

(1) I think that we may want self._no_prefix_no_drive instead of self.fspath, which removes the s3://bucket portion but keeps the rest. This way the first argument won't include the path to the potentially arbitrary local folder in the cache.

(2) I know it follows the CPython implementation, but it's not ideal to call iterdir twice since there's network call overhead on both of them. Would it be feasible to do something like the following within the main loop instead?

...

for subpath in self.iterdir():
    if not ignore(subpath.parent._no_prefix_no_drive, [subpath.name]):
        continue
    ...  

Maybe not ideal if calls to ignore are expensive, but there's a tradeoff with the network overhead of iterdir that may be worth it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just looked a little closer at the CPython one—at least point 2 we could address like their implementation where we introduce a _copytree that gets passed entries so that we don't do the expensive operation twice (like they do to avoid calling scandir twice)

Copy link
Contributor Author

@aaossa aaossa Sep 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I get it. Is not a bad idea since it replaces a single call on every file for multiple calls on a single file each time, but keeping the interface untouched. Would it be a bad idea to cache or store the result of self.iterdir() and use it twice?

contents = list(self.iterdir())  # Using list because iterdir is a generator

if ignore is not None:
    ignored_names = ignore(subpath.parent._no_prefix_no_drive, [x.name for x in contents])
else:
    ignored_names = set()

# ...

for subpath in contents:
    # ...

EDIT: Just noticed your second comment, I'll give it a try later

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like what I proposed is equivalent to what you described. They do the step of creating a list of entries here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Oh yeah, that looks simple and saves us some network overhead! Let's do it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, maybe we should still use their solution. copytree is a recursive function, so we still are making network calls (at least is half now). Seems like the only way to make a single network call is using their approach to determine the list of entries before actually calling copytree (which will be named _copytree then)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the recursive call (when subpath is a directory) also needs the ignore argument passed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, maybe we should still use their solution. copytree is a recursive function, so we still are making network calls (at least is half now). Seems like the only way to make a single network call is using their approach to determine the list of entries before actually calling copytree (which will be named _copytree then)

NVM, seems like this won't make a difference since iterdir is not recursive, so we actually need to call iterdir on each depth level of the recursive function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the recursive call (when subpath is a directory) also needs the ignore argument passed

Fixed in the latest push

The `ignore` argument expects a callable that returns a list of
names that will be ignored while copying. It supports
`shutil.ignore_patterns` [1] to allow using a similar interface.

Also, include proper typing according to the behaviour described in
the Python docs on `shutil.copytree` and expanding it to support the
`cloudpathlib` environment.

[1]: https://docs.python.org/3/library/shutil.html#shutil.ignore_patterns

Signed-off-by: Antonio Ossa Guerra <aaossa@uc.cl>
When the argument is not used (defaults to `None`), the function
should work normally. The arguments is expected to be a callable that,
given a list of names, returns a list of ignored names to skip those
names while performing the copy.

The tests create additional files in the reference path (`p`): a
Python file (`ignored.py`) and two directories (`dir1/` and `dir2/`).
These files are ignored in two different ways, and tested separatelly:
using `shutil.ignore_patterns` and using a custom ignore function

The tests are performed by copying the tree (and ignoring the files)
and then comparing the source and destination (checking that every file
in the destination is also in the source), and asserting that the
ignored files do not exist in the destination.

Signed-off-by: Antonio Ossa Guerra <aaossa@uc.cl>
@pjbull pjbull changed the base branch from master to 272-merge-copytree September 27, 2022 04:13
@pjbull pjbull merged commit 5bfa91e into drivendataorg:272-merge-copytree Sep 27, 2022
@pjbull
Copy link
Member

pjbull commented Sep 27, 2022

Thanks @aaossa! I've pulled your changes into a branch that will run tests against the live servers and opened a PR for that branch at #273. If those all pass, we'll merge it. Much appreciated.

pjbull added a commit that referenced this pull request Sep 27, 2022
* Add ignore keyword argument to copytree (#272)

* Add `ignore` keyword argument to `copytree`

The `ignore` argument expects a callable that returns a list of
names that will be ignored while copying. It supports
`shutil.ignore_patterns` [1] to allow using a similar interface.

Also, include proper typing according to the behaviour described in
the Python docs on `shutil.copytree` and expanding it to support the
`cloudpathlib` environment.

[1]: https://docs.python.org/3/library/shutil.html#shutil.ignore_patterns

Signed-off-by: Antonio Ossa Guerra <aaossa@uc.cl>

* Add tests for `ignore` argument on `copytree`

When the argument is not used (defaults to `None`), the function
should work normally. The arguments is expected to be a callable that,
given a list of names, returns a list of ignored names to skip those
names while performing the copy.

The tests create additional files in the reference path (`p`): a
Python file (`ignored.py`) and two directories (`dir1/` and `dir2/`).
These files are ignored in two different ways, and tested separatelly:
using `shutil.ignore_patterns` and using a custom ignore function

The tests are performed by copying the tree (and ignoring the files)
and then comparing the source and destination (checking that every file
in the destination is also in the source), and asserting that the
ignored files do not exist in the destination.

Signed-off-by: Antonio Ossa Guerra <aaossa@uc.cl>

Signed-off-by: Antonio Ossa Guerra <aaossa@uc.cl>

* Update changelog and version

Signed-off-by: Antonio Ossa Guerra <aaossa@uc.cl>
Co-authored-by: Antonio Ossa-Guerra <aaossa@uc.cl>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add ignore argument to copytree that lets one filter stuff out
2 participants