Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Globbing top-level bucket returns malformed CloudPaths #311

Closed
ssoj13 opened this issue Jan 5, 2023 · 2 comments
Closed

Globbing top-level bucket returns malformed CloudPaths #311

ssoj13 opened this issue Jan 5, 2023 · 2 comments

Comments

@ssoj13
Copy link

ssoj13 commented Jan 5, 2023

As I understand, simple code like this is supposed to work just fine, but it's not:

root = AnyPath(f's3://')
gen = root.glob('*')
buckets = list(gen)
files = list(buckets[0].glob('*'))
pp(files)

So, in bucket S3Paths I have malformed url like: "s3:////bucket1":
The error happens here: https://github.com/drivendataorg/cloudpathlib/blob/master/cloudpathlib/cloudpath.py#L398
It happens when s3:// gets joined with /bucket1 via slash in
https://github.com/drivendataorg/cloudpathlib/blob/master/cloudpathlib/client.py#L64

Next problem is that these "bucket" entries don't actually have a bucket attribute set, it causes confusion inside, so the next bucket.glob('*') causes havoc inside, it pulls 2nd bucket into the 1st one somehow:
raise ValueError("{!r} is not in the subpath of {!r}" ValueError: '/bucket2' is not in the subpath of '/bucket1' OR one path is relative and the other is absolute.

Moreover, using library like this:
files = list(AnyPath('s3://bucket1').glob('*')) produces next: S3Path('s3://bucket1/bucket1/root_folder')
which is obviously incorrect with bucket name twice in the path (and consequent .glob('*') failing as well).

Is it me doing something horribly wrong, or S3 is broken right now?

@pjbull pjbull changed the title Library is broken for S3 buckets? Globbing top-level bucket returns malformed CloudPaths Jan 5, 2023
@pjbull
Copy link
Member

pjbull commented Jan 5, 2023

Thanks for the report @ssoj13. There are 2 separate issues here, both of which just affect .glob, not any other methods.

Issue 1 - Globbing across buckets

Globbing across buckets is not currently implemented, and likely will not be since it would need to be specially handled.

CloudPath("s3://").glob("*")  # this throws an error

To get all of the buckets that a user can see, you can use iterdir:

CloudPath("s3://").iterdir()  # this lists buckets

In this case, we should at least raise a user-friendly error that indicates that globbing across buckets is not supported.

Issue 2 - Globbing at bucket-level results in malformed paths

Your second issue, CloudPath("s3://bucket").glob("*") is a bug that looks like it was just introduced by #304. You can try version 0.11.0 and see if it reproduces or not. It likely will not reproduce but be substantially slower. To work around for now, you could use iterdir at the top level to have equivalent behavior to glob("*"). Also, .glob should work as expected within folders, e.g. CloudPath("s3://bucket/folder").glob("*").

The fix here should be to properly form paths at the top level so this doesn't happen.

@pjbull
Copy link
Member

pjbull commented Jan 5, 2023

@ssoj13 as of #312, Issue 1 now raises a helpful error, and Issue 2 should be fixed. You can get it by upgrading to 0.12.1, which is on PyPI now.

@pjbull pjbull closed this as completed Jan 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants