Globbing top-level bucket returns malformed CloudPaths #311

ssoj13 · 2023-01-05T00:41:24Z

As I understand, simple code like this is supposed to work just fine, but it's not:

root = AnyPath(f's3://')
gen = root.glob('*')
buckets = list(gen)
files = list(buckets[0].glob('*'))
pp(files)

So, in bucket S3Paths I have malformed url like: "s3:////bucket1":
The error happens here: https://github.com/drivendataorg/cloudpathlib/blob/master/cloudpathlib/cloudpath.py#L398
It happens when s3:// gets joined with /bucket1 via slash in
https://github.com/drivendataorg/cloudpathlib/blob/master/cloudpathlib/client.py#L64

Next problem is that these "bucket" entries don't actually have a bucket attribute set, it causes confusion inside, so the next bucket.glob('*') causes havoc inside, it pulls 2nd bucket into the 1st one somehow:
raise ValueError("{!r} is not in the subpath of {!r}" ValueError: '/bucket2' is not in the subpath of '/bucket1' OR one path is relative and the other is absolute.

Moreover, using library like this:
files = list(AnyPath('s3://bucket1').glob('*')) produces next: S3Path('s3://bucket1/bucket1/root_folder')
which is obviously incorrect with bucket name twice in the path (and consequent .glob('*') failing as well).

Is it me doing something horribly wrong, or S3 is broken right now?

The text was updated successfully, but these errors were encountered:

pjbull · 2023-01-05T01:05:00Z

Thanks for the report @ssoj13. There are 2 separate issues here, both of which just affect .glob, not any other methods.

Issue 1 - Globbing across buckets

Globbing across buckets is not currently implemented, and likely will not be since it would need to be specially handled.

CloudPath("s3://").glob("*")  # this throws an error

To get all of the buckets that a user can see, you can use iterdir:

CloudPath("s3://").iterdir()  # this lists buckets

In this case, we should at least raise a user-friendly error that indicates that globbing across buckets is not supported.

Issue 2 - Globbing at bucket-level results in malformed paths

Your second issue, CloudPath("s3://bucket").glob("*") is a bug that looks like it was just introduced by #304. You can try version 0.11.0 and see if it reproduces or not. It likely will not reproduce but be substantially slower. To work around for now, you could use iterdir at the top level to have equivalent behavior to glob("*"). Also, .glob should work as expected within folders, e.g. CloudPath("s3://bucket/folder").glob("*").

The fix here should be to properly form paths at the top level so this doesn't happen.

pjbull · 2023-01-05T16:56:08Z

@ssoj13 as of #312, Issue 1 now raises a helpful error, and Issue 2 should be fixed. You can get it by upgrading to 0.12.1, which is on PyPI now.

pjbull changed the title ~~Library is broken for S3 buckets?~~ Globbing top-level bucket returns malformed CloudPaths Jan 5, 2023

pjbull mentioned this issue Jan 5, 2023

Fix globbing top level buckets and #312

Merged

pjbull closed this as completed Jan 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Globbing top-level bucket returns malformed CloudPaths #311

Globbing top-level bucket returns malformed CloudPaths #311

ssoj13 commented Jan 5, 2023 •

edited

pjbull commented Jan 5, 2023

pjbull commented Jan 5, 2023

Globbing top-level bucket returns malformed CloudPaths #311

Globbing top-level bucket returns malformed CloudPaths #311

Comments

ssoj13 commented Jan 5, 2023 • edited

pjbull commented Jan 5, 2023

Issue 1 - Globbing across buckets

Issue 2 - Globbing at bucket-level results in malformed paths

pjbull commented Jan 5, 2023

ssoj13 commented Jan 5, 2023 •

edited