Handle `mkdir` for cloud providers that support creating directories #51

jayqi · 2020-08-29T04:08:57Z

S3 has an interesting situation with folders.

Like other object stores like Azure, it has a flat structure, and when you upload a file to a/b/c.txt for example, it creates an object literally named a/b/c.txt. The directories a and b aren't real and don't exist. The web console has special behavior to fake those as folders in the UI. When you delete c.txt, a and b will automatically be gone.

However, S3 does have another mechanism that lets you have folders. There is a "Create Folders" button in the web console, which lets you make a folders that exist even while empty. These turn out to actually be dummy files with a trailing slash. So if you create a/, there is actually an object in your bucket a/ which is not actually a folder, but the S3 console will treat it like a folder for the UI. You can equivalently upload a file to a/ and it will do the same thing.

We need to think through the implications of this and what cloudpathlib should support.

pathlib PurePosixPath strips trailing / on instantiation, so this is something that doesn't map to a representation cleanly through PurePosixPath
~~As a result, it's not currently possible to create an S3Path object that points to an S3 folder.~~ EDIT: This was incorrect. See discussion in Handle and test for s3 fake directories #190. This is possible because the string representation of the input URI is the main basis for a CloudPath object, not a PurePosixPath.
Additionally, the S3Path.mkdir method is not implemented.

The text was updated successfully, but these errors were encountered:

jayqi · 2020-09-01T23:54:02Z

This already leads to some tricky unintuitive behavior in our existing implementation.

In filesystems, you can't have a directory and a file share a name:

>>> p = Path()
>>> (p / "foo").touch()
>>> (p / "foo").mkdir()
Traceback (most recent call last):
...
FileExistsError: [Errno 17] File exists: 'foo'

S3 has no problem doing this. Not only that, but there is path dependence for how cloudpathlib handles it.

Making "foo" first and then "foo/bar.txt" works, but now masks the directory.

>>> s = CloudPath("s3://my-bucket/")
>>> (s / "foo").touch()
>>> (s / "foo" / "bar.txt").touch()
>>> (s / "foo").is_dir()
False

(s is an S3Path)

Making "foo/bar.txt" first and then making "foo" throws an error, because pathlib has some notion of representing the virtual folders, but then gets confused when you try to manipulate it.

>>> (s / "foo" / "bar.txt").touch()
>>> (s / "foo").exists()
True
>>> (s / "foo").is_dir()
True
>>> (s / "foo").touch()
Traceback (most recent call last):
...
botocore.errorfactory.NoSuchKey: An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist.

jayqi · 2020-09-02T01:56:04Z

Overall, I think this is just going to be a problem with all flat object stores. I can also create foo, foo/, and foo/bar.txt files in Azure Blob Storage and run into problems with the disambiguation. I'm not sure there are any great solutions around this. We will have to have limitations because we're both modeling the object store with a file system (with CloudPath interface and mechanics) and actually using a file system to cache the object store.

Doing some testing with AWS CLI:

foo/ and foo/bar.txt both exist:
- aws s3 ls shows foo/ as a directory
- aws s3 sync will download foo/bar.txt and ignore foo/
only foo/ exists
- aws s3 ls shows foo/ as a directory
- aws s3 sync will ignore foo/, not even create a directory locally
foo and foo/bar.txt both exist:
- aws s3 ls shows foo/ as a directory and foo as a file
- aws s3 sync will download foo/bar.txt first, and then error on foo because the directory foo/ already exists locally. This isn't fatal, and the process will continue to download any other files that are in the current directory.

In that case, I think we will need to come up with a specification of what happens, and document how it works. For example, we may decide:

cloudpathlib cannot directly interact with files with trailing slashes (explicitly will not support S3 "folders")
if foo and foo/bar.txt both exist, the directory will take precedence in all cloudpathlib methods. foo will be masked and users will not be able to directly interact with it so long as foo/bar.txt exists. If foo was in the local cache, it gets deleted from the local cache.

This seems like it would best match what the AWS CLI will do.

pjbull · 2020-09-04T17:01:49Z

When we add #17 we can make tests that handle these scenarios when we define what the behavior should be.

analog-cbarber · 2021-08-28T17:10:12Z

This is also broken for blob storage using Data Lake 2 Storage which DOES have a concept of hierarchical directory namespaces.

pjbull · 2021-08-28T17:11:53Z

Thanks @analog-cbarber. Can you provide a code snippet that repros the problem and some more details on the configuration?

analog-cbarber · 2021-08-28T19:41:13Z

If you create a data lake 2 storage container and create a directory in the container using mkdir(), which works, the current implementation returns False for is_dir() and subsequently does not support rmtree(). I am not sure if there is a way to remove such a directory using your current API (I tried unlink but I got an error). I can iterate over the files using glob and remove those, but not the directory entries themselves.

az_test_client = AzureBlobClient(AZ_TEST_ACCOUNT, AZ_SAS)
az_test_path = AzureBlobPath(f'az://{AZ_TEST_CONTAINER}', client=az_test_client)
subdir = az_test_path.joinpath('subdir')
subdir.mkdir()
assert subdir.is_dir() # fails

SlimFrkh · 2022-09-13T14:09:55Z

If the trick to create an S3 folder is to touch (make sure its path ends with "/" to not get mixed with a regular file), why don't we just do the following ?

class S3Path(CloudPath):
    ...
    # simplified version
    def mkdir(self, parents=False, exist_ok=False):
        path = path if str(self).endswith("/") else S3Path(f"{str(self)}/")
        path.touch(exist_ok=exist_ok)

pjbull · 2022-12-19T00:34:57Z

Additional useful discussion in #295, which has been consolidated to this issue.

pjbull · 2022-12-19T06:07:29Z

Handling treating fake directories on S3 as directories should work correctly now with #302. Changing this issue to just be about implementing the mkdir command for the cloud providers that support it.

jayqi mentioned this issue Aug 29, 2020

rmdir and rmtree #52

Merged

jayqi changed the title ~~Handle folders in S3~~ Figure out how to handle directories and files that look like directories Sep 2, 2020

jayqi added the help wanted Extra attention is needed label Sep 4, 2020

pjbull added the S3 label Oct 2, 2020

jayqi mentioned this issue Jan 5, 2022

Handle and test for s3 fake directories #190

Merged

the-matt-morris mentioned this issue Nov 30, 2022

Enable S3Path.mkdir #295

Closed

pjbull mentioned this issue Dec 18, 2022

Use list_objects_v2 API for S3 #302

Merged

pjbull changed the title ~~Figure out how to handle directories and files that look like directories~~ Handle mkdir for cloud providers that support creating directories Dec 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle `mkdir` for cloud providers that support creating directories #51

Handle `mkdir` for cloud providers that support creating directories #51

jayqi commented Aug 29, 2020 •

edited

jayqi commented Sep 1, 2020 •

edited

jayqi commented Sep 2, 2020 •

edited

pjbull commented Sep 4, 2020

analog-cbarber commented Aug 28, 2021

pjbull commented Aug 28, 2021

analog-cbarber commented Aug 28, 2021

SlimFrkh commented Sep 13, 2022 •

edited

pjbull commented Dec 19, 2022

pjbull commented Dec 19, 2022

Handle mkdir for cloud providers that support creating directories #51

Handle mkdir for cloud providers that support creating directories #51

Comments

jayqi commented Aug 29, 2020 • edited

jayqi commented Sep 1, 2020 • edited

jayqi commented Sep 2, 2020 • edited

pjbull commented Sep 4, 2020

analog-cbarber commented Aug 28, 2021

pjbull commented Aug 28, 2021

analog-cbarber commented Aug 28, 2021

SlimFrkh commented Sep 13, 2022 • edited

pjbull commented Dec 19, 2022

pjbull commented Dec 19, 2022

Handle `mkdir` for cloud providers that support creating directories #51

Handle `mkdir` for cloud providers that support creating directories #51

jayqi commented Aug 29, 2020 •

edited

jayqi commented Sep 1, 2020 •

edited

jayqi commented Sep 2, 2020 •

edited

SlimFrkh commented Sep 13, 2022 •

edited