Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle mkdir for cloud providers that support creating directories #51

Open
jayqi opened this issue Aug 29, 2020 · 9 comments
Open

Handle mkdir for cloud providers that support creating directories #51

jayqi opened this issue Aug 29, 2020 · 9 comments
Labels
help wanted Extra attention is needed S3

Comments

@jayqi
Copy link
Member

jayqi commented Aug 29, 2020

S3 has an interesting situation with folders.

Like other object stores like Azure, it has a flat structure, and when you upload a file to a/b/c.txt for example, it creates an object literally named a/b/c.txt. The directories a and b aren't real and don't exist. The web console has special behavior to fake those as folders in the UI. When you delete c.txt, a and b will automatically be gone.

However, S3 does have another mechanism that lets you have folders. There is a "Create Folders" button in the web console, which lets you make a folders that exist even while empty. These turn out to actually be dummy files with a trailing slash. So if you create a/, there is actually an object in your bucket a/ which is not actually a folder, but the S3 console will treat it like a folder for the UI. You can equivalently upload a file to a/ and it will do the same thing.

We need to think through the implications of this and what cloudpathlib should support.

  • pathlib PurePosixPath strips trailing / on instantiation, so this is something that doesn't map to a representation cleanly through PurePosixPath
  • As a result, it's not currently possible to create an S3Path object that points to an S3 folder. EDIT: This was incorrect. See discussion in Handle and test for s3 fake directories #190. This is possible because the string representation of the input URI is the main basis for a CloudPath object, not a PurePosixPath.
  • Additionally, the S3Path.mkdir method is not implemented.
@jayqi jayqi mentioned this issue Aug 29, 2020
@jayqi
Copy link
Member Author

jayqi commented Sep 1, 2020

This already leads to some tricky unintuitive behavior in our existing implementation.

In filesystems, you can't have a directory and a file share a name:

>>> p = Path()
>>> (p / "foo").touch()
>>> (p / "foo").mkdir()
Traceback (most recent call last):
...
FileExistsError: [Errno 17] File exists: 'foo'

S3 has no problem doing this. Not only that, but there is path dependence for how cloudpathlib handles it.

Making "foo" first and then "foo/bar.txt" works, but now masks the directory.

>>> s = CloudPath("s3://my-bucket/")
>>> (s / "foo").touch()
>>> (s / "foo" / "bar.txt").touch()
>>> (s / "foo").is_dir()
False

(s is an S3Path)

Making "foo/bar.txt" first and then making "foo" throws an error, because pathlib has some notion of representing the virtual folders, but then gets confused when you try to manipulate it.

>>> (s / "foo" / "bar.txt").touch()
>>> (s / "foo").exists()
True
>>> (s / "foo").is_dir()
True
>>> (s / "foo").touch()
Traceback (most recent call last):
...
botocore.errorfactory.NoSuchKey: An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist.

@jayqi jayqi changed the title Handle folders in S3 Figure out how to handle directories and files that look like directories Sep 2, 2020
@jayqi
Copy link
Member Author

jayqi commented Sep 2, 2020

Overall, I think this is just going to be a problem with all flat object stores. I can also create foo, foo/, and foo/bar.txt files in Azure Blob Storage and run into problems with the disambiguation. I'm not sure there are any great solutions around this. We will have to have limitations because we're both modeling the object store with a file system (with CloudPath interface and mechanics) and actually using a file system to cache the object store.

Doing some testing with AWS CLI:

  • foo/ and foo/bar.txt both exist:
    • aws s3 ls shows foo/ as a directory
    • aws s3 sync will download foo/bar.txt and ignore foo/
  • only foo/ exists
    • aws s3 ls shows foo/ as a directory
    • aws s3 sync will ignore foo/, not even create a directory locally
  • foo and foo/bar.txt both exist:
    • aws s3 ls shows foo/ as a directory and foo as a file
    • aws s3 sync will download foo/bar.txt first, and then error on foo because the directory foo/ already exists locally. This isn't fatal, and the process will continue to download any other files that are in the current directory.

In that case, I think we will need to come up with a specification of what happens, and document how it works. For example, we may decide:

  • cloudpathlib cannot directly interact with files with trailing slashes (explicitly will not support S3 "folders")
  • if foo and foo/bar.txt both exist, the directory will take precedence in all cloudpathlib methods. foo will be masked and users will not be able to directly interact with it so long as foo/bar.txt exists. If foo was in the local cache, it gets deleted from the local cache.

This seems like it would best match what the AWS CLI will do.

@jayqi jayqi added the help wanted Extra attention is needed label Sep 4, 2020
@pjbull
Copy link
Member

pjbull commented Sep 4, 2020

When we add #17 we can make tests that handle these scenarios when we define what the behavior should be.

@pjbull pjbull added the S3 label Oct 2, 2020
@analog-cbarber
Copy link

This is also broken for blob storage using Data Lake 2 Storage which DOES have a concept of hierarchical directory namespaces.

@pjbull
Copy link
Member

pjbull commented Aug 28, 2021

Thanks @analog-cbarber. Can you provide a code snippet that repros the problem and some more details on the configuration?

@analog-cbarber
Copy link

If you create a data lake 2 storage container and create a directory in the container using mkdir(), which works, the current implementation returns False for is_dir() and subsequently does not support rmtree(). I am not sure if there is a way to remove such a directory using your current API (I tried unlink but I got an error). I can iterate over the files using glob and remove those, but not the directory entries themselves.

az_test_client = AzureBlobClient(AZ_TEST_ACCOUNT, AZ_SAS)
az_test_path = AzureBlobPath(f'az://{AZ_TEST_CONTAINER}', client=az_test_client)
subdir = az_test_path.joinpath('subdir')
subdir.mkdir()
assert subdir.is_dir() # fails

@SlimFrkh
Copy link

SlimFrkh commented Sep 13, 2022

If the trick to create an S3 folder is to touch (make sure its path ends with "/" to not get mixed with a regular file), why don't we just do the following ?

class S3Path(CloudPath):
    ...
    # simplified version
    def mkdir(self, parents=False, exist_ok=False):
        path = path if str(self).endswith("/") else S3Path(f"{str(self)}/")
        path.touch(exist_ok=exist_ok)

@pjbull pjbull changed the title Figure out how to handle directories and files that look like directories Handle mkdir for cloud providers that support creating directories Dec 18, 2022
@pjbull
Copy link
Member

pjbull commented Dec 19, 2022

Additional useful discussion in #295, which has been consolidated to this issue.

@pjbull
Copy link
Member

pjbull commented Dec 19, 2022

Handling treating fake directories on S3 as directories should work correctly now with #302. Changing this issue to just be about implementing the mkdir command for the cloud providers that support it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed S3
Projects
None yet
Development

No branches or pull requests

4 participants