Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ceph compatibility - is_dir() fails for directories when "/" included at end of path #148

Closed
remi-braun opened this issue Jun 29, 2021 · 16 comments · Fixed by #190
Closed

Comments

@remi-braun
Copy link

Hello,

For directories that have a suffix such as .SAFE directories for Sentinel satellites data, pathlib recognizes the directories correctly, although cloudpathlib fails:

>>> # On disk
>>> from cloudpathlib import AnyPath
>>> A = AnyPath(r"D:\S2B_MSIL2A_20200114T065229_N0213_R020_T40REQ_20200114T094749.SAFE")
>>> A.is_file()
False
>>> A.is_dir()
True
>>> # In th Cloud
>>> client = S3Client(
    endpoint_url=f"https://{AWS_S3_ENDPOINT}",
    aws_access_key_id=os.getenv(AWS_ACCESS_KEY_ID),
    aws_secret_access_key=os.getenv(AWS_SECRET_ACCESS_KEY),
)
>>> client.set_as_default_client()
>>> B = AnyPath("s3://my-bucket/S2B_MSIL2A_20200114T065229_N0213_R020_T40REQ_20200114T094749.SAFE/")
>>> B.is_file()
True
>>> B.is_dir()
False

However, it works when removing the final \ !

>>> # In th Cloud
>>> client = S3Client(
    endpoint_url=f"https://{AWS_S3_ENDPOINT}",
    aws_access_key_id=os.getenv(AWS_ACCESS_KEY_ID),
    aws_secret_access_key=os.getenv(AWS_SECRET_ACCESS_KEY),
)
>>> client.set_as_default_client()
>>> B = AnyPath("s3://my-bucket/S2B_MSIL2A_20200114T065229_N0213_R020_T40REQ_20200114T094749.SAFE")
>>> B.is_file()
False
>>> B.is_dir()
True
@remi-braun remi-braun changed the title is_dir() fails for directories with suffix is_dir() fails for directories with suffix and antislash Jun 29, 2021
@pjbull
Copy link
Member

pjbull commented Jun 29, 2021

Hi @remi-braun, I don't reproduce this based on what you specified. Here's an example with a public asset on S3:

>>> file = S3Path('s3://drivendata-public-assets/nested_dir/test.dot/a_file.txt')
>>> file.is_file()
True
>>> file.is_dir()
False
>>> directory = S3Path('s3://drivendata-public-assets/nested_dir/test.dot/')
>>> directory.is_file()
False
>>> directory.is_dir()
True

A few questions:

Here's the logic that checks directories/files on S3. It's not too complicated, but bsaically we check if S3 lists the exact path when we ask it for files. If it does, we assume it is a file.

@remi-braun
Copy link
Author

remi-braun commented Jun 30, 2021

Hi !

Thanks for your quick reply :)

* Can you print out `B` from your second example so we can see what kind of path that resolved to?
>>> B
S3Path('s3://sertit-eoreader-ci/optical/S2B_MSIL2A_20200114T065229_N0213_R020_T40REQ_20200114T094749.SAFE/')
* Is there anything in that directory or is it empty? There's not an actual concept of a "directory" on S3 so some of it is faked if the directory is empty, which may cause the behavior you are seeing: https://serverfault.com/questions/435827/what-is-the-difference-between-buckets-and-folders-in-amazon-s3

This directory is full and the glob function works:

>>> [path.name for path in B.glob("**/*")]
['S2B_MSIL2A_20200114T065229_N0213_R020_T40REQ_20200114T094749.SAFE', 'AUX_DATA', 'DATASTRIP', 'DATASTRIP', 'DS_MTI__20200114T094749_S20200114T065245', 'DS_MTI__20200114T094749_S20200114T065245', 'MTD_DS.xml', 'QI_DATA', 'QI_DATA', 'FORMAT_CORRECTNESS.xml', 'GENERAL_QUALITY.xml', 'GEOMETRIC_QUALITY.xml', 'RADIOMETRIC_QUALITY.xml', 'SENSOR_QUALITY.xml', 'GRANULE', 'GRANULE', 'L2A_T40REQ_A014918_20200114T065245', 'L2A_T40REQ_A014918_20200114T065245', 'AUX_DATA', 'AUX_DATA', 'AUX_ECMWFT', 'IMG_DATA', 'IMG_DATA', 'R10m', 'R10m', 'T40REQ_20200114T065229_AOT_10m.jp2', 'T40REQ_20200114T065229_B02_10m.jp2', 'T40REQ_20200114T065229_B03_10m.jp2', 'T40REQ_20200114T065229_B04_10m.jp2', 'T40REQ_20200114T065229_B08_10m.jp2', 'T40REQ_20200114T065229_TCI_10m.jp2', 'T40REQ_20200114T065229_WVP_10m.jp2', 'R20m', 'R20m', 'T40REQ_20200114T065229_AOT_20m.jp2', 'T40REQ_20200114T065229_B02_20m.jp2', 'T40REQ_20200114T065229_B03_20m.jp2', 'T40REQ_20200114T065229_B04_20m.jp2', 'T40REQ_20200114T065229_B05_20m.jp2', 'T40REQ_20200114T065229_B06_20m.jp2', 'T40REQ_20200114T065229_B07_20m.jp2', 'T40REQ_20200114T065229_B11_20m.jp2', 'T40REQ_20200114T065229_B12_20m.jp2', 'T40REQ_20200114T065229_B8A_20m.jp2', 'T40REQ_20200114T065229_SCL_20m.jp2', 'T40REQ_20200114T065229_TCI_20m.jp2', 'T40REQ_20200114T065229_WVP_20m.jp2', 'R60m', 'R60m', 'T40REQ_20200114T065229_AOT_60m.jp2', 'T40REQ_20200114T065229_B01_60m.jp2', 'T40REQ_20200114T065229_B02_60m.jp2', 'T40REQ_20200114T065229_B03_60m.jp2', 'T40REQ_20200114T065229_B04_60m.jp2', 'T40REQ_20200114T065229_B05_60m.jp2', 'T40REQ_20200114T065229_B06_60m.jp2', 'T40REQ_20200114T065229_B07_60m.jp2', 'T40REQ_20200114T065229_B09_60m.jp2', 'T40REQ_20200114T065229_B11_60m.jp2', 'T40REQ_20200114T065229_B12_60m.jp2', 'T40REQ_20200114T065229_B8A_60m.jp2', 'T40REQ_20200114T065229_SCL_60m.jp2', 'T40REQ_20200114T065229_TCI_60m.jp2', 'T40REQ_20200114T065229_WVP_60m.jp2', 'MTD_TL.xml', 'QI_DATA', 'QI_DATA', 'FORMAT_CORRECTNESS.xml', 'GENERAL_QUALITY.xml', 'GEOMETRIC_QUALITY.xml', 'MSK_CLDPRB_20m.jp2', 'MSK_CLDPRB_60m.jp2', 'MSK_CLOUDS_B00.gfs', 'MSK_CLOUDS_B00.gml', 'MSK_DEFECT_B01.gfs', 'MSK_DEFECT_B01.gml', 'MSK_DEFECT_B02.gfs', 'MSK_DEFECT_B02.gml', 'MSK_DEFECT_B03.gfs', 'MSK_DEFECT_B03.gml', 'MSK_DEFECT_B04.gfs', 'MSK_DEFECT_B04.gml', 'MSK_DEFECT_B05.gfs', 'MSK_DEFECT_B05.gml', 'MSK_DEFECT_B06.gfs', 'MSK_DEFECT_B06.gml', 'MSK_DEFECT_B07.gml', 'MSK_DEFECT_B08.gfs', 'MSK_DEFECT_B08.gml', 'MSK_DEFECT_B09.gml', 'MSK_DEFECT_B10.gml', 'MSK_DEFECT_B11.gfs', 'MSK_DEFECT_B11.gml', 'MSK_DEFECT_B12.gfs', 'MSK_DEFECT_B12.gml', 'MSK_DEFECT_B8A.gfs', 'MSK_DEFECT_B8A.gml', 'MSK_DETFOO_B01.gfs', 'MSK_DETFOO_B01.gml', 'MSK_DETFOO_B02.gfs', 'MSK_DETFOO_B02.gml', 'MSK_DETFOO_B03.gfs', 'MSK_DETFOO_B03.gml', 'MSK_DETFOO_B04.gfs', 'MSK_DETFOO_B04.gml', 'MSK_DETFOO_B05.gfs', 'MSK_DETFOO_B05.gml', 'MSK_DETFOO_B06.gfs', 'MSK_DETFOO_B06.gml', 'MSK_DETFOO_B07.gml', 'MSK_DETFOO_B08.gfs', 'MSK_DETFOO_B08.gml', 'MSK_DETFOO_B09.gml', 'MSK_DETFOO_B10.gml', 'MSK_DETFOO_B11.gfs', 'MSK_DETFOO_B11.gml', 'MSK_DETFOO_B12.gfs', 'MSK_DETFOO_B12.gml', 'MSK_DETFOO_B8A.gfs', 'MSK_DETFOO_B8A.gml', 'MSK_NODATA_B01.gfs', 'MSK_NODATA_B01.gml', 'MSK_NODATA_B02.gfs', 'MSK_NODATA_B02.gml', 'MSK_NODATA_B03.gfs', 'MSK_NODATA_B03.gml', 'MSK_NODATA_B04.gfs', 'MSK_NODATA_B04.gml', 'MSK_NODATA_B05.gfs', 'MSK_NODATA_B05.gml', 'MSK_NODATA_B06.gfs', 'MSK_NODATA_B06.gml', 'MSK_NODATA_B07.gml', 'MSK_NODATA_B08.gfs', 'MSK_NODATA_B08.gml', 'MSK_NODATA_B09.gml', 'MSK_NODATA_B10.gml', 'MSK_NODATA_B11.gfs', 'MSK_NODATA_B11.gml', 'MSK_NODATA_B12.gfs', 'MSK_NODATA_B12.gml', 'MSK_NODATA_B8A.gfs', 'MSK_NODATA_B8A.gml', 'MSK_SATURA_B01.gfs', 'MSK_SATURA_B01.gml', 'MSK_SATURA_B02.gfs', 'MSK_SATURA_B02.gml', 'MSK_SATURA_B03.gfs', 'MSK_SATURA_B03.gml', 'MSK_SATURA_B04.gfs', 'MSK_SATURA_B04.gml', 'MSK_SATURA_B05.gfs', 'MSK_SATURA_B05.gml', 'MSK_SATURA_B06.gfs', 'MSK_SATURA_B06.gml', 'MSK_SATURA_B07.gml', 'MSK_SATURA_B08.gfs', 'MSK_SATURA_B08.gml', 'MSK_SATURA_B09.gml', 'MSK_SATURA_B10.gml', 'MSK_SATURA_B11.gfs', 'MSK_SATURA_B11.gml', 'MSK_SATURA_B12.gfs', 'MSK_SATURA_B12.gml', 'MSK_SATURA_B8A.gfs', 'MSK_SATURA_B8A.gml', 'MSK_SNWPRB_20m.jp2', 'MSK_SNWPRB_60m.jp2', 'MSK_TECQUA_B01.gfs', 'MSK_TECQUA_B01.gml', 'MSK_TECQUA_B02.gfs', 'MSK_TECQUA_B02.gml', 'MSK_TECQUA_B03.gfs', 'MSK_TECQUA_B03.gml', 'MSK_TECQUA_B04.gfs', 'MSK_TECQUA_B04.gml', 'MSK_TECQUA_B05.gfs', 'MSK_TECQUA_B05.gml', 'MSK_TECQUA_B06.gfs', 'MSK_TECQUA_B06.gml', 'MSK_TECQUA_B07.gml', 'MSK_TECQUA_B08.gfs', 'MSK_TECQUA_B08.gml', 'MSK_TECQUA_B09.gml', 'MSK_TECQUA_B10.gml', 'MSK_TECQUA_B11.gfs', 'MSK_TECQUA_B11.gml', 'MSK_TECQUA_B12.gfs', 'MSK_TECQUA_B12.gml', 'MSK_TECQUA_B8A.gfs', 'MSK_TECQUA_B8A.gml', 'RADIOMETRIC_QUALITY.xml', 'SENSOR_QUALITY.xml', 'T40REQ_20200114T065229_PVI.jp2', 'HTML', 'HTML', 'UserProduct_index.html', 'UserProduct_index.xsl', 'banner_1.png', 'banner_2.png', 'banner_3.png', 'star_bg.jpg', 'INSPIRE.xml', 'MTD_MSIL2A.xml', 'manifest.safe', 'rep_info', 'rep_info', 'S2_PDI_Level-2A_Datastrip_Metadata.xsd', 'S2_PDI_Level-2A_Tile_Metadata.xsd', 'S2_User_Product_Level-2A_Metadata.xsd']
* How was the directory created? Some tools like the console GUI can create these fake "folders" on s3: https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-folders.html

This folder has been uploaded to a S3 compatible storage as usual, and it can be used without the final /. Only this prevents the normal use. But what is problematic is that glob function gives path with this / in the end for directories.
my currunt workaround is:

 >>> # WORKAROUND
 >>> if str(path).endswith("/"):
 >>>     path = AnyPath(str(path)[:-1])
* Is there a public S3 path you can point to that reproduces this for us to test?

I'm sorry, I've not found any public S3 enabling .SAFE products, but you could download as they are free: https://sentinelhub-py.readthedocs.io/en/latest/aws_cli.html
I could send you some product but they are quite heavy (several hundred of Mo minimum).

@remi-braun
Copy link
Author

remi-braun commented Jun 30, 2021

I think I have the same problem in another place:

I am downloading a multi level directory, such as:
2021-06-30_17h55_08
(the sub-directories are not empty), with the code:

path.download_to(dir.joinpath(path.name))

And the downloaded product obtained is:
2021-06-30_17h53_19

You can see that directories have been downloaded as files !

The code of the function does 2 different things according if we are in face of a file or a directory (and it adds a / if it is a directory !), so maybe it is related:

def download_to(self, destination: Union[str, os.PathLike]) -> Path:
    destination = Path(destination)
    if self.is_file():
        if destination.is_dir():
            destination = destination / self.name
        return self.client._download_file(self, destination)
    else:
        destination.mkdir(exist_ok=True)
        for f in self.iterdir():
            rel = str(self)
            if not rel.endswith("/"):
                rel = rel + "/"

            rel_dest = str(f)[len(rel) :]
            f.download_to(destination / rel_dest)

        return destination

All this is happening on a S3 compatible storage, I don't know if it is related.

@pjbull
Copy link
Member

pjbull commented Jun 30, 2021

All this is happening on a S3 compatible storage

What library/implementation? We do test MinIO, but maybe we are missing a test like this.

It does seem that the root cause is that our is_file_or_dir check does not work reliably for every set of files/metadata/s3 implementations. Just need to track down enough info to repro and fix.

@remi-braun
Copy link
Author

Ouch I don't know, it is my university's storage...

I have no clue on this question 😢

If you want me to run additional tests on my side, just ask !

@remi-braun
Copy link
Author

remi-braun commented Jul 1, 2021

I'm told they are using Ceph and maybe OpenStack.
I don't know if it helps you...

@pjbull
Copy link
Member

pjbull commented Jul 1, 2021

🙇 thanks, I'll see if we can find a Ceph server to test against.

@pjbull pjbull changed the title is_dir() fails for directories with suffix and antislash Ceph compatibility - is_dir() fails for directories when "/" included at end of path Jul 1, 2021
@karolzlot
Copy link
Contributor

I have the same issue, but I'm using ordinary S3 (I think it is ordinary S3 at least)

@pjbull
Copy link
Member

pjbull commented Jan 6, 2022

@remi-braun @karolzlot The issues you were seeing should (hopefully) be fixed by #190. I'm not sure about ceph since we test on MinIO and AWS S3.

If you get a chance, could you test the latest on master before we cut a release with these changes?

pip install git+https://github.com/drivendataorg/cloudpathlib

@remi-braun
Copy link
Author

Hello,

Thanks for the update :)
Here is what I got:

cloudpathlib

@pjbull
Copy link
Member

pjbull commented Jan 7, 2022

Thanks @remi-braun. The recursion problem looks like a separate issue that we can investigate. Can you confirm that the is_dir calls that work here fail on the latest release?

@remi-braun
Copy link
Author

Yes !
cloudpathlib(1)

@remi-braun
Copy link
Author

Thanks @remi-braun. The recursion problem looks like a separate issue that we can investigate. Can you confirm that the is_dir calls that work here fail on the latest release?

Do I need to open an issue on that ? Or you are investigating it ?

@GeorgeSabu
Copy link
Contributor

@pjbull @remi-braun I also faced the same issue fixed it like this

@pjbull
Copy link
Member

pjbull commented Feb 2, 2022

Thanks @GeorgeSabu that fix looks good, and we'd accept a PR if you want to contribute!

@pjbull
Copy link
Member

pjbull commented Feb 2, 2022

@remi-braun @GeorgeSabu ^ filed #198 to track the recursion issue and the fix you pointed out. If you can contribute referencing that issue, it would be great. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants