Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle and test for s3 fake directories #190

Merged
merged 4 commits into from
Jan 6, 2022
Merged

Handle and test for s3 fake directories #190

merged 4 commits into from
Jan 6, 2022

Conversation

pjbull
Copy link
Member

@pjbull pjbull commented Jan 5, 2022

#148 reported inconsistent is_dir behavior between having a trailing / on S3 and not having that /.

The root cause of this is that AWS does create "fake" folders. According to this boto3 thread these are just zero-size keys ending with /. This fix changes our file/dir check on s3 to count these zero-size keys as directories rather than files.

Note: since we use load to check for existence, this change does not incur any additional network calls.

Testing:

  • Added explicit is_dir/is_file test (note: I discovered that these pass without the change since we do not create these fake folders normally, but may be useful so I left them in)
  • Added explicit test when using "live" s3 backends. This test fails without the change and succeeds with it.
  • Manually tested "folder" created with cyberduck
  • Manually tested "folder" created using AWS Console UI

Closes #148

@pjbull pjbull requested a review from jayqi January 5, 2022 22:40
@github-actions
Copy link
Contributor

github-actions bot commented Jan 5, 2022

@codecov
Copy link

codecov bot commented Jan 5, 2022

Codecov Report

Merging #190 (57bc4ff) into master (93b6576) will increase coverage by 0.1%.
The diff coverage is 100.0%.

@@           Coverage Diff            @@
##           master    #190     +/-   ##
========================================
+ Coverage    94.1%   94.3%   +0.1%     
========================================
  Files          21      21             
  Lines        1189    1193      +4     
========================================
+ Hits         1120    1125      +5     
+ Misses         69      68      -1     
Impacted Files Coverage Δ
cloudpathlib/s3/s3client.py 95.0% <100.0%> (+1.2%) ⬆️


Ref: https://github.com/boto/boto3/issues/377
"""
if live_s3_like_rig.live_server:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than this if inside the test body, I think it would be clearer to skip the whole test.

https://docs.pytest.org/en/latest/how-to/skipping.html#id1

@pytest.mark.skipif(os.getenv("USE_LIVE_CLOUD") == "1")
def test_fake_directories(live_s3_like_rig):

or maybe we can set a global flag USE_LIVE_CLOUD = True in this if block in conftest.py

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing tests used this format, so I cleaned up all of them to do if not live_s3_like_rig.live_server: pytest.skip().

I think this is best of both worlds since now we could have different env vars per rig/backend at some point if we wanted without any changes at the test level, but we still show skip rather than pass if we don't run it.

tests/conftest.py Outdated Show resolved Hide resolved
@jayqi
Copy link
Member

jayqi commented Jan 5, 2022

I actually didn't realize / maybe forgot that we could explicitly create objects with trailing / paths with cloudpathlib. This doesn't match regular pathlib behavior and actually makes me kind of uncomfortable. It leads to some weird behavior.

from pathlib import Path

p1 = Path("foo")
p1
#> PosixPath('foo')
p1.parts
#> ('foo',)
p2 = Path("foo/")
p2
#> PosixPath('foo')
p2.parts
#> ('foo',)

p1 == p2
#> True


from cloudpathlib import CloudPath

cp1 = CloudPath("s3://bucket/foo")
cp1.key
#> 'foo'
cp1.name
#> 'foo'
cp1.parts
#> ('s3://', 'bucket', 'foo')

cp2 = CloudPath("s3://bucket/foo/")
cp2.key
#> 'foo/'
cp2.name
#> 'foo'
cp2.parts
#> ('s3://', 'bucket', 'foo')

cp1 == cp2
#> False
cp2 == CloudPath("s3://bucket/foo/bar").parent
#> False
cp2 / "bar"
#> S3Path('s3://bucket/foo/bar')
cp2 == (cp2 / "bar").parent
#> False

Created at 2022-01-05 18:04:04 EST by reprexlite v0.4.2

@pjbull
Copy link
Member Author

pjbull commented Jan 5, 2022

I actually didn't realize / maybe forgot that we could explicitly create objects with trailing / paths with cloudpathlib. This doesn't match regular pathlib behavior and actually makes me kind of uncomfortable. It leads to some weird behavior.

We could file an issue for discussion since unlike local file systems, S3 treats with/without / as different keys:
Screen Shot 2022-01-05 at 3 23 22 PM

@jayqi
Copy link
Member

jayqi commented Jan 5, 2022

#51 is the existing issue. I guess one learning is that I was wrong when I said:

As a result, it's not currently possible to create an S3Path object that points to an S3 folder.

@pjbull
Copy link
Member Author

pjbull commented Jan 6, 2022

@jayqi updated to address comments + add note to changelog

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ceph compatibility - is_dir() fails for directories when "/" included at end of path
2 participants