Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF: Use head_object for .info #24

Closed
wants to merge 1 commit into from

Conversation

TomAugspurger
Copy link
Contributor

This should be more efficient for .info calls.
The response dict is almost the same. They use ContentLength instead
of Size and don't include the key in the response.

One question I had was on exception handling. To get the test suite to
pass, I had to catch botocore's ParamvalidationError. Otherwise things
like s3.open('x', 'rb') failed with a ParamvalidationError, since 'x'
isn't a valid bucket name, while the tests were expecting an IOError or
OSError. Would be curious to hear what you think on this.

This change was necessary to make the pandas test-suite pass. We had issues in the
past where a users's bucket allowed them to list some, but not all the items in a bucket.
The old implementation for .info did an ls on the bucket, and then filtered down to
the single file matching the key.
With this the pandas test-suite pretty much passes.
My changes are
here
Just a couple tests to fix up that were expecting different error
messages.

http://boto3.readthedocs.org/en/latest/reference/services/s3.html#S3.Client.head_object

This should be more efficient for `.info` calls.
The response dict is almost the same. They use `ContentLength` instead
of `Size` and don't include the key in the response.

One question I had was on exception handling. To get the test suite to
pass, I had to catch botocore's `ParamvalidationError.` Otherwise things
like `s3.open('x', 'rb')` failed with a ParamvalidationError, since `'x'`
isn't a valid bucket name, while the tests were expecting an IOError or
OSError. Would be curious to hear what you think on this.

With this the pandas test-suite pretty much passes.
My changes are
[here](pandas-dev/pandas@e89e54a)
Just a couple tests to fix up that were expecting different error
messages.

http://boto3.readthedocs.org/en/latest/reference/services/s3.html#S3.Client.head_object
@martindurant
Copy link
Member

Does head_object return other info like access times?

@TomAugspurger
Copy link
Contributor Author

Just last modified

In [5]: fs.s3.head_object(Bucket='pandas-test', Key='tips.csv')
Out[5]:
{'AcceptRanges': 'bytes',
 'ContentLength': 7943,
 'ContentType': 'application/octet-stream',
 'ETag': '"b8e189917e1e12e5e34247f90db7834f"',
 'LastModified': datetime.datetime(2015, 9, 12, 20, 19, 12, tzinfo=tzutc()),
 'Metadata': {},
 'ResponseMetadata': {'HTTPStatusCode': 200,
  'HostId': 'tR9J0P0Si+OHpRAFrENSDBA5Cos8pFC+bjJFQHkz4ESKqEc2x0chU7zNw93Rvo62cWBAfCYtLUs=',
  'RequestId': 'B12B2C8015E611AC'}}

@martindurant
Copy link
Member

On the ParamValidationError, actually we should have caught that before passing to boto3, because the path you give in that open() doesn't include both bucket and key.

@martindurant
Copy link
Member

On head_object, you are probably right that it is faster when fetching details for a single key (i.e., info), but at the moment we get the info for many keys at once with _ls and cache, and info picks from the cache.
Your suggestion is connected with not listing all entries when there are over 1000?
One issue I have yet to check, is what happens in buckets where there is a mixture of accessible and inaccessible keys.

@TomAugspurger
Copy link
Contributor Author

One issue I have yet to check, is what happens in buckets where there is a mixture of accessible and inaccessible keys

This is the issue I was working around for pandas, this test specifically. It failed before when .info() on that file triggered an _ls, so boto raised a permissions error.

I see your concern about performance. Does adding a parameter to .info controlling whether it should do a full ls and cache make sense? I'm not sure of a way to make list_objects not raise when it encounters a Key the current user doesn't have access to, so I think pandas does need a way to selectively get info on an object.

For the param validation error, are you saying that

fs = S3FileSystem()
fs.info('foo')

should raise an IOError('File not found'), or something like a ValueError('Unable to parse bucket')?

@martindurant
Copy link
Member

Yes, I suppose info() should give FileNotFound when the key isn't given or the bucket doesn't exist or the bucket name is invalid.

How did you set up s3://cant_get_it/tips.csv to cause the permissions error?

I agree that we can cause info() to first look in the file cache, and then do head_object() if not found; but I'm still looking for a listing that would return all accessible keys.

@jreback
Copy link
Contributor

jreback commented Apr 7, 2016

@martindurant not sure who owns that bucket. I don't seem to have access :<

@TomAugspurger
Copy link
Contributor Author

I'll look tonight or this weekend on if there's a better way around this.

FWIW here's the original issue in pandas: pandas-dev/pandas#10604

@TomAugspurger
Copy link
Contributor Author

I haven't found anything in boto3 to control the exception handling in s3.list_objects. It seems like as soon as you hit a key you don't have permission on, it'll raise an error.

Could we break these two use-cases into two methods?

  1. S3FileSystem.info: the current method that lists and caches everything.
  2. S3File.?: the method in this PR that calls head_object on a single key?

@martindurant
Copy link
Member

Yes, OK, that sounds totally reasonable.

@martindurant
Copy link
Member

Would this do?
48bbc19

@TomAugspurger
Copy link
Contributor Author

Would this do? 48bbc19

That's what I was going to suggest. To be honest, I thought I was working on S3File.info, and not S3FileSystem.info so I my first few messages may not make any sense 😄.

I can repurpose this PR to use it, or you can. Whatever you think is easier (sorry I've been slow to respond this week).

There's just a couple test failures with that patch,

    def test_bad_open(s3):
        with pytest.raises(IOError):
>           s3.open('')


    def test_errors(s3):
        with pytest.raises((IOError, OSError)):
            s3.open(test_bucket_name+'/tmp/test/shfoshf', 'rb')

        ## This is fine, no need for interleving directories on S3
        #with pytest.raises((IOError, OSError)):
        #    s3.touch('tmp/test/shfoshf/x')

        with pytest.raises((IOError, OSError)):
            s3.rm(test_bucket_name+'/tmp/test/shfoshf/x')

        with pytest.raises((IOError, OSError)):
            s3.mv(test_bucket_name+'/tmp/test/shfoshf/x', 'tmp/test/shfoshf/y')

        #with pytest.raises((IOError, OSError)):
        #    s3.open('x', 'wb')

        with pytest.raises((IOError, OSError)):
>           s3.open('x', 'rb')

Those raise ParamValidationError instead. I think you said we should be catching that earlier?

@martindurant
Copy link
Member

I appreciate any and all time you can spend on this!

I have reformed this into PR #27 - OK?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants