REF: Use `head_object` for .info #24

TomAugspurger · 2016-04-02T02:44:45Z

This should be more efficient for .info calls.
The response dict is almost the same. They use ContentLength instead
of Size and don't include the key in the response.

One question I had was on exception handling. To get the test suite to
pass, I had to catch botocore's ParamvalidationError. Otherwise things
like s3.open('x', 'rb') failed with a ParamvalidationError, since 'x'
isn't a valid bucket name, while the tests were expecting an IOError or
OSError. Would be curious to hear what you think on this.

This change was necessary to make the pandas test-suite pass. We had issues in the
past where a users's bucket allowed them to list some, but not all the items in a bucket.
The old implementation for .info did an ls on the bucket, and then filtered down to
the single file matching the key.
With this the pandas test-suite pretty much passes.
My changes are
here
Just a couple tests to fix up that were expecting different error
messages.

http://boto3.readthedocs.org/en/latest/reference/services/s3.html#S3.Client.head_object

This should be more efficient for `.info` calls. The response dict is almost the same. They use `ContentLength` instead of `Size` and don't include the key in the response. One question I had was on exception handling. To get the test suite to pass, I had to catch botocore's `ParamvalidationError.` Otherwise things like `s3.open('x', 'rb')` failed with a ParamvalidationError, since `'x'` isn't a valid bucket name, while the tests were expecting an IOError or OSError. Would be curious to hear what you think on this. With this the pandas test-suite pretty much passes. My changes are [here](pandas-dev/pandas@e89e54a) Just a couple tests to fix up that were expecting different error messages. http://boto3.readthedocs.org/en/latest/reference/services/s3.html#S3.Client.head_object

martindurant · 2016-04-05T15:16:36Z

Does head_object return other info like access times?

TomAugspurger · 2016-04-05T20:12:34Z

Just last modified

In [5]: fs.s3.head_object(Bucket='pandas-test', Key='tips.csv')
Out[5]:
{'AcceptRanges': 'bytes',
 'ContentLength': 7943,
 'ContentType': 'application/octet-stream',
 'ETag': '"b8e189917e1e12e5e34247f90db7834f"',
 'LastModified': datetime.datetime(2015, 9, 12, 20, 19, 12, tzinfo=tzutc()),
 'Metadata': {},
 'ResponseMetadata': {'HTTPStatusCode': 200,
  'HostId': 'tR9J0P0Si+OHpRAFrENSDBA5Cos8pFC+bjJFQHkz4ESKqEc2x0chU7zNw93Rvo62cWBAfCYtLUs=',
  'RequestId': 'B12B2C8015E611AC'}}

martindurant · 2016-04-06T17:16:47Z

On the ParamValidationError, actually we should have caught that before passing to boto3, because the path you give in that open() doesn't include both bucket and key.

martindurant · 2016-04-06T17:21:22Z

On head_object, you are probably right that it is faster when fetching details for a single key (i.e., info), but at the moment we get the info for many keys at once with _ls and cache, and info picks from the cache.
Your suggestion is connected with not listing all entries when there are over 1000?
One issue I have yet to check, is what happens in buckets where there is a mixture of accessible and inaccessible keys.

TomAugspurger · 2016-04-06T22:18:48Z

One issue I have yet to check, is what happens in buckets where there is a mixture of accessible and inaccessible keys

This is the issue I was working around for pandas, this test specifically. It failed before when .info() on that file triggered an _ls, so boto raised a permissions error.

I see your concern about performance. Does adding a parameter to .info controlling whether it should do a full ls and cache make sense? I'm not sure of a way to make list_objects not raise when it encounters a Key the current user doesn't have access to, so I think pandas does need a way to selectively get info on an object.

For the param validation error, are you saying that

fs = S3FileSystem()
fs.info('foo')

should raise an IOError('File not found'), or something like a ValueError('Unable to parse bucket')?

martindurant · 2016-04-07T14:11:34Z

Yes, I suppose info() should give FileNotFound when the key isn't given or the bucket doesn't exist or the bucket name is invalid.

How did you set up s3://cant_get_it/tips.csv to cause the permissions error?

I agree that we can cause info() to first look in the file cache, and then do head_object() if not found; but I'm still looking for a listing that would return all accessible keys.

jreback · 2016-04-07T14:28:04Z

@martindurant not sure who owns that bucket. I don't seem to have access :<

TomAugspurger · 2016-04-07T14:28:41Z

I'll look tonight or this weekend on if there's a better way around this.

FWIW here's the original issue in pandas: pandas-dev/pandas#10604

TomAugspurger · 2016-04-09T12:13:58Z

I haven't found anything in boto3 to control the exception handling in s3.list_objects. It seems like as soon as you hit a key you don't have permission on, it'll raise an error.

Could we break these two use-cases into two methods?

S3FileSystem.info: the current method that lists and caches everything.
S3File.?: the method in this PR that calls head_object on a single key?

martindurant · 2016-04-09T13:23:49Z

Yes, OK, that sounds totally reasonable.

martindurant · 2016-04-09T21:32:17Z

Would this do?
48bbc19

TomAugspurger · 2016-04-10T15:26:58Z

Would this do? 48bbc19

That's what I was going to suggest. To be honest, I thought I was working on S3File.info, and not S3FileSystem.info so I my first few messages may not make any sense 😄.

I can repurpose this PR to use it, or you can. Whatever you think is easier (sorry I've been slow to respond this week).

There's just a couple test failures with that patch,

    def test_bad_open(s3):
        with pytest.raises(IOError):
>           s3.open('')


    def test_errors(s3):
        with pytest.raises((IOError, OSError)):
            s3.open(test_bucket_name+'/tmp/test/shfoshf', 'rb')

        ## This is fine, no need for interleving directories on S3
        #with pytest.raises((IOError, OSError)):
        #    s3.touch('tmp/test/shfoshf/x')

        with pytest.raises((IOError, OSError)):
            s3.rm(test_bucket_name+'/tmp/test/shfoshf/x')

        with pytest.raises((IOError, OSError)):
            s3.mv(test_bucket_name+'/tmp/test/shfoshf/x', 'tmp/test/shfoshf/y')

        #with pytest.raises((IOError, OSError)):
        #    s3.open('x', 'wb')

        with pytest.raises((IOError, OSError)):
>           s3.open('x', 'rb')

Those raise ParamValidationError instead. I think you said we should be catching that earlier?

martindurant · 2016-04-10T20:49:53Z

I appreciate any and all time you can spend on this!

I have reformed this into PR #27 - OK?

TomAugspurger closed this Apr 11, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REF: Use `head_object` for .info #24

REF: Use `head_object` for .info #24

TomAugspurger commented Apr 2, 2016

martindurant commented Apr 5, 2016

TomAugspurger commented Apr 5, 2016

martindurant commented Apr 6, 2016

martindurant commented Apr 6, 2016

TomAugspurger commented Apr 6, 2016

martindurant commented Apr 7, 2016

jreback commented Apr 7, 2016

TomAugspurger commented Apr 7, 2016

TomAugspurger commented Apr 9, 2016

martindurant commented Apr 9, 2016

martindurant commented Apr 9, 2016

TomAugspurger commented Apr 10, 2016

martindurant commented Apr 10, 2016

REF: Use head_object for .info #24

REF: Use head_object for .info #24

Conversation

TomAugspurger commented Apr 2, 2016

martindurant commented Apr 5, 2016

TomAugspurger commented Apr 5, 2016

martindurant commented Apr 6, 2016

martindurant commented Apr 6, 2016

TomAugspurger commented Apr 6, 2016

martindurant commented Apr 7, 2016

jreback commented Apr 7, 2016

TomAugspurger commented Apr 7, 2016

TomAugspurger commented Apr 9, 2016

martindurant commented Apr 9, 2016

martindurant commented Apr 9, 2016

TomAugspurger commented Apr 10, 2016

martindurant commented Apr 10, 2016

REF: Use `head_object` for .info #24

REF: Use `head_object` for .info #24