Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GetObjectRequest in S3 should support final bytes as a Range header value #1551

Open
bbranan opened this issue Apr 13, 2018 · 9 comments
Open
Labels
feature-request A feature should be added or improved.

Comments

@bbranan
Copy link

bbranan commented Apr 13, 2018

According to https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.35, a Range header may include a single negative value to indicate the last X bytes in a file should be retrieved. For example bytes=-500 is a valid Range value for the final 500 bytes in a file.

This Range header option is currently supported by S3, as verified through the AWS S3 CLI.

Currently, GetObjectRequest includes setRange(long start) and setRange(long start, long end), which supports Range values like bytes=100- and bytes=100-200, however, there is no way to provide a Range value in GetObjectRequest which results in "bytes=-100", despite the fact that this is a valid value which is already supported by S3.

@shorea shorea added the feature-request A feature should be added or improved. label Apr 13, 2018
@shorea
Copy link
Contributor

shorea commented Apr 13, 2018

Makes sense, we'd have to see if we can make this in a backwards compatible way. In the meantime I think you should be able to workaround this by doing something like the following.

        GetObjectRequest req = new GetObjectRequest("bucket", "key");
        req.putCustomRequestHeader("Range", "-500");
        amazonS3.getObject(req);

@bbranan
Copy link
Author

bbranan commented Apr 13, 2018

The simplest way to be backwards compatible here would likely be to add a new method, perhaps something like setRangeEnd(long end), which results in the expected header value.

Thanks for the work around, I will use that strategy for now, though I believe the call would need to be

req.putCustomRequestHeader("Range", "bytes=-500");

@shorea
Copy link
Contributor

shorea commented Apr 14, 2018

Yes good catch.

@bbranan
Copy link
Author

bbranan commented Apr 18, 2018

Using the suggested work around results in the following error:

com.amazonaws.SdkClientException: Unable to verify integrity of data download.  Client calculated content hash didn't match hash calculated by Amazon S3.  The data may be corrupt.
	com.amazonaws.services.s3.internal.DigestValidationInputStream.validateMD5Digest(DigestValidationInputStream.java:79)
	com.amazonaws.services.s3.internal.DigestValidationInputStream.read(DigestValidationInputStream.java:61)
	com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)

By default, when a getObject() request is made, the checksum of the retrieved file is verified against the complete file checksum by the client. Of course, the subset of bytes retrieved with a Range request will not have the expected checksum. When GetObjectRequest.setRange() is used, the checksum validation step is disabled (based on an internal getRange() check). Setting Range as a custom header does not result in the checksum validation being disabled, so it fails consistently.

This update to the work around allows it to work by setting the range (thus disabling the checksum check), then overwriting the Range header value with the custom header:

    GetObjectRequest req = new GetObjectRequest("bucket", "key");
    req.setRange(0);
    req.putCustomRequestHeader("Range", "bytes=-500");
    amazonS3.getObject(req);

Unfortunately, this is based on the assumption that the internal implementation will continue to override the Range value with the custom header. That does not seem like a good assumption to make.

@zoewangg
Copy link
Contributor

You can disable md5 checks for GET request using the System Property.
https://github.com/aws/aws-sdk-java/blob/master/aws-java-sdk-s3/src/main/java/com/amazonaws/services/s3/internal/SkipMd5CheckStrategy.java#L34.

Note: this will disable md5 checks for ALL get requests.

@bbranan
Copy link
Author

bbranan commented Apr 19, 2018

Thanks for the pointer @zoewangg. Unfortunately, the majority of requests I will be making are full-object requests, and I really do want md5 checks to occur for those transfers. I'm just looking for a way to disable the md5 checks specifically for Range-limited requests.

@omalley
Copy link

omalley commented Dec 6, 2019

This will also be useful for file formats like ORC and Parquet that want to read the file footer first.

omalley added a commit to omalley/aws-sdk-java that referenced this issue Dec 20, 2019
omalley added a commit to omalley/aws-sdk-java that referenced this issue Dec 20, 2019
@kyprifog
Copy link

@omalley I have exactly this use case. Did you find an acceptable work around?

@kyprifog
Copy link

kyprifog commented Mar 20, 2020

I was able to just pull the content length from the header and then have a second call using that content length to pull the footer, although I am guessing this issue is about being able to do this without doing 2 calls

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request A feature should be added or improved.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants