Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid another extra getObjectMetadata request #983

Conversation

NikolayAtSony
Copy link
Contributor

If client already knows last modified time for S3 object,
there is no reason to pull object metadata from the S3 service
inside TransferManager just to retrieve this information again.

If client already knows last modified time for S3 object,
there is no reason to pull object metadata from the S3 service
inside TransferManager just to retrieve this information again.
@shorea shorea requested review from zhangzhx and removed request for zhangzhx January 25, 2017 15:42
@NikolayAtSony
Copy link
Contributor Author

Will you consider merging it? If not, what concerns do you have?

@@ -77,7 +74,7 @@ public synchronized ObjectMetadata getObjectMetadata() {
if (s3Object != null) {
return s3Object.getObjectMetadata();
}
return objectMetadata;
return null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be a breaking change as customers who are getting object metadata before will get null now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, they wont. Null will only be returned to the customers who will be using the new API where object metadata is not retrieved before object is downloaded. After that even while using new API, object metadata will be populated properly. It's definitely not a breaking change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the s3Object is only populated for serial downloads. For parallel downloads, the s3Object will always be null. That is the reason we store object metadata and return it for parallel downloads.

* @see GetObjectRequest#GetObjectRequest(String, String, String)
* @see GetObjectRequest#GetObjectRequest(String, String, boolean)
*/
public GetObjectRequest(String bucketName, String key, Long lastModifiedTime) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We model the request and response classes as per service team (Amazon S3 in this case) specifications. The members in GetObjectRequest are based on the S3 Get Object specification. We cannot add it to the request class if we are not sending it over wire.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, if you can't accept changes in GetObjectRequest, the other approach is to modify doDownload in TransferManager and pass this timestamp as a separate parameter instead of part of GetObjectRequest. Will you accept such approach?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use object metadata in 3 different places during Download operation.

  1. Get last modification time.
  2. Get content length
  3. In DownloadImpl task, return metadata when requested by customer. See the other comment.

It is highly unlikely that customers will provide 1 and 2 values. Even then, we need object metadata to use in (3) for parallel downloads. The only solution I see to avoid this call is providing the object metadata as a parameter to download method. This is a rare case as most customers want to download an object from bucket/key values and won't have object metadata. So if we expect them to provide it, they have to make an additional call to get object metadata before calling download method which is not great user experience.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use case might looks like not being very likely, but it's legit use case and even AWS recommends to use external lookup tables for storing file data attributes outside of S3 (such as key, size, timestamp). Based on this and various other use cases, I can't see a reason why metadata needs to be pulled 100% of the time ahead of object itself. Yes, it might be needed sometimes, but it's not always needed and there is no way to avoid this extra roundtrip now (which btw was avoidable using earlier versions of AWS SDK, so it should be considered as a regression). Refer (for example to 1.9.25) to

private Download doDownload(final GetObjectRequest getObjectRequest,
where it was possible to actually use TransferManager without pulling metadata beforehand.

First, object metadata is available as soon as object download process is started as it's delivered as part of HEAD. For the purpose of pause/resume, there is no reason to pull the metadata beforehand as last modification time could be obtained from the object itself when pause action will be triggered because it's part of S3Object which is about to be put on pause (which might not even happened). If download not even started, it's not needed, right?

Content length could be already known to the client as part of listAvailableObjects() or by other measures and should be used instead of another round trip to the S3 (if available).

As for the point 3, it wasn't the case before (

), so I fail to see why you think it's imperative that metadata is available in advance of S3Object download start (it was never promised). The whole point of introducing this separate instance of object metadata into DownloadImpl is to achieve the retrieval of last modified time for persisting the object (https://github.com/aws/aws-sdk-java/blob/master/aws-java-sdk-s3/src/main/java/com/amazonaws/services/s3/transfer/internal/DownloadImpl.java#L193). Does it really necessary? If download wasn't started, persist it as 0, if started - pull from the S3Object metadata.

Yes, probably the current solution offered by me is a bit straightforward and nicer one is required, but let's decide how to fix it. I'm open to suggestions. It doesn't make sense to sequentially pull the metadata for each object just to know information we already know...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand your concerns. I agree that making HEAD request to get metadata for each call might be excessive. I am worried that if we remove object metadata from DownloadImpl, it might be a breaking change. I have to think some more on how to fix it in a clean way. I don't have bandwidth this week to work on this. I will look at this issue next week and get back to you. Thanks.

@ejono
Copy link

ejono commented Feb 6, 2017

I agree that it doesn't make sense to add lastModifiedTime to GetObjectRequest. However, what about adding a new overloaded version of TransferManager.download() that takes an ObjectMetadata class? If specified and non-null, it can be used in place of having doDownload() make its own GetObjectMetadataRequest. A similar approach can be used in TransferManager.copy() in order to resolve #988.

@NikolayAtSony
Copy link
Contributor Author

The suggestion about passing ObjectMetadata to download() is not very appealing as it's not that the real object metadata is present (with all headers, etc) - it's only range & timestamp that is relevant for TransferManager here. So, aside from range & last modified date there is nothing to fill into it and it seems wrong to partially initialize the object when the whole purpose now for pulling it is to obtain range & timestamp. Range could be in GetObjectRequest and it leaves only timestamp as a potential initialized object inside this ObjectMetadata instance.

Could it be better to just pass this timestamp to the overloaded version of download() and tweak doDownload to only do the pull of metadata if both range & timestamp are not present? If yes, I could create a new pull request for it.

@ejono
Copy link

ejono commented Feb 7, 2017

My suggestion was coming more from the perspective of fixing it this way in TransferManager.copy() (where you very well might already have a full ObjectMetadata instance that you could then pass to TransferManager.copy()) and then also fixing it similarly here in TransferManager.download(). You could conceivably already have a full ObjectMetadata object in this case too, right? I guess that's not quite what you had in mind though, which is why you are suggesting passing just the lastModifiedTime. Could you please describe a little more about how you might already have this lastModifiedTime but not a full ObjectMetadata object? (I'm not saying there's no case but am just curious what your exact use case is.)

@ejono
Copy link

ejono commented Feb 7, 2017

Oops, nvm, I just realized that you already explained this above.

@shorea shorea assigned shorea and unassigned zhangzhx May 15, 2017
@millems
Copy link
Contributor

millems commented Nov 10, 2017

Sorry about the long delay on this one. Did we still want to iterate on this? We're mostly code frozen for transfer manager on 1.11.x at this point, other than for fixing bugs.

We'll make a note to integrate this feedback when we implement 2.0.x's transfer manager: aws/aws-sdk-java-v2#37. Feel free to add any other transfer manager-specific feature requests on that issue as well.

@NikolayAtSony
Copy link
Contributor Author

I think we can close this one and implement this as part of v2 instead. What is the estimated ETA for v2 to be in production status?

@millems
Copy link
Contributor

millems commented Nov 13, 2017

Thanks!

We haven't published a date we'll GA for V2. Once we figure it out, we'll give an announcement on a blog post.

@millems millems closed this Nov 13, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants