Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache S3FileAttributes in S3Path instances to dramatically reduce requests to S3 #47

Closed
wants to merge 4 commits into from

Conversation

heikkipora
Copy link

Given this code snippet running with S3-FileSystem:

List<Path> files = Files.walk(folder, 1).filter(Files::isRegularFile).collect(Collectors.toList());

It makes four (!) S3 HTTP requests per each directory entry (files and immediate subfolders).

With this pull request, the number of S3 HTTP requests is reduced to 1 + N x 2, where N is the number of immediate subfolders.

For a folder containing 10 000 files but no subfolders, the change is 40 000 -> 1 ⚡

@pditommaso
Copy link

Actually I've sent months ago a similar pull request. See #34. Hope that both of them will be merged.

@heikkipora
Copy link
Author

Oh, I didn’t even spot that one - let’s hope there’s now enough pressure get them merged!

:: Heikki

On 13 Aug 2015, at 15:42, Paolo Di Tommaso notifications@github.com wrote:

Actually I've sent months ago a similar pull request. See #34. Hope that both of them will be merged.


Reply to this email directly or view it on GitHub.

@jarnaiz
Copy link
Member

jarnaiz commented Aug 13, 2015

Thanks @heikkipora and @pditommaso I promise to check both PRs during the weekend :)

@heikkipora
Copy link
Author

THAT sounds good! I wasn't yet able to make all the mock'ed tests pass for my part. But it seemed to actually work for me :)

@jarnaiz
Copy link
Member

jarnaiz commented Aug 14, 2015

Ok, I do a first check of the code and looks great, but i have one question:
When the cache are refreshed? If i do a manual change (using the web interface) in Amazon S3. Im going to be in some kind of inconsistency, no?
A example:

Path s3Path = FileSystem.getPath("s3://path/to/file");
Files.readAttributes(s3Path); // now we have a s3Path with the attributes cache.
// do something in the amazon s3 web interface or in the command line
Files.readAttributes(s3Path); // wrong result no?

If you store a s3path in some kind of session and after a while you try to do a Files.exists(s3Path) you are going to get a false positive. (if some other process delete the file before the call)

What do you think? I am not sure if It is a real problem...

@pditommaso
Copy link

Yes, I agree on this. A good strategy it to allow the cached attributed to be used only for the following next readAttributes operation and invalidate them. You can see this in my implementation.

https://github.com/Upplication/Amazon-S3-FileSystem-NIO2/pull/34/files#diff-c97f0d097f57a3d98291ce62c0b9d525R482

@heikkipora
Copy link
Author

You are absolutely right, the attributes will be cached for the lifetime of an S3Path instance - which was my purpose to begin with :) Could be a problem for some uses, true.

@jarnaiz
Copy link
Member

jarnaiz commented Aug 16, 2015

Thanks for the feedback, I think the @pditommaso strategy its a good point.

Im working with some kind of time expiration cache. If the cache time expire then the readAttributes regenerate the S3FilesAttributes.
If you are working only with this API then you can set the cache to -1 and if you want to mix, you can set to to some time. By default is 1 minute.

I am working on this but, what do you think?

@pditommaso
Copy link

The time based expiration is alternative to the "expire by read" strategy that I was suggesting, or they are supposed to work together?

I think a good strategy could be: the cached attributes expire after the first access with the readAttributes method and, in any case, after 1 minute.

Also the documentation should stress the fact that to avoid access useless network roundtrips, the user should cache the file attributes when possible.

@jarnaiz
Copy link
Member

jarnaiz commented Aug 17, 2015

sounds good to me, thx @pditommaso :)
I have some problems with the filewalker test and i am working on it.

@heikkipora
Copy link
Author

My commits did actually break a few places, pushed a commit to fix that :) Tests are 100% green for me now.

jarnaiz added a commit that referenced this pull request Aug 18, 2015
@jarnaiz
Copy link
Member

jarnaiz commented Aug 18, 2015

Already integrated in the developer branch, thanks for the commits 👍 I used the The time based expiration and expired by use as @pditommaso suggested.

@jarnaiz
Copy link
Member

jarnaiz commented Aug 18, 2015

I am going to publish a new 1.0.3 version. @heikkipora do you want to be added to the developers list in the pom.xml?. If yes plz tell me your email if you want ;)

@heikkipora
Copy link
Author

Excellent! My email is heikki.pora at gmail dot com

@jarnaiz
Copy link
Member

jarnaiz commented Aug 18, 2015

thanks!

@jarnaiz jarnaiz closed this Aug 18, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants