-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache S3FileAttributes in S3Path instances to dramatically reduce requests to S3 #47
Conversation
Actually I've sent months ago a similar pull request. See #34. Hope that both of them will be merged. |
Oh, I didn’t even spot that one - let’s hope there’s now enough pressure get them merged! :: Heikki
|
Thanks @heikkipora and @pditommaso I promise to check both PRs during the weekend :) |
THAT sounds good! I wasn't yet able to make all the mock'ed tests pass for my part. But it seemed to actually work for me :) |
Ok, I do a first check of the code and looks great, but i have one question: Path s3Path = FileSystem.getPath("s3://path/to/file");
Files.readAttributes(s3Path); // now we have a s3Path with the attributes cache.
// do something in the amazon s3 web interface or in the command line
Files.readAttributes(s3Path); // wrong result no? If you store a s3path in some kind of session and after a while you try to do a Files.exists(s3Path) you are going to get a false positive. (if some other process delete the file before the call) What do you think? I am not sure if It is a real problem... |
Yes, I agree on this. A good strategy it to allow the cached attributed to be used only for the following next |
You are absolutely right, the attributes will be cached for the lifetime of an S3Path instance - which was my purpose to begin with :) Could be a problem for some uses, true. |
Thanks for the feedback, I think the @pditommaso strategy its a good point. Im working with some kind of time expiration cache. If the cache time expire then the readAttributes regenerate the S3FilesAttributes. I am working on this but, what do you think? |
The time based expiration is alternative to the "expire by read" strategy that I was suggesting, or they are supposed to work together? I think a good strategy could be: the cached attributes expire after the first access with the Also the documentation should stress the fact that to avoid access useless network roundtrips, the user should cache the file attributes when possible. |
sounds good to me, thx @pditommaso :) |
My commits did actually break a few places, pushed a commit to fix that :) Tests are 100% green for me now. |
Already integrated in the developer branch, thanks for the commits 👍 I used the The time based expiration and expired by use as @pditommaso suggested. |
I am going to publish a new 1.0.3 version. @heikkipora do you want to be added to the developers list in the pom.xml?. If yes plz tell me your email if you want ;) |
Excellent! My email is heikki.pora at gmail dot com |
thanks! |
Given this code snippet running with S3-FileSystem:
List<Path> files = Files.walk(folder, 1).filter(Files::isRegularFile).collect(Collectors.toList());
It makes four (!) S3 HTTP requests per each directory entry (files and immediate subfolders).
With this pull request, the number of S3 HTTP requests is reduced to 1 + N x 2, where N is the number of immediate subfolders.
For a folder containing 10 000 files but no subfolders, the change is 40 000 -> 1 ⚡