Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3: provide java.nio.FileSystem implementation #1388

Closed
ashleymercer opened this issue Aug 17, 2019 · 24 comments
Closed

S3: provide java.nio.FileSystem implementation #1388

ashleymercer opened this issue Aug 17, 2019 · 24 comments
Labels
feature-request A feature should be added or improved. p2 This is a standard priority issue

Comments

@ashleymercer
Copy link

ashleymercer commented Aug 17, 2019

Expected Behavior

The java.nio.FileSystem API provides an abstraction for dealing with different types of file systems, and for accessing files and folders within that file system. AWS should provide an implementation of this interface backed by an s3 client.

Current Behavior

Application code either has to explicitly know about s3 (by passing around S3Client everywhere) or else use some custom abstraction, which inevitably ends up being a half-baked implementation of parts of the FileSystem and Path APIs anyway.

Possible Solution

There are two existing attempts to solve this problem that I can find:

  • Upplication/Amazon-S3-FileSystem-NIO2

    • built against the old 1.11 AWS SDK, so incompatible with codebases that use aws-sdk-java-v2
    • no code changes, and very little activity in issues, for over a year so not actively maintained?
  • elerch/Amazon-S3-FileSystem-NIO2

    • fork of the Upplication implementation, but switches to aws-sdk-java-v2
    • however this largely appears to be a solo effort, no guarantees about support, bugfixes etc going forwards
  • carlspring/s3fs-nio

Fortunately, this code is MIT-licensed so perhaps could form the basis of an official library?

Context

Developing application code against s3 can be problematic because it's not always possible to have a live s3 instance available: access policies might be very strict (no access to s3 from outside corp network, or only from specific (non-dev) machines), or developers might not even have internet access at all when e.g. working on the move.

Attempts to solve this problem in a different way exists (e.g. libraries which provide a local webserver with an s3-like interface) but this adds another set of dependencies, another tool to have to learn / configure / debug.

In my view, a cleaner solution would be to provide an implementation of java.nio.FileSystem which is the standard Java abstraction for dealing with different file systems. Application code would only need to talk to java.nio.file.Path and friends and developers could be confident of their code working reliably regardless of whether it's running against local disk storage, or s3.

@millems
Copy link
Contributor

millems commented Aug 19, 2019

We've talked about this recently, actually. It would be cool to have, but our friends over in .NET (who have done something similar) have stated that it's actually surprisingly tricky to do correctly.

Marking as feature request.

@millems millems added the feature-request A feature should be added or improved. label Aug 19, 2019
@debora-ito debora-ito added this to Backlog (Not Ordered) in New Features (Public) via automation Sep 4, 2019
@pditommaso
Copy link

Big +1 for this feature. We maintain a fork of Upplication project (see here), but it would be very useful to have an official implementation.

@steve-todorov
Copy link

Has there been any changes and are there any plans to make official support? :)

@millems
Copy link
Contributor

millems commented Apr 2, 2020

We still think it would be a cool idea. Right now the team is focused on getting customer's favorite V1 features into V2. Once we've gotten further along in that process, we can start more seriously considering new, cool features like this one.

@steve-todorov
Copy link

Great! Thanks for the update! Please keep us in the loop :)

@lbergelson
Copy link

@millems This would be very useful for us. The proliferation of forks of the Upplication provider ( which is itself a fork of an older provider) causes a lot of confusion.

Google has a very robust open source Path provider for gs: buckets. It would be great if Amazon did too.

@carlspring
Copy link

Are there any updates on this?

@millems
Copy link
Contributor

millems commented Jul 6, 2020

Sorry, this still has not been prioritized.

@carlspring
Copy link

carlspring commented Jul 22, 2020

Google's NIO storage provider works really well and is pretty straight-forward to integrate in any project.

What would be required in order to get the ball rolling for an S3 provider as well?

If somebody were to, say, walk over the different forks of Upplication/Amazon-S3-FileSystem-NIO2 and merge the useful changes that people have done in their forks, would the Amazon team be interested in adopting such a fork and continuing the work?

@millems
Copy link
Contributor

millems commented Jul 22, 2020

Unfortunately we aren't able to take on the project ourselves right now, even in just a maintenance capacity. That might change in the future, as demand for this feature rises (both here on Github and via any other official AWS channels of communication) or demand for our time elsewhere falls.

Until such time, we would be surprised and delighted if the open source community were to take up the mantle and develop such a feature. We'd be willing to provide any kind of AWS expertise you might require in the design or development of such a project.

@carlspring
Copy link

We've talked about this recently, actually. It would be cool to have, but our friends over in .NET (who have done something similar) have stated that it's actually surprisingly tricky to do correctly.

@millems , would you mind elaborating? What were the issues? What were they trying to do and what exactly didn't work?

@lbergelson
Copy link

Having been involved in the development of the google implementation as well as currently developing a generic https filesystem provider, I can say that it's a reasonable amount of work but definitely not insurmountable. One person working part time for a year should be able to come up with a very good solution. It probably has to be iterated though as new error modes are discovered / appear due to changes in the underlying infrastructure.

I would say the hardest part is making it robust against intermittent failure. A file system can't fail at the same rate the internet does so every operation has to be able to continue and retry in the face of failures.
Authentication is also tricky and I don't know how amazon handles this.
Performance is tricky because of the ludicrously high latency compared to local disk operations so some sort of caching or prefetching layer is very helpful.

This is the sort of project that definitely benefits from a set of dedicated maintainers rather than a hodgepodge of forks with their own solutions.

@millems
Copy link
Contributor

millems commented Jul 23, 2020

@lbergelson's summary is great. @normj can weigh in on the struggles encountered doing it for .NET.

@lbergelson
Copy link

lbergelson commented Jul 23, 2020

The "year" of time might sound more intense than I meant. It took initial work but then needed continual adjustment over time as we discovered new rare edge cases through use. Not a solid year of someone writing code.

@normj
Copy link
Member

normj commented Jul 24, 2020

For the .NET SDK we have a similar feature where make S3 look like a file system matching the .NET File IO API. Although it does make it easier to traverse it does cause pitfalls that are not obvious to the user because S3 really isn't a filesystem. For example the .NET File IO has file operations to append to an existing file. Looks simple and very tempting API for users to call. Under the cover we have to download the object concat the new data and reupload the data. Also if you do a simple File system operation like move or rename directories S3 doesn't really have directories and you end up having to get list all of the objects copy them over and then delete them. If there are a lot of objects under that S3 virtual directory this can be very costly.

So although we have the similar approach in .NET it has cause a lot of confusion for users, especially new to S3, that I actually regret us having the feature. I would rather users of S3 know what manipulations they are doing to S3 then doing what looks like a simple operations but getting a big surprise when it is actually very slow and costly operation.

@carlspring
Copy link

Thanks for sharing your experience with GCS, as well as the S3 .Net implementation!

@carlspring
Copy link

carlspring commented Nov 8, 2020

Hi guys,

We are pleased to let you know that we've created a spin-off project (rebranded fork called s3fs-nio) of Upplication/Amazon-S3-FileSystem-NIO2. This is a spin-off (of the latest Upplication/Amazon-S3-FileSystem-NIO2 master with fully preserved history), instead of just another fork, as the upstream has a plethora of forks which contain fixes for bits and pieces that had pull requests against the upstream which were never merged. Most of these forks appear to have sadly died out, just like the upstream seems to have died (Upplication/Amazon-S3-FileSystem-NIO2#135).

As there is a need for such a library, we have decided to take on this task and rebuild an active project with a knowledge base, chat channel and helpful community around it. Ultimately, we would really appreciate it, if we could have some of the Amazon folks helping out with advice and reviewing pull requests, as this would be a massive help for us!

We've done a big clean up of the code, upgraded its depenencies and migrated to AWS SDK v2 (special thanks to @ptirador for all the hard work, as well as to @elerch + @markjschreiber for their advice and reviews!). The project is actively tested against JDK 8 and 11 via Github Actions. Our work is not done and we would like to keep working on this project! We intend to invite all the contributors with open pull requests to join our efforts and forward-port their fixes to our project.

If anyone is interested in lending a hand and joining our project, please reach out, as we have plenty to do! :)

@ashleymercer : Could you please add us to your list above? Thanks! :)

cc: @ptirador , @steve-todorov , @sbespalov

@oeph
Copy link

oeph commented Jan 12, 2023

It seems that s3fs-nio didn't came to the point where a first version was finished.
During my search for such a library, i came across https://github.com/awslabs/aws-java-nio-spi-for-s3. Could this be a good starting point?

@markjschreiber
Copy link

I hope so. I wrote that package as I needed something that used the AWS SDK v2 and offered the option of standard S3Clients and Async S3 clients while also making a clean break from some of the approaches of the Upplication library. Currently aws-java-nio-spi-for-s3](https://github.com/awslabs/aws-java-nio-spi-for-s3 only offers read access to S3. Implementing write access is certainly possible and probably easy, although s3 doesn't support random writes so any write is a complete put of the entire object. Let me know if writes are important, as an implementation would need to consider the most likely use cases.

@oeph
Copy link

oeph commented Jan 13, 2023

@markjschreiber thank you! Yes, we certainly need write access and would obviously prefer to have it within the library you wrote. We also don't need random writes, so a complete put of the file would be enough for our use case at least.

@markjschreiber
Copy link

Makes sense (we should probably also move this discussion to that project). I'm certainly happy to have review any pull request, even on that is a work in progress (WIP). Let's follow up with more detailed requirements in a discussion at https://github.com/awslabs/aws-java-nio-spi-for-s3

@debora-ito
Copy link
Member

Hi everyone, thank you for your interest in seeing S3 FileSystem integration supported in the Java SDK v2.

Given that the aws-java-nio-spi-for-s3 lib is under the awslabs org, we recommend you use it - thank you @markjschreiber!

Closing.

New Features (Public) automation moved this from Backlog (Not Ordered) to Done Feb 24, 2023
@github-actions
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

@carlspring
Copy link

carlspring commented May 24, 2023

Hey guys,

After a long wait, we'd like to let you know that we've cut a release for s3fs-nio as org.carlspring.cloud.aws:s3fs-nio:1.0.0. This is now available via Maven Central (https://repo.maven.apache.org/maven2/). Here are our Release Notes.

We are also working on improving our documentation and contributions would be highly appreciated.

We would like to welcome you to test and report back any findings!

For those of you interested in contributing, there is plenty yet to be done and we'd be more than happy to have you aboard!

Looking forward to your feedback! Happy coding! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request A feature should be added or improved. p2 This is a standard priority issue
Development

No branches or pull requests