Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add FUSE capability #111

Closed
martindurant opened this issue Jan 15, 2018 · 5 comments
Closed

Add FUSE capability #111

martindurant opened this issue Jan 15, 2018 · 5 comments

Comments

@martindurant
Copy link
Member

Replication of functionality in fsspec/gcsfs#53

@martindurant
Copy link
Member Author

Note that there are several projects already in existence that claim to have this functionality, so do not plan to resolve this issue until someone asks for it.

@JiaweiZhuang
Copy link

JiaweiZhuang commented Feb 14, 2018

I am working with a lot of NetCDF files on S3 and am quite interested in this. How would this be different from s3fs-fuse? Similarly, how is dask/gcsfs different from gcsfuse? Say, would it lead to better I/O performance for xarray?

Besides xarray, I also want to use the NetCDF Fortran API to read data on S3 (the input data for our group's GEOS-Chem model). Each NetCDF file is pretty small (~100 MB) and s3fs-fuse seems to perform OK. Do you expect dask/s3fs to increase or decrease the performance of the NetCDF Fortran/C API, compared to s3fs-fuse?

@mrocklin
Copy link
Collaborator

@JiaweiZhuang can you quantify how well s3fs-fuse does? I would be very curious to see numbers on how fast you can get a small bit of data and how fast you can get a large amount of data using this method.

In principle there is no difference between what is proposed here and existing fuse systems. This would be redundant. That being said, it would be nice to have easy access to build modify behavior. It might end up being a good idea to write code to make HDF on FUSE work decently across a few cloud object stores.

Excerpt from http://matthewrocklin.com/blog/work/2018/02/06/hdf-in-the-cloud

This slowdown is significant because the HDF library makes many small 4kB reads in order to gather the metadata necessary to pull out a chunk of data. Each of those tiny reads made sense when the data was local, but now that we’re sending out a web request each time. This means that users can sit for minutes just to open a file.

Fortunately, we can be clever. By buffering and caching data, we can reduce the number of web requests. For example, when asked to download 4kB we actually download 100kB or 1MB. If some of the future 4kB reads are within this 1MB then we can return them immediately., Looking at HDF traces it looks like we can probably reduce “dozens” of web requests to “a few”.

@JiaweiZhuang
Copy link

JiaweiZhuang commented Feb 14, 2018

@JiaweiZhuang can you quantify how well s3fs-fuse does? I would be very curious to see numbers on how fast you can get a small bit of data and how fast you can get a large amount of data using this method.

I am testing s3fs-fuse and will keep you posted.

Preliminary results: For a file size of 100~200 MB, the latency is close to reading data from EBS volumes. But reading 1 GB data is significantly slower. Even just getting the metadata by ncdump or xr.open_dataset() takes ~2s.

@mrocklin
Copy link
Collaborator

Preliminary results: For a file size of 100~200 MB, the latency is close to reading data from EBS volumes

What are you testing when you test for latency? I would expect this to be something like getting a single value from an array, %time x[0, 0, 0]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants