Add FUSE capability #111

martindurant · 2018-01-15T17:51:35Z

Replication of functionality in fsspec/gcsfs#53

martindurant · 2018-01-16T14:28:48Z

Note that there are several projects already in existence that claim to have this functionality, so do not plan to resolve this issue until someone asks for it.

JiaweiZhuang · 2018-02-14T22:08:02Z

I am working with a lot of NetCDF files on S3 and am quite interested in this. How would this be different from s3fs-fuse? Similarly, how is dask/gcsfs different from gcsfuse? Say, would it lead to better I/O performance for xarray?

Besides xarray, I also want to use the NetCDF Fortran API to read data on S3 (the input data for our group's GEOS-Chem model). Each NetCDF file is pretty small (~100 MB) and s3fs-fuse seems to perform OK. Do you expect dask/s3fs to increase or decrease the performance of the NetCDF Fortran/C API, compared to s3fs-fuse?

mrocklin · 2018-02-14T22:11:44Z

@JiaweiZhuang can you quantify how well s3fs-fuse does? I would be very curious to see numbers on how fast you can get a small bit of data and how fast you can get a large amount of data using this method.

In principle there is no difference between what is proposed here and existing fuse systems. This would be redundant. That being said, it would be nice to have easy access to build modify behavior. It might end up being a good idea to write code to make HDF on FUSE work decently across a few cloud object stores.

Excerpt from http://matthewrocklin.com/blog/work/2018/02/06/hdf-in-the-cloud

This slowdown is significant because the HDF library makes many small 4kB reads in order to gather the metadata necessary to pull out a chunk of data. Each of those tiny reads made sense when the data was local, but now that we’re sending out a web request each time. This means that users can sit for minutes just to open a file.

Fortunately, we can be clever. By buffering and caching data, we can reduce the number of web requests. For example, when asked to download 4kB we actually download 100kB or 1MB. If some of the future 4kB reads are within this 1MB then we can return them immediately., Looking at HDF traces it looks like we can probably reduce “dozens” of web requests to “a few”.

JiaweiZhuang · 2018-02-14T22:24:43Z

@JiaweiZhuang can you quantify how well s3fs-fuse does? I would be very curious to see numbers on how fast you can get a small bit of data and how fast you can get a large amount of data using this method.

I am testing s3fs-fuse and will keep you posted.

Preliminary results: For a file size of 100~200 MB, the latency is close to reading data from EBS volumes. But reading 1 GB data is significantly slower. Even just getting the metadata by ncdump or xr.open_dataset() takes ~2s.

mrocklin · 2018-02-15T01:48:30Z

Preliminary results: For a file size of 100~200 MB, the latency is close to reading data from EBS volumes

What are you testing when you test for latency? I would expect this to be something like getting a single value from an array, %time x[0, 0, 0]

martindurant closed this as completed Sep 16, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add FUSE capability #111

Add FUSE capability #111

martindurant commented Jan 15, 2018

martindurant commented Jan 16, 2018

JiaweiZhuang commented Feb 14, 2018 •

edited

Loading

mrocklin commented Feb 14, 2018

JiaweiZhuang commented Feb 14, 2018 •

edited

Loading

mrocklin commented Feb 15, 2018

Add FUSE capability #111

Add FUSE capability #111

Comments

martindurant commented Jan 15, 2018

martindurant commented Jan 16, 2018

JiaweiZhuang commented Feb 14, 2018 • edited Loading

mrocklin commented Feb 14, 2018

JiaweiZhuang commented Feb 14, 2018 • edited Loading

mrocklin commented Feb 15, 2018

JiaweiZhuang commented Feb 14, 2018 •

edited

Loading

JiaweiZhuang commented Feb 14, 2018 •

edited

Loading