Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fuse #53

Merged
merged 27 commits into from
Jan 15, 2018
Merged

Fuse #53

merged 27 commits into from
Jan 15, 2018

Conversation

martindurant
Copy link
Member

supersedes #51

Example:
In one temrinal

python gcsfuse.py anaconda-public-data/iris fuser

in another

mdurant@blackleaf:~/code/gcsfs/gcsfs$ ls fuser
iris.csv
mdurant@blackleaf:~/code/gcsfs/gcsfs$ cat fuser/iris.csv
petal_length,sepal_length,petal_width,sepal_width,species
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
...

@mrocklin
Copy link
Contributor

mrocklin commented Jan 2, 2018

mrocklin@carbon:~/workspace/gcsfs/gcsfs$ mkdir fuser
mrocklin@carbon:~/workspace/gcsfs/gcsfs$ python gcsfuse.py anaconda-public-data/iris fuser
# this hangs here
mrocklin@carbon:~/workspace/gcsfs/gcsfs$ ls fuser
ls: cannot access 'fuser': No such file or directory

@martindurant
Copy link
Member Author

@mrocklin , you shouldn't need to make the directory, so it's doubly little weird that it doesn't exist when you try in the other terminal. Note that I am not passing any project to GCSFileSystem, which I should have been (perhaps as a command line argument, or just hard-code for now).

@mrocklin
Copy link
Contributor

mrocklin commented Jan 2, 2018

mrocklin@carbon:~/workspace/gcsfs/gcsfs$ rmdir fuser
mrocklin@carbon:~/workspace/gcsfs/gcsfs$ python gcsfuse.py anaconda-public-data/iris fuser
fuse: bad mount point `fuser': No such file or directory
Traceback (most recent call last):
  File "gcsfuse.py", line 110, in <module>
    main(sys.argv[2], sys.argv[1])
  File "gcsfuse.py", line 107, in main
    FUSE(GCSFS(root, ), mountpofloat, nothreads=True, foreground=True)
  File "/home/mrocklin/Software/anaconda/lib/python3.6/site-packages/fuse.py", line 480, in __init__
    raise RuntimeError(err)
RuntimeError: 1

@martindurant
Copy link
Member Author

I would try with a new directory name each time (sorry for pollution) as fuse does not necessarily release the previous mount. I'm sure there is a proper way to do that.

@mrocklin
Copy link
Contributor

mrocklin commented Jan 2, 2018

mrocklin@carbon:~/workspace/gcsfs/gcsfs$ python gcsfuse.py anaconda-public-data/iris fuse2
fuse: bad mount point `fuse2': No such file or directory
Traceback (most recent call last):
  File "gcsfuse.py", line 110, in <module>
    main(sys.argv[2], sys.argv[1])
  File "gcsfuse.py", line 107, in main
    FUSE(GCSFS(root, ), mountpofloat, nothreads=True, foreground=True)
  File "/home/mrocklin/Software/anaconda/lib/python3.6/site-packages/fuse.py", line 480, in __init__
    raise RuntimeError(err)
RuntimeError: 1
``

@mrocklin
Copy link
Contributor

mrocklin commented Jan 3, 2018

Your recent commit helps. I suspect that I needed to specify a project

@mrocklin
Copy link
Contributor

mrocklin commented Jan 3, 2018

This fails in an interesting way for me

In [1]: import netCDF4 

In [2]: netCDF4.Dataset?

In [3]: ds = netCDF4.Dataset('fuse4/newmann-met-ensemble-netcdf/conus_ens_001.nc')
HDF5-DIAG: Error detected in HDF5 (1.8.17) thread 139690544854784:
  #000: H5L.c line 1183 in H5Literate(): link iteration failed
    major: Symbol table
    minor: Iteration failed
  #001: H5Gint.c line 844 in H5G_iterate(): error iterating over links
    major: Symbol table
    minor: Iteration failed
  #002: H5Gobj.c line 693 in H5G__obj_iterate(): can't iterate over dense links
    major: Symbol table
    minor: Iteration failed
  #003: H5Gdense.c line 1069 in H5G__dense_iterate(): error building table of links
    major: Symbol table
    minor: Can't get value
  #004: H5Gdense.c line 863 in H5G__dense_build_table(): error iterating over links
    major: Symbol table
    minor: Can't move to next iterator location
  #005: H5Gdense.c line 1060 in H5G__dense_iterate(): link iteration failed
    major: Symbol table
    minor: Iteration failed
  #006: H5B2.c line 389 in H5B2_iterate(): node iteration failed
    major: B-Tree node
    minor: Unable to list node
  #007: H5B2int.c line 2059 in H5B2_iterate_node(): unable to protect B-tree leaf node
    major: B-Tree node
    minor: Unable to protect metadata
  #008: H5B2int.c line 1870 in H5B2_protect_leaf(): unable to protect B-tree leaf node
    major: B-Tree node
    minor: Unable to protect metadata
  #009: H5AC.c line 1262 in H5AC_protect(): H5C_protect() failed.
    major: Object cache
    minor: Unable to protect metadata
  #010: H5C.c line 3574 in H5C_protect(): can't load entry
    major: Object cache
    minor: Unable to load metadata into cache
  #011: H5C.c line 7954 in H5C_load_entry(): unable to load entry
    major: Object cache
    minor: Unable to load metadata into cache
  #012: H5B2cache.c line 875 in H5B2__cache_leaf_load(): wrong B-tree leaf node signature
    major: B-Tree node
    minor: Unable to load metadata into cache
HDF5-DIAG: Error detected in HDF5 (1.8.17) thread 139690544854784:
  #000: H5L.c line 1183 in H5Literate(): link iteration failed
    major: Symbol table
    minor: Iteration failed
  #001: H5Gint.c line 844 in H5G_iterate(): error iterating over links
    major: Symbol table
    minor: Iteration failed
  #002: H5Gobj.c line 693 in H5G__obj_iterate(): can't iterate over dense links
    major: Symbol table
    minor: Iteration failed
  #003: H5Gdense.c line 1069 in H5G__dense_iterate(): error building table of links
    major: Symbol table
    minor: Can't get value
  #004: H5Gdense.c line 863 in H5G__dense_build_table(): error iterating over links
    major: Symbol table
    minor: Can't move to next iterator location
  #005: H5Gdense.c line 1060 in H5G__dense_iterate(): link iteration failed
    major: Symbol table
    minor: Iteration failed
  #006: H5B2.c line 389 in H5B2_iterate(): node iteration failed
    major: B-Tree node
    minor: Unable to list node
  #007: H5B2int.c line 2059 in H5B2_iterate_node(): unable to protect B-tree leaf node
    major: B-Tree node
    minor: Unable to protect metadata
  #008: H5B2int.c line 1870 in H5B2_protect_leaf(): unable to protect B-tree leaf node
    major: B-Tree node
    minor: Unable to protect metadata
  #009: H5AC.c line 1262 in H5AC_protect(): H5C_protect() failed.
    major: Object cache
    minor: Unable to protect metadata
  #010: H5C.c line 3574 in H5C_protect(): can't load entry
    major: Object cache
    minor: Unable to load metadata into cache
  #011: H5C.c line 7954 in H5C_load_entry(): unable to load entry
    major: Object cache
    minor: Unable to load metadata into cache
  #012: H5B2cache.c line 875 in H5B2__cache_leaf_load(): wrong B-tree leaf node signature
    major: B-Tree node
    minor: Unable to load metadata into cache
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-3-74b3e4fd7f67> in <module>()
----> 1 ds = netCDF4.Dataset('fuse4/newmann-met-ensemble-netcdf/conus_ens_001.nc')

netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Dataset.__init__ (netCDF4/_netCDF4.c:13992)()

OSError: NetCDF: HDF error

@martindurant
Copy link
Member Author

The caching (which also fixes in-filling when jumping about in a file) reduced the time to open an xarray dataset from 36s to 11s.

@mrocklin
Copy link
Contributor

mrocklin commented Jan 3, 2018

Yeah I was running into that last night and was surprised that my machine was still downloading data an hour later :)

@martindurant
Copy link
Member Author

I could use eyes on the logic in _fetch to make sure I am doing the right thing. Getting this right will help a lot with performance, although we may still want to differentiate between large (data) and small (metadata) reads in the fuse class.

gcsfs/core.py Outdated
new = _fetch_range(self.gcsfs.header, self.details,
self.end, end + self.blocksize)
self.end = end + self.blocksize
self.cache = self.cache + new
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we always add self.blocksize to end. What if end - start > self.blocksize?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to read past the currently requested data by a blocksize, so that further (small) reads will not need additional requests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might do something like max(start + self.blocksize, end) so that we don't do this behavior on large reads.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, could do, however the little extra may make little difference for longer reads

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough

@martindurant
Copy link
Member Author

Tests pass in direct mode, not with vcr.
Need to write fuse-specific tests and re-record test yamls.

@mrocklin
Copy link
Contributor

mrocklin commented Jan 5, 2018

Looks like we treat the directory also as a file

mrocklin@carbon:~/workspace/gcsfs/tmp/newmann-met-ensemble-netcdf$ ls
ls: cannot access 'newmann-met-ensemble-netcdf': No such file or directory
conus_ens_001.nc  conus_ens_022.nc  conus_ens_043.nc  conus_ens_064.nc  conus_ens_085.nc
conus_ens_002.nc  conus_ens_023.nc  conus_ens_044.nc  conus_ens_065.nc  conus_ens_086.nc
conus_ens_003.nc  conus_ens_024.nc  conus_ens_045.nc  conus_ens_066.nc  conus_ens_087.nc
conus_ens_004.nc  conus_ens_025.nc  conus_ens_046.nc  conus_ens_067.nc  conus_ens_088.nc
conus_ens_005.nc  conus_ens_026.nc  conus_ens_047.nc  conus_ens_068.nc  conus_ens_089.nc
conus_ens_006.nc  conus_ens_027.nc  conus_ens_048.nc  conus_ens_069.nc  conus_ens_090.nc
conus_ens_007.nc  conus_ens_028.nc  conus_ens_049.nc  conus_ens_070.nc  conus_ens_091.nc
conus_ens_008.nc  conus_ens_029.nc  conus_ens_050.nc  conus_ens_071.nc  conus_ens_092.nc
conus_ens_009.nc  conus_ens_030.nc  conus_ens_051.nc  conus_ens_072.nc  conus_ens_093.nc
conus_ens_010.nc  conus_ens_031.nc  conus_ens_052.nc  conus_ens_073.nc  conus_ens_094.nc
conus_ens_011.nc  conus_ens_032.nc  conus_ens_053.nc  conus_ens_074.nc  conus_ens_095.nc
conus_ens_012.nc  conus_ens_033.nc  conus_ens_054.nc  conus_ens_075.nc  conus_ens_096.nc
conus_ens_013.nc  conus_ens_034.nc  conus_ens_055.nc  conus_ens_076.nc  conus_ens_097.nc
conus_ens_014.nc  conus_ens_035.nc  conus_ens_056.nc  conus_ens_077.nc  conus_ens_098.nc
conus_ens_015.nc  conus_ens_036.nc  conus_ens_057.nc  conus_ens_078.nc  conus_ens_099.nc
conus_ens_016.nc  conus_ens_037.nc  conus_ens_058.nc  conus_ens_079.nc  conus_ens_100.nc
conus_ens_017.nc  conus_ens_038.nc  conus_ens_059.nc  conus_ens_080.nc  conus_ens_mean.nc
conus_ens_018.nc  conus_ens_039.nc  conus_ens_060.nc  conus_ens_081.nc  newmann-met-ensemble-netcdf
conus_ens_019.nc  conus_ens_040.nc  conus_ens_061.nc  conus_ens_082.nc
conus_ens_020.nc  conus_ens_041.nc  conus_ens_062.nc  conus_ens_083.nc
conus_ens_021.nc  conus_ens_042.nc  conus_ens_063.nc  conus_ens_084.nc

@mrocklin
Copy link
Contributor

mrocklin commented Jan 5, 2018

For reference, here are some times from the Anaconda Austin office

In [1]: import xarray as xa

In [2]: %time ds = xa.open_dataset('conus_ens_005.nc')
CPU times: user 32 ms, sys: 24 ms, total: 56 ms
Wall time: 14.4 s

In [3]: ds
Out[3]: 
<xarray.Dataset>
Dimensions:    (lat: 224, lon: 464, time: 12054)
Coordinates:
  * time       (time) datetime64[ns] 1980-01-01 1980-01-02 1980-01-03 ...
  * lon        (lon) float64 -124.9 -124.8 -124.7 -124.6 -124.4 -124.3 ...
  * lat        (lat) float64 25.06 25.19 25.31 25.44 25.56 25.69 25.81 25.94 ...
Data variables:
    elevation  (lat, lon) float64 ...
    pcp        (time, lat, lon) float64 ...
    t_mean     (time, lat, lon) float64 ...
    t_range    (time, lat, lon) float64 ...
    mask       (lat, lon) int32 ...
    t_max      (time, lat, lon) float64 ...
    t_min      (time, lat, lon) float64 ...
Attributes:
    history:                   Version 1.0 of ensemble dataset, created Decem...
    nco_openmp_thread_number:  1
    institution:               National Center for Atmospheric Research (NCAR...
    title:                     CONUS daily 12-km gridded ensemble precipitati...
    source:                    Generated using version 1.0 of CONUS ensemble ...
    references:                Newman et al. 2015: Gridded Ensemble Precipita...

In [4]: %time ds.t_mean[0]
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 2.09 ms
Out[4]: 
<xarray.DataArray 't_mean' (lat: 224, lon: 464)>
[103936 values with dtype=float64]
Coordinates:
    time     datetime64[ns] 1980-01-01
  * lon      (lon) float64 -124.9 -124.8 -124.7 -124.6 -124.4 -124.3 -124.2 ...
  * lat      (lat) float64 25.06 25.19 25.31 25.44 25.56 25.69 25.81 25.94 ...
Attributes:
    long_name:  Daily estimated mean temperature
    units:      degC

In [5]: %time ds.t_mean[0].data
CPU times: user 24 ms, sys: 4 ms, total: 28 ms
Wall time: 6.5 s
Out[5]: 
array([[         nan,          nan,          nan, ...,          nan,
                 nan,          nan],
       [         nan,          nan,          nan, ...,          nan,
                 nan,          nan],
       [         nan,          nan,          nan, ...,          nan,
                 nan,          nan],
       ..., 
       [ -7.07293558,  -8.94900703,  -7.75014973, ..., -20.14479828,
        -20.12835121, -18.99111557],
       [ -6.04429531,  -7.69851446,  -7.93106461, ..., -20.83728981,
        -20.99421883, -20.46924019],
       [ -3.97688127,  -4.74640989,  -5.86901665, ..., -20.96295929,
        -21.32979012, -21.29034042]])

In [6]: %time ds.t_mean[100].data
CPU times: user 12 ms, sys: 0 ns, total: 12 ms
Wall time: 3.38 s
Out[6]: 
array([[        nan,         nan,         nan, ...,         nan,
                nan,         nan],
       [        nan,         nan,         nan, ...,         nan,
                nan,         nan],
       [        nan,         nan,         nan, ...,         nan,
                nan,         nan],
       ..., 
       [-1.7363404 , -2.25378013, -0.83159602, ..., -1.26883531,
        -1.68508101, -1.71476936],
       [-0.02053847, -1.07118607, -0.70093918, ..., -1.00373149,
        -1.16537523, -1.74216485],
       [ 2.34900451,  2.06779528,  1.23093593, ..., -1.26731825,
        -1.45158088, -1.9439559 ]])

I'm also somewhat concerned about these NaNs . cc @jhamman

@martindurant
Copy link
Member Author

The NaNs should be fine - there is a mask variable showing where you expect valid data.
I am not seeing the directory-file, and am getting much better timings; are you using the latest merged version, or perhaps am I seeing some good OS-level caching?

@mrocklin
Copy link
Contributor

mrocklin commented Jan 5, 2018

It turns out that I wasn't fully updated. I just updated though and am getting similar results.

I plan to work on a small click-based CLI and push it up here

@martindurant
Copy link
Member Author

Correction: it really matters which file you are looking at for the timings, because of the layout of the chunks that hdf5 accesses. Again, this suggests we probably want to cache multiple small chunks of a given file for speed.
Still, downloading 1MB of data or 5-10MB standard block shouldn't take more than a couple of seconds, if not establishing new SSL connections each time.
Interesting that hdf5 seems to load metadata in 4k and 8k chunks, and data in 64k chunks.

@martindurant
Copy link
Member Author

@mrocklin , you are welcome to push a CLI here, but thoughts on how to sensibly test this are also welcome.
Also, you may want to experiment with the arguments to FUSE(): nothreads, foreground, raw_fi.

@martindurant
Copy link
Member Author

Tests now pass following #62 - but there are no tests here of FUSE itself. I will try whether starting the interface in-process, with or without threads, works.
This remains experimental, but are there are suggestions of further work that would be nice to see here?

@martindurant martindurant mentioned this pull request Jan 10, 2018
@martindurant
Copy link
Member Author

@mrocklin , I believe you solved how to use FUSE on TravisCI?
(the test is not being run, above, because I have not yet added fusepy to the requirements for travis)

@mrocklin
Copy link
Contributor

@martindurant I didn't solve FUSE on TravisCI. However I did solve fuse on a docker container, which is presumably harder.

I recommend running travis with sudo: true and sudo apt-get install libfuse-dev

@martindurant
Copy link
Member Author

The simple test now passes on travis, thanks @mrocklin
I think this should be merged with documentation labelling it as "experimental" and listing caveats (read-only, incorrect permissions).
Anything else that should be tied up?

@mrocklin
Copy link
Contributor

I haven't taken a deep look, but a cursory glance makes me pretty happy.

Martin Durant added 2 commits January 12, 2018 10:27
Implemented some writing in FUSE too.
@martindurant martindurant changed the title WIP: Fuse Fuse Jan 12, 2018
@martindurant
Copy link
Member Author

@mrocklin : the code now passes around File objects explicitly, rather than using a hand-rolled cache. This might make things more efficient, but won't make anything worse, and I think is cleaner. Also, you can now write, including opening files in text mode from python (which is what the test does now).

fuse.FUSE(
GCSFS(TEST_BUCKET, gcs=gcs), mountpath, nothreads=False,
foreground=True))
th.daemon = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach works in Python 2 or Python 3. I recommend using it in both cases.

assert 'hello' in files
with open(os.path.join(mountpath, 'hello'), 'r') as f:
# NB this is in TEXT mode
assert f.read() == 'hello'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's nice to see this example in tests.

Is there a way to clean up the thread after this test? Not a big deal, just curious.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure how to achieve that. Calling FUSE in the thread causes it to block, so join() would never return, and I don't see how to use an interrupt signal, which is what FUSE is expecting. That is the reason for daemon above, to make sure that the thread and the process does exit when the testing is done and the main thread ends.

Now the tests pass locally but not on travis, and I don't know why.

@@ -29,6 +29,7 @@ def test_fuse(token_restore):
with open(os.path.join(mountpath, 'hello'), 'w') as f:
# NB this is in TEXT mode
f.write('hello')
time.sleep(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically i would do something like the following:

start = time()
while 'hello' not in os.listdir(mountpath):
    sleep(0.1)
    assert time() < start + 5

However this may not work well with VCR. In general I'm surprised that VCR wouldn't result in entirely equivalent results on travis-ci

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A good idea if this works, but I don't know the basic reason it's different. Maybe fuse on travis works somehow differently, laggy or with different caching.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...but it still doesn't work :(

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that test does pass in docker linux. Should that be good enough for merger?

@mrocklin
Copy link
Contributor

mrocklin commented Jan 15, 2018 via email

@martindurant martindurant merged commit 720aea8 into fsspec:master Jan 15, 2018
@martindurant martindurant deleted the fuse branch January 15, 2018 17:49
@mrocklin
Copy link
Contributor

mrocklin commented Jan 15, 2018 via email

hanseaston pushed a commit to hanseaston/gcsfs that referenced this pull request Aug 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants