Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using Parquet for large references #293

Closed
rsignell-usgs opened this issue Feb 3, 2023 · 43 comments
Closed

Using Parquet for large references #293

rsignell-usgs opened this issue Feb 3, 2023 · 43 comments

Comments

@rsignell-usgs
Copy link
Collaborator

rsignell-usgs commented Feb 3, 2023

I've been testing out DFReferenceFileSystem, which loads the references from a collection of Parquet files (one Parquet file for metadata and one for each variable) instead of from a single JSON file.

Two questions:

  1. I understand the references are only loaded when a specific variable is requested, but then all the references for that variable are loaded, correct? (Or does it load just the references for the subset requested)?
  2. I tried loading some data from a specific time step into a dataarray, and then replaced that dataarray with data loaded from another variable at the same time step. Each time I switched the variable, I saw the memory usage increasing, and after changing the variable four times, I blew out my 8GB memory. Is it possible to purge the references from memory when you no longer using them?
@martindurant
Copy link
Member

Correct, atm., all the references for a variable are loaded on first access, and kept around.

A purgeable cache would be very reasonable, as would be splitting each variable into reference chunks to allow even more partial loading. These are the principal benefits of @agoodm 's approach, which could/should be adopted.

@agoodm
Copy link
Contributor

agoodm commented Feb 3, 2023

@rsignell-usgs since we are on the subject, you can give my approach a try with your test dataset. The code is currently in this gist at the moment and you can use it as follows (making sure you have the latest development version of fastparquet installed as well):

import fsspec
import xarray as xr

# refs is original references JSON dict
make_parquet_store('refs_test', refs)
mapper = ParquetReferenceMapper('refs_test')
fs = fsspec.filesystem('reference', fo=mapper, remote_protocol='file')
ds = xr.open_zarr(fs.get_mapper(''))

This uses an LRU cache along with splitting the references into row groups (of 1000 references by default) to help keep the in-memory footprint down. More details are provided here.

@martindurant
Copy link
Member

Thanks for the snippet! Also interested to see how well it fares.

Also interesting would be to see whether we can convert the individual files' JSON data too, and see an improvement in the memory footprint of the combine step - that would be a big win too.

@rsignell-usgs
Copy link
Collaborator Author

@agoodm, I'm trying -- I just need to find a place to run with more memory!

@rsignell-usgs
Copy link
Collaborator Author

Progress report: I created a large enough instance on our AWS JupyterHub to run the code, but I can't build the latest fastparquet from github/main because the JupyterHub doesn't have gcc. @martindurant says he can cut a dev release later today so that the build on conda-forge will trigger.

@rsignell-usgs
Copy link
Collaborator Author

rsignell-usgs commented Feb 14, 2023

@agoodm, Martin released the new version of fastparquet on conda-forge and I was able to run your test. It spent about 10 minutes generating the files for each variable, but then died trying to generate the file for the crs variable, which I suppose is not surprising, considering this variable type/content:

refs['refs']['crs/.zarray']

'{"chunks":[],"compressor":null,"dtype":"|S1","fill_value":"IA==","filters":null,"order":"C","shape":[],"zarr_format":2}'

I was going to try just deleting the crs variable from the references, but I wasn't sure how to do that.

The full reproducible notebook I tried is here: https://nbviewer.org/gist/2233712db4679628d2d67769b9cc5937

I moved the references JSON file to an Open Storage Network pod, so it's available with no egress fees and no credentials needed!

@agoodm
Copy link
Contributor

agoodm commented Feb 14, 2023

@rsignell-usgs Thank you so much for testing this! I have my own potential use-case which has a full reference set of similar size to the one you are using, so these results are of great importance to me too.

It looks like my code wasn't taking into account the possibility for variables with zero chunk size, so I have updated the gist: (https://gist.github.com/agoodm/25d41ce0c47cd714271be66d0db0459d). I am wondering if this is even intended behavior when the original references file was generated since the value of crs is just an empty string.

Before you attempt to run this again, I have a few more tips / suggestions:

First I just remembered that fastparquet doesn't have compression on by default, so when making the parquet files be sure to set the compression info, eg:

make_parquet_store('refs_test', refs, compression='zstd')

This reduces the size of the references on disk by a factor of 10 compared to the original JSON. Martin and I were discussing how a way we could possibly improve this further is using categorical data like the DFReferenceFileSystem implementation does.

Next, I highly recommend adding a .zmetadata key to the store (saved in refs_test/.zmetadata) since this will take an especially long time to open in xarray without it. Even more so using the parquet references vs the original JSON, access times for each chunk key are going to be slower which will be noticeable when iterating over the entire thing. To save you some time I have attached the file here: https://gist.github.com/agoodm/4e1512b9655dbcfc3b1db17ea8e4020d

Finally since your data is on S3 you need to change your fsspec options:

mapper = ParquetReferenceMapper('refs_test')
fs = fsspec.filesystem('reference', fo=mapper, remote_protocol='s3', remote_options=dict(anon=True))

With all of these changes I was able to load the dataset and it seems to be working reasonably well:

In [6]: ds = xr.open_zarr(fs.get_mapper(''))

In [7]: %time ds.isel(time=0).ACCET.values
CPU times: user 5.8 s, sys: 1.05 s, total: 6.85 s
Wall time: 7.01 s
Out[7]: 
array([[ nan,  nan,  nan, ..., 0.  , 0.  , 0.  ],
       [ nan,  nan,  nan, ..., 0.  , 0.  , 0.  ],
       [ nan,  nan,  nan, ..., 0.01, 0.  , 0.  ],
       ...,
       [ nan,  nan,  nan, ..., 0.  , 0.  , 0.  ],
       [ nan,  nan,  nan, ..., 0.  , 0.  , 0.  ],
       [ nan,  nan,  nan, ..., 0.  , 0.  , 0.  ]])

@rsignell-usgs
Copy link
Collaborator Author

@agoodm , yep that worked just fine!
https://nbviewer.org/gist/d71a0eb5afeb801ae5df8431e0dbc293
I'll start doing a few tests to see how access time/memory requirements vary between the two approaches.

@rsignell-usgs
Copy link
Collaborator Author

rsignell-usgs commented Feb 15, 2023

@agoodm, I copied the zarr/parquet references to the Open Storage Network. Thinking that someone could just use the references from the Cloud without generating them locally.

I'm trying to get this to work. Can you help?
https://nbviewer.org/gist/3201f41408cb2c844a918fa39079d85c

@rsignell-usgs
Copy link
Collaborator Author

rsignell-usgs commented Feb 15, 2023

For National Water Model reanalysis 1km grid, Alex's method takes a bit longer to write the references, and the references are a bit bigger, but the access speeds are awesome from refs stored locally. I haven't been able to figure out how to access the refs for Alex's method from a bucket yet, so waiting on that...

-- DFReferenceFileSystem approach agoodm approach
time to parse 9GB references JSON 5min 24s 5 min 25s
time to write refs 7min 40s 11min 41s
size of refs 492MB 756MB
local refs, 1st read of var 10s 1s
local refs, 2nd read of same var 6 s 1s
cloud (OSN) refs, 1st read of var 50s -- s
cloud (OSN) refs, 2nd read of same var 5 s -- s

@martindurant
Copy link
Member

That seems pretty conclusive :|

So let's merge the two efforts:

  • enable the refmapper to use remote FSs (should be easy)
  • add some of the space saving tricks

It leaves two features out:

  • passing additional columns to any decoding (or even checksum verification) we want to do on a per-chunk basis (planned but not implemented in ReferenceFS)
  • allowing multiple references per key, to be combined with a given function (requirement from preffs)

@agoodm , do you see any way to get those too?

@agoodm
Copy link
Contributor

agoodm commented Feb 15, 2023

Regarding your first two points, I have just updated the reference mapper to work with references on any fsspec supported FS (https://gist.github.com/agoodm/25d41ce0c47cd714271be66d0db0459d):

mapper = ParquetReferenceMapper(lazy_refs, fs=fs)

With these changes I was able to run @rsignell-usgs's example loading the references from S3 though maybe this could be cleaned up a bit more. I'll go ahead and try to incorporate the additional space saving tricks from the DFReferenceFilesystem implementation next. I am guessing the main reason for the longer write times is due to having a large number of row groups in the parquet files which I have set to size 1000 by default since this gives fast first load times.

Do you have some concrete examples (perhaps of some references files) that have these other two more advanced features? The latter one doesn't sound too difficult to support (if I am understanding it correctly, this would mean multiple lists of paths and byte ranges to the data files and/or b64 strings assigned to one zarr chunk key?). For the former, are you referring to this option in DFReferenceFilesystem?

@rsignell-usgs
Copy link
Collaborator Author

rsignell-usgs commented Feb 16, 2023

With these changes I was able to run @rsignell-usgs's example loading the references from S3 though maybe this could be cleaned up a bit more.

Tried reading from remote refs also: https://nbviewer.org/gist/067827019f247ddfa1f614eff17f8d0e
Wow, that's impressive!

@martindurant
Copy link
Member

The number of references to store per row group could be a tuneable knob to trade-off:

  • bigger means better initialization time (less metadata to read) and faster throughput when reading a lot of the data
  • smaller means faster random access and smaller memory footprint (although the size of the LRU cache is also a parameter)

I do think that storing the data in typed columns rather than JSON blobs should be optimal, especially as it allows for categoricals on the URLs.

I'll work on trying to combine the two approaches next week. The question of what to do with preffs-style multiple refs, I'm not sure.

@agoodm
Copy link
Contributor

agoodm commented Feb 16, 2023

That sounds good to me. Would it be alright if I move the code I have implemented so far into a PR or two to track further progress? The code for making the parquet files would go here of course but I take it the mapper code should be in the main fsspec repo? I am also thinking it may be desirable to just keep it more invisible to end users and have the constructor to ReferenceFilesystem wrap it automatically if the input is a directory.

As for typed and categorical columns I should have some time today or tonight to give this a try myself, definitely curious to see how much of an improvement that will make in the compression ratio when applying that to Rich's reference set.

I see #254 has an example of what you were mentioning as a use-case for per-column/chunk decoding. I also took a look at the preffs example and do see now why that might be tricky to adapt with the reference mapper as it is currently implemented. If the references are kept as JSON blobs this wouldn't be a problem since multiple references can just be included as a list but then the categorical trick won't work. Conversely the problem of using separate rows would break the current assumption the mapper makes of having one zarr chunk per row which eliminates the need to explicitly store key names makes it quick and easy to look up the correct row group and row numbers from the chunk key. I'll have to see if there is a more elegant solution I can come up with.

@martindurant
Copy link
Member

Indeed, creating parquet should definitely live here, and so far the FS implementation has lived in fsspec. There is an argument that it is too niche and experimental and should also live here or in its own package...

how much of an improvement that will make in the compression ratio

It will make a bigger difference to in-memory size (because zstd should be doing a good job finding repeated strings, but they get expanded into independent objects on load). This is also true for integers, which are 8 bytes in memory as an array, but 28 bytes for a python int.

it may be desirable to just keep it more invisible to end users and have the constructor to ReferenceFilesystem wrap it automatically if the input is a directory

Yes indeed, once everything is working smoothly.

the problem of using separate rows

Yes, agree with all you say. Typed columns and categoricals should be a decent win; maybe we can have a choice of what format to use per row group, depending on whether we have/allow multi-refs or not.

@rsignell-usgs
Copy link
Collaborator Author

rsignell-usgs commented Feb 16, 2023

it may be desirable to just keep it more invisible to end users and have the constructor to ReferenceFilesystem wrap it automatically if the input is a directory

And also for allowing it to work with intake, right?

@martindurant
Copy link
Member

And also for allowing it to work with intake

You can bet on it

@agoodm
Copy link
Contributor

agoodm commented Feb 20, 2023

@martindurant @rsignell-usgs I have just tested generating the parquet files with separate typed columns instead of raw json blobs, that change alone is cutting the file sizes nearly in half. I tried saving the paths as categoricals too but this instead made the compression much worse. So I think it should be fine to just wrap the paths up as categoricals upon loading the references. I'll incorporate these changes into the PRs later today.

I have given the multiple refs question some more thought and am thinking perhaps we could just use MultiIndex columns? Maybe something like where the number of sublevels corresponds to the maximum number of references found for a single key in the entire set. Something like this:

columns = [
    np.array(['path', 'path', 'offset', 'offset', 'size', 'size', 'raw', 'raw']),
    np.array(['0', '1', '0', '1', '0', '1', '0', '1'])
]
df = pd.DataFrame([['a.dat', 'b.dat', 123, 12, 12, 17, None, None], 
                   ['b', None, 0, 0, 0, 0, b'data', None]], 
                   columns=columns)
df
Out[68]: 
    path        offset     size       raw      
       0      1      0   1    0   1     0     1
0  a.dat  b.dat    123  12   12  17  None  None
1      b   None      0   0    0   0  b'data'  None

@rsignell-usgs
Copy link
Collaborator Author

If you cut the size in half you will be at the same level as the @martindurant variable-refs-per-parquet approach! Nice! I'm hoping Martin will chime in on the multiple refs issue as I don't really understand it.

@martindurant
Copy link
Member

I didn't immediately understand the multiref description there either - I'll come back to it. The use of pandas multiindex is probably a bad idea, though, because it neither maps clearly to parquet nor is at all performant.

@agoodm
Copy link
Contributor

agoodm commented Feb 22, 2023

Apologies for any confusion on the description for my multirefs example. Basically the idea was to show the same dataframe from the preffs README and allow multiple references to be mapped to a single key by multiplying the number of columns equal to the maximum number of possible refs per key. I thought of a MultiIndex being one possible way to do that but if that is inefficient or poorly supported by parquet, perhaps just explicitly having separate columns for each possible value (eg, "path.0", "path.1")? The HDF-EOS file format has a convention like that for splitting up core metadata into multiple attributes.

Leaving that aside, I have implemented the all the other necessary changes into both PRs including typed columns and wrapping the mapper directly inside the filesystem instantiation. I did a test run with Rich's NWM retrospective reference set and indeed found the file size was 449 MB. It seems to be working well:

In [1]: import fsspec

In [2]: fs = fsspec.filesystem('reference', fo='refs_noaa', remote_protocol='s3', remote_options=dict(anon=True))

In [3]: import xarray as xr

In [4]: ds = xr.open_dataset(fs.get_mapper(''), engine='zarr')

In [5]: ds
Out[5]: 
<xarray.Dataset>
Dimensions:   (time: 116631, y: 3840, x: 4608, vis_nir: 2, soil_layers_stag: 4)
Coordinates:
  * time      (time) datetime64[ns] 1979-02-01T03:00:00 ... 2020-12-31T21:00:00
  * x         (x) float64 -2.303e+06 -2.302e+06 ... 2.303e+06 2.304e+06
  * y         (y) float64 -1.92e+06 -1.919e+06 ... 1.918e+06 1.919e+06
Dimensions without coordinates: vis_nir, soil_layers_stag
Data variables: (12/21)
    ACCET     (time, y, x) float64 ...
    ACSNOM    (time, y, x) float64 ...
    ALBEDO    (time, y, x) float64 ...
    ALBSND    (time, y, vis_nir, x) float64 ...
    ALBSNI    (time, y, vis_nir, x) float64 ...
    COSZ      (time, y, x) float64 ...
    ...        ...
    SNOWH     (time, y, x) float64 ...
    SOIL_M    (time, y, soil_layers_stag, x) float64 ...
    SOIL_W    (time, y, soil_layers_stag, x) float64 ...
    TRAD      (time, y, x) float64 ...
    UGDRNOFF  (time, y, x) float64 ...
    crs       object ...
Attributes:
    Conventions:                CF-1.6
    GDAL_DataType:              Generic
    TITLE:                      OUTPUT FROM WRF-Hydro v5.2.0-beta2
    code_version:               v5.2.0-beta2
    model_configuration:        retrospective
    model_initialization_time:  1979-02-01_00:00:00
    model_output_type:          land
    model_output_valid_time:    1979-02-01_03:00:00
    model_total_valid_times:    472
    proj4:                      +proj=lcc +units=m +a=6370000.0 +b=6370000.0 ...

In [6]: ds.isel(time=0).ACCET.compute()
Out[6]: 
<xarray.DataArray 'ACCET' (y: 3840, x: 4608)>
array([[ nan,  nan,  nan, ..., 0.  , 0.  , 0.  ],
       [ nan,  nan,  nan, ..., 0.  , 0.  , 0.  ],
       [ nan,  nan,  nan, ..., 0.01, 0.  , 0.  ],
       ...,
       [ nan,  nan,  nan, ..., 0.  , 0.  , 0.  ],
       [ nan,  nan,  nan, ..., 0.  , 0.  , 0.  ],
       [ nan,  nan,  nan, ..., 0.  , 0.  , 0.  ]])
Coordinates:
    time     datetime64[ns] 1979-02-01T03:00:00
  * x        (x) float64 -2.303e+06 -2.302e+06 ... 2.303e+06 2.304e+06
  * y        (y) float64 -1.92e+06 -1.919e+06 -1.918e+06 ... 1.918e+06 1.919e+06
Attributes:
    esri_pe_string:  PROJCS["Lambert_Conformal_Conic",GEOGCS["GCS_Sphere",DAT...
    grid_mapping:    crs
    long_name:       Accumulated total ET
    units:           mm
    valid_range:     [-100000, 100000000]

Other things to note:

  • References are loaded directly as numpy arrays (and paths optionally wrapped as pd.Categorical) instead of being loaded into a dataframe since this has quite a bit less overhead. I had to use some somewhat lower level API calls to do this with fastparquet, maybe there is a better way?
  • I noticed that fsspec doesn't currently require numpy/pandas, so I moved the imports for those inside LazyReferenceMapper.

Please go ahead and test the changes out and review when you have the chance.

@rsignell-usgs
Copy link
Collaborator Author

@agoodm thanks for continuing to push on this! I know @martindurant is super busy this week, but after this gets merged are you interested in helping write a blog post about this new capability? I started writing one about the original parquet-file-per-variable approach -- maybe we can just adapt it for this new approach? You could be the author...

@agoodm
Copy link
Contributor

agoodm commented Feb 22, 2023

Absolutely, I would be happy to. Feel free to send me your current draft when you are ready and we can work from there.

@rsignell-usgs
Copy link
Collaborator Author

rsignell-usgs commented Mar 1, 2023

@agoodm sorry this slipped off my radar, for the draft blod post, please request access to this google doc: https://docs.google.com/document/d/1qSpI24kjXz15bRJ0Deqe0-F6eraJUtuHYrt-UfkjSBM/edit?usp=sharing

@rsignell-usgs
Copy link
Collaborator Author

@agoodm, also, is there an updated code (or notebook) for converting the massive JSON into the parquet/zarr reference files?

@agoodm
Copy link
Contributor

agoodm commented Mar 1, 2023

I have not kept the original gist up to date, but the code is now implemented as refs_to_dataframe in #298, see:

https://github.com/fsspec/kerchunk/blob/d32224b74aa2627921af17b4b9b423bc9f3387f7/kerchunk/df.py

Then just call it with

refs_to_dataframe(refs, 'refs_test')

It should run a lot more quickly than before.

@rsignell-usgs
Copy link
Collaborator Author

rsignell-usgs commented Mar 13, 2023

@agoodm, how to proceed on the proposed blog post?

Would you like me to take a stab at modifying the text to describe the new and improved best practice?

@agoodm
Copy link
Contributor

agoodm commented Mar 13, 2023

Before you go ahead and do so I should note that we have changed the approach a little bit to now instead save multiple small parquet files for each variable rather than having one larger parquet file with multiple row groups in order to store the URLs categorically without blowing up the file size (see discussion here). The updated refs_to_dataframe() is now merged into kerchunk as of today, while the necessary changes to RefFS in fsspec (including the LazyReferenceMapper) are still undergoing review here. I would suggest updating your benchmarks to use the latest code from both places to verify that everything still works as expected and see how the results changed. At the very least the file sizes and time to write out the refs should be improved pretty significantly compared to before. If you pull the changes you can skip explicitly using the LazyReferenceMapper and just pass the root directory of the parquet refs to RefFS like you normally would with JSON/dict references and it should take care of the rest. You may be also interested in the additional benchmark I just posted in that PR today which showcases the improvements in total access time and memory usage for iterating over the entire NWM retrospective reference set (which may also be relevant for kerchunk.combine workflows).

In the meantime I can go ahead and do a quick lookover of the text tonight or tomorrow and add some changes and comments. Sound good?

@rsignell-usgs
Copy link
Collaborator Author

rsignell-usgs commented Mar 14, 2023

This all sounds great! I asked because people are asking when the blog post will come out -- a bunch of folks will probably jump on this as soon as they understand it! I can hopefully get this done over the next couple of days.

@rsignell-usgs
Copy link
Collaborator Author

rsignell-usgs commented Mar 28, 2023

guys, okay swinging back to this again....

In the last few cells of this notebook I'm trying to convert the big JSON on s3 to parquet on s3 using the new approach.

OBVI I don't have the syntax quite right:

refs_to_dataframe(fs_json.open(refs), fs_json.open('s3://esip-qhub/noaa/nwm/grid1km/parquet/refs_test', mode='wb'))

bombs out with: ValueError: File not in read mode.

@martindurant
Copy link
Member

    refs: str | dict
        Location of a JSON file containing references or a reference set already loaded
        into memory.
    url: str
        Location for the output, together with protocol. This must be a writable
        directory.

The input (first arg) should be either a string URL or a dict of references (i.e., already decoded). The output should always be a string, and is the directory name, not the name of any file.

For the input, I notice that, although it can be remote, there is no way to pass parameters to open that, which we should definitely add.

@rsignell-usgs
Copy link
Collaborator Author

rsignell-usgs commented Mar 28, 2023

Getting closer:

# references are on an OSN pod (no credentials needed)
url = 's3://rsignellbucket2/noaa/nwm/grid1km/refs/'

target_opts = {'anon':True, 'skip_instance_cache':True,
              'client_kwargs': {'endpoint_url': 'https://renc.osn.xsede.org'}}

# netcdf files are on the AWS public dataset program (no credentials needed)
remote_opts = {'anon':True}

fs = fsspec.filesystem("reference", fo=url, 
                       remote_protocol='s3', remote_options=remote_opts,
                      target_options=target_opts)
m = fs.get_mapper("")

ds = xr.open_dataset(m, engine='zarr', chunks={})

bombs with

TypeError: slice indices must be integers or None or have an __index__ method

@martindurant
Copy link
Member

Following setup

!pip install git+https://github.com/fsspec/kerchunk
!pip install git+https://github.com/martindurant/filesystem_spec.git@ref_mode
!pip install fastparquet

your code worked ok for me

<xarray.Dataset>
Dimensions:   (time: 116631, y: 3840, x: 4608, vis_nir: 2, soil_layers_stag: 4)
Coordinates:
  * time      (time) datetime64[ns] 1979-02-01T03:00:00 ... 2020-12-31T21:00:00
  * x         (x) float64 -2.303e+06 -2.302e+06 ... 2.303e+06 2.304e+06
  * y         (y) float64 -1.92e+06 -1.919e+06 ... 1.918e+06 1.919e+06
Dimensions without coordinates: vis_nir, soil_layers_stag
Data variables: (12/21)
    ACCET     (time, y, x) float64 dask.array<chunksize=(1, 768, 922), meta=np.ndarray>
    ACSNOM    (time, y, x) float64 dask.array<chunksize=(1, 768, 922), meta=np.ndarray>
    ALBEDO    (time, y, x) float64 dask.array<chunksize=(1, 768, 922), meta=np.ndarray>
    ALBSND    (time, y, vis_nir, x) float64 dask.array<chunksize=(1, 960, 1, 1152), meta=np.ndarray>
    ALBSNI    (time, y, vis_nir, x) float64 dask.array<chunksize=(1, 960, 1, 1152), meta=np.ndarray>
    COSZ      (time, y, x) float64 dask.array<chunksize=(1, 768, 922), meta=np.ndarray>
    ...        ...
    SNOWH     (time, y, x) float64 dask.array<chunksize=(1, 768, 922), meta=np.ndarray>
    SOIL_M    (time, y, soil_layers_stag, x) float64 dask.array<chunksize=(1, 768, 1, 922), meta=np.ndarray>
    SOIL_W    (time, y, soil_layers_stag, x) float64 dask.array<chunksize=(1, 768, 1, 922), meta=np.ndarray>
    TRAD      (time, y, x) float64 dask.array<chunksize=(1, 768, 922), meta=np.ndarray>
    UGDRNOFF  (time, y, x) float64 dask.array<chunksize=(1, 768, 922), meta=np.ndarray>
    crs       object ...
Attributes:
    Conventions:                CF-1.6
    GDAL_DataType:              Generic
    TITLE:                      OUTPUT FROM WRF-Hydro v5.2.0-beta2
    code_version:               v5.2.0-beta2
    model_configuration:        retrospective
    model_initialization_time:  1979-02-01_00:00:00
    model_output_type:          land
    model_output_valid_time:    1979-02-01_03:00:00
    model_total_valid_times:    472
    proj4:                      +proj=lcc +units=m +a=6370000.0 +b=6370000.0 ..

@martindurant
Copy link
Member

I made those changes into fsspec/filesystem_spec#1224

@martindurant
Copy link
Member

(but something is wrong with CI :|)

@rsignell-usgs
Copy link
Collaborator Author

rsignell-usgs commented Mar 28, 2023

Ah, right, I forgot to use those versions of fsspec and fastparquet.

Success!!! : https://nbviewer.org/gist/rsignell-usgs/4a482cbcfebf84387a89d1ef0ca6e3bf

@rsignell-usgs rsignell-usgs changed the title DFReferenceFileSystem questions Using Parquet for large references Mar 28, 2023
@rsignell-usgs rsignell-usgs reopened this Mar 28, 2023
@rsignell-usgs
Copy link
Collaborator Author

I think we should wait until these new package versions land on conda-forge before we release the blog post. But I can start working on updating it now...

@agoodm
Copy link
Contributor

agoodm commented Mar 28, 2023

Good to hear you were able to get things working again. In the meantime I can make a quick update to refs_to_dataframe to allow for passing urls to remote json references file instead of currently requiring an object.

@martindurant
Copy link
Member

to allow for passing urls to remote json references file instead of currently requiring an object

It does allow a URL, but doesn't take the typical target_kwargs for opening it with fsspec

@agoodm
Copy link
Contributor

agoodm commented Mar 28, 2023

to allow for passing urls to remote json references file instead of currently requiring an object

It does allow a URL, but doesn't take the typical target_kwargs for opening it with fsspec

No, I think I overlooked this when writing the docstring. It says strings are accepted but the actual code that follows immediately assumes a mapping object:

kerchunk/kerchunk/df.py

Lines 135 to 142 in 33b00d6

if "refs" in refs:
refs = refs["refs"]
fs, _ = fsspec.core.url_to_fs(url)
fs.makedirs(url, exist_ok=True)
fields = get_variables(refs, consolidated=True)
# write into .zmetadata at top level, one fewer read on access
refs[".zmetadata"]["record_size"] = record_size

@martindurant
Copy link
Member

OK :|

@martindurant
Copy link
Member

Will close this now that the PR in fsspec in merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants