potential performance improvements for GRIB files #127

d70-t · 2022-02-21T15:45:13Z

I've been playing around a bit on reading GRIB files, but quickly became hit by the performance impact of the temporary files being created by kerchunk/grib2.py. Thus I tried to find ways around this. As far as I understood up to now, cfgrib requires access to entire files and also requires some file-API while eccodes is happy with in-memory grib messages as well. So I tried to read in grib files using mostly eccodes and circumventing cfgrib where possible, which is orders of magnitude faster than the current method implemented in kerchunk, but sadly, it doesn't do all the magic cfgrib does in assembling proper datasets in all cases. This lack of generality is the reason why I'm not proposing a PR (yet?), but rather seek for further ideas on that topic:

Do others work on this as well?
Do you have ideas on how to do the dataset assembly more generically?

Here's how I'd implement the "decompression", which I belive is relatively generic (but may still be incompatible with what the current kerchunk-grib does):

import eccodes
import numcodecs
from numcodecs.compat import ndarray_copy, ensure_contiguous_ndarray

class RawGribCodec(numcodecs.abc.Codec):
    codec_id = "rawgrib"

    def encode(self, buf):
        return buf

    def decode(self, buf, out=None):
        mid = eccodes.codes_new_from_message(bytes(buf))
        try:
            data = eccodes.codes_get_array(mid, "values")
        finally:
            eccodes.codes_release(mid)

        if hasattr(data, "build_array"):
            data = data.build_array()


        if out is not None:
            return ndarray_copy(data, out)
        else:
            return data

this gist shows how it may be possible to scan GRIB files without the need for temporary files

The text was updated successfully, but these errors were encountered:

martindurant · 2022-02-21T15:55:55Z

@TomAugspurger , can you please link here your "cogrib" experiments? It indeed scans files without first downloading them, and we plan to upstream it here. I'm not sure if it answers all your points, @d70-t , perhaps you have gone further.

Aside from this, it should also be possible to peek into the binary description of the data and directly find the buffers representing the main array of each message. This is assuming we can understand the encoding, which is a very likely yes. This would allow:

somewhat smaller downloads on read (the main array normally dominates a message's size)
no need to call cfgrib (or eccodes) to interpret the array and no need to create the codec. We may need a different codec, depending on how the array is actually encoded.
no creation of coordinate arrays for every message read. This is pretty fast, but can cause a big memory spike in eccodes and is wholly redundant

TomAugspurger · 2022-02-21T17:16:15Z

That's at https://github.com/TomAugspurger/cogrib.

It indeed scans files without first downloading them, and we plan to upstream it here.

That cogrib experiment does need to download the whole file when it's being "kerchunked". User's accessing through fsspec's reference filesystem don't need to download it, and it doesn't need a temporary file.

It'd be nice to avoid the temporary file for scanning too, but one of my desires was to match the output of cfgrib.

d70-t · 2022-02-21T17:36:31Z

cogrib looks very nice 👍

And yes, the isssue with cfgrib-compatibility is what bothers me most as well in my current attempt (I chose to drop compatibility for speed). I'd really hope we'd be able to figure out a way to do both: no temporary files and cfgrib compatibility.

martindurant · 2022-02-21T17:38:36Z

Actually, can you please enlighten me what "compatibility" means here? I thought cfgrib was a pretty thin wrapper around eccodes.

d70-t · 2022-02-21T17:44:46Z

As far as I understand GRIB (I'm really bad at this), GRIB doesn't know about dimensions and coordinates which are shared between arrays. GRIB files consist of messages (which are chunks + per chunk metadata) and nothing shared by those messages. cfgrib guesses how to assemble those messages to a Dataset based on what it finds among the per-message metadata.

d70-t · 2022-02-21T17:46:10Z

As always with guessing, there are multiple options on how you might want to do this and which kind of conventions are to be followed, so when rolling this guesswork on your own, you might end up with something different.

TomAugspurger · 2022-02-21T18:24:28Z

That matches what I mean by compatibility too. The output of kerchunking a GRIB file should be a list of datasets equal to what you get from cfgrib.open_datasets(file). I'll want to stretch that definition a bit to handle concatenating data from many GRIB files along time, but the basic idea is that I don't want to guess how to assemble messages into datasets.

martindurant · 2022-02-21T18:26:12Z

From working previously on gribs, I do also want to add, that for some files, you cannot use open_datasets without appropriate filters being supplied, because of coordinates mismatch between messages.

TomAugspurger · 2022-02-21T19:41:02Z

Do you mean open_datasets (plural) or open_dataset (singular)? I don't think I've run into files where open_datasets fails, but I haven't tried on too many different types of files.

martindurant · 2022-02-21T19:41:56Z

Yes, single

d70-t · 2022-05-19T15:51:14Z

We've been working a bit more on our gribscan, which is now also available at gribscan/gribscan. It's still very fragile, deliberately doesn't care about being compatible to the output of cfgrib and potentially requires users to implement their own Magician.

martindurant · 2022-05-19T15:55:47Z

Magician?? :)

Do you intend to integrate any of the work into or even replace grib2 in this repo? Do you have any comments on how it compares with @TomAugspurger 's cogrib?

Note that with the latest version of numcodecs, you no longer need to import and register your codec, but can instead include install entrypoints.

d70-t · 2022-05-19T16:00:49Z

:-) yes, we call the customization points a Magician because that's the part where users have to put their guesswork of how to assemble datasets to "magically" stuff the grib messsage together.

That's also the biggest difference to cogrib: We do not try to have a universal tool which makes some dataset out of almost any GRIB. Instead we require customization to make the resulting dataset nicer. That's under the assumption that someone who is involved in creating the inital GRIBs might put some valuable knowledge into it.

The latest version of numcodecs isn't released yet... We've got the entrypoints set up, but they don't yet work 😬

d70-t · 2022-05-19T16:03:36Z

Currently it works for some GRIBs, but is not really stable yet and we need to gain more experience... Thus we thought it might need a little time before we really want it in kerchunk.

martindurant · 2022-05-19T16:05:25Z

@TomAugspurger , I'm sure your thoughts on gribscan would be appreciated, if you have the time to look.

martindurant · 2022-05-19T16:07:57Z

The magician looks quite a lot like what happens in MultiZarrToZarr - if each of the messages of a grib were made into independent datasets, and combined with that, then maybe you wouldn't need your own mages. Sorry, sorcerers, ... er magicians.

d70-t · 2022-05-19T16:16:22Z

Probably it would be possible to try to stuff some of the magicians into something like the coo_map... I've to think more about that.

d70-t · 2022-05-19T16:17:57Z

Initially we've had a design which built one dataset per grib-file and then put all of them into MultiZarrToZarr. We moved away from that design, because we needed something which looks at the collection of all individual messages. But we didn't come up with the idea to make datasets out of each individual message.

d70-t mentioned this issue May 19, 2022

is a Magician essentially the same as MultiZarrToZarr ? gribscan/gribscan#4

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

potential performance improvements for GRIB files #127

potential performance improvements for GRIB files #127

d70-t commented Feb 21, 2022 •

edited

martindurant commented Feb 21, 2022

TomAugspurger commented Feb 21, 2022

d70-t commented Feb 21, 2022

martindurant commented Feb 21, 2022

d70-t commented Feb 21, 2022

d70-t commented Feb 21, 2022

TomAugspurger commented Feb 21, 2022

martindurant commented Feb 21, 2022 •

edited

TomAugspurger commented Feb 21, 2022

martindurant commented Feb 21, 2022

d70-t commented May 19, 2022

martindurant commented May 19, 2022

d70-t commented May 19, 2022

d70-t commented May 19, 2022

martindurant commented May 19, 2022

martindurant commented May 19, 2022

d70-t commented May 19, 2022

d70-t commented May 19, 2022

potential performance improvements for GRIB files #127

potential performance improvements for GRIB files #127

Comments

d70-t commented Feb 21, 2022 • edited

martindurant commented Feb 21, 2022

TomAugspurger commented Feb 21, 2022

d70-t commented Feb 21, 2022

martindurant commented Feb 21, 2022

d70-t commented Feb 21, 2022

d70-t commented Feb 21, 2022

TomAugspurger commented Feb 21, 2022

martindurant commented Feb 21, 2022 • edited

TomAugspurger commented Feb 21, 2022

martindurant commented Feb 21, 2022

d70-t commented May 19, 2022

martindurant commented May 19, 2022

d70-t commented May 19, 2022

d70-t commented May 19, 2022

martindurant commented May 19, 2022

martindurant commented May 19, 2022

d70-t commented May 19, 2022

d70-t commented May 19, 2022

d70-t commented Feb 21, 2022 •

edited

martindurant commented Feb 21, 2022 •

edited