## Retrieving subsets from Grib files via GribJump

This example demonstrates how the experimental `gribjump` source allows efficient retrieval of individual grid cells from Grib messages stored in an FDB. The source is a thin wrapper around the Python bindings of [GribJump](https://github.com/ecmwf/gribjump).

In [1]:
import os
import numpy as np
import earthkit.data

GribJump can retrieve ranges of grid cells for GRIB files in an FDB that were
previously indexed by GribJump (e.g. using `gribjump-scan`). To use the
`gribjump` source in earthkit-data, the environment must point to an FDB in
addition to GribJump-specific environment variables.

⚠️ Please be aware that this source currently does not perform any validation
that the grid indices specified by the user actually correspond to the fields'
underlying grids. Please make sure that any fields referenced by the specified
FDB requests will result in your expected grid. Because of this, we also need to
tell GribJump to ignore any missing grid validation information via the
`GRIBJUMP_IGNORE_GRID` environment variable.

In [2]:
# Configure FDB either via FDB_HOME or FDB5_CONFIG environment variable.
# os.environ.setdefault("FDB_HOME", "<your fdb home directory>")
os.environ.setdefault("FDB5_CONFIG_FILE", "<your fdb5 config file>")
os.environ.setdefault("GRIBJUMP_CONFIG_FILE", "<your gribjump config file>")
os.environ.setdefault("GRIBJUMP_IGNORE_GRID", "1")

'1'

### How To Use

The `gribjump` source works similar to the `fdb` source and receives a dictionary with an FDB request.
Please note that the mars syntax for ranges and lists using "/" is not supported. Only scalar values and
Python lists are supported.

The second required parameter is one of `ranges`, `indices`, or `mask`, selecting the grid cells which should
be extracted. For convenience, one can set an additional parameter `fetch_coords_from_fdb=True` to make an additional
request directly to the fdb to retrieve latitude and longitude information for the retrieved cells and include
them in the retrieved cell's metadata.

In [None]:
source = earthkit.data.from_source(
    "gribjump",
    {
        "class": "ce",
        "expver": "0001",
        "stream": "efcl",
        "date": "20230101",
        "model": "lisflood",
        "domain": "g",
        "origin": "ecmf",
        "step": 6,
        "type": "sfo",
        "levtype": "sfc",
        "param": "240023",
        "time": ["0000", "0600"],
        "hdate": ["20200101", "20200102"],
    },
    ranges=[(1234, 2345)],
    fetch_coords_from_fdb=True,
)

In [4]:
source.ls()

Gribjump Engine: Built file map: 0.022177 second elapsed, 0.011457 second cpu
Starting 8 threads
Gribjump Progress: 1 of 1 tasks complete
Gribjump Engine: All tasks finished: 0.334884 second elapsed, 0.162512 second cpu
Gribjump Engine: Repackaged results: 8e-06 second elapsed, 7e-06 second cpu
Engine::extract: 1.7e-05 second elapsed, 1.5e-05 second cpu


Unnamed: 0,param,level,base_datetime,valid_datetime,step,number
0,240023,,2020-01-01T00:00:00,2020-01-01T06:00:00,6,
1,240023,,2020-01-01T06:00:00,2020-01-01T12:00:00,6,
2,240023,,2020-01-02T00:00:00,2020-01-02T06:00:00,6,
3,240023,,2020-01-02T06:00:00,2020-01-02T12:00:00,6,


In [5]:
ds = source.to_xarray()
ds

### Selection and Groupings

The `gribjump` source offers limited support for selection methods (`.sel()` and
`.isel()`) and grouping method (`.group_by()`) and anything else implemented for a
`SimpleFieldList`. However, please keep in mind that the only available metadata
for these operations comes from the specified fdb request dictionary. Any
selection value must match the type in this dictionary supplied by the user.

In [6]:
groups = source.sel(hdate="20200101").group_by("time")
for group in groups:
    print(group, group.to_numpy().shape, group.metadata('base_datetime'))

data=SimpleFieldList(2) 2
SimpleFieldList(1) (1, 1111) ['2020-01-01T00:00:00']
SimpleFieldList(1) (1, 1111) ['2020-01-01T06:00:00']


### Extraction Options

You can specify the extraction points through one of three options. GribJump
treats all fields as flattened 1D arrays and all coordinates on the grid must
assume this representation.

* **Ranges:** A list of tuples `(start, end)` defining contiguous ranges of grid
    points to extract. As shown in the example above, each tuple specifies a start
    index (inclusive) and end index (exclusive) in the flattened 1D array
    representation of the grid. For example, `[(0, 100), (200, 300)]` would extract
    grid points 0-99 and 200-299.

* **Indices:** A 1D numpy array or list of specific grid point indices to extract
    from the flattened grid. This allows for non-contiguous extraction of
    individual grid points. For example, `np.array([5, 10, 15, 20])` would extract
    exactly those four grid points. This array must be sorted in ascending order.

* **Masks:** A numpy boolean array where `True` indicates grid points to extract
    and `False` indicates points to skip. The mask must have the same length as
    the total number of grid points in the field. However, no such validation is
    performed and passing a mask with an invalid shape will silently return wrong
    results.

Only one of these methods can be used at a time. Please also note that GribJump
uses ranges internally regardless of what the user specifies. Converting the
user's chosen representation to ranges can be expensive when multiple
fields are accessed simultaneously.

#### Code Examples

In [7]:
request = {
    "class": "ce",
    "expver": "0001",
    "stream": "efcl",
    "date": "20230101",
    "model": "lisflood",
    "domain": "g",
    "origin": "ecmf",
    "step": 6,
    "type": "sfo",
    "levtype": "sfc",
    "param": "240023",
    "time": "0000",
    "hdate": "20200101",
}

# Example 1: Using ranges
source_ranges = earthkit.data.from_source(
    "gribjump",
    request,
    ranges=[(1234, 2345), (3456, 4567)],
)
ds = source_ranges.to_xarray()
print("Extracted dataset (ranges):", ds)

# Example 2: Using indices to extract specific grid points
indices = np.array([10, 50, 100, 150, 200])
source_indices = earthkit.data.from_source(
    "gribjump",
    request,
    indices=indices,
)
print("Extracted dataset (indices):", source_indices.to_xarray())

# Example 3: Using a boolean mask with random selection
shape = 4530 * 2970 # Depends on your grid size
mask = np.random.choice([True, False], size=shape, p=[0.05, 0.95])

source_mask = earthkit.data.from_source(
    "gribjump",
    request,
    mask=mask,
)
print("Extracted dataset (mask):", source_mask.to_xarray())

Gribjump Engine: Built file map: 0.010474 second elapsed, 0.008713 second cpu
Gribjump Progress: 1 of 1 tasks complete
Gribjump Engine: All tasks finished: 0.039335 second elapsed, 0.039178 second cpu
Gribjump Engine: Repackaged results: 6e-06 second elapsed, 5e-06 second cpu
Engine::extract: 2e-05 second elapsed, 2e-05 second cpu
Extracted dataset (ranges): <xarray.Dataset> Size: 36kB
Dimensions:  (index: 2222)
Coordinates:
  * index    (index) int64 18kB 1234 1235 1236 1237 1238 ... 4563 4564 4565 4566
Data variables:
    240023   (index) float64 18kB ...
Attributes: (12/13)
    param:        240023
    class:        ce
    stream:       efcl
    levtype:      sfc
    type:         sfo
    expver:       0001
    ...           ...
    hdate:        20200101
    time:         0000
    origin:       ecmf
    domain:       g
    Conventions:  CF-1.8
    institution:  ECMWF
Gribjump Engine: Built file map: 0.009283 second elapsed, 0.007779 second cpu
Gribjump Progress: 1 of 1 tasks comple