# Example: Opening large local datasets

> [!NOTE]
> The online laboratory has only been tested in recent Firefox and Chrome browsers. Some features may not (yet) be supported in Safari browsers.

> [!CAUTION]
> Any changes you make to this notebook will be lost once the page is closed or refreshed. Please download any files you would like to keep.

In [1]:
import sys
sys.path.insert(0, "..")

In [2]:
import utils

[pyodide]: Loading pandas, tzdata, dask, click, cloudpickle, importlib_metadata, zipp, Jinja2, MarkupSafe, partd, locket, toolz, pyyaml, kerchunk, fsspec, numcodecs, msgpack, ujson, zarr, asciitree, cftime, xarray, h5py, pkgconfig, cfgrib, attrs, eccodes, cffi, pycparser, findlibs, scipy, openblas, sympy, distutils, mpmath, ipyfilite, ipywidgets, widgetsnbextension, jupyterlab_widgets
[pyodide]: Loaded Jinja2, MarkupSafe, asciitree, attrs, cffi, cfgrib, cftime, click, cloudpickle, dask, distutils, eccodes, findlibs, fsspec, h5py, importlib_metadata, ipyfilite, ipywidgets, jupyterlab_widgets, kerchunk, locket, mpmath, msgpack, numcodecs, openblas, pandas, partd, pkgconfig, pycparser, pyyaml, scipy, sympy, toolz, tzdata, ujson, widgetsnbextension, xarray, zarr, zipp
[pyodide]: Memory usage has grown to 177.6MiB (from 49.9MiB) for this notebook


## Motivation

The online laboratory operates within a memory constrained environment. Therefore, downloading large datasets into the lab is often not possible.

If the data is stored remotely, e.g. because if exceeds even the size of your machine's working memory or even file storage, [`02-remote.ipynb`](02-remote.ipynb) shows you how to open the remote data to stream it in as needed.

However, if you already have the dataset stored in your local filesystem, mounting the local file into the online laboratory is the preferred option. This approach is explored in this notebook.

Note that you only need to use this approach when running notebooks in the online laboratory on <https://lab.climet.eu>. If you are running notebooks locally, you can simply `open()` the local file directly.

## Mounting a local file into the laboratory

Mounting a local file might seem similar to uploading it. However

1. Mounting does not copy any data and does not read the file into memory, thus allowing arbitrarily large files to be made accessible.
2. A mounted file never leaves your machine and is not uploaded to any server. This is especially important if your data contains sensitive information.

It is worth remembering that large files can still only be read if the algorithm that processes them supports streaming or chunking and does not request to load all data into memory at the same time.

In [3]:
upload_path = await utils.mount_user_local_file()
upload_path

FileUploadLite(value=(), description='Upload')

PosixPath('/uploads/60ae67af-b393-47c5-9276-8c94cf956df3/03-t2m.nc')

## Loading the file into `xarray`

In [4]:
import cfgrib
import netCDF4
import zarr

import xarray as xr

[pyodide]: Loading netcdf4
[pyodide]: Loaded netcdf4
[pyodide]: Memory usage has grown to 213.2MiB (from 177.6MiB) for this notebook


Finally, we can load the data into `xarray` as usual.

When opening a GRIB dataset, `cfgrib` looks for or creates an index file for the dataset. Since we have mounted the local GRIB file as read-only, however, `cfgrib` is unable to create the index file at its usual location and will fail with a cryptic error. You can either disable the generation of an index file using

```python
xr.open_dataset(dataset_path, backend_kwargs=dict(indexpath=""))
```

or provide an explicit index path instead using, e.g

```python
from pathlib import Path

xr.open_dataset(dataset_path, backend_kwargs=dict(
    indexpath=f"./{Path(dataset_path).name}.{{short_hash}}.idx",
))
```

The `utils.open_dataset(..)` helper function uses the first strategy and automatically disables the generation of an index file.

In [5]:
ds = xr.open_dataset(upload_path)
ds