In [None]:
!pip install -U s3fs

In [None]:
from rich import print

# Pangeo Forge: Transforming archival data into analysis-ready, cloud-optimized (ARCO)  formats

**Charles Stern** ([@cisaacstern](http://github.com/cisaacstern)), Data Infrastructure Engineer, Lamont-Doherty Earth Observatory (LDEO)

Presentation Repo: https://github.com/cisaacstern/pangeo-forge-slides

### NOAA Optimum Interpolation Sea Surface Temperature (OISST)

For this example we will consider the [NOAA OISST](https://www.psl.noaa.gov/data/gridded/data.noaa.oisst.v2.html) dataset. This dataset serves as a generic example of how many source files, each representing a temporal subset of a larger dataset, can be transformed into a single, consolidated ARCO datastore using Pangeo Forge.

## NOAA OISST `http` index
https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/
```
 │
 ├──198109/
 │   ├──oisst-avhrr-v02r01.19810901.nc
 │  ...
 │   └──oisst-avhrr-v02r01.19810930.nc
...
 └──202110/
```

In [None]:
!wget 'https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/198109/oisst-avhrr-v02r01.19810901.nc'

In [None]:
import xarray as xr

ds = xr.open_dataset("oisst-avhrr-v02r01.19810901.nc")
nbytes = ds.nbytes
print(f"{nbytes/1e6:.2f} MB")
ds

In [None]:
import pandas as pd

# Our target range includes 14372 files
dates = pd.date_range("1981-09-01", "2021-01-05", freq="D")

print(f"{len(dates)} files -> {(len(dates)*nbytes)/1e9:.2f} GB")

### The old way: start downloading

This will take awhile! 

And the end result is likely not well-situated for parallel computation.

# A better way: Pangeo Forge

`pangeo-forge-recipes` provides logic for transforming all of these source files into a single consolidated zarr store.

## What's a `recipe`?

A `recipe` is a Python file which can "see" all of the source files, and also knows how to logically arrange them into a cohesive dataset.

<img src='architecture.png'>

In [None]:
# https://github.com/pangeo-forge/staged-recipes/blob/master/recipes/noaa-oisst/recipe.py

from noaa_oisst_recipe import recipe

for i, kv in enumerate(recipe.file_pattern.items()):
    if i < 2 or i > 14369:
        print(kv[0], kv[1])

# Zarr format

- Compressed, N-dimensional arrays of any NumPy data type
- Chunking in any dimension
- Read & write array chunks concurrently from multiple threads or processes

https://zarr.readthedocs.io/en/stable/

# Zarr build
1. Cache files to cloud
2. Write Zarr "skeleton" to target location
3. Write Zarr chunks
4. Consolidate metadata

> For parallel computation with Dask, 50-100 MB chunks tend to work well. In Pangeo Forge, contributors specify chunking with the `XarrayZarrRecipe.target_chunks` keyword; e.g., `target_chunks={"time": 10}`.

In [None]:
import s3fs
import xarray as xr

endpoint_url = "https://ncsa.osn.xsede.org"
fs_osn = s3fs.S3FileSystem(anon=True, client_kwargs={"endpoint_url": endpoint_url},)

path = "s3://Pangeo/pangeo-forge/noaa_oisst/v2.1-avhrr.zarr"
ds = xr.open_zarr(fs_osn.get_mapper(path), consolidated=True)
print(f"{ds.nbytes/1e9:.2f} GB")
ds

# An unbroken _provenance chain_

From source archive ...

https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/

... through `recipe` ...

https://github.com/pangeo-forge/staged-recipes/blob/master/recipes/noaa-oisst/recipe.py

... to Zarr dataset.

**s3://Pangeo/pangeo-forge/noaa_oisst/v2.1-avhrr.zarr**

## Today
- Start working on a recipe: https://pangeo-forge.readthedocs.io/en/latest/
- Add it to the queue for automated builds: https://github.com/pangeo-forge/staged-recipes/pulls
- Build it using a notebook

## Soon
- Build your recipe in an automated "Bakery"
- Browse (and contribute to) a STAC catalog of available Zarr datasets