# Lab 1: Exploring the Catalog

**Goal**: Discover what's available

In this lab, you'll log into a pre-configured AWS SageMaker Studio environment and explore the pre-loaded data catalog (ERA5, DEM, etc.) in a Jupyter Notebook. You'll run basic Xarray queries that are accelerated by Icechunk, experiencing the platform's speed firsthand.

**Note:** If you are not running this in an AWS-facilitated workshop environment you will need to `pip install` `xarray`, `zarr`, `icechunk`, `matplotlib` and `cartopy`, as well as the latest version of `arraylake` (`v0.23.1`).

## Connect to Arraylake

Now let's login to arraylake. We can [login to the web app](https://app.earthmover.io/earthsciencesnz/dashboard), or log in from the notebook programatically via the client.

In [None]:
from arraylake import Client

In [None]:
client = Client()
client.login()

This `client` object is how we interact with arraylake from python. Using it we can find data repositories ("repos"), create and edit repos, as well as read and write data to and from repos.

## Explore Arraylake's Data Catalog

Data in arraylake is organised into "repos" (think GitHub repositories), which contain data, and "orgs" (think GitHub organization), which contain a managed set of repos.

In this workshop we will use two orgs:
- The `earthmover-public` org, containing examples of public datasets (e.g. ERA5),
- The `earthsciencesnz` org, containing data specific to Earth Sciences New Zealand.

All repositories can be public or private, and your accounts for today are able to view all public repositories, such as those in `earthmover-public`, as well as all the private repositories within the `earthsciencesnz` org.

We can view all the repos in the `earthsciencesnz` org by navigating to the [earthsciencesnz org homepage](https://app.earthmover.io/earthsciencesnz/dashboard) in the Arraylake web app.

Alternatively we can access the same information programatically via the client:

In [None]:
# requires arraylake v0.23.1 for nice HTML repr
client.list_repos(org="earthsciencesnz")

Either way, we can see that this single org contains multiple repos, which may be public or private.

## Explore a specific repo (`copernicus_dem`)

Let's explore a specific data repository: the `copernicus_dem` repo. We can [view the repo](https://app.earthmover.io/earthsciencesnz/copernicus_dem) in the web app.

Or we can get information about the repo via the client.

In [None]:
# requires arraylake v0.23.1 to use sync version of this method
client.get_repo_object("earthsciencesnz/copernicus_dem")

Whilst the web app shows us various metadata about the contents of the repo, to actually access data we need to use the client.

In [None]:
ic_repo = client.get_repo("earthsciencesnz/copernicus_dem")

In [None]:
ic_repo

Icechunk is Earthmover's open-source transactional storage engine. You can think of it as "version-controlled, multiplayer Zarr".

Icechunk is incredibly powerful, and you can read more about it in the [icechunk documentation](https://icechunk.io/en/latest/), and on the [Earthmover Blog](https://earthmover.io/blog).

For today, in this notebook, Icechunk will mainly be behind-the-scences.

To access the data in Icechunk via zarr, we need to start a `Session`, then get the Zarr store object.

In [None]:
session = ic_repo.readonly_session("main")
session

In [None]:
icechunk_store = session.store
icechunk_store

Now we have something that xarray can read from!

In [None]:
import xarray as xr

In [None]:
ds = xr.open_dataset(
    icechunk_store, group="90m_new_zealand_complete", engine="zarr", zarr_format=3
)
ds

This `xarray.Dataset` represents a lazy view of the data in the `90m_new_zealand` group of the zarr data in the `copernicus_dem` repo.

The metadata shown is the same as in the [web app view](https://app.earthmover.io/earthsciencesnz/copernicus_dem/tree/main/90m_new_zealand) of the same group of that repo.

## Plot some data

Though this `Dataset` contains over a GB of data, only a tiny fraction of that (the metadata) has so far been downloaded to the machine on which we are running this notebook.

We can select a region over the North Island, and plot just that bounding box.

In [None]:
ds["latitude"].sel(latitude=-41, method="nearest")

In [None]:
# note: once https://github.com/pydata/xarray/pull/10711 is merged exact decimal grid values won't be necessary,
# instead we can use integer values with `method="nearest"`.
bbox = {
    "longitude": slice(174.00041667, 178.00041667),
    "latitude": slice(-36.99958333, -40.99958333),
}

In [None]:
ds["elevation"].sel(**bbox).plot()

Apparently New Zealand has at least one point below sea level, which is why xarray has defaulted to a diverging colormap. We can override this to make a more informative plot:

In [None]:
ds["elevation"].sel(**bbox).plot(vmin=-100, center=False)

## Coarsen

We could plot the elevation over the whole of NZ at our full resolution, but that would involve downloading >1GB of data to the machine our notebook is running on. That's totally possible, but what if we just wanted to work with a coarsened view of the data, for example for a regional climate model?

We can achieve that very easily using [xarray's `.coarsen` functionality](https://docs.xarray.dev/en/stable/user-guide/computation.html#coarsen-large-arrays). 

Here we coarsen the data by taking a mean over boxes of 400x400 lat-lon points.

In [None]:
ds["elevation"].coarsen(latitude=400, longitude=400).mean().plot()

## Explore another repo (`era5-surface-aws`)

Let's explore another repo. Earthmover maintains some public datasets - let's get ERA5!

In [None]:
ic_repo = client.get_repo("earthmover-public/era5-surface-aws")
session = ic_repo.readonly_session("main")

We can see information about this repo in the [web app page for ERA5](https://app.earthmover.io/earthmover-public/era5-surface-aws).

There are two groups - these contain the same data, but with chunking optimized for different access patterns. For now let's get the `spatial` group.

In [None]:
ds = xr.open_dataset(session.store, group="spatial", engine="zarr", zarr_format=3)
ds

A lot more data in here! How much data...?

In [None]:
ds.nbytes / 1e12

Wow there's 32TB in here!

## Total cloud cover

This dataset has lots of interesting variables, but let's try plotting just one first - total cloud cover. We can look at the metadata of the `tcc` variable to confirm that that's the one that represents total cloud cover.

In [None]:
ds["tcc"].attrs

Now as this dataset is global, we should pick a map projection to use, for which we need the `cartopy` library.

In [None]:
import cartopy.crs as ccrs

In [None]:
p = (
    ds["tcc"]
    .isel(time=-1)
    .plot(
        subplot_kws={"projection": ccrs.Orthographic(173, -42), "facecolor": "gray"},
        transform=ccrs.PlateCarree(),
    )
)
p.axes.set_global()
p.axes.coastlines();

The total cloud cover over New Zealand on New Year's Eve 2024!

### Vorticity

Let's try calculating and plotting a simple derived quantity that you hopefully remember from GFD classes - vorticity.

In [None]:
def vorticity(u, v):
    """
    Calculate the vertical component of vorticity from horizontal velocity fields u and v.
    """

    du_dy = u.differentiate("latitude")
    dv_dx = v.differentiate("longitude")

    return dv_dx - du_dy

(This operation will compute eagerly, so we need to select just the timestep we want first, otherwise we will load ~4TB of data into memory! To do this lazily or in parallel for the entire dataset, we would need to take advantage of xarray's integration with parallel computing frameworks such as [Dask](https://docs.xarray.dev/en/stable/user-guide/dask.html).)

In [None]:
final_timestep = ds.isel(time=-1)

In [None]:
vort100 = vorticity(
    u=final_timestep["u100"],
    v=final_timestep["v100"],
)

In [None]:
p = vort100.plot(
    subplot_kws={"projection": ccrs.Orthographic(173, -42), "facecolor": "gray"},
    transform=ccrs.PlateCarree(),
    robust=True,
)
p.axes.set_global()
p.axes.coastlines();

The reason this is all so fast is because xarray is only requesting the zarr chunks that it needs to make the plot, and Icechunk's IO layer is capable of fetching them extremely efficiently and concurrently.

### Try exploring yourself!

Now have a go at exploring yourself! Here are some ideas of things you could try:

- Find all public repos in arraylake.

Click to reveal the solution!

Navigate to [https://app.earthmover.io/public/repositories](https://app.earthmover.io/public/repositories).

- Plot a different variable from ERA5.

In [None]:
# Click to reveal the solution!

p = (
    ds["sd"]
    .isel(time=-1)
    .plot(
        subplot_kws={"projection": ccrs.Orthographic(173, -42), "facecolor": "gray"},
        transform=ccrs.PlateCarree(),
    )
)
p.axes.set_global()
p.axes.coastlines();

- Compute the global mean sea surface temperature at a point in time.

In [None]:
# Click to reveal the solution!

ds["sst"].isel(time=-1).mean()

- Plot a timeseries of ERA5 data at a single point (this will perform better if you open the `temporal` group instead of the `spatial` group, which has the same data but chunked in a way more optimized for a timeseries access pattern).

In [None]:
# Click to reveal the solution!

# open time-optimized-chunking version of ERA5
ds = xr.open_dataset(session.store, group="temporal", engine="zarr", zarr_format=3)

# total precipitation
precip = ds["cp"]

# timeseries over Wellington, NZ
wellington_precip = precip.sel(
    latitude=-41.288889, longitude=174.777222, method="nearest"
)

wellington_precip.plot()

- Compute a monthly climatology at a specific location using [xarray's `.groupby` pattern](https://docs.xarray.dev/en/stable/user-guide/groupby.html).

In [None]:
# Click to reveal the solution!
wellington_precip.groupby("time.month").mean().plot()

- Explore some of the other example repos in `earthmover-public`, such as `gfs` or `hrrr`.

In [None]:
# Click to reveal the solution!

ic_repo = client.get_repo("earthmover-public/hrrr")
session = ic_repo.readonly_session("main")

hrrr = xr.open_dataset(session.store, group="solar", engine="zarr", zarr_format=3)

# plot a forecast of solar radiation
hrrr["dswrf"].isel(step=10, time=-5).plot()

## Conclusion

Hopefully this gives you a little taste of how easy it is to find and explore data with the combination of arraylake, icechunk, zarr, and xarray!