# ERA5 Analysis Demo

ERA5 is a weather reanalysis produced by ECMWF, providing hourly estimates of atmospheric and land variables from 1940-present at ~0.25 deg resolution. The complete dataset is approximately 1PB in size. For this demo however, we're going to use a subset of the data that includes 18 surface fields from 1975-2024.

The goals of the demo are to show participants how to:

1. Use Arraylake to discover and access datasets
2. Use Icechunk and Xarray to access the ERA5 dataset directly from cloud object storage
3. Explore Icechunk's version history
4. Create a subset of the ERA5 dataset and write it to a new Icechunk repo

We'll start by importing the Python libraries we need for the demo

In [None]:
from arraylake import Client

import xarray as xr
from dask.diagnostics import ProgressBar
import cartopy.crs as ccrs

## Authenticate with Arraylake

The Arraylake platform utilizes OAuth2-style authentication. This allows us to govern access to datasets based on individual user's identity. The datasets we'll be using today are all public but when you create your own Icechunk repo, it will be private (by default).

Instructions: after running the cell below, open the link provided and follow the login instructions. Use the email address you registered for the hackweek with. If your email is associated with a Google Workspace or GMail account, you can use that to login as well.

Note that you'll have to login separately to the Arraylake web app: https://app.earthmover.io/

In [None]:
client = Client()
client.login()

## Discovering data

In addition to using the Arraylake Web App to explore the catalog of data, we can use the Arraylake Client to discover datasets in our any organization we are allowed to view. Here we'll list the contents of the `earthmover-public` organization:

In [None]:
repos = client.list_repos('ICESAT-2HackWeek')
repos

## Accessing data

We will come back to the ICESat data in the next notebook. For now, we're going to work with the ERA5 surface dataset in the Earthmover-public organization.

In [None]:
repo = client.get_repo("earthmover-public/era5-surface-aws")  # get the icechunk repository
session = repo.readonly_session("main")                       # checkout a read-only session
era5 = xr.open_zarr(session.store, group="spatial")           # open the dataset with Xarray

display(era5)

Now that we've opened this dataset with Xarray, we can immediately start querying it. Below, we'll make a simple plot from the 

In [None]:
# we can immediately start querying this dataset
da = era5['t2'].sel(time='2024-08-20T23')
da.plot()

Of course, because we're now working with Xarray, we can pull in other parts of the open source ecosystem. For example, in the next cell, we'll turn the above plot into a nicely formatted map using Cartopy.

In [None]:
# using Cartopy, we can polish this plot a bit further
plot_kws = dict(
    transform=ccrs.PlateCarree(), 
    subplot_kws=dict(projection=ccrs.Robinson()),
    cbar_kwargs={'orientation': 'horizontal', 'shrink': 0.8, 'pad': 0.1}
)

p = da.plot(robust=True, **plot_kws)

p.axes.set_global()
p.axes.coastlines()

## Exercise 1 -- Plot the Dec 2021 average snowdepth over the northern hemisphere

In the cell below, plot the average snowdepth from Dec 2021 in the Northern Hemisphere. 

***Hint: you'll need to use Xarray to sub-select and process the correct part of the dataet using the `.sel` and `.mean` methods.***

## Trend Analysis

We can also use Xarray to help us calculate the estimate temperature trends in the ERA5 dataset. We'll use the `temporal` group in our Icechunk store which has the same data, just chunked differently to upport more efficeint time-series queries.

***Hint: if you are running on a smaller computer or VM, you may want to further limit the spatial area or time bounds in the example below.***

In [None]:
era5 = xr.open_zarr(session.store, group="temporal")           # open the dataset with Xarray

# 1. Greenland region extraction (efficient spatial subsetting)
greenland = era5.sel(latitude=slice(85, 59), longitude=slice(290, 350))

In [None]:
with ProgressBar():
    t2_trend = greenland['t2'].groupby('time.year').mean().polyfit('year', 1, skipna=True).load()

In [None]:
trend_analysis['t2'].polyfit_coefficients.isel(degree=0).plot(robust=True)

## Exercise 2 -- Calculate the trend in peak winter snowfall over Alaska

## Back to Icechunk

Now that we've got our hands dirty with Xarray+Icechunk, let's come back to some of the Icechunk features we skipped past earlier on. We're specifically interested here in exploring the version history of the ERA5 repository. Let's start by listing the branches in the repository then looking at the commit history (aka ancestry):

In [None]:
repo.list_branches()

In [None]:
hist = repo.ancestry(branch="main")
for ancestor in hist:
    print(ancestor.id, ancestor.message, ancestor.written_at)

Inspecting this version history, we can see how the dataset was constructed! 

## Excersise 3 -- checkout the repository at a prior snapshot

***Hint: `repo.readonly_session` takes a `snapshot_id` parameter.***

In [None]:
repo.readonly_session?

## Write your first Icechunk repo

As the final part of this notebook, we're each going to write our first Icechunk repo to Arraylake. 

***Hint: I recommend keeping your subset of ERA5 relatively small - particularly if you are on a small VM.***

In [None]:
## create your first icechunk dataset

org = 'ICESAT-2HackWeek'
my_name = 'jhamman'  # <-- put your name or github id here

repo = client.create_repo(f'{org}/era5-subset-{my_name}')

In [None]:
session = repo.writable_session('main')

# bonus -- replace the greenland dataset with a subset of ERA5 that is more interesting to you!
subset = greenland[['t2', 'tcc']].isel(time=slice(24)).drop_encoding().load()

subset.to_zarr(session.store, zarr_format=3, consolidated=False, mode='w')

# inspect the status of your session before commiting
session.status()

In [None]:
session.commit('🧊 my first icechunk commit! 🚀')