## Exercise 3: Doing it all again with Icechunk 🧊

[Icechunk](https://icechunk.io/en/latest/overview/) augments the Zarr storage model to provide additional features such as data version control and transactions. For production-grade datasets, it is an ideal complement to the Zarr storage model for geospatial data. Let's run through the same data cube exercise above, but with Icechunk!

In [1]:
import rioxarray
import xarray as xr
import icechunk

In [2]:
# Initialize Icechunk storage and repo
storage = icechunk.local_filesystem_storage("/Users/tom/Documents/Work/Code/workshop-cng-2025-zarr/ic")
ic_repo = icechunk.Repository.create(storage)

In [38]:
# Start a new Icechunk writeable session
session = ic_repo.writable_session(branch="main")
icechunk_store = session.store

Let's get some data and write it to the store

In [25]:
def get_lulc_data(year: str, filename: str) -> xr.DataArray:
    url = f"https://storage.googleapis.com/earthenginepartners-hansen/GLCLU2000-2020/v2/{year}/{file_name}.tif"
    da = rioxarray.open_rasterio(url, chunks={"x": 4000, "y": 4000})

    # remove the useless band dimension
    da = da.squeeze("band", drop=True)

    # set the name of the data variable to something more informative
    da.name = "lulc"
    return da

In [26]:
file_name = "40N_120W"  # Feel free to change this to any of the other files in the dataset

In [27]:
da_2000 = get_lulc_data("2000", file_name)

In [28]:
# Add a 'year' dimension
da_2000 = da_2000.expand_dims(dim={"year": [2000]})  

In [29]:
da_2000

Unnamed: 0,Array,Chunk
Bytes,1.49 GiB,15.26 MiB
Shape,"(1, 40000, 40000)","(1, 4000, 4000)"
Dask graph,100 chunks in 4 graph layers,100 chunks in 4 graph layers
Data type,uint8 numpy.ndarray,uint8 numpy.ndarray
"Array Chunk Bytes 1.49 GiB 15.26 MiB Shape (1, 40000, 40000) (1, 4000, 4000) Dask graph 100 chunks in 4 graph layers Data type uint8 numpy.ndarray",40000  40000  1,

Unnamed: 0,Array,Chunk
Bytes,1.49 GiB,15.26 MiB
Shape,"(1, 40000, 40000)","(1, 4000, 4000)"
Dask graph,100 chunks in 4 graph layers,100 chunks in 4 graph layers
Data type,uint8 numpy.ndarray,uint8 numpy.ndarray


In [31]:
%%time
da_2000.to_zarr(icechunk_store, consolidated=False)

CPU times: user 20.2 s, sys: 4.01 s, total: 24.2 s
Wall time: 1min 15s


<xarray.backends.zarr.ZarrStore at 0x179a2cf70>

That took a while because we had to write out the data from memory to disk, and to get it into memory xarray triggered loading it from remote storage.

In [32]:
session.commit("wrote data for the year 2000")

'RFKD68YZ1CHNC78VMZ80'

Let's open and look at the icechunk store we just created. Icechunk data is a zarr store, so we can open it using xarray's `open_zarr` function:

In [36]:
roundtrip = xr.open_zarr(icechunk_store, consolidated=False)
roundtrip

Unnamed: 0,Array,Chunk
Bytes,11.92 GiB,122.07 MiB
Shape,"(1, 40000, 40000)","(1, 4000, 4000)"
Dask graph,100 chunks in 2 graph layers,100 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 11.92 GiB 122.07 MiB Shape (1, 40000, 40000) (1, 4000, 4000) Dask graph 100 chunks in 2 graph layers Data type float64 numpy.ndarray",40000  40000  1,

Unnamed: 0,Array,Chunk
Bytes,11.92 GiB,122.07 MiB
Shape,"(1, 40000, 40000)","(1, 4000, 4000)"
Dask graph,100 chunks in 2 graph layers,100 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


In [None]:
# TODO there's a bug here: `spatial_ref` should still be a coordinate

Now let's add the other years of data, committing each one separately in each iteration of a loop.

In [39]:
for year in [2005, 2010, 2015]:
    da_this_year = get_lulc_data(year, "40N_120W")

    # TODO change the chunking before writing here...?
    
    da.to_zarr(icechunk_store, append_dim='year', consolidated=False)
    session.commit(f"wrote data for the year {year}")

ValueError: failed to prevent overwriting existing key add_offset in attrs. This is probably an encoding field used by xarray to describe how a variable is serialized. To proceed, remove this key from the variable's attributes manually.

In [None]:
ds = xr.open_zarr(icechunk_store, consolidated=False)
ds

We have a datacube! It's Zarr, so this approach scales, but it's also Icechunk, so we have the version history of the data

In [None]:
# TODO print commit history somehow

But wait! We forgot the final year of data (from 2020)

With Zarr this could be a big problem - we would either have to rewrite the entire store (wasteful) or write only the new chunks and edit the metadata (unsafe if someone else is reading from it).

But with Icechunk this is both efficient and safe!

In [None]:
da_this_year = get_lulc_data("2020", "40N_120W")

# TODO change the chunking before writing here...?

da.to_zarr(icechunk_store, append_dim='year')
session.commit(f"wrote data for the year {year}")

## Virtualization

This is cool, but we're duplicating the data in the TIFF files into a new location (in the chunks of the Zarr store). What if we didn't have to duplicate the data?

Icechunk stores references to chunks. Those chunks can live in the Icechunk store itself, or outside of it, as "virtual" chunks.