### Survey
TODO: Quick survey of who is in the room, with what background?
TODO: make some slides and a QR code for this survey

## Concepts

2. Conceptually: Level-2 swath vs Level-3+ datacubes
    1. Xarray data model; `glad.tif`
        1. dims/coordinates/data variables
    2. GDAL Raster data model; `glad.tif`
    3. Zarr data model : `glad.zarr`
    4. Icechunk version control model?
3. Spatial Reference Systems in GeoTIFF & CF
    1. how is this stored, how do you access it.

### Level 2 vs. Level 3 data
- L2:
- L3: 

### COG Data Model

### Xarray Data Model

### Zarr Storage Model

## GLAD LULC Data

In this tutorial, we will be working with the Global Land Analysis & Discovery (GLAD) Global Land Cover and Land Use Change, 2000-2020 dataset: https://glad.umd.edu/dataset/GLCLUC2020 [1]

This 30m dataset consists of 10x10 degree tiles of combined land cover for 2000, 2005, 2010, 2015 and 2020, as well as 2000-2020 land cover/use change. We will be using the annual land use and land cover data for this tutorial, but feel free to challenge yourself and ingest the land use change data on your own!

TODO

### Legend
TODO

### Data Format

This dataset is formatted as individual 10x10 degree granules stored in TIF files, one per year. 
TODO

![alt text](../assets/GLAD_data_coverage.png "Title")
Taken from: https://storage.googleapis.com/earthenginepartners-hansen/GLCLU2000-2020/v2/download.html

We will start by converting one of these tiles for one year into zarr and visualize the result
Then converting the same tile over all available years into a zarr data cube
Finally, build up to creating a global zarr data cube for all years

# Part 1: *Soemthing catchy here*

## Exercise 1: COG to Zarr with a single tile

In this exercise, we will load in a single GeoTIFF into xarray using [rioxarray](https://corteva.github.io/rioxarray/html/modules.html) and show how to navigate the xarray repr. We will then do some quick visualizations of the tile and save out the xarray data to Zarr. 

To start, let's read in a single GLAD LULC tile from the year 2000 from Google Cloud. The data can also be downloaded to local files [here](https://storage.googleapis.com/earthenginepartners-hansen/GLCLU2000-2020/v2/download.html). We will use rioxarray's [`open_rasterio`](https://corteva.github.io/rioxarray/html/rioxarray.html#rioxarray-open-rasterio) for this operation:

In [7]:
import rioxarray

year = 2000  # Feel free to change this to 2005, 2010, 2015, or 2020
file_name = "40N_120W"  # Feel free to change this to any of the other files in the dataset
url = f"https://storage.googleapis.com/earthenginepartners-hansen/GLCLU2000-2020/v2/{year}/{file_name}.tif"

da = rioxarray.open_rasterio(url, masked=True)
da

It's important to note here that we did not actually read in all of the tile data; we actually only read in the metadata, which is why this was so quick! We will actually have to load the data in for operations that require direct data access like plotting and writing to Zarr. These will require some additional optimizations for these large tiles.

Now, let's examine the data structure....

TODO: Tom to fill in

You will find that for a lot of Level 3 geospatial datasets, the data is stored in a single band (often named "band" -- very original!) as it is here. Let's rename this band to "lulc" just to be a bit more explicit. 

We will also remove the `lulc` dim since it only has one value. Since it's size 1, it doesn't hold additional information along that axis, so removing it will simplify the array shape from a 3D array to a 2D array.

In [8]:
da = da.rename({"band": "lulc"})
da = da.squeeze("lulc")
da

### Plotting

Visualization is essential for geospatial data. How can we know that our data was correctly loaded into xarray without actually looking at it? Below are a few different approaches to plotting xarray data in a notebook. 

**Cloud vs. Local Latencies**

Note that the data must be loaded in before it can be plotted; loading data in from a cloud source vs. from your local machine can cause a large disparity in runtime. Loading data from the cloud has higher latency. This tile is also quite large. To avoid long runtimes or kernel crashes, consider downloading the data locally, only trying to visualize a slice of the full tile, or TODO

**Visualizing in QGIS**

QGIS natively supports TIFF and GeoTIFFS...

In [4]:
# Load the data into memory
# This may take awhile depending on your internet connection, the size of the file, and whether it is local or in cloud storage
da = da.load()
da

#### hvPlot

[hvPlot](https://hvplot.holoviz.org/) is great for large xarray datasets because it integrates well with xarray, supports Dask for lazy evaluation, and leverages Datashader to efficiently render millions of points without performance loss. It also enables interactive, zoomable plots with minimal code, making it ideal for exploring complex geospatial or time-series data.

We discourage the use of dask-backed xarray dataset in this plotting example because **

In [None]:
import hvplot.xarray

def plot_hvplot(da, x_slice, y_slice):
    data_to_plot = da.isel(x=x_slice, y=y_slice) if (x_slice and y_slice) else da
    # rasterize=True will enable datashading for large datasets
    return data_to_plot.hvplot.image(x='x', y='y', cmap='viridis', rasterize=True, frame_width=500, dynamic=True, geo=True)

plot_hvplot(da, x_slice=None, y_slice=None)

#### Leafmap

[leafmap](https://leafmap.org/) is good for plotting xarray data because it combines the mapping power of Leaflet (via `ipyleaflet` or `folium`) with convenient tools for handling raster and vector geospatial data, including xarray. It can automatically convert xarray DataArrays into interactive map layers, supporting time sliders, colorbars, and basemaps — making it especially useful for visualizing geospatial timeseries or remote sensing data with minimal setup.

In [5]:
import leafmap

def plot_leafmap(da, x_slice, y_slice):
    data_to_plot = da.isel(x=x_slice, y=y_slice) if (x_slice and y_slice) else da
    m = leafmap.Map(center=(40, -100), zoom=4)
    m.add_raster(data_to_plot)
    return m

plot_leafmap(da, x_slice=None, y_slice=None)

Map(center=[35.0, -115.0], controls=(ZoomControl(options=['position', 'zoom_in_text', 'zoom_in_title', 'zoom_o…

### Writing Data to Zarr

Before we move on from this single data tile, let's write our data to Zarr using xarray's [`to_zarr`](https://docs.xarray.dev/en/latest/generated/xarray.Dataset.to_zarr.html) method.

- `store`:
- `group`:

In [None]:
store = ""
group = ""
da.to_zarr(store=store, group=group)

We can also easily read this dataset back into Xarray with [`open_zarr`](https://docs.xarray.dev/en/stable/generated/xarray.open_zarr.html). Note that we set `chunks` here to load the data into `dask` arrays. We will discuss chunking more in the next section of the workshop.

In [None]:
import xarray as xr

ds = xr.open_zarr(store=store, group=group, chunks={"x": 2048, "y": 2048})
ds

## Exercise 2: Creating a Zarr Data Cube From a Timeseries

We will now create a timeseries data cube over a single tile. To do this, we will read each tile into an xarray dataset, add a `year` dimension to the tile, and then "stack" them together along the `year` dimension. This will yield a 3 dimensional cube (year, x, y) of LULC data.

We will employ the following xarray methods:
- [`expand_dims`](https://docs.xarray.dev/en/stable/generated/xarray.DataArray.expand_dims.html) adds a new dimension to the data array. In this case, we are adding a new dimension called "year" with a single value of either 2000, 2005, 2010, 2015, or 2020.
- [`concat`](https://docs.xarray.dev/en/stable/generated/xarray.concat.html) concatenates multiple data arrays along a specified dimension. In this case, we are concatenating the data arrays along the "year" dimension. This is useful for creating a time series dataset from multiple time steps.

Let's first consider a **** approach to this strategy:

In [None]:
### BEWARE: RUN ME AT YOUR (AND YOUR KERNEL'S) OWN RISK
import rioxarray
import xarray as xr

file_name = "40N_120W"  # Feel free to change this to any of the other files in the dataset
years = [2000, 2005, 2010, 2015, 2020]
data_arrays = []

for year in years:
    url = f"https://storage.googleapis.com/earthenginepartners-hansen/GLCLU2000-2020/v2/{year}/{file_name}.tif"
    da = rioxarray.open_rasterio(url, masked=True)
    # NOTE: expand_dims will read the data in 
    da = da.expand_dims({"year": [year]})  # Add a 'year' dimension
    data_arrays.append(da)

# Concatenate the data arrays along the 'year' dimension
combined = xr.concat(data_arrays, dim="year")
combined

If you attempted to run the above cell, you probably sat idly for several minutes before your kernel gave up. We are attempting to load a large amount of data here, so we will need to optimize a bit. That is where chunking comes in!

### Chunking

Data chunking in xarray (with Dask) is a way to break up large datasets into smaller, manageable pieces ("chunks") that can be processed lazily and in parallel. It’s essential when working with out-of-core data — data too big to fit into memory.

Before we call `expand_dims`, which will load our data in, we need to chunk our xarray data. This leads to the age-old question: **How should I chunk my data?**

While we could rely on `chunks="auto"` to determine optimal chunks for us, let's do some actual math (I know, scary! 😱)


- Each tile is `(y: 40,000, x: 40,000)`, so there are `40,000 x 40,000 = 1.6 billion values with dtype=float32`
- Each `float32` is 4 bytes, so the whole array is `1.6e9 x 4 bytes = 6.4 GB`

We want to keep **chunk size between ~50MB to 200MB** for efficiency and to optimize for **access patterns** (ie processing entire rows vs entire tiles)

Let's target ~100 MB chunks. Each `float32`=bytes, so:
```
chunk_size = (chunk_y, chunk_x)
chunk_memory = chunk_y * chunk_x * 4 bytes
```

**Option 1: Chunk by tiles (e.g. 1000 x 1000)**
```
chunks = {"y": 1000, "x": 1000}
memory_per_chunk = 1000 * 1000 * 4 = 4 MB
```
Too small — leads to **400** chunks per axis = **160,000 chunks total** 😱 (overhead!)

**Option 2: Bigger tiles (e.g. 4000 x 4000)**
```
chunks = {"y": 4000, "x": 4000}
memory_per_chunk = 4000 * 4000 * 4 = 64 MB
```
This results in 10 `y` chunks and 10 `x` chunks, so `100 total chunks`. This strikes a nice balance between chunk size and number.




Let's try the above code again, but with 4000 x 4000 chunks:

In [14]:
import rioxarray
import xarray as xr

years = [2000, 2005, 2010, 2015, 2020]
data_arrays = []

for year in years:
    url = f"https://storage.googleapis.com/earthenginepartners-hansen/GLCLU2000-2020/v2/{year}/40N_080W.tif"
    da = rioxarray.open_rasterio(url, chunks={"x": 4000, "y": 4000})
    da = da.rename({"band": "lulc"})
    da = da.squeeze("lulc")
    # NOTE: expand_dims will read the data in 
    da = da.expand_dims(dim={"year": [year]})  # Add a 'year' dimension
    data_arrays.append(da)

# Concatenate the data arrays along the 'year' dimension
combined = xr.concat(data_arrays, dim="year")
combined

Unnamed: 0,Array,Chunk
Bytes,7.45 GiB,15.26 MiB
Shape,"(5, 40000, 40000)","(1, 4000, 4000)"
Dask graph,500 chunks in 21 graph layers,500 chunks in 21 graph layers
Data type,uint8 numpy.ndarray,uint8 numpy.ndarray
"Array Chunk Bytes 7.45 GiB 15.26 MiB Shape (5, 40000, 40000) (1, 4000, 4000) Dask graph 500 chunks in 21 graph layers Data type uint8 numpy.ndarray",40000  40000  5,

Unnamed: 0,Array,Chunk
Bytes,7.45 GiB,15.26 MiB
Shape,"(5, 40000, 40000)","(1, 4000, 4000)"
Dask graph,500 chunks in 21 graph layers,500 chunks in 21 graph layers
Data type,uint8 numpy.ndarray,uint8 numpy.ndarray


See? So much faster! Let's discuss the data model here for a moment: we now have a 3D array of the shape `(year: 5, y: 40000, x: 40000)`. We have essentially stacked 5 years worth of 2D (x,y) array data into a data cube. TODO

Now, let's do some analysis with our cube!

In [None]:
# TODO: do stuff with combined!

TODO: segway into Icechunk

## Exercise 3: Doing it all again with Icechunk 🧊

1. one year per commit.

## Virtualization

TODO: virtualizarr discussion
2. point out that virtual zarr is possible.
    1. Data model is separable from the file format 

# Part 2: something about the whole world!

## Zonal Statistics

## Masking

## Reprojection

# The Grand Finale

# Appendix

### Supplemental Materials
- Zarr documentation:
- Xarray documentation:
- Rioxarray documentation:
- Icechunk documentation:
- VirtualiZarr documentation:
- Dask documentation:
- Arraylake documentation:

### References

[1] Potapov, P., Hansen, M.C., Pickens, A., Hernandez-Serna, A., Tyukavina, A., Turubanova, S., Zalles, V., Li, X., Khan, A., Stolle, F. and Harris, N., 2022. The global 2000-2020 land cover and land use change dataset derived from the Landsat archive: first results. Front. Remote Sens. 3: 856903. doi: 10.3389/frsen. https://doi.org/10.3389/frsen.2022.856903

Data is provided under a [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/).