# Part 1: *Soemthing catchy here*

## Exercise 1: [COG/TIF/GEOTIFF?] to Zarr with a single tile

In this exercise, we will load in a single GeoTIFF into xarray using [rioxarray](https://corteva.github.io/rioxarray/html/modules.html) and show how to navigate the Xarray repr. We will then do some quick visualizations of the tile and save out the Xarray dataset to Zarr. 

To start, let's read in a single GLAD LULC tile from the year 2000 from Google Cloud. The data can also be downloaded to local files [here](https://storage.googleapis.com/earthenginepartners-hansen/GLCLU2000-2020/v2/download.html). We will use rioxarray's [`open_rasterio`](https://corteva.github.io/rioxarray/html/rioxarray.html#rioxarray-open-rasterio) for this operation:

In [None]:
import rioxarray

year = 2000  # Feel free to change this to 2005, 2010, 2015, or 2020
file_name = "50N_120W"  # Feel free to change this to any of the other files in the dataset

url = f"https://storage.googleapis.com/earthenginepartners-hansen/GLCLU2000-2020/v2/{year}/{file_name}.tif"

da = rioxarray.open_rasterio(url, masked=True)
da

Now, let's examine the data structure....

TODO: Tom to fill in

It's important to note here that we did not actually read in all of the tile data; we actually only read in the metadata, which is why this was so quick! We will actually have to load the data in for operations that require direct data access like plotting and writing to Zarr. These will require some additional optimizations for these large tiles. 

For this first part of the tutorial, we will subset this tile to expedite the first few exercises. We will discuss optimizations when we get to building the global Zarr dataset. 

In [None]:
# Select a subset of the data
x_slice=slice(-112.5, -111.5)  # None for whole tile
y_slice=slice(41, 40.5)  # None for whole tile
da_sample = da.sel(x=x_slice, y=y_slice)
da_sample

You will find that for a lot of Level 3 geospatial datasets, the data is stored in a single band (often named "band" -- very original!) as it is here. Let's rename this band to "lulc" just to be a bit more explicit. 

We will also remove the `lulc` dim. Since it only has one value, it doesn't hold additional information along that axis, so removing it will simplify the array shape from a 3D array to a 2D array.

In [None]:
da_sample = da_sample.rename({"band": "lulc"})
da_sample = da_sample.squeeze("lulc")
da_sample

### Plotting

Visualization is essential for geospatial data. How can we know that our data was correctly loaded into xarray without actually looking at it? Below are a few different approaches to plotting xarray data in a notebook. 

**Cloud vs. Local Latencies**

Note that the data must be loaded in before it can be plotted. Loading data from the cloud has higher latency, and thus loading data in from a cloud source vs. from your local machine can cause a large disparity in runtime.

**Visualizing in QGIS**

QGIS natively supports TIFF and GeoTIFFS...

In [None]:
# Load the data into memory from the cloud
# This may take awhile depending on your internet connection, the size of the file, and whether it is local or in cloud storage
# This is slow because we are loading non-cloud optimized data
# TODO: is this cloud optimized data?
da_sample = da_sample.load()
da_sample

#### Leafmap

[leafmap](https://leafmap.org/) is good for plotting xarray data because it combines the mapping power of Leaflet (via `ipyleaflet` or `folium`) with convenient tools for handling raster and vector geospatial data, including xarray. It can automatically convert xarray DataArrays into interactive map layers, supporting time sliders, colorbars, and basemaps — making it especially useful for visualizing geospatial timeseries or remote sensing data with minimal setup.

In [None]:
import leafmap

def plot_leafmap(data_to_plot):
    m = leafmap.Map(center=(40, -100), zoom=11)
    m.add_raster(data_to_plot, colormap="tab20", layer_name="LULC")
    m.add("inspector")
    return m

In [None]:
plot_leafmap(da_sample)

#### hvPlot

[hvPlot](https://hvplot.holoviz.org/) is great for large xarray datasets because it integrates well with xarray, supports Dask for lazy evaluation, and leverages Datashader to efficiently render millions of points without performance loss. It also enables interactive, zoomable plots with minimal code, making it ideal for exploring complex geospatial or time-series data.

We discourage the use of dask-backed xarray dataset in this plotting example because **

In [None]:
import hvplot.xarray  # needed for hvplot
import hvplot.pandas  # needed for tile sources
import holoviews as hv
from holoviews.element.tiles import EsriImagery  # or other tile source

hv.extension('bokeh')

def plot_hvplot(data_to_plot):
    # rasterize=True will enable datashading for large datasets and will downsample the data based on the aggregation method
    img = data_to_plot.hvplot.image(x='x', y='y', cmap='viridis', aggregator="first", rasterize=True, frame_width=500, dynamic=True, geo=True)
    return EsriImagery() * img

In [None]:
plot_hvplot(da_sample)

### Writing Data to Zarr

Before we move on from this single data tile, let's write our subset of data to Zarr using xarray's [`to_zarr`](https://docs.xarray.dev/en/latest/generated/xarray.Dataset.to_zarr.html) method.

- `store`:
- `group`:

In [None]:
store = ""
group = ""
da.to_zarr(store=store, group=group)

TODO: discussion of the Zarr data model using written out files

We can also easily read this dataset back into Xarray with [`open_zarr`](https://docs.xarray.dev/en/stable/generated/xarray.open_zarr.html)

In [None]:
import xarray as xr

ds = xr.open_zarr(store=store, group=group)
ds

## Exercise 2: Creating a Zarr Data Cube From a Timeseries

We will now create a timeseries data cube over a single tile. To do this, we will read each tile into an xarray dataset, add a `year` dimension to the tile, and then "stack" them together along the `year` dimension. This will yield a 3 dimensional cube `(year, x, y)` of LULC data.

We will first use a naive approach to illustrate this general flow with a very small sample of data. We will then work up to using more advanced approaches like Icechunk for version control and virtualization for ****.

But, let's start with a straightforward example. We will employ Xarray's [`concat`](https://docs.xarray.dev/en/stable/generated/xarray.concat.html) method which concatenates multiple data arrays along a specified dimension. Here we add a new dimension called "year" with a single value of either 2000, 2005, 2010, 2015, or 2020. We will then concatenate the data arrays along the "year" dimension to create a "stack" of data. This method is useful for creating a time series dataset from multiple time steps.

In [None]:
import rioxarray
import xarray as xr

file_name = "50N_120W"  # Feel free to change this to any of the other files in the dataset
years = [2000, 2005, 2010, 2015, 2020]
x_slice = slice(-112.5, -111.5)
y_slice=slice(41, 40.5)
data_arrays = []

for year in years:
    url = f"https://storage.googleapis.com/earthenginepartners-hansen/GLCLU2000-2020/v2/{year}/{file_name}.tif"
    da = rioxarray.open_rasterio(url)
    da = da.rename({"band": "lulc"})
    da = da.squeeze("lulc")
    # Subset the data to the area of interest
    da = da.sel(x=x_slice, y=y_slice)
    data_arrays.append(da)

# Concatenate the data arrays along the 'year' dimension
# NOTE: this call reads all the data into memory and may take a while for large datasets
combined = xr.concat(data_arrays, dim=xr.DataArray(years, dims="year"))
combined

We just read in 5x the amount of data into memory as we did in Exercise 1. The subset of data that we are using is pretty small, so the operation was relatively quick. If we wanted to use this naive approach with a larger AOI (say, the whole tile), we would want to consider chunking. 

In the [`rioxarray.open_rasterio()`]() call, we have the option of specifying `chunks`. Data chunking in Xarray (with Dask) is a way to break up large datasets into smaller, manageable pieces ("chunks") that can be processed lazily and in parallel. It’s essential when working with out-of-core data — data too big to fit into memory. 

This often leads to the age-old question: **How should I chunk my data?** See the Appendix for a walkthrough on how to calculate chunks based on the desired chunk size.

Let's discuss the data model here for a moment: we now have a 3D array of the shape `(year: 5, y: 2000, x: 4000)`. We have essentially stacked 5 years worth of 2D (x,y) array data into a data cube. TODO

Now let's visualize our data cube!

In [None]:
plot_leafmap(combined)

In [None]:
plot_hvplot(combined)

Let's take this exercise a step further and explore how we can integrate Icechunk into this workflow for data versioning and management. 