## Exercise 2: Creating a Zarr Data Cube From a Timeseries

We will now create a timeseries data cube over a single tile. To do this, we will read each tile into an xarray dataset, add a `year` dimension to the tile, and then "stack" them together along the `year` dimension. This will yield a 3 dimensional cube (year, x, y) of LULC data.

We will employ the following xarray methods:
- [`expand_dims`](https://docs.xarray.dev/en/stable/generated/xarray.DataArray.expand_dims.html) adds a new dimension to the data array. In this case, we are adding a new dimension called "year" with a single value of either 2000, 2005, 2010, 2015, or 2020.
- [`concat`](https://docs.xarray.dev/en/stable/generated/xarray.concat.html) concatenates multiple data arrays along a specified dimension. In this case, we are concatenating the data arrays along the "year" dimension. This is useful for creating a time series dataset from multiple time steps.

Let's first consider a **** approach to this strategy:

TODO: get rid of chunking discussion since this should be small

In [None]:
### BEWARE: RUN ME AT YOUR (AND YOUR KERNEL'S) OWN RISK
import rioxarray
import xarray as xr

file_name = "40N_120W"  # Feel free to change this to any of the other files in the dataset
years = [2000, 2005, 2010, 2015, 2020]
data_arrays = []

for year in years:
    url = f"https://storage.googleapis.com/earthenginepartners-hansen/GLCLU2000-2020/v2/{year}/{file_name}.tif"
    # TODO: slice data here
    da = rioxarray.open_rasterio(url, masked=True)
    # NOTE: expand_dims will read the data in 
    da = da.expand_dims({"year": [year]})  # Add a 'year' dimension
    data_arrays.append(da)

# Concatenate the data arrays along the 'year' dimension
combined = xr.concat(data_arrays, dim="year") . # skip expand_dims 
combined

If you attempted to run the above cell, you probably sat idly for several minutes before your kernel gave up. We are attempting to load a large amount of data here, so we will need to optimize a bit. That is where chunking comes in!

### Chunking

Data chunking in xarray (with Dask) is a way to break up large datasets into smaller, manageable pieces ("chunks") that can be processed lazily and in parallel. It’s essential when working with out-of-core data — data too big to fit into memory.

Before we call `expand_dims`, which will load our data in, we need to chunk our xarray data. This leads to the age-old question: **How should I chunk my data?**

While we could rely on `chunks="auto"` to determine optimal chunks for us, let's do some actual math (I know, scary! 😱)


#### Math

- Each tile is `(y: 40,000, x: 40,000)`, so there are `40,000 x 40,000 = 1.6 billion values with dtype=float32`
- Each `float32` is 4 bytes, so the whole array is `1.6e9 x 4 bytes = 6.4 GB`

We want to keep **chunk size between ~50MB to 200MB** for efficiency and to optimize for **access patterns** (ie processing entire rows vs entire tiles)

Let's target ~100 MB chunks. Each `float32`=bytes, so:
```
chunk_size = (chunk_y, chunk_x)
chunk_memory = chunk_y * chunk_x * 4 bytes
```

**Option 1: Chunk by tiles (e.g. 1000 x 1000)**
```
chunks = {"y": 1000, "x": 1000}
memory_per_chunk = 1000 * 1000 * 4 = 4 MB
```
Too small — leads to **400** chunks per axis = **160,000 chunks total** 😱 (overhead!)

**Option 2: Bigger tiles (e.g. 4000 x 4000)**
```
chunks = {"y": 4000, "x": 4000}
memory_per_chunk = 4000 * 4000 * 4 = 64 MB
```
This results in 10 `y` chunks and 10 `x` chunks, so `100 total chunks`. This strikes a nice balance between chunk size and number.




Let's try the above code again, but with 4000 x 4000 chunks:

In [None]:
import icechunk
import rioxarray
import xarray as xr

file_name = "40N_120W"  # Feel free to change this to any of the other files in the dataset
years = [2000, 2005, 2010, 2015]
data_arrays = []

for year in years:
    url = f"https://storage.googleapis.com/earthenginepartners-hansen/GLCLU2000-2020/v2/{year}/{file_name}.tif"
    da = rioxarray.open_rasterio(url, chunks={"x": 4000, "y": 4000})
    da = da.rename({"band": "lulc"})
    da = da.squeeze("lulc")
    # NOTE: expand_dims will read the data in 
    da = da.expand_dims(dim={"year": [year]})  # Add a 'year' dimension
    data_arrays.append(da)

# Concatenate the data arrays along the 'year' dimension
combined = xr.concat(data_arrays, dim="year")
combined

See? So much faster! Let's discuss the data model here for a moment: we now have a 3D array of the shape `(year: 5, y: 40000, x: 40000)`. We have essentially stacked 5 years worth of 2D (x,y) array data into a data cube. TODO

Now, let's do some analysis with our cube!

In [None]:
# TODO: do stuff with combined!
# Replot this data



TODO: segway into Icechunk