# Scaling up compute - Working with CM2.6

This is a supposed to be a 'real-life' science workflow example. No tiny well-behaved test cases. Instead you will see an approximation of how your science project could evolve, based on some of my experiences. The actual results do however not make a lot of sense, so there is still some work needed 😜. 

I will try to convey some core concepts that hopefully convince you of the awesomeness of **Analysis-Ready Cloud Optimized** (ARCO) data + flexible/scalable cloud computing:

- Quickly loading/exploring huge datasets without *downloading them*

- Analysis, scaled on demand, transitioning from exploration to heavy processing in minutes.

As an example I have chosen to work on the [NOAA-GFDL CM2.6 high resolution coupled climate simulation](https://www.gfdl.noaa.gov/high-resolution-climate-modeling/). 

## Reading the data (lazily)

In [3]:
# copy paste from catalog

In [4]:
# cut the dataset to the first ~3600 time steps (about 10 years)

## Exploring the data

### A quick plot

### Something a bit more exciting

[hvplot docs](https://hvplot.holoviz.org/index.html)

## Moving from exploration to processing large amounts of data
So far all of the data has been streamed on demand. We only load the data when we plot e.g. a certain time slice

Lets run a bit larger computation

In [None]:
#from dask.diagnostics import ProgressBar
#with ProgressBar():
    # This is loaded into memory (processing the full data array) and showing a numpy array#
#    display(ds.surface_temp.mean().load())

This will eventually finish...

...but I **hate** waiting!

![](https://media.giphy.com/media/d31vwWHR0gLcLU76/giphy.gif)

This is using dask under the hood to parallelize the computation, but there is only so much we can do with 4 CPU cores (or up to 16 depending on the server you chose). After all we are averaging 300GB! here. 

Depending on the dataset size it might be worth getting a distributed dask cluster set up. 

**Make sure to copy the link 👆 into the dask sidebar**

So wait, what is happening here? 

By starting a Gateway cluster, we provisioned more kubernetes nodes in the cloud which run dask and can do work (we got 10*16 cores) for us as instructed from this notebook (which only has a few cores)!


This might take a few minutes to start up, but is definitely worth it for very large computations. 

## How about some filtering?

Following the example in the [gcm-filters docs](https://gcm-filters.readthedocs.io/en/latest/gpu.html#filtering-on-cpus-versus-gpus).

In [None]:
# lets write the results out

In [5]:
# and plot the results again

## Oh wait, my original dataset is not enough. 

I just figured out I need to look at some renalysis data 😱

But it turns out someone else has already [ingested ERA5 for us](https://github.com/google-research/arco-era5) so we can continue the ARCO awesomeness!

In [None]:
import xarray
era5 = xarray.open_zarr(
    "gs://gcp-public-data-arco-era5/ar/1959-2022-full_37-1h-0p25deg-chunk-1.zarr-v2",
    chunks={'time': 48},
    consolidated=True,
)
era5 = era5.isel(time=slice(0, 50000))
era5

In [None]:
era5['2m_temperature']

In [None]:
mean_era5 = era5['2m_temperature'].mean('time').load()

In [None]:
mean_era5.plot(robust=True)

In [None]:
cluster.shutdown()