# Basic ingredients of cloud computing

## Jupyter notebook / Jupyterlab

This is where you are right now !

You can replace lab by tree to see the notebook interface instead of the lab one

Documentation and try online : https://docs.jupyter.org/en/latest/index.html

## Xarray

How to handle multi dimensional data

DataArray : dictionary-like containers of multiple arrays with multiple dimensions

You have data and metadata, coordinates for instance

Documentation and tutorial : https://tutorial.xarray.dev/intro.html

In [None]:
import xarray as xr

In [None]:
# Let's open a dataset that is sitting on the cloud
store = 'https://ncsa.osn.xsede.org/Pangeo/pangeo-forge/cmip6-feedstock/CMIP6.CMIP.CCCma.CanESM5.historical.r1i1p1f1.Omon.zos.gn.v20190429.zarr'
ds = xr.open_dataset(store, engine='zarr', chunks={})
ds

In [None]:
# Total size of the dataset
ds.nbytes/1e9

In [None]:
# We extract one variable
ds['zos']

In [None]:
# What is the value at a particular location
ds['zos'][0,100,100].values

In [None]:
# We plot a map at one date
ds['zos'].isel(time=0).plot()

In [None]:
# We plot a map at one date
ds['zos'].sel(time="2014-01-16").plot()

In [None]:
# A time serie at one location
ds['zos'][:,100,100].plot()

In [None]:
# The dataset is so small we can compute means without parallel computation
ds['zos'].mean(dim='time').plot()

## Intake

A package to organize, disseminate datasets.

Widely used on the cloud to handle catalogs of data.

Documentation : https://intake.readthedocs.io/en/latest/

In [None]:
from intake import open_catalog

Pangeo's online catalog https://catalog.pangeo.io/

In [None]:
# We can explore it both online and in the command line
cat = open_catalog("https://raw.githubusercontent.com/pangeo-data/pangeo-datastore/master/intake-catalogs/master.yaml")
list(cat)

In [None]:
# One level down
cat = open_catalog("https://raw.githubusercontent.com/pangeo-data/pangeo-datastore/master/intake-catalogs/ocean.yaml")
list(cat)

In [None]:
# Now we open one dataset
from intake import open_catalog
cat = open_catalog("https://raw.githubusercontent.com/pangeo-data/pangeo-datastore/master/intake-catalogs/ocean.yaml")
ds  = cat["sea_surface_height"].to_dask()
ds

In [None]:
# Total size of the dataset
ds.nbytes/1e9

We don't have this much memory, this time we need parallel computing

## Dask

Parallel computing in python with task scheduling on workers

Allows computation to scale from laptop to HPC, cloud

Documentation : https://docs.dask.org/en/stable/

Click on the Dask tab on the left side of jupyterlab, then +NEW

A LocalCluster has been launched, drag and drop it to the notebook below, it should look like this :

Select some dashboard metrics to follow : Progress, Task Stream, Graph, CPU and Cluster Memory, and rearrange the lab windows

In [None]:
# One variable size
ds.sla.nbytes/1e9

In [None]:
# We only have 15Gb available but we can still handle this variable
sla_timeseries = ds.sla.mean(dim=('latitude', 'longitude'))

In [None]:
# Nothing is happenning while we do not load the computation
sla_timeseries.load()

In [None]:
# Let's make a plot

import matplotlib.pyplot as plt
sla_timeseries.plot(label='full data')
sla_timeseries.rolling(time=365, center=True).mean().plot(label='rolling annual mean')
plt.ylabel('Sea Level Anomaly [m]')
plt.title('Global Mean Sea Level')
plt.legend()
plt.grid()

That is all for the basics now let's redo some computations from Takaya's paper : [spectra](Spectra-eNATL60.ipynb)