# Earth System Data Cube

## The Data Analytics Toolkit for Python

This notebook is describes how to use access the Eart System Data Cube (ESDC) using the Python Data Analytics Toolkit (DAT). It is meant as a starting point for the exploaration and analysis of the ESDC. The Python DAT draws heavily on [xarray](http://xarray.pydata.org/en/stable/), a "pandas-like and pandas-compatible toolkit for analytics on multi-dimensional arrays". Xarray implements the common data model of Netcdf in memory and nicely conserves the strucutre and all metainformation of the data in the ESDC. Thus, the full power of xarray and dask (for out-of-core computation) is immediatley available ot work with ESDC. 

In the following, typical steps a first-time user may take to explore the ESDC and common analytical procedures and visualisations are introduced. Note, however, that this example is b yno means exhaustive - the DAT is full ycompatible with the entire Python ecosystem and therefore offers almost unlimited approahces to specific analyitical needs. 

###  Import the Cablab DAT

In [None]:
from cablab import Cube
import xarray as xr

###  Access Cube on disk

In [None]:
ESDC_path = "/home/jovyan/work/datacube/cablab-datacube-1.0.0/low-res"
cube = Cube.open(ESDC_path)

###  Open returns a Cube object 

In [None]:
cube

### List variable names in the Cube

In [None]:
cube.data.variable_names

###  Data are best handled as xarray datasets. Just like netcdf files, datasets contain dimensions, variables, and further metadata

In [None]:
ESDC = cube.data.dataset()
ESDC

In [None]:
ESDC.precipitation

### The array of values can be also accessed directly

In [None]:
ESDC['lon'].values

###  xarray offers rich set of built-in convenience functions
see the [API reference](http://xarray.pydata.org/en/stable/api.html) for full reference!
### Mean over all dimensions

In [None]:
ESDC.mean(skipna=True)

### Mean over time and latitidue, result is a dataset again

In [None]:
precip_avg = ESDC['precipitation'].mean(dim = ["time","lat"], skipna=True)

In [None]:
precip_avg.compute()

###  Simple plotting with xarrays implementation of matplotlib
#### import additional libraries

In [None]:
import xarray as xr
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sn

### Select 2d image (lat/lon) given a time as integer index

In [None]:
precip2d = ESDC.precipitation.isel(time=123)

In [None]:
precip2d

### or given a specific date

In [None]:
precip2d = ESDC['precipitation'].sel(time='2007-04-12', method = 'nearest')
precip2d

In [None]:
precip1d = ESDC['precipitation'].sel(lon = 12.67,lat = 41.83, method = 'nearest')  

#### Plot 2d image with matplotlib

In [None]:
precip2d.plot.pcolormesh(vmax = 5)

#### Time-series at a given location (here ESRIN), and histogram lf values

In [None]:
fig, ax = plt.subplots(figsize = [12,5], ncols=2)

precip1d.plot(ax = ax[0], color ='red', marker ='.')
ax[0].set_title("Precipitation at ESRIN")
precip1d.plot.hist(ax = ax[1], color ='blue')
ax[1].set_xlabel("precipitation")
plt.tight_layout()

### Make use of the known and stable strucutre of all data in the ESDC and create high-level methods for visualization


In [None]:
from mpl_toolkits.basemap import Basemap
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

In [None]:
def map_plot(ds, var=None, time = 0, title_str='No title', **kwargs):
    ''' 
    Expects a data set and a variable name to plot
    
    '''
    if isinstance(time,int):
        res = ds[var].isel(time=time)
    elif time is None:
        res = ds[var]
        time = None
    else: 
        try: 
            res = ds[var].sel(time=time,method='nearest')
        except: 
            print("Wrong date format, should be YYYY-MM-DD")
            raise    
   
    lons, lats = np.meshgrid(np.array(res.lon),np.array(res.lat))
    ma_res = np.ma.array(res, mask =np.isnan(res))
    
    if "vmin" in kwargs:
        vmin = kwargs["vmin"] 
    else:
        vmin = None
    if "vmax" in kwargs:
        vmax = kwargs["vmax"] 
    else:
        vmax = None
    if title_str == "No title":
        title_str = var + ' ' +str(time)
    else:
        title_str = title_str + ' ' +str(res.time.values)[0:10]
        
    fig = plt.figure()
    ax = fig.add_axes([0.05,0.05,0.9,0.9])
    m = Basemap(projection='kav7',lon_0=0,resolution=None)
    m.drawmapboundary(fill_color='0.3')
    ccmap = plt.cm.jet
    ccmap.set_bad("gray",1.)
    im = m.pcolormesh(lons,lats,ma_res,shading='flat',cmap=ccmap,latlon=True, vmin = vmin, vmax=vmax)
    # lay-out 
    m.drawparallels(np.arange(-90.,99.,30.))
    m.drawmeridians(np.arange(-180.,180.,60.))
    cb = m.colorbar(im,"bottom", size="5%", pad="2%")
    cb.set_label(ds[var].attrs['standard_name']+' ('+ds[var].attrs['units']+')')
    ax.set_title(title_str)
    # write to disk if specified 
    if "plot_me" in kwargs:
        if kwargs["plot_me"] == True:
            plt.savefig(title_str[0:15] + '.png',dpi = 600)
            
    fig.set_size_inches(8,12)
    return fig ,ax, m

In [None]:
def DAT_corr(ds, var1 = None, var2 = None, dim ='time'):
   
    if not isinstance(ds,xr.Dataset):
        print('Input object ',ds,' is no xarray Dataset!')
        var1 = None

    if var1 is not None:  
        if var2 is None: 
            var2 = var1  
        ds_tmean = ds.mean(skipna=True, dim = dim)
        ds_tstd =  ds.std(skipna=True, dim = dim)
        covar_1 = (ds[var1] - ds_tmean[var1])*(ds[var2] - ds_tmean[var2])
        res = covar_1.mean(dim= 'time', skipna=True)/(ds_tstd[var1]*ds_tstd[var2])
    else: 
        res = None
        
    return res

In [None]:
fig, ax, m = map_plot(ESDC,'evaporation','2006-03-01',vmax = 6.)


#### Subsetting geographical sub-region. Note that the slice of the latitude dimension has to be in reverse order.

In [None]:
Europe = ESDC.sel(lat = slice(70.,30.), lon = slice(-20.,35.))

In [None]:
Europe.mean(dim='time',skipna=True).soil_moisture.plot()

#### Seasonal averages

In [None]:
Air_temp_monthly = Europe.air_temperature_2m.groupby('time.month').mean(dim='time')

In [None]:
Air_temp_monthly.plot.imshow(x='lon',y='lat',col='month',col_wrap=3)

#### Zscores

In [None]:
Europe_zscore = (Europe-Europe.mean(dim='time'))/Europe.std(dim='time')

In [None]:
ESRIN_zscore = Europe_zscore.sel(lon = 12.67,lat = 41.83, method = 'nearest')
ESRIN_zscore.precipitation.plot()

#### Using apply() to apply any arbitrary function to all variables in the dataset

In [None]:
Europe.apply(np.nanmax)

### Define own fucntion for anomaly detection. 

In [None]:
def above_Nsigma(x,Nsigma):
    return xr.ufuncs.fabs(x)>Nsigma

In [None]:
res = Europe_zscore.apply(above_Nsigma,Nsigma = 2)

In [None]:
fig2, ax2 = plt.subplots(figsize = [12,5], ncols=2)

res["precipitation"].sum(dim="time").plot(ax = ax2[0])
ax2[0].set_title("No of obs above or below 2 sigma")

res["evaporation"].sum(dim="time").plot(ax = ax2[1])
ax2[1].set_title("No of obs above or below 2 sigma")

plt.tight_layout()



In [None]:
df = Europe_zscore.to_dataframe()
df.boxplot(column=["precipitation","evaporation","soil_moisture","ozone"])

In [None]:
df

In [None]:
sn.boxplot(df[['precipitation','evaporation','soil_moisture','air_temperature_2m','ozone']])

### Compute correlationbetween arbitrary variables in the ESDC

In [None]:
cv = DAT_corr(ESDC, 'precipitation', 'evaporation')

In [None]:
cv.plot.imshow(vmin = -1., vmax = 1.)

#### Test if function works as expected: correlation of var with itself should be 1.

In [None]:
cv2 = DAT_corr(ESDC, 'precipitation', 'precipitation')
cv2.plot.imshow(vmin = 0.5, vmax = 1.5)