# Annual updating of AusEFlux <img align="right" src="https://github.com/cbur24/AusEFlux/blob/master/results/banner_picture.png?raw=True" width="40%">

This notebook contains the workflow for annual updating of the product. It contains four main steps, instructions are provided in the subsections below. Pay close attention to the `Analysis Parameters` sections and ensure paths etc. are correct.

***
**Ideal compute environment:**

Assuming 5-km resolution

- NCI's 'normal' queue
- X-large (24 cores, 95GiB)
- Python 3.10.0
- Python venv: `/g/data/os22/chad_tmp/AusEFlux/env/py310`
- Folders: `gdata/os22+gdata/ub8+gdata/xc0+gdata/gh70`
***
> **Expected completion time to run all steps: ~3 hours**

## Import libraries and set up Dask

In [1]:
%load_ext autoreload
%autoreload 2

import numpy as np
import warnings
warnings.simplefilter(action='ignore')

import sys
sys.path.append('/g/data/os22/chad_tmp/AusEFlux/src/')
from _utils import start_local_dask

In [2]:
client = start_local_dask(mem_safety_margin='2Gb')
client



0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: /proxy/8787/status,

0,1
Dashboard: /proxy/8787/status,Workers: 1
Total threads: 26,Total memory: 124.00 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:40569,Workers: 1
Dashboard: /proxy/8787/status,Total threads: 26
Started: Just now,Total memory: 124.00 GiB

0,1
Comm: tcp://127.0.0.1:35907,Total threads: 26
Dashboard: /proxy/34163/status,Memory: 124.00 GiB
Nanny: tcp://127.0.0.1:38659,
Local directory: /jobfs/119839944.gadi-pbs/dask-scratch-space/worker-x2ci2oed,Local directory: /jobfs/119839944.gadi-pbs/dask-scratch-space/worker-x2ci2oed


## Set up project directory structure

This workflow assumes a specific file/folder structure, here we create that folder structure to support the rest of the process.

Below, enter the `root directory location` where project results and data are stored

In [3]:
base='/g/data/os22/chad_tmp/AusEFlux/'

In [4]:
from _utils import create_project_directories
create_project_directories(root_dir=base)

Directory /g/data/os22/chad_tmp/AusEFlux//data/5km already exists
Directory /g/data/os22/chad_tmp/AusEFlux//data/interim already exists
Directory /g/data/os22/chad_tmp/AusEFlux//data/ozflux_netcdf already exists
Directory /g/data/os22/chad_tmp/AusEFlux//data/training_data already exists
Directory /g/data/os22/chad_tmp/AusEFlux//results/AusEFlux already exists
Directory /g/data/os22/chad_tmp/AusEFlux//results/cross_val already exists
Directory /g/data/os22/chad_tmp/AusEFlux//results/figs already exists
Directory /g/data/os22/chad_tmp/AusEFlux//results/models already exists
Directory /g/data/os22/chad_tmp/AusEFlux//results/predictions already exists
Directory /g/data/os22/chad_tmp/AusEFlux//results/predictions/ensemble/historical/GPP already exists
Directory /g/data/os22/chad_tmp/AusEFlux//results/models/ensemble/GPP already exists
Directory /g/data/os22/chad_tmp/AusEFlux//results/AusEFlux/GPP already exists
Directory /g/data/os22/chad_tmp/AusEFlux//results/cross_val/ensemble/GPP already

## Step 1: Spatiotemporal harmonisation of input datasets

Most datasets are originally from here: https://thredds.nci.org.au/thredds/catalog/ub8/au/catalog.html

Dataset from this process are output as annual layers in `data/interim`

**Expected completion time ~2hrs**

### Analysis Parameters

* `base`: Path to where most of the data is stored
* `results`: Path to store interim datasets after they have undergone harmonisatin
* `year_start`: The first year in the series to predict. If predicting for a single year, make _year_start_ and _year_end_ the same.
* `year_end`: The last year in the series to predict. If predicting for a single year, make _year_start_ and _year_end_ the same.

In [5]:
base = '/g/data/ub8/au/'
results='/g/data/os22/chad_tmp/AusEFlux/data/interim/'
year = 2023

### Run step 1



In [6]:
from _harmonisation import spatiotemporal_harmonisation

In [None]:
spatiotemporal_harmonisation(year_start=year,
                             year_end=year,
                             base_path=base,
                             results_path=results,
                             verbose=True
                                )

Create land/sea mask
Process NDWI, estimated time 10 mins/year
  2023


  return func(*(_execute_task(a, cache) for a in args))
  return func(*(_execute_task(a, cache) for a in args))
  return func(*(_execute_task(a, cache) for a in args))
  return func(*(_execute_task(a, cache) for a in args))
  return func(*(_execute_task(a, cache) for a in args))
  return func(*(_execute_task(a, cache) for a in args))
  return func(*(_execute_task(a, cache) for a in args))
  return func(*(_execute_task(a, cache) for a in args))
  return func(*(_execute_task(a, cache) for a in args))
  return func(*(_execute_task(a, cache) for a in args))
  return func(*(_execute_task(a, cache) for a in args))
  return func(*(_execute_task(a, cache) for a in args))
  return func(*(_execute_task(a, cache) for a in args))
  return func(*(_execute_task(a, cache) for a in args))
  return func(*(_execute_task(a, cache) for a in args))
  return func(*(_execute_task(a, cache) for a in args))
  return func(*(_execute_task(a, cache) for a in args))
  return func(*(_execute_task(a, cache) for a in

Process kNDVI, estimated time 10 mins/year
  2023
Process NDVI, estimated time 1 min/year
  2023
Process LST, estimated time 5 mins/year
  2023
Process Veg Height, estimated time 1 mins/year
  2023
Process Tavg, estimated time 80 mins/year
  Tmin 2023


## Step 2: Create feature datasets

Combine results of the spatiotemporal harmonisation into temporally stacked netcdf files, and create new features/variables based on the climate (e.g. anomalies) and remote sensing (e.g veg fractions) datasets. 

**Expected completion time ~6 mins**

### Analysis Parameters

* `base`: Path to where the harmonised datasets output from Step 1 are stored. 
* `results`: Path to store temporally stacked netcdf files i.e. where the outputs of Step 2 will be stored
* `exclude`: Variables to exclude from combining. i.e. Some of the variables in `/interim` output in Step 1 are not needed hereafter.

In [None]:
base = '/g/data/os22/chad_tmp/AusEFlux/data/interim/'
results='/g/data/os22/chad_tmp/AusEFlux/data/5km/'
exclude = ['.ipynb_checkpoints', 'kTavg', 'Tmax', 'Tmin', 'EVI']

### Run step 2

In [None]:
from _feature_datasets import create_feature_datasets

In [None]:
%%time
create_feature_datasets(base=base,
                       results_path=results,
                       exclude=exclude,
                       verbose=True
                       )

## Step 3: Predict ensemble

Using the ensemble of models, we will generate an ensemble of gridded predictions.

**Expected completion time ~45 mins**

### Analysis Parameters

* `model_var`: Which variable are we modelling? Must be one of 'GPP', 'ER', 'NEE', or 'ET'
* `base`: Path to where the harmonised datasets output from Step 1 are stored. 
* `results_path`: Path to store temporally stacked netcdf files i.e. where the outputs of Step 2 will be stored
* `year_start`: The first year in the series to predict. If predicting for a single year, make _year_start_ and _year_end_ the same.
* `year_end`: The last year in the series to predict. If predicting for a single year, make _year_start_ and _year_end_ the same.
* `models_folder`: where are the models stored?
* `features_list`: Where are the list of features used by the model?

In [None]:
model_var = 'GPP' #ER #NEE #ET #GPP
base = '/g/data/os22/chad_tmp/AusEFlux/'
year_start, year_end=2023, 2023
results_path = f'{base}results/predictions/ensemble/annual_update/{year_start}/{model_var}/'
models_folder = f'{base}results/models/ensemble/{model_var}/'
features_list = f'{base}results/variables.txt'

### Run Step 3


In [None]:
from _ensemble_prediction import predict_ensemble

In [None]:
%%time
predict_ensemble(
   base=base,
   model_var=model_var,
   models_folder=models_folder,
   features_list=features_list,
   results_path=results_path,
   year_start=year_start,
   year_end=year_end,
   compute_early=True,
   verbose=True
)

## Step 4: Combine ensembles

Ran an ensemble of predictions, now we need to compute the ensemble median and the uncertainty range.

This step will also output production ready datasets with appropriate metadata

**Expected completion time, < 1 mins**

### Analysis Parameters

* `model_var`: Which variable are we combining? Must be one of 'GPP', 'ER', 'NEE', or 'ET'
* `base`: Path to where the modelling/data etc is occuring. We build the other path strings from the 'base' path to reduce the length of path strings.
* `results_path`: Path where final AusEFlux datasets will be output.
* `year_start`: The first year in the series. If running for a single year, make _year_start_ and _year_end_ the same.
* `year_end`: The last year in the series. If running for a single year, make _year_start_ and _year_end_ the same.
* `quantiles`: What quantiles are we using to determine the middle value and uncertainty range? The default is 0.05 and 0.95 for the uncertainty envelope, and 0.5 (median) for the middle estimate. You're advised not to change these.
* `predictions_folder`: where are the ensemble predictions stored? Those output from the previous step.

> There are also several metadata fields (e.g. `full_name`, `units`) that will change with the variable being modelled. Make sure you update these for each model run as these atttributes are appended to the exported netcdf files.

In [None]:
base = '/g/data/os22/chad_tmp/AusEFlux/'
model_var = 'GPP' #ER #NEE #ET #GPP
results_path = f'{base}results/AusEFlux/{model_var}/'
year_start, year_end=2023,2023
quantiles=[0.25,0.5,0.75] # interquartile range
predictions_folder= f'{base}results/predictions/ensemble/annual_update/{year_start}/{model_var}/'
# predictions_folder= f'{base}results/predictions/ensemble/historical/{model_var}/'

# metadata for netcdf attributes
full_name = 'Gross Primary Productivity'#'Gross Primary Productivity' #Net Ecosystem Exchange #Ecosystem Respiration #Evapotranspiration
version = 'v1.2'
crs='EPSG:4326'
units = 'gC/m\N{SUPERSCRIPT TWO}/month' #mm/month
description = f'AusEFlux {full_name} is created by empirically upscaling the OzFlux eddy covariance network using machine learning methods coupled with climate and remote sensing datasets. The estimates provided within this dataset were extracted from an ensemble of predictions and represent the median and uncertainty range.'


#### Create attributes dictionary

In [None]:
attrs_dict={}
attrs_dict['nodata'] = np.nan
attrs_dict['crs'] = crs
attrs_dict['short_name'] = model_var
attrs_dict['long_name'] = full_name
attrs_dict['units'] = units
attrs_dict['version'] = version
attrs_dict['description'] = description

### Run step 4

In [None]:
from _combine_ensemble import combine_ensemble

In [None]:
combine_ensemble(
    base=base,
    model_var=model_var,
    results_path=results_path,
    predictions_folder=predictions_folder,
    year_start=year_start,
    year_end=year_end,
    attrs=attrs_dict,
    quantiles=quantiles,
    verbose=True
)