# Annual updating of AusEFlux <img align="right" src="https://github.com/cbur24/AusEFlux/blob/master/banner_picture.png?raw=True" width="40%">

This notebook executes a workflow for annual updating of AusEFlux terrestrial carbon and water fluxes. It contains four main steps, instructions are provided in the subsections below. Pay close attention to the `Analysis Parameters` sections and ensure paths etc. are correct.

***
**Ideal compute environment:**

Assuming 1-km resolution

- NCI's 'normalsr' queue
- X-large (26 cores, 124 GiB)
- Python 3.10.0
- Python venv: `/g/data/os22/chad_tmp/AusEFlux/env/py310`
- Storage Folders: `scratch/os22+gdata/os22+gdata/ub8+gdata/xc0+gdata/gh70+gdata/r78`
***

## Set up Dask

In [None]:
import warnings
warnings.simplefilter(action='ignore')

import sys
sys.path.append('/g/data/os22/chad_tmp/AusEFlux/src/')
from _utils import start_local_dask

In [None]:
client = start_local_dask(mem_safety_margin='2Gb')
client

## Set up project directory structure

This workflow assumes a specific file/folder structure, here we create that folder structure to support the rest of the process.

Below, enter the `root directory location` where project results and data are stored

In [None]:
root='/g/data/os22/chad_tmp/AusEFlux/'
target_grid = '1km'

In [None]:
from _utils import create_project_directories
create_project_directories(root_dir=base, target_grid=target_grid)

## Step 1: Spatiotemporal harmonisation of input datasets

Most datasets are originally from here: https://thredds.nci.org.au/thredds/catalog/ub8/au/catalog.html

Dataset from this process are output as annual layers in `data/interim`

**Expected completion time X.  Most of this time is spent resampling the temperature datasets.**

### Analysis Parameters

* `base`: Path to where most of the data is stored
* `results`: Path to store interim datasets after they have undergone harmonisatin
* `year`: The year of data we are processing.

In [None]:
base = '/g/data/ub8/au/'
results='/g/data/os22/chad_tmp/AusEFlux/data/interim/'
year = 2024

### Run step 1



In [None]:
from _harmonisation import spatiotemporal_harmonisation

In [None]:
spatiotemporal_harmonisation(
    year_start=year,
    year_end=year,
    base_path=base,
    target_grid=target_grid,
    results_path=results,
    verbose=True
)

## Step 2: Create feature datasets

Combine results of the spatiotemporal harmonisation into temporally stacked netcdf files, and create new features/variables based on the climate (e.g. anomalies) and remote sensing (e.g vegetation fractions) datasets. 

**Expected completion time X mins**

### Analysis Parameters

* `base`: Path to where the harmonised datasets output from Step 1 are stored. 
* `results`: Path to store temporally stacked netcdf files i.e. where the outputs of Step 2 will be stored
* `exclude`: Variables to exclude from combining. i.e. Some of the variables in `/interim` output in Step 1 are not needed hereafter.

In [None]:
#results are stored in different directory due to storage issues
base = '/g/data/r78/cb3058/phd/interim_1km/'
results='/g/data/r78/cb3058/phd/1km/'
target_grid = '1km'
exclude = ['.ipynb_checkpoints', 'kTavg', 'Tmax', 'Tmin', 'EVI']

### Run step 2

In [None]:
from _feature_datasets import create_feature_datasets

In [None]:
create_feature_datasets(
    base=base,
    results_path=results,
    exclude=exclude,
    target_grid=target_grid,
    verbose=True
)

## Step 3: Predict ensemble

Using the ensemble of models, we will generate an ensemble of gridded predictions. The code below will loop through the carbon and water fluxes.

**Expected completion time X mins/flux, so approximately X hours total if running all four fluxes**

### Analysis Parameters
* `fluxes`: The list of fluxes to loop through and predict. Most times this shouldn't need to change.
* `base`: Path to the root directory
* `year_start`: The first year in the series to predict. If predicting for a single year, make _year_start_ and _year_end_ the same.
* `year_end`: The last year in the series to predict. If predicting for a single year, make _year_start_ and _year_end_ the same.
* `features_list`: Where are the list of features used by the model?

In [None]:
fluxes = ['GPP','NEE','ER','ET']
base = '/g/data/os22/chad_tmp/AusEFlux/'
year_start, year_end=2024, 2024
target_grid='1km'
features_list = f'{base}results/variables.txt'
prediction_data='/g/data/r78/cb3058/phd/'

### Run Step 3


In [None]:
from _ensemble_prediction import predict_ensemble

In [None]:
for f in fluxes:
    # set up paths
    # results_path = f'{base}results/predictions/ensemble/annual_update/{year_start}/{f}/'
    results_path = f'/scratch/os22/chad/AusEFlux/annual_update/{year_start}/{f}/'
    models_folder = f'{base}results/models/ensemble/{f}/'

    #predict ensemble
    predict_ensemble(
       base=base,
       prediction_data=prediction_data,
       model_var=model_var,
       models_folder=models_folder,
       features_list=features_list,
       results_path=results_path,
       year_start=year_start,
       year_end=year_end,
       target_grid=target_grid,
       compute_early=False,
       verbose=True
    )

## Step 4: Combine ensembles

Ran an ensemble of predictions, now we need to compute the ensemble median and the uncertainty range.

This step will also output production ready datasets with appropriate metadata. The code below will loop through the carbon and water fluxes.

**Expected completion time X mins**

### Analysis Parameters

* `fluxes`: A dictionary linking the fluxes to be modelled (e.g 'GPP', 'NEE' etc.) with their full names (e.g. 'Gross Primary Productivity'). This dictionary is used to loop through the fluxes for combining the ensembles, and the full name is used for metadata on the exported netcdf.
* `base`: Path to the root directory
* `year_start`: The first year in the series. If running for a single year, make _year_start_ and _year_end_ the same.
* `year_end`: The last year in the series. If running for a single year, make _year_start_ and _year_end_ the same.
* `quantiles`: What quantiles are we using to determine the middle value and uncertainty range? The default is 0.25 and 0.75 for the uncertainty envelope, and 0.5 (median) for the middle estimate.
* `version`: What version of the dataset is this?


In [None]:
fluxes = {
    'GPP':'Gross Primary Productivity',
    'NEE':'Net Ecosystem Exchange',
    'ER':'Ecosystem Respiration',
    'ET':'Evapotranspiration'
         }

base = '/g/data/os22/chad_tmp/AusEFlux/'
predictions_path = '/scratch/os22/chad/AusEFlux/annual_update/' #temporary due to storage issues
year_start, year_end=2024,2024
quantiles=[0.25,0.5,0.75] # interquartile range
version = 'v1.2'

### Run step 4

In [None]:
from _combine_ensemble import combine_ensemble

In [None]:
for f,n in fluxes.items():

    # paths
    results_path = f'{base}results/AusEFlux/{f}/'
    # predictions_folder= f'{base}results/predictions/ensemble/annual_update/{year_start}/{f}/'
    predictions_folder= f'{predictions_path}{year_start}/{f}/'

    # metadata for netcdf attributes
    version = version
    crs='EPSG:4326'
    
    if f =='ET':
        units = 'mm/month'
    else:
        units = 'gC/m\N{SUPERSCRIPT TWO}/month'
    
    description = f'AusEFlux {n} is created by empirically upscaling the OzFlux eddy covariance network using machine learning methods coupled with climate and remote sensing datasets. The estimates provided within this dataset were extracted from an ensemble of predictions and represent the median and uncertainty range.'

    # Create attributes dictionary
    attrs_dict={}
    attrs_dict['nodata'] = np.nan
    attrs_dict['crs'] = crs
    attrs_dict['short_name'] = f
    attrs_dict['long_name'] = n
    attrs_dict['units'] = units
    attrs_dict['version'] = version
    attrs_dict['description'] = description

    #combine ensembles and save netcdf
    combine_ensemble(
        base=base,
        model_var=f,
        results_path=results_path,
        predictions_folder=predictions_folder,
        year_start=year_start,
        year_end=year_end,
        attrs=attrs_dict,
        quantiles=quantiles,
        verbose=True
    )


## You're done!

### Last Run:

In [None]:
from datetime import date
print(date.today())