# Annual updating of AusEFlux <img align="right" src="https://github.com/cbur24/AusEFlux/blob/master/banner_picture.png?raw=True" width="40%">

This notebook executes a workflow for annual updating of AusEFlux terrestrial carbon and water fluxes. It contains four main steps, instructions are provided in the subsections below. Pay close attention to the `Analysis Parameters` sections and ensure paths etc. are correct.

***
**Ideal compute environment:**

Assuming 500m resolution. At this high resolution the process is quite slow.

- NCI's 'hugemem' queue
- X-large (24 cores, 765GiB)
- Python 3.10.0
- Python venv: `/g/data/xc0/project/AusEFlux/env/py310`
- Storage Folders: `gdata/ub8+gdata/xc0`
***

## Set up Dask

In [None]:
import warnings
warnings.simplefilter(action='ignore')

import sys
sys.path.append('/g/data/xc0/project/AusEFlux/src/')
from _utils import start_local_dask

In [None]:
client = start_local_dask(mem_safety_margin='2Gb')
client

## Set up project directory structure

This workflow assumes a specific file/folder structure, here we create that folder structure to support the rest of the process.

Below, enter the `root directory location` where project results and data are stored

In [None]:
root='/g/data/xc0/project/AusEFlux'
target_grid = '500m'

In [None]:
from _utils import create_project_directories
create_project_directories(root_dir=root, target_grid=target_grid)

## Step 1: Spatiotemporal harmonisation of input datasets

Most datasets are originally from here: https://thredds.nci.org.au/thredds/catalog/ub8/au/catalog.html

Dataset from this process are output as annual layers in `data/interim`

**Expected completion time at 500m ~3.5 hours**

### Analysis Parameters

* `target_grid`: The spatial resolution of the product we are building
* `results`: Path to store interim datasets after they have undergone harmonisatin
* `year`: The year of data we are processing.

In [None]:
target_grid = '500m'
results=f'/g/data/xc0/project/AusEFlux/data/interim_{target_grid}/'
year = 2024

### Run step 1



In [None]:
from _harmonisation import spatiotemporal_harmonisation

In [None]:
spatiotemporal_harmonisation(
    year_start=year,
    year_end=year,
    target_grid=target_grid,
    results_path=results,
    verbose=True
)

## Step 2: Create feature datasets

Combine results of the spatiotemporal harmonisation into temporally stacked netcdf files, and create new features/variables based on the climate (e.g. anomalies) and remote sensing (e.g. vegetation fractions) datasets. 

**Expected completion time at 500m resolution is ~6 hours**

### Analysis Parameters

* `target_grid`: The spatial resolution of the product we are building
* `base`: Path to where the harmonised datasets output from Step 1 are stored. 
* `results`: Path to store temporally stacked netcdf files i.e. where the outputs of Step 2 will be stored
* `exclude`: Variables to exclude from combining. i.e. Some of the variables in `/interim` output in Step 1 are not needed hereafter.

In [None]:
#results are stored in different directory due to storage issues
target_grid = '500m'
base = f'/g/data/xc0/project/AusEFlux/data/interim_{target_grid}/'
results=f'/g/data/xc0/project/AusEFlux/data/{target_grid}/'
exclude = ['.ipynb_checkpoints', 'kTavg', 'Tmax', 'Tmin']

### Run step 2

start time = 1511

In [None]:
from _feature_datasets import create_feature_datasets

In [None]:
create_feature_datasets(
    base=base,
    results_path=results,
    exclude=exclude,
    target_grid=target_grid,
    verbose=True
)

## Step 3: Predict ensemble

Using the ensemble of models, we will generate an ensemble of gridded predictions. The code below will loop through the carbon and water fluxes and submit each job to the gadi queue. If, for some reason, you need to adapt the parameters of the shell script that submits the jobs then it can be found at `/g/data/xc0/project/AusEFlux/src/_qsub_ensemble_member.sh`

Its advisable that you check that all 30 ensemble members for each flux are exported, as it is not uncommon for a Gadi job to fail for no apparent reason and the results won't be exported (this is very likely when running 120 large jobs simultaneously). In which case, run the code block again as the code first checks if the result already exists.

**Annual fluxes will begin to be exported within ~10-15 mins, but it depends on the compute available as to how long it will take to get through all 120 files.**

### Analysis Parameters
* `fluxes`: The list of fluxes to loop through and predict. Most times this shouldn't need to change.
* `base`: Path to the root directory
* `target_grid`: The spatial resolution of the product we are building
* `year_start`: The first year in the series to predict. If predicting for a single year, make _year_start_ and _year_end_ the same.
* `year_end`: The last year in the series to predict. If predicting for a single year, make _year_start_ and _year_end_ the same.
* `features_list`: Where are the list of features used by the model?
* `prediction_data`: Where are the combined feature datasets stored?
* `n_workers`: When we start the dask client, how many cores will the client have?
* `memory_limit`: the amount of memory to assign to each dask client

In [None]:
fluxes = ['GPP','NEE','ER','ET']
base = '/g/data/xc0/project/AusEFlux/'
year_start, year_end=2024, 2024
target_grid='500m'
features_list = f'{base}results/variables.txt'
prediction_data=f'{base}data/{target_grid}'
n_workers=26
memory_limit='140GiB'

### Run Step 3


In [None]:
import os

for f in fluxes:
    # set up paths
    results_path = f'{base}results/predictions/annual_update/{year_start}/{f}/'
    models_folder = f'{base}results/models/ensemble/{f}/'
    
    #paths to models
    model_list = [models_folder+file for file in os.listdir(models_folder) if file.endswith(".joblib")]
    model_list.sort()
    os.chdir('/g/data/xc0/project/AusEFlux/') #so o,e files get spit out here.
    
    #submit each model to gadi seperately for prediction
    for m in model_list:
        
        name =  m.split('/')[-1].split('.')[0]

        #check if its already been  predicted
        if os.path.exists(f'{results_path}{name}.nc'):
            pass
        else:
            #submit to Gadi
            os.system(f"qsub -v model_path={m},model_var={f},year_start={year_start},year_end={year_end},target_grid={target_grid},base={base},results_path={results_path},prediction_data={prediction_data},features_list={features_list},n_workers={n_workers},memory_limit={memory_limit} /g/data/xc0/project/AusEFlux/src/_qsub_ensemble_member.sh"
                     )


In [None]:
!qstat

### Step 4: Combine ensembles

We ran an ensemble of predictions, now we need to compute the ensemble median and the uncertainty range.

This step will also output the datasets with appropriate metadata.  All datasets will be exported to: 
`/g/data/xc0/project/AusEFlux/results/AusEFlux/<flux>/`

The code below will loop through the carbon and water fluxes.

**Expected completion time ~30 mins**

### Analysis Parameters

* `fluxes`: A dictionary linking the fluxes to be modelled (e.g 'GPP', 'NEE' etc.) with their full names (e.g. 'Gross Primary Productivity'). This dictionary is used to loop through the fluxes for combining the ensembles, and the full name is used for metadata on the exported netcdf.
* `predictions_path`: Where is the top level folder that the predictions will be exported too? 
* `year_start`: The first year in the series. If running for a single year, make _year_start_ and _year_end_ the same.
* `year_end`: The last year in the series. If running for a single year, make _year_start_ and _year_end_ the same.
* `target_grid`: The spatial resolution of the product we are building
* `quantiles`: What quantiles are we using to determine the middle value and uncertainty range? The default is 0.25 and 0.75 for the uncertainty envelope, and 0.5 (median) for the middle estimate.
* `version`: What version of the dataset is this?


In [None]:
fluxes = {
    'GPP':'Gross Primary Productivity',
    'NEE':'Net Ecosystem Exchange',
    'ER':'Ecosystem Respiration',
    'ET':'Evapotranspiration'
         }

base = '/g/data/xc0/project/AusEFlux/'
predictions_path = f'{base}results/predictions/annual_update/'
year_start, year_end=2024,2024
quantiles=[0.25,0.5,0.75] # interquartile range
version = 'v2.0'
target_grid='500m'
dask_chunks=dict(x=500, y=500, time=-1)

### Run step 4

In [None]:
import numpy as np
from _combine_ensemble import combine_ensemble

In [None]:
for f,n in fluxes.items():
    print(f)
    # paths
    predictions_folder= f'{predictions_path}{year_start}/{f}/'

    # metadata for netcdf attributes
    if f =='ET':
        units = 'mm/month'
    else:
        units = 'gC/m\N{SUPERSCRIPT TWO}/month'
    
    description = f'AusEFlux {n} is created by empirically upscaling the OzFlux eddy covariance network using machine learning methods coupled with climate and remote sensing datasets. The estimates provided within this dataset were extracted from an ensemble of predictions and represent the median and uncertainty range.'

    # Create attributes dictionary
    attrs_dict={}
    attrs_dict['nodata'] = np.nan
    attrs_dict['crs'] = 'EPSG:4326'
    attrs_dict['short_name'] = f
    attrs_dict['long_name'] = n
    attrs_dict['units'] = units
    attrs_dict['version'] = version
    attrs_dict['description'] = description

    #combine ensembles and save netcdf
    combine_ensemble(
        model_var=f,
        results_path=f'{base}results/AusEFlux/{f}/',
        predictions_folder=predictions_folder,
        year_start=year_start,
        year_end=year_end,
        target_grid=target_grid,
        dask_chunks=dask_chunks,
        attrs=attrs_dict,
        quantiles=quantiles,
        verbose=True
    )

## You're done!

**The next step** is to open and run `/g/data/xc0/project/AusEFlux/notebooks/Move_auseflux_to_production.ipynb` which will move the datasets into the production `ub8` folder, along with creating annual summaries.

### Last Run:

In [None]:
from datetime import date
print(date.today())