# MOD16 Calibration (via MCMC) Tutorial

**Contents:**

1. [Overview](#Overview)
2. [Before You Get Started](#Before-You-Get-Started)
   - [The HDF5 Cal-Val Data](#The-HDF5-Cal-Val-Data)
   - [The YAML Configuration File](#The-YAML-Configuration-File)
   - [The YAML Priors File](#The-YAML-Priors-File)
3. [Running the Calibration](#Running-the-Calibration)
   - [Output Files](#Output-Files)
   - [Using k-folds Cross Validation](#Using-k-folds-Cross-Validation)

---

## Overview

The MOD16 calibration suite is intended to be used at the command line.

So, for starters, let's confirm where we are on our file system.

In [None]:
! pwd

**Note that when we start an example with a `!` character, we're actually sending that command to the command line (the Unix Shell or Windows Power Shell); it's not Python code.**

The `notebooks` directory isn't a useful place to start, but we can't change directories inside a Jupyter Notebook. So, for now, note that we will interact with the `calibration.py` script by using a relative path to that file; it is located in the `mod16` directory:

In [None]:
! ls ../mod16/calibration.py

The script is intended to be called like:

```sh
python calibration.py <command> <options>
```

If you run the `calibration.py` script without any arguments, you'll see what `COMMANDS` are available.

In [None]:
! python ../mod16/calibration.py

And you can get help on individual commands using `--help`:

In [None]:
! python ../mod16/calibration.py tune --help

---

## Before You Get Started

You need three files prepared before you can start calibration:

- **HDF5 Cal-Val data file:** This an HDF5 file that contains all the data (inputs and observed fluxes) at the tower sites needed to run the calibration.
- **YAML configuration file:** This is a YAML file (`*.yaml`) that specifies all the options and file paths for *your* calibration run.
- **YAML priors file:** This is a YAML file that specifies the priors for every parameter in MOD16.

### The HDF5 Cal-Val Data

The structure of this file is defined in the [module-level docstring of `calibration.py`](https://arthur-e.github.io/MOD16/calibration.html)

### The YAML Configuration File

Below is a template for the YAML configuration file; you can copy it and modify it to suit your needs.

```yaml
---
BPLUT:
  ET: "/home/user/MOD16_BPLUT_CX.X_05deg_MCD43B_Albedo_MERRA.csv"
data:
  file: "/home/user/VIIRS_MOD16_tower_site_latent_heat_and_drivers_v5.h5"
  class_map: state/PFT # HDF5 field name for PFT map
  classes: [1,2,3,4,5,6,7,8,9,10,12] # The unique and valid PFT codes
  # If a time-varying PFT map is used, this should be true and the corresponding
  #   "class_map" array (see above) should be a (T x N) array
  classes_are_dynamic: false
  target_observable: FLUXNET/latent_heat # HDF5 field name
  sites_blacklisted: []
  # The name of the HDF5 group that contains surface meteorology data
  met_group: MERRA2
  # The name of the HDF5 datasets for albedo, fPAR, LAI, etc.
  datasets:
    albedo: MODIS/MCD43GF_black_sky_sw_albedo
    fPAR: MODIS/MOD15A2HGF_fPAR_interp
    LAI: MODIS/MOD15A2HGF_LAI_interp
optimization:
  backend_template: "/home/user/20231218_MOD16_%s_calibration_PFT%d.nc4"
  prior: "/home/user/MOD16/mod16/data/MOD16_BPLUT_prior_20231218.yaml"
  chains: 3 # Number of chains to run
  draws: 1000 # Number of draws from the posterior distribution
  tune: scaling # Which hyperparameter to tune
  scaling: 0.001 # Initial scale factor for epsilon
  objective: RMSD # Objective function
```

Some particular configuration options to pay attention to include are as follows.

In the `BPLUT` group:

- `ET:` This is the path to the existing MOD16 BPLUT, i.e., the old BPLUT. It is used to fill-in any parameters that are not being calibrated.

In the `data` group:

- `file:` This is the file path to the HDF5 calibration-validation (Cal-Val) File.
- `classes:` This is a list of the valid numeric PFT codes. The Calibration API will use this to determine when all PFTs have been calibrated.
- `classes_are_dynamic:` The PFT classes can be static (this is set to `false`) or vary in time (`true`); in the latter case, we mean that tower's PFT may change at every time step (potentially). Most likely, tower PFT changes yearly, but if `classes_are_dynamic` is `true`, then the `class_map` should be a (T x N) array where T is the number of (daily) time steps.

In the `optimization` group:

- `backend_template:` This is the file path to an output file that *will be created* after a successful run. The filename should have two formatting characters, `%s` where the model name (usually `"ET"`) should go and `%d` where the numeric PFT code should go. If this file already exists, it will be overwritten!
- `prior:` This is the file path to the YAML priors file, **which is described in the next section.**
- `draws:` You may want to increase this to increase the number of draws from the posterior distribution. However, starting with a small number of draws, for faster completion, will help you verify everything is working as intended.

### The YAML Priors File

The priors file will look different depending on the prior distributions you have settled on for your unique use case. In the current version of MOD16, [these are the free parameters to be calibrated.](https://github.com/arthur-e/MOD16?tab=readme-ov-file#free-parameters) However, the minimum temperature and vapor pressure deficit (VPD) ramp functions are fixed at the same values as MOD17, so these are not calibrated in the current scheme.

In the current scheme of our calibration software, we use the following functional forms for each prior:

- `gl_sh`, or $g_{SH}$, uses a LogNormal prior
- `gl_wv`, or $g_{WV}$, uses a LogNormal prior
- `g_cuticular`, or $g_{cuticular}$, uses a LogNormal prior
- `csl`, or $C_L$, uses a LogNoraml prior
- `rbl_min`, or $r_{\text{BL,min}}$, uses a Uniform prior
- `rbl_max`, or $r_{\text{BL,max}}$, uses a Uniform prior
- `beta`, or $\beta$, uses a Uniform prior

An example priors file is below.

```yaml
---
# NOTE that the tilde ~ below is a NULL; it should be used in the first
#   position (where Python starts counting, at 0) when there is no PFT 0
gl_sh:
  mu: [~, -2.41, -4.25, -2.41, -4.25, -3.45, -3.91, -3.91, -3.45, -3.45, -3.22, ~, -3.45]
  sigma: [~, 0.1, 0.1, 0.1, 0.1, 0.71, 0.1, 0.1, 0.71, 0.71, 0.1, ~, 0.71]
gl_wv:
  mu: [~, -2.41, -4.25, -2.41, -4.25, -3.45, -3.91, -3.91, -3.45, -3.45, -3.22, ~, -3.45]
  sigma: [~, 0.1, 0.1, 0.1, 0.1, 0.71, 0.1, 0.1, 0.71, 0.71, 0.1, ~, 0.71]
g_cuticular:
  mu: [~, -10.2, -10.75, -10.12, -10.08, -10.38, -10.42, -10.42, -10.19, -10.19, -9.57, ~, -8.79]
  sigma: [~, 1.02, 1.14, 0.89, 0.82, 1.05, 1.04, 1.04, 1.44, 1.44, 1.26, ~, 0.26]
csl:
  mu: [~, -5.5, -5.5, -5.5, -5.5, -5.5, -5.5, -5.5, -5.5, -5.5, -5.11, ~, -4.29]
  sigma: [~, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.8, 0.89, ~, 0.65]
# These parameters have a Uniform prior, so we specify the lower and upper bounds
rbl_min:
  lower: [~, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, ~, 10]
  upper: [~, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, ~, 99]
rbl_max:
  lower: [~, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, ~, 100]
  upper: [~, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, ~, 1000]
beta:
  lower: [~, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~, 0]
  upper: [~, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, ~, 1000]
```

**Each MOD16 parameter has one or more statistical parameters describing its prior distribution.** These statistical parameters are given in a list because there should be one value for each PFT. The tilde, `~`, character indicates no prior parameters are provided for that PFT, and are used to make sure that the numeric position of statistical parameter value in the list corresponds to the numeric PFT code, considering the way that Python counts. That's why every list begins with `~`: Python starts counting at zero (0) but there is no PFT coded 0.

- **For LogNormal (or Normal) priors,** the `mu` and `sigma` keys indicate the mean and standard deviation of the prior, respectively.
- **For Uniform priors,** the `lower` and `upper` keys indicate the minimum and maximum bounds on the Uniform distribution.

**These prior distributions are defined in the `MOD16StochasticSampler.compile_et_model()` function in `calibration.py`.** For more information, consult [the PyMC documentation.](https://www.pymc.io/projects/docs/en/latest/api/distributions.html)

---

## Running the Calibration

When the files described above are in place, you're ready to calibrate MOD16! Below, we run the calibration for PFT 1. It's that simple! Here, we also add `--save-fig=True` so that we save a file version of the trace plot at the end.

In [None]:
! python ../mod16/calibration.py tune --pft=1 --save-fig=True

If we wanted to provide a path to a specific configuration file (other than the default, `data/MOD16_calibration_config.yaml`), then we could do so with:

```sh
python ../mod16/calibration.py tune --pft=1 --config=path/to/my/configuration_file.yaml
```

You should see some output that looks like this:

```
Using configuration file: "/usr/local/dev/MOD16/mod16/data/MOD16_calibration_config.yaml"
Masking out validation data...
Loading driver datasets...
Initializing sampler...
-- RMSD at the initial point: 19.035
Compiling model...
Multiprocess sampling (3 chains in 3 jobs)
DEMetropolisZ: [gl_sh, gl_wv, g_cuticular, csl, rbl_min, rbl_max, beta]
 |██-----------| 21.38% [1283/6000 00:18<01:07 Sampling 3 chains, 0 divergences]
```

### Output Files

There is usually a single output file associated with a successful run, a netCDF4 file (`*.nc4`) that is determined by the `backend_template` option of your configuration file.

### Using k-folds Cross Validation

If you have a small training dataset and want to use $k$-folds cross-validation, you can do so by adding a command-line argument:

```sh
# e.g., 3-fold cross-validation
python ../mod16/calibration.py tune --pft=1 --k-folds=3
```

What's different when using $k$-folds?

- The sampler will run $k$ times. No trace plot(s) will be shown.
- There will be $k$ output netCDF4 (backend) files.
- There will be an additional output HDF5 file. This file contains the numeric indices of the samples used; the indices correspond to the flattened (1D) array of tower observations *after NaNs have been removed.*

---

## Diagnostics

Once you have an output netCDF4 backend associated with calibrating one of your PFTs, you're ready to inspect the trace to see that you have a good sample of the posterior.

### Diagnosing Autocorrelation

The first step should be to diagnose autocorrelation in the trace.

In [None]:
! python ../mod16/calibration.py plot-autocorr --pft=1

If there is significant autocorrelation (bars exceed the height of the gray shaded area), thinning the chain can remove it. You can also burn-in (remove the first $N$ samples), which probably won't help with autocorrelation but can help remove samples from the beginning of the chain, before the sampler settled.

Below, we thin by 10 (take every 10th sample) and burn-in 100 samples (throw away the first 100).

In [None]:
! python ../mod16/calibration.py plot-autocorr --pft=1 --burn=100 --thin=10

---

### Exporting the Posterior

When you're satisfied there is no autocorrelation in the posterior, you can export it to an HDF5 file for later statistical summary.

```sh
python ../mod16/calibration.py export-posterior <model_name> <parameter_name> <output_path>
```

Where:

- `<model_name>` should always be `ET`, for now.
- `<parameter_name>` is the name of the parameter for which you want to export the posterior.
- `<output_path>` is the file path of an HDF5 file, to be created on your file system.

See the example below.

**This step should only be run (and will only work) once you have a sample for every PFT in the `valid_PFT` configuration option.**

In [None]:
! python ../mod16/calibration.py export-posterior ET g_cuticular /home/arthur/MOD16_g_cuticular_sample.h5