# Calibrating tied meta parameters

## About this document

This document illustrates how to set up a calibration where a global parameterization is set at the catchment level, with scaled values for each subareas. This method helps to keep the degrees of freedom of an optimisation to a minimum.

In [None]:
from swift2.doc_helper import pkg_versions_info

print(pkg_versions_info("This document was generated from a jupyter notebook"))

## Use case and sample data

This workflow uses for convenience hourly time series data gathered a decade ago. The data comes from the Ovens River catchment, however the provenance is unclear

## Imports


In [None]:
import os

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [None]:
from cinterop.timeseries import as_timestamp, xr_ts_end
from swift2.doc_helper import (
    configure_daily_gr4j,
    configure_hourly_gr4j,
    create_test_catchment_structure,
    get_free_params,
    gr4j_scaled_parameteriser,
    sample_series,
)
from swift2.parameteriser import (
    create_parameter_sampler,
    create_sce_optim_swift,
    get_default_sce_parameters,
    get_marginal_termination,
    set_calibration_logger,
)
from swift2.simulation import get_state_value, get_subarea_ids, swap_model
from swift2.utils import as_xarray_series, mk_full_data_id, paste
from swift2.vis import OptimisationPlots, plot_two_series

In [None]:
%matplotlib inline

## Data

The sample data for this tutorial are daily series for the Ovens Catchment in Victoria. Daily streamflow was sourced from https://data.water.vic.gov.au/ at Bright (VIC), and rainfall and morton PET were sourced from  https://www.longpaddock.qld.gov.au/silo/point-data/  at Eurobin (VIC). Note that Eurobin is a bit downstream from Bright and perhaps not the "best" point climate data, but this works for this present vignette.


In [None]:
loc_key = "Ovens-Bright"
daily_rain = sample_series(loc_key, "rain")
daily_pet = sample_series(loc_key, "pet")
daily_streamflow_mlday = sample_series(loc_key, "streamflow")

In [None]:
daily_streamflow_mlday.plot();

In [None]:
catchment_area = 495
daily_runoff = daily_streamflow_mlday * 1000 * 1000 / (catchment_area * 1e6)

In [None]:
daily_runoff.plot();

In [None]:
daily_cumecs = daily_streamflow_mlday * 1000 / 86400 # ML/day to m3/s

## Creating a synthetic but realistic model

We create a system with total area similar to the real use case, but we use an arbitrary test catchment structure (multiple subareas). This is suitable for this tutorial.

In [None]:
areas_portions = np.array([91, 95, 6, 128, 93]) # arbitrary weights.
areas_portions = areas_portions / sum(areas_portions)
areasKm2 = areas_portions * catchment_area
sum(areasKm2)

In [None]:
summary, ms = create_test_catchment_structure(areas_km2=areasKm2)

In [None]:
summary

In [None]:
sum(areasKm2)

## channel routing

This is besides the main point of this tutorial, but let's take a detour showing how to set up a uniform channel routing using a pure lag routing.

In [None]:
ms = swap_model(ms, "PureLag", "channel_routing")

`PureLag` has a `Tau` parameter that can be a positive floating point value. If we query the variable identifiers of one of the links now:

In [None]:
ms.get_variable_ids('link.lnk1')

In [None]:
from swift2.parameteriser import create_parameteriser
p = pd.DataFrame.from_dict({
    "Name": ["Tau"],
    "Value": [0.25],
    "Min": [0.25],
    "Max": [0.25],
})
pure_lag_six_hours = create_parameteriser('generic links', specs=p)

In [None]:
ms.get_state_value('link.lnk1.Tau')

In [None]:
pure_lag_six_hours.apply_sys_config(ms)

In [None]:
ms.get_state_value('link.lnk1.Tau')

We will run over a few years and calibrate with a warmup of two years.

## Assign simulation inputs

In [None]:
sa_ids = ms.get_subarea_ids()
rainfall_ids = mk_full_data_id('subarea', sa_ids, "P")
evap_ids = mk_full_data_id('subarea', sa_ids, "E")

In [None]:
ms.get_state_value(rainfall_ids)

In [None]:
for rids in rainfall_ids:
    ms.play_input(daily_rain, rids)
for evids in evap_ids:
    ms.play_input(daily_pet, evids)
ms.set_simulation_time_step('daily')

## Define a calibration time span

We define a calibration with objective calculation over 10 years, plus a 2-year warmup period 

In [None]:
e = pd.Timestamp("2024-12-31")

w = e - pd.DateOffset(years=10)
s = w - pd.DateOffset(years=2)

print(f"Calibration run: simulation from {s} to {e}, with a warmup till {w}")

The package includes a function that flags possible inconsistencies prior to running a model (inconsistent time steps, etc.)

In [None]:
ms.check_simulation()

We need to adjust a couple of parameters for proper operation on hourly data for the GR4 model structure.

Wait what? The message is admitedly not the clearest, but in this case, we have not yet set a the simulation time span

In [None]:
ms.set_simulation_span(s, e)

Now, the check can compare simulation span and time series spans, and finds no problem:

In [None]:
ms.check_simulation()

## GR4J (GR4H) modes

GR4J (j for "journalier" i.e. "daily") and GR4H (h for hourly) differ by the values of tso parameters. There are two helper functions to switch modes on all GR4J models in the system

In [None]:
configure_hourly_gr4j(ms)
ms.get_state_value("subarea.lnk1.UHExponent")

In [None]:
configure_daily_gr4j(ms)
ms.get_state_value("subarea.lnk1.UHExponent")

## Feasible parameter space and parsimony

We have a catchment with 5 subareas, with GR4J. Leaving outside the links which we will not calibrate, this still means 20 parameters overall to calibrate. This can be problematic as there is likely an inflated parameter equifinality (many different combinations leading to sensibly similar performances), and the resulting parameters may not be robust or physically sensible.

Instead, we can define a meta parameter set with only 4 degrees of freedom, with area scaling applied to x4 and time scaling applied to x2 and x3. The time scaling makes it invariant if the simulation time step changes from daily to hourly, but in this sample the most telling scaling is the one for the "lag parameter" `x4`. A single `x4` meta-parameter is reflected in each subarea with values than are scaled according to a function (square root) of the unit's area. Intuitively, it makes sense that the bigger the subarea, the longer the flow routing lag. 

In [None]:
ref_area = 250 # The area for which the scaling of x4 is invariant
time_span = 86400 # The time step of the simulation, one day is 86400 seconds
# time_span = 3600 # if we had an hourly simulation, and hourly inputs 

While it is possible to construct meta-parameterisers from scratch, it is tedious. The GR4J/H scaling strategy is well known and pre-implemented in `gr4j_scaled_parameteriser`

In [None]:
p = gr4j_scaled_parameteriser(ref_area, time_span)

In [None]:
print(p.as_dataframe())

In [None]:
# set x4 bounds to be in "days", not hours
p_x4 = pd.DataFrame.from_dict({
    "Name": ["x4"],
    "Value": [1.0],
    "Min": [0.25],
    "Max": [10.0],
})

In [None]:
p.set_hypercube(p_x4)
p

In [None]:
subarea_ids = paste("subarea", get_subarea_ids(ms), sep=".")
areas = get_state_value(ms, paste(subarea_ids, "areaKm2", sep="."))
areas

Let us have a look at the values of the `x4` parameters in each subarea, before and after applying this meta-parameteriser `p`

In [None]:
x4_param_ids = paste(subarea_ids, "x4", sep=".")
get_state_value(ms, x4_param_ids)

In [None]:
p.apply_sys_config(ms)
get_state_value(ms, x4_param_ids)

The values of the individual x4 parameters are scaled according to the area of subareas. The larger the catchment, the larger the routing delay, the larger `x4`. The reference area for which the scaling would be 1.0 is 250 km^2, so the closer the catchment area from 250 km^2, the closer to 1.0 the area based scaling.

We can compose a parameter transformation, on top of the tied. It is typical to calibrate on log(x4) rather than x4

In [None]:
p = p.wrap_transform()
p.add_transform("log_x4", "x4", "log10")

In [None]:
p

In [None]:
outflowVarname = "Catchment.StreamflowRate"
ms.record_state(outflowVarname)

In [None]:
ms.exec_simulation()
calc = ms.get_recorded(outflowVarname)

In [None]:
flow = as_xarray_series(daily_cumecs)

In [None]:
vis_e = as_timestamp(xr_ts_end(flow))
vis_s = vis_e - pd.DateOffset(years=3)

plot_two_series(flow, calc, names=["Observed", "Calculated"], start_time=vis_s, end_time=vis_e)

## Optimiser

Let'c create an NSE evaluator, and check what the default parameter set yields as a goodness of fit.

In [None]:
objective = ms.create_objective(outflowVarname, flow, "NSE", w, e)
score = objective.get_score(p)
print(score)

We have our objectives defined, and the parameter space 'p' in which to search. Let's create an optimiser and we are ready to go. While the optimiser can be created in one line, we show how to choose one custom termination criterion and how to configure the optimiser to capture a detailed log of the process.

In [None]:
if "SWIFT_FULL" in os.environ.keys():
    max_hours = 0.2
else:
    max_hours = 0.02

term = get_marginal_termination(tolerance = 1e-05, cutoff_no_improvement = 30, max_hours = max_hours)
# term = get_max_runtime_termination(max_hours=max_hours)
sce_params = get_default_sce_parameters()
urs = create_parameter_sampler(0, p, "urs")
optimiser = create_sce_optim_swift(objective, term, sce_params, urs)
calib_logger = set_calibration_logger(optimiser, "")

At this point you may want to specify the maximum number of cores that can be used by the optimiser, for instance if you wish to keep one core free to work in parallel on something else.

In [None]:
sce_params

The number of complexes is 6; by default the optimiser will try to use 6 CPU cores in parallel, or n-1 where N is your number of cores and less than 6. It is possible to limit the level of parallelism if needed, for instance to make sure you have a few cores to work with if an optimiser will run for some time.

In [None]:
optimiser.set_maximum_threads_free_cores(2)

In [None]:
%%time
calib_results = optimiser.execute_optimisation()

Processing the calibration log below. We subset the full log to keep only some types of optimiser messages, in this case we do not keep the "shuffling" stages of the SCE algorithm.

In [None]:
opt_log = optimiser.extract_optimisation_log(fitness_name="NSE")
geom_ops = opt_log.subset_by_message(pattern="Initial.*|Reflec.*|Contrac.*|Add.*")

We can then visualize how the calibration evolved. There are several types of visualisations included in the **mhplot** package, and numerous customizations possible, but starting with the overall population evolution:

In [None]:
geom_ops._data["NSE"].describe()

In [None]:
p_var_ids = p.as_dataframe().Name.values
p_var_ids

In [None]:
v = OptimisationPlots(geom_ops)
for pVarId in p_var_ids:
    v.parameter_evolution(pVarId, obj_lims=[0, 1])
    plt.gcf().set_size_inches(10, 8)

In [None]:
# sortedResults = sortByScore(calib_results, 'NSE')
# best_pset = getScoreAtIndex(sortedResults, 1)
# best_pset = GetSystemConfigurationWila_R(best_pset)

In [None]:
best_pset = calib_results.get_best_score("NSE").parameteriser

*swift* can back-transform a parameters to obtain the untransformed parameter set(s):

In [None]:
best_pset

In [None]:
untfPset = best_pset.backtransform()
score = objective.get_score(best_pset)
score

In [None]:
score = objective.get_score(untfPset)
score

Finally, let's have a visual of the fitted streamflow data at Bright:

In [None]:
best_pset.apply_sys_config(ms)
ms.exec_simulation()
mod_runoff = ms.get_recorded(outflowVarname)

In [None]:
plot_two_series(
    flow, mod_runoff, start_time=vis_s, end_time=vis_e, names=["observed", "modelled"]
)

In [None]:
# runoff = flow / sum(areasKm2)
# runoff.plot()

# plot_two_series(
#     rainfall, runoff, start_time=vis_s, end_time=vis_e, names=["observed rain", "observed runoff"]
# )