examples/basic_example/example.Rmd

---
title: "Basic Hmsc-HPC usage example"
output:
  html_document:
    df_print: paged
---

This notebook demonstrates the concept of using Hmsc-HPC extension for `Hmsc` package. Unlike the core `Hmsc`, the Hmsc-HPC extension is written in Python programming language and is executed with Python interpreter. Hence, before a user can use it, a proper Python installation is required.

# Preparing Python environment

**If you are familiar with using Python within R**, then please configure Python in your preferred way and pip-install the Python package from the distributed zip package (`pip install .../path/to/hmsc-hpc`), and skip to the section [Checking Python environment].

**If you are not familiar with using Python within R**, then please follow the detailed instructions below.

## Detailed instructions

### 1. Finding Python installation

Depending on your hardware, operating system and user rights, the set of steps to acquire and configure a proper Python distribution may greatly vary. Thus, we would like to relay the installation process itself either to one of multitude guides available on the web, or to the IT support that manages your device. This section merely checks that a Python installation can be found.

Please test the next chunk of code that tries to check version of Python available in you system:

```{r}
system_python = "python3"
# system_python = "/Users/username/opt/anaconda3/envs/tf/bin/python3"
system2(system_python, "--version")
```

If the Python distribution in your system is configured well, then the code shall print the version of Python. If this is failing, then you are likely missing Python altogether, or its path is not configured correctly. In some cases you may have several distributions of Python available, and then you shall either explicitly specify the path to the desired Python distribution --- as exemplified in the commented-out `system_python = ...` line above.

### 2. Creating a new virtual environment

Please note that this notebook is configured **NOT** to execute the codes that install any software or packages --- both in this and next sections during whole notebook execution. Please execute them manually one by one if needed.


The next line creates an empty virtual environment where we will set up Hmsc-HPC:

```{r}
system2(system_python, "-m venv hmsc-venv")
```

Then, we activate this Python environment by defining an appropriate `python` path variable and check that it works by printing the version of Python:

```{r}
python = file.path(getwd(), "hmsc-venv", "bin", "python")  # for Linux and macOS
# python = file.path(getwd(), "hmsc-venv", "Scripts", "python")  # for Windows
system2(python, "--version")
```

If this is failing, then you need to adjust the path to the correct Python executable (note that the path depends on operating system -- see comments in the code block above).

### 3. Install Hmsc-HPC package

For installing the Hmsc-HPC Python package, we need to define the path to the Hmsc-HPC package. Assuming that you have downloaded this notebook as a part of the distributed zip package, the pre-set `package_path` shall work fine. Otherwise, please set the correct `package_path`. (Note for reviewers: after the blind review, the Hmsc-HPC package will be published and can be directly installed from web repository.):

```{r}
package_path = file.path(getwd(), "..", "..")
system2(python, "-m pip install --upgrade pip")
system2(python, paste("-m pip install", shQuote(package_path)))
```

After this, you should have a functioning Python environment.

## Checking Python environment

This section is for checking whether the examples in this notebook shall be expected to execute well or not.

The next code chunk tests that the Python environment works by executing a basic TensorFlow-based command and importing Hmsc-HPC package. Please define the correct `python` path to the Python executable. If you have configured Python outside R, then the default should work fine.

```{r}
# Choose correct python by uncommenting correct line:
# python = "python3"  # default
# python = file.path(getwd(), "hmsc-venv", "bin", "python")  # hmsc-venv for Linux and macOS
# python = file.path(getwd(), "hmsc-venv", "Scripts", "python")  # hmsc-venv for Windows

Sys.setenv(TF_CPP_MIN_LOG_LEVEL=3)  # reduce debug output from tensorflow
system2(python, "-c \"import tensorflow as tf; print(tf.constant(1))\"")
system2(python, "-c \"import hmsc\"")
```

Your Python setup is working correctly if the code does not produce any errors.

### Troubleshooting

If the above check and the detailed instructions do not work for you, please consult one of multitude guides available on the web. You can configure Python within R in multiple ways - for instance, `reticulate` package features several functions aimed to achieve that or you can prepare and activate the correct Python environment outside R.

# Setting up a toy Hmsc model

First, we shall acquire a sufficiently recent `Hmsc` package. Most likely, the actual distribution on CRAN is already suitable, but most certainly it can be done from the master branch of `Hmsc` repo on GitHub.

```{r eval=FALSE, include=FALSE}
library(devtools)
install_github("hmsc-r/HMSC")
```

Next, we load the required packages. We also set up the path to the working directory and the path to the Hmsc-HPC package. Assuming that you have downloaded this notebook as a part of the distributed Hmsc-HPC extension, the pre-set relative paths shall work fine. Otherwise, please note that these are user and system-specific, therefore you shall ensure their correctness yourself.

```{r}
library(Hmsc)
library(jsonify)
```

Next, we introduce the fundamental model fitting parameters determining the MCMC sampling: number of samples to obtain per MCMC chain, thinning and number of chains. We also define the regularity of progress printing during MCMC sampling. We set the transient phase being equally long as the sampling phase.

```{r}
nSamples = 100
thin = 2
nChains = 4
verbose = 100
transient = nSamples * thin
```

For the sake of this example close-to-online run times, we deliberately use short MCMC chains that are not expected to reach sufficient convergence. This example can be rerun with longer MCMC chains to ensure their convergence.

We use the `TD` synthetic data, included to the `Hmsc` package for the demonstration purposes. In practice, the users would have to define the model with their own data. We set up the model with `Hmsc(...)` call and fit it with

```{r}
summary(TD)
m = Hmsc(Y=TD$Y,
				 XData=TD$X, XFormula=~.,
				 TrData=TD$Tr[,-1], TrFormula=~.,
				 phyloTree=TD$phy,
				 studyDesign=TD$studyDesign,
				 ranLevels=list(plot=TD$rL1, sample=TD$rL2))
```

### Standard Hmsc-R fitting

Next, we obtain the reference by fitting `m` using the standard way with `sampleMcmc(...)` call.

```{r include=FALSE, results='hide'}
fitR = sampleMcmc(m, samples=nSamples, thin=thin,
									transient=transient, nChains=nChains,
									verbose=verbose, engine="R")
```

We also visualize some possible post-fitting postprocessing available in `Hmsc` package --- by the estimated partition of variance in the fitted model.

```{r}
plotVariancePartitioning(fitR, computeVariancePartitioning(fitR),
												 args.legend=list(x="bottomright"))
```

# Exporting model for Hmsc-HPC

In order to use Hmsc-HPC, we need to export the model object, created by `Hmsc(...)` call. Also, the current design of `Hmsc-R` and `Hmsc-HPC` interplay requires that the model initialization, which happens just before the MCMC chains are run, is conducted in R. Thus, we use special keyword in the call `sampleMcmc(..., engine="HPC")`, which denotes that we are not interested in sampling the model, but only to initialize the sampling. We save the resulted object in obligatory JSON+RDS format to the working directory.

```{r}
init_obj = sampleMcmc(m, samples=nSamples, thin=thin,
											transient=transient, nChains=nChains,
											verbose=verbose, engine="HPC")
init_file_path = file.path(getwd(), "init_file.rds")
saveRDS(to_json(init_obj), file=init_file_path)
```

# Hmsc-HPC for sequential execution of chains

As the Hmsc-HPC operates in Python, in the next step we programmatically formulate the required call.

```{r}
post_file_path = file.path(getwd(), "post_file.rds")
python_cmd_args = paste("-m hmsc.run_gibbs_sampler",
												"--input", shQuote(init_file_path),
												"--output", shQuote(post_file_path),
												"--samples", nSamples,
												"--transient", transient,
												"--thin", thin,
												"--verbose", verbose)
cat(paste(shQuote(python), python_cmd_args), "\n")
```

In this example we implicitly focus on the case of using local machine for MCMC execution, but any properly set-up machine can be used --- the user just need to move the initialized object there.

### Running Python model fitting script

If the user is savvy in running Python scripts from R, then the outputted calls can be executed as a part of R script execution. While this can be accomplished by executing the next chunk of code, we personally have found it to be very sensitive to a very proper Python configuration. Thus, at this stage we recommend to run a shell (command line) and simply paste the call produced by the previous chunk - at least during the user's learning of Hmsc-HPC workflow.

```{r}
system2(python, python_cmd_args)
```

### Importing computed posterior to R

Once the Python call has conducted, the following step is to import the calculated posterior samples back to R. We start by reading the output of `Hmsc-HPC`, which is a stacked list of fitted chains and time elapsed for model fitting.

```{r}
importFromHPC = from_json(readRDS(file = post_file_path)[[1]])
postList = importFromHPC[1:nChains]
cat(sprintf("fitting time %.1f sec\n", importFromHPC[[nChains+1]]))
```

A fitted Hmsc-R model differs from its unfitted counterpart in two aspects. First, it contains information on the sampling being done. Next, it accommodates the list of fitted MCMC, each of which is a list of posterior samples for that chain. Both adjustments are made with a novel function of the `Hmsc` package, called `importPosteriorFromHPC(...)` that takes the unfitted model, imported list of chains produced by Hmsc-HPC and the core MCMC settings that were used for the fitting.

```{r}
fitTF = importPosteriorFromHPC(m, postList, nSamples, thin, transient)
```

We finalize by plotting the variance partition again --- just to ensure the integrity of the resulted object.

```{r}
plotVariancePartitioning(fitTF, computeVariancePartitioning(fitTF), args.legend=list(x="bottomright"))
```

# Hmsc-HPC for parallel execution of multiple chains

If the model fitting with Hmsc-HPC is conducted locally, then there is typically no need for parallel execution of chains. However, if the user's machine features multiple GPUs or a specialized CPU with very large number of cores, there is a need to provide an opportunity for the MCMC being run in parallel. Furthermore, this is the intended pathway especially once fitting is conducted with HPC infrastructure. Hence, we provide an example of how to execute such pathway as well.

For this purpose, we need to create chain-specific Python calls, so that each one will handle a single particular chain.

```{r}
chain_cmd_args_list = vector("list", nChains)
for(cInd in 1:nChains){
	chain_file_path = file.path(getwd(), sprintf("post_chain%.2d_file.rds", cInd-1))
	chain_cmd_args = paste("-m hmsc.run_gibbs_sampler",
												 "--input", shQuote(init_file_path),
												 "--output", shQuote(chain_file_path),
												 "--samples", nSamples,
												 "--transient", transient,
												 "--thin", thin,
												 "--verbose", verbose,
												 "--chain", cInd-1)
	cat(paste(shQuote(python), chain_cmd_args), "\n")
	chain_cmd_args_list[[cInd]] = chain_cmd_args
}
```

Next step is to appropriately execute each of these calls. For the sake of runnability of this notebook in most users' devices, here we manually mimic the job scheduler, while an example of doing it properly in a Slurm-based HPC cluster is provided in the next section. The principal difference from the single call considers above is that each call's output is a separate file with chain-specific results.

```{r}
for(cInd in 1:nChains){
	system2(python, chain_cmd_args_list[[cInd]])
}
```

Therefore, once all Python calls have conducted, the following step is to import the calculated posterior samples back to R. We exemplify it as following.

```{r}
chainList = vector("list", nChains)
for(cInd in 1:nChains){
	chain_file_path = file.path(getwd(), sprintf("post_chain%.2d_file.rds", cInd-1))
	chainList[[cInd]] = from_json(readRDS(file = chain_file_path)[[1]])[[1]]
}
```

The remaining steps remain identical to the previously shown alternative.

```{r}
fitSepTF = importPosteriorFromHPC(m, chainList, nSamples, thin, transient)
plotVariancePartitioning(fitSepTF, computeVariancePartitioning(fitSepTF), args.legend=list(x="bottomright"))
```

# Running Hmsc-HPC in a cluster

Different computational clusters greatly diverge in how software and scripts are run, both in terms of execution characteristics and in user interface. Thus, it is close to impossible to properly cover all the alternatives related to how Hmsc-HPC can be run in HPC that the particular user is interested and has access to. Therefore, we focus on how we have done it with the HPC under Slurm cluster management and job scheduling system, which is used at the supercomputers maintained by CSC (Finland) that provided HPC resources for this research project.

### Script for parallel execution of chains

The following script is an example to be run with `sbatch` command from the login node of the cluster.

```{bash eval=FALSE}
#!/bin/bash
#SBATCH --account=project_123456789
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem-per-cpu=32G
#SBATCH --gpus-per-node=1
#SBATCH --time=00:15:00
#SBATCH --partition=small-g
#SBATCH --array=0-3

module load tensorflow/2.12
# pip install hmsc ...

SAM=${1:-100}
THIN=${2:-10}

input_path="init_file.rds"
output_path="post_file.rds"
output_path=$(printf "post_chain%.2d_file.rds" $SLURM_ARRAY_TASK_ID)

srun python3 -m hmsc.run_gibbs_sampler --input $input_path --output $output_path --samples $SAM --transient $(($SAM*$THIN)) --thin $THIN --verbose 100 --chain $SLURM_ARRAY_TASK_ID
```

Note that the `#SBATCH` settings in the header of the script and the module environment will be different depending on the supercomputer used, so please consult the help webpage or IT support of the HPC resource that you intend to use. This presented script was used for runs in LUMI supercomputer.