# Parallel normative modelling

This notebook will go through the options of the runner class. We will show how to fit and evaluate a model in parallel, and how to do cross-validation. 

This notebook is just an adaptation of the 'normative_modelling.ipynb' notebook, so it is recommended that you look at that one first.

The notebook is tailored to the Slurm environment on the Donders HPC cluster, but can be adapted to other Slurm or Torque environments. 


## Setting up the environment on the cluster

First, SSH into the cluster. If you are using VScode, you can use the Remote - SSH extension to connect to the cluster. It's a breeze. 

We start with a clean environment and install the PCNtoolkit package. We do this in an interactive job, because it is much faster than using a login node.

```bash
sbash --time=10:00:00 --mem=16gb -c 4 --ntasks-per-node=1
module load anaconda3
conda create -n pcntoolkit_cluster_tutorial python=3.12
source activate pcntoolkit_cluster_tutorial
pip install pcntoolkit
pip install ipykernel
pip install graphviz
```

Next, we want to use the newly created environment in our notebook. 

If you are running this notebook in VScode, you can select the environment by clicking on the mysterious symbol in the top right corner of the notebook. 

Click "Select Another Kernel...", "Python environments...", and then from the dropdown, select the `pcntoolkit_cluster_tutorial` environment. 

You may have to reload the window after creating the environment before it is available in VScode -> Open the command palette (mac: cmd+shift+P, windows: ctrl+shift+P) and type "Reload Window"

After selecting the environment, the weird symbol in the top right corner should now show the environment name. 

### Imports

In [1]:
import warnings
import logging

import pandas as pd
from pcntoolkit.dataio.norm_data import NormData
from pcntoolkit.normative_model import NormativeModel
from pcntoolkit.regression_model.blr import BLR
from pcntoolkit.math.basis_function import BsplineBasisFunction
from pcntoolkit.util.runner import Runner
import numpy as np
import pcntoolkit.util.output
import os
import sys

# Get the conda environment path
conda_env_path = os.path.join(os.path.dirname(os.path.dirname(sys.executable)))
print(f"This should be the conda environment path: {conda_env_path}")

# Suppress some annoying warnings and logs
pymc_logger = logging.getLogger("pymc")
pymc_logger.setLevel(logging.WARNING)
pymc_logger.propagate = False
warnings.simplefilter(action="ignore", category=FutureWarning)
pd.options.mode.chained_assignment = None  # default='warn'
pcntoolkit.util.output.Output.set_show_messages(True)

This should be the conda environment path: /project/3022000.05/projects/stijdboe/envs/pcntoolkit_cluster_tutorial


# Load data

First we download a small example dataset from github. Saving this dataset on your local device (under 'resources/data/fcon1000.csv' for example) saves time and bandwidth if you re-run this notebook.

In [2]:
resource_dir = "resources"
os.makedirs(os.path.join(resource_dir, "data"), exist_ok=True)
if not os.path.exists("resources/data/fcon1000.csv"):
    pd.read_csv(
        "https://raw.githubusercontent.com/predictive-clinical-neuroscience/PCNtoolkit-demo/refs/heads/main/data/fcon1000.csv"
    ).to_csv("resources/data/fcon1000.csv", index=False)

data = pd.read_csv("resources/data/fcon1000.csv")
data["sex"] = np.where(data["sex"], "Male", "Female")
print(os.path.abspath("resources/data/fcon1000.csv"))
covariates = ["age"]
batch_effects = ["sex", "site"]
response_vars = data.columns[3:220]
response_vars = data.columns[3:7]
print(f"We model {len(response_vars)} variables")
norm_data = NormData.from_dataframe(
    name="full",
    dataframe=data,
    covariates=["age"],
    batch_effects=["sex", "site"],
    response_vars=response_vars,
)
transfer_sites = ["Milwaukee_b", "Oulu"]
transfer_data, fit_data = norm_data.split_batch_effects({"site": transfer_sites}, names=("transfer", "fit"))
train, test = fit_data.train_test_split()
transfer_train, transfer_test = transfer_data.train_test_split()


/project/3022000.05/projects/stijdboe/Projects/PCNtoolkit/examples/resources/data/fcon1000.csv
We model 4 variables


## Configure the regression model


In [3]:
# BLR model
blr_regression_model = BLR(
    name="template",
    n_iter=1000,
    tol=1e-8,
    optimizer="l-bfgs-b",
    l_bfgs_b_epsilon=0.1,
    l_bfgs_b_l=0.1,
    l_bfgs_b_norm="l2",
    fixed_effect=True,
    basis_function_mean=BsplineBasisFunction(basis_column=0, degree=3, nknots=5),
    heteroskedastic=True,
    basis_function_var=BsplineBasisFunction(basis_column=0, degree=3, nknots=5),
    fixed_effect_var=True,
    intercept=True,
    intercept_var=True,
)

In [4]:
model = NormativeModel(
    template_reg_model=blr_regression_model,
    # Whether to save the model after fitting.
    savemodel=True,
    # Whether to evaluate the model after fitting.
    evaluate_model=True,
    # Whether to save the results after evaluation.
    saveresults=True,
    # Whether to save the plots after fitting.
    saveplots=True,
    # The directory to save the model, results, and plots.
    save_dir="resources/blr/save_dir",
    # The scaler to use for the input data. Can be either one of "standardize", "minmax", "robustminmax", "none"
    inscaler="standardize",
    # The scaler to use for the output data. Can be either one of "standardize", "minmax", "robustminmax", "none"
    outscaler="standardize",
)

## Fit the model
Normally we would just call 'fit_predict' on the model directly, but because we want to use the runner to do cross-validation in parallel, we need to first create a runner object. 

In [5]:
runner = Runner(
    cross_validate=False,
    parallelize=True,
    environment=conda_env_path,
    job_type="slurm",  # or "torque" if you are on a torque cluster
    n_jobs=2,
    time_limit="00:10:00",
    log_dir="resources/runner_output/log_dir",
    temp_dir="resources/runner_output/temp_dir",
)


The runner object will now fit the model in parallel, and save the results in save directories that it will create for each fold.

In [6]:
runner.fit_predict(model, train, test, observe=True)


---------------------------------------------------------
              PCNtoolkit Job Status Monitor ®
---------------------------------------------------------
Task uuid: e3547c6c-312b-44b4-adeb-841456357e37
---------------------------------------------------------
Job ID      Name          State      Time      Nodes
---------------------------------------------------------

46951638    ptk_job_0 COMPLETED                          
46951639    ptk_job_1 COMPLETED                          

---------------------------------------------------------
Total active jobs: 0
Total completed jobs: 2
Total failed jobs: 0
---------------------------------------------------------


---------------------------------------------------------
No more running jobs!
---------------------------------------------------------



### Loading a fold model
We can load a model for a specific fold by calling `load_model` on the runner object. This will return a `NormativeModel`, which we can inspect and use to predict on new data.


In [7]:
fitted_model = runner.load_model(1)
display(fitted_model)

<pcntoolkit.normative_model.NormativeModel at 0x7f3e2b3cd4c0>

## Model extension

BLR models can only be extended, not transferred (yet).


In [8]:
runner.extend_predict(model, transfer_train, transfer_test)


---------------------------------------------------------
              PCNtoolkit Job Status Monitor ®
---------------------------------------------------------
Task uuid: 889e4103-b842-4338-8589-292b83d282c8
---------------------------------------------------------
Job ID      Name          State      Time      Nodes
---------------------------------------------------------

46951640    ptk_job_0 COMPLETED                          
46951641    ptk_job_1 COMPLETED                          

---------------------------------------------------------
Total active jobs: 0
Total completed jobs: 2
Total failed jobs: 0
---------------------------------------------------------


---------------------------------------------------------
No more running jobs!
---------------------------------------------------------



<pcntoolkit.normative_model.NormativeModel at 0x7f3e2cdcdcd0>

Datasets with a zscores DataArray will have the `.plot_qq()` function available:

# More to do with the runner

The following functions are available:
- `transfer(transfer_data)`: Transfer the model to transfer_data.
- `extend(extend_data)`: Extend the model to extend_data.
- `transfer_predict(transfer_data, transfer_test)`: Transfer to transfer_test and predict on transfer_test.
- `extend_predict(extend_data, extend_test)`: Extend to extend_test and predict on extend_test.