# Parallel normative modelling

With big datasets, it is often necessary to run normative models in parallel. This notebook will go through the options of the runner class. We will show how to fit and evaluate a model in parallel, and how to do cross-validation. 

This notebook is just an adaptation of the 'normative_modelling.ipynb' notebook, so it is recommended that you look at that one first.

The notebook is tailored to Slurm jobs on the Donders HPC cluster, but can be adapted to other Slurm or Torque environments. 

### Imports

In [1]:
import warnings
import logging


import pandas as pd
from pcntoolkit.dataio.norm_data import NormData
from pcntoolkit.normative_model import NormativeModel
from pcntoolkit.util.plotter import plot_centiles, plot_qq
from pcntoolkit.regression_model.blr import BLR
from pcntoolkit.math.basis_function import BsplineBasisFunction
from pcntoolkit.util.runner import Runner
import numpy as np
import pcntoolkit.util.output
import seaborn as sns
import os

# Suppress some annoying warnings and logs
pymc_logger = logging.getLogger("pymc")

pymc_logger.setLevel(logging.WARNING)
pymc_logger.propagate = False

warnings.simplefilter(action="ignore", category=FutureWarning)
pd.options.mode.chained_assignment = None  # default='warn'
pcntoolkit.util.output.Output.set_show_messages(True)

# Load data

First we download a small example dataset from github. Saving this dataset on your local device (under 'resources/data/fcon1000.csv' for example) saves time and bandwidth if you re-run this notebook.

In [2]:
if not os.path.exists("resources/data/fcon1000.csv"):
    pd.read_csv(
        "https://raw.githubusercontent.com/predictive-clinical-neuroscience/PCNtoolkit-demo/refs/heads/main/data/fcon1000.csv"
    ).to_csv("resources/data/fcon1000.csv", index=False)

data = pd.read_csv("resources/data/fcon1000.csv")
data["sex"] = np.where(data["sex"], "Male", "Female")

covariates = ["age"]
batch_effects = ["sex", "site"]
response_vars = data.columns[3:220]
print(f"We model {len(response_vars)} variables")
norm_data = NormData.from_dataframe(
    name="full",
    dataframe=data,
    covariates=["age"],
    batch_effects=["sex", "site"],
    response_vars=response_vars,
)
transfer_sites = ["Milwaukee_b", "Oulu"]
transfer_data, fit_data = norm_data.split_batch_effects({"site": transfer_sites}, names=("transfer", "fit"))
train, test = fit_data.train_test_split()
transfer_train, transfer_test = transfer_data.train_test_split()


We model 217 variables


## Configure the regression model


In [3]:
# BLR model
blr_regression_model = BLR(
    name="template",
    n_iter=1000,
    tol=1e-8,
    optimizer="l-bfgs-b",
    l_bfgs_b_epsilon=0.1,
    l_bfgs_b_l=0.1,
    l_bfgs_b_norm="l2",
    fixed_effect=True,
    basis_function_mean=BsplineBasisFunction(basis_column=0, degree=3, nknots=5),
    heteroskedastic=True,
    basis_function_var=BsplineBasisFunction(basis_column=0, degree=3, nknots=5),
    fixed_effect_var=True,
)

In [4]:
model = NormativeModel(
    template_reg_model=blr_regression_model,
    # Whether to save the model after fitting.
    savemodel=True,
    # Whether to evaluate the model after fitting.
    evaluate_model=True,
    # Whether to save the results after evaluation.
    saveresults=True,
    # Whether to save the plots after fitting.
    saveplots=False,
    # The directory to save the model, results, and plots.
    save_dir="resources/blr/save_dir",
    # The scaler to use for the input data. Can be either one of "standardize", "minmax", "robustminmax", "none"
    inscaler="standardize",
    # The scaler to use for the output data. Can be either one of "standardize", "minmax", "robustminmax", "none"
    outscaler="standardize",
)

## Fit the model
Normally we would just call 'fit_predict' on the model directly, but because we want to use the runner to do cross-validation in parallel, we need to first create a runner object. 

In [5]:
runner = Runner(
    cross_validate=False,
    cv_folds=3,
    parallelize=False,
    environment="/project/3022000.05/projects/stijdboe/envs/dev_refactor",
    job_type="slurm",  # or "torque" if you are on a torque cluster
    n_jobs=10,
    log_dir="resources/runner_output/log_dir",
    temp_dir="resources/runner_output/temp_dir",
)


The runner object will now fit the model in parallel, and save the results in save directories that it will create for each fold.

In [6]:
runner.fit_predict(model, train, test)


Process: 52593 - UUID for runner task created: 5f4c41b1-9c65-4528-b8be-9a71885a7f13
Process: 52593 - Temporary directory created:
	/Users/stijndeboer/Projects/PCN/PCNtoolkit/examples/resources/runner_output/temp_dir/5f4c41b1-9c65-4528-b8be-9a71885a7f13
Process: 52593 - Log directory created:
	/Users/stijndeboer/Projects/PCN/PCNtoolkit/examples/resources/runner_output/log_dir/5f4c41b1-9c65-4528-b8be-9a71885a7f13
Process: 52593 - Fitting models on 217 response variables.
Process: 52593 - Fitting model for lh_G&S_frontomargin_thickness.


  return (X - self.m) / self.s


KeyboardInterrupt: 

### Loading a fold model
We can load a model for a specific fold by calling `load_model` on the runner object. This will return a `NormBLR` object, which we can inspect and use to predict on new data.


In [8]:
fitted_model = runner.load_model()
display(fitted_model)

Process: 1982211 - Configuration of normative model is valid.
Process: 1982211 - Configuration of normative model is valid.
Process: 1982211 - Configuration of regression model is valid.
Process: 1982211 - Configuration of regression model is valid.
Process: 1982211 - Configuration of regression model is valid.
Process: 1982211 - Configuration of regression model is valid.


<pcntoolkit.normative_model.norm_blr.NormBLR at 0x7fd6339e1b20>

## Inspecting the model 

The norm_blr model contains a collection of regression models, one for each response variable. We can inspect those models individually by calling `norm_blr.regression_models.get("{responsevar}")`

In [9]:
single_hbr_model = fitted_model.regression_models.get("rh_MeanThickness_thickness")  # type: ignore

{'_name': 'rh_MeanThickness_thickness',
 '_reg_conf': BLRConf(n_iter=200, tol=1e-05, ard=False, optimizer='l-bfgs-b', l_bfgs_b_l=0.1, l_bfgs_b_epsilon=0.1, l_bfgs_b_norm='l2', intercept=True, fixed_effect=False, heteroskedastic=True, intercept_var=False, fixed_effect_var=False, warp='WarpSinhArcsinh', warp_reparam=True),
 'is_fitted': True,
 '_is_from_dict': True,
 'hyp': array([-0.00271425,  0.40459816,  0.57171555,  0.14023606, -0.02021072,
        -0.02146683, -0.00490682, -0.28481558, -0.09296211, -0.06571992,
        -0.00115522,  0.00524956,  0.00339416,  0.00250672, -0.03653256,
        -0.00530122,  0.00506945]),
 'nlZ': 979.881210875374,
 'N': 744,
 'D': 8,
 'lambda_n_vec': None,
 'Sigma_a': array([[1.06792757, 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        ],
        [0.        , 1.00115589, 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.9947642 , 0.        , 0.  

## Evaluation
Calling `predict` will extend the predict_data with:
1. `measures`: DataArray, which contains a number of evaluation statistics. 
1. `Z`: the predicted z-scores for each datapoint.  
1. `centiles`: the predicted centiles of variation evaluated at each covariate in the dataset. 


In [10]:
runner.predict(model, test)


--------------------------------------------------------
             PCNtoolkit Job Status Monitor ®
--------------------------------------------------------
Job ID      Name              State      Time      Nodes
--------------------------------------------------------

46821015    job_1             FAILED                        
46821014    job_0             FAILED                        

--------------------------------------------------------    
All jobs completed!
--------------------------------------------------------



Datasets with a zscores DataArray will have the `.plot_qq()` function available:

# Transfering and extending with the runner
The following functions are available:
- `transfer(transfer_data)`: Transfer the model to transfer_data.
- `extend(extend_data)`: Extend the model to extend_data.
- `transfer_predict(transfer_data, transfer_test)`: Transfer to transfer_test and predict on transfer_test.
- `extend_predict(extend_data, extend_test)`: Extend to extend_test and predict on extend_test.