<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Running-CNN-for-Interactive-Testing" data-toc-modified-id="Running-CNN-for-Interactive-Testing-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Running CNN for Interactive Testing</a></span></li><li><span><a href="#Visualizing-Outputs" data-toc-modified-id="Visualizing-Outputs-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Visualizing Outputs</a></span></li><li><span><a href="#Running-Slurm-batch-jobs" data-toc-modified-id="Running-Slurm-batch-jobs-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Running Slurm batch jobs</a></span><ul class="toc-item"><li><span><a href="#Example-config.ini" data-toc-modified-id="Example-config.ini-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Example config.ini</a></span></li></ul></li></ul></div>

# Notebook 4: Training the MRE Model

In [1]:
%matplotlib inline
%load_ext autoreload
%autoreload 2

import os,sys
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from datetime import datetime

import SimpleITK as sitk
from scipy import ndimage as ndi
import skimage as skim
from skimage import feature
import pandas as pd
import xarray as xr
plt.rcParams['figure.figsize'] = (10,8)
import pickle as pkl
from torch.utils.data import Dataset, DataLoader
import torchvision.utils
from torchsummary import summary
import torch
import torch.nn as nn
from collections import defaultdict
import torch.nn.functional as F
import torch.optim as optim
from torch.optim import lr_scheduler
from torchvision import models
import time
import copy

%aimport mre_ai
from mre_ai import pytorch_arch_deeplab_3d
%aimport mre_ai.mre_datasets
from mre_ai.mre_datasets import MRETorchDataset, MREtoXr, TorchToXr
%aimport mre_ai.plotting
from mre_ai.plotting import hv_dl_vis, hv_pred_comp, xr_viewer_v2
%aimport mre_ai.train_mre_model
from mre_ai.train_mre_model import train_model_full
%aimport mre_ai.prediction
from mre_ai.prediction import add_predictions, get_linear_fit

import holoviews as hv
from holoviews import opts
hv.extension('bokeh')




ModuleNotFoundError: No module named 'mre_ai'

## Running CNN for Interactive Testing
- This notebook is for running a full interactive training loop.
- Most important modules are `train_mre_model`, `prediction`, and `mre_datasets`.
    - Plan was to overhaul this setup into a more unified `training_class`, which is used for the Physics-Informed sections.
    - Recommendation for future improvements is to use `training_class`, deprecate the `train_mre_model` function.
- The `kwargs` act as the set of training parameters for specifying all the different parts of the model and methods.
    - Did my best to remove old args which are no longer in use.
- `dry_run=True` will set everything up, but not run the training loop.  It will output the augmented input data to use in a QC visualizer to make sure augmentations are performing correctly.
- The core training loop is wrapped in a `try/except` block to handle keyboard interrupts (kernel interrupts in Jupyter).  This will allow you to break out of a loop gracefully without waiting for all epochs to finish.
- The final outputs are saved as xArray objects with the same structure as the inputs.  One can merge any saved outputs with the inputs to make a new xArray object.
- The models parameters are also saved for future use.
- This module automatically grabs the date and time to name a new set of model parameters and outputs.
- Loop is hard-coded not to save any models until epoch 50 to minimize stat fluctuations in early epochs.

In [2]:
data_path = '/ocean/projects/asc170022p/shared/Data/MRE/XR_full_gold_clinical_v3/'
data_file = '*.nc'
output_path = '/ocean/projects/asc170022p/bpollack/mre_ai/data'
torch.cuda.empty_cache()
# now = '2020-02-20_12-04-13'
now = datetime.today().strftime('%Y-%m-%d_%H-%M-%S')
print(now)

kwargs = dict(seed=33, num_epochs=100, 
              batch_size=4, num_workers=1,
              gamma=0.3, lr_scheduler='step', step_size=30, lr=5e-4,
              # lr_scheduler='cyclic', step_size=15, lr_max=0.001, lr_min=1e-5,
              inputs=['t2', 't1_pre_water', 't1_pos_160_water', 't1_pre_fat'],
              loss='l2',
              class_only=False, # If you want a class-only loss (not pixel-wise prediction)
              out_channels=1,
              model_arch='deeplab',
              pixel_weight=0.5,
              sampling_breakdown='smart',
              dataset_ver='rad_freeze',
              do_clinical=True, norm_clinical=True,
              do_clinical_only=False, #For ignoring images, just using clinical info
              # inputs=['t2], Requires a single dummy image for running the clinical-only model
              subj=['1045', '1329', '1714', '1935'], # Held-out test subjects
              erode_mask=10, # Erosion of the liver mask
              smear_amt=25, # Smearing of the region in the liver mask
              train_aug=True,
              train_smear='none',
              train_sample='shuffle',
              worker_init_fn = 'rand_epoch',
              test_aug=False,
              test_smear='none',
              val_aug=False,
              val_smear='none',
              val_sample='shuffle',
              do_val=True,
              norm='bn',
              dry_run=False)

train_model_output = train_model_full(data_path, data_file, output_path, model_version=f'notebook_test_{now}', **kwargs)


2021-05-24_15-23-27
/ocean/projects/asc170022p/bpollack/mre_ai/notebooks
/ocean/projects/asc170022p/bpollack/mre_ai/mre_ai
{'train_aug': True, 'train_sample': 'shuffle', 'val_aug': False, 'val_sample': 'shuffle', 'test_aug': False, 'train_num_samples': 200, 'val_num_samples': 100, 'train_smear': 'none', 'val_smear': 'none', 'test_smear': 'none', 'smear_amt': 25, 'lap_kernel': 25, 'class_only': False, 'batch_size': 4, 'subj': ['1045', '1329', '1714', '1935'], 'gamma': 0.3, 'num_epochs': 100, 'dry_run': False, 'loss': 'l2', 'mask_trimmer': False, 'mask_mixer': 'mixed', 'target_max': None, 'target_bins': 100, 'model_arch': 'deeplab', 'out_channels': 1, 'transfer_layer': False, 'seed': 33, 'worker_init_fn': 'rand_epoch', 'resize': False, 'patient_list': False, 'num_workers': 1, 'lr_scheduler': 'step', 'lr': 0.0005, 'lr_max': 0.01, 'lr_min': 0.0001, 'step_size': 30, 'pixel_weight': 0.5, 'sampling_breakdown': 'smart', 'do_clinical': True, 'do_clinical_only': False, 'dataset_ver': 'rad_freeze

train: pixel_loss: 681.448591, subj_loss: 303.019613, loss: 492.234098
val: pixel_loss: 1049.089291, subj_loss: 501.044606, loss: 775.066943
test: pixel_loss: 2348.494629, subj_loss: 1192.727783, loss: 1770.611206
1m 42s
Epoch 14/99
----------
LR 0.0005
train: pixel_loss: 690.185971, subj_loss: 289.136362, loss: 489.661170
val: pixel_loss: 783.181970, subj_loss: 414.929759, loss: 599.055859
test: pixel_loss: 1755.963257, subj_loss: 958.288757, loss: 1357.125977
1m 41s
Epoch 15/99
----------
LR 0.0005
train: pixel_loss: 684.434756, subj_loss: 326.378845, loss: 505.406797
val: pixel_loss: 999.517379, subj_loss: 455.537251, loss: 727.527318
test: pixel_loss: 1377.898071, subj_loss: 902.744446, loss: 1140.321289
1m 43s
Epoch 16/99
----------
LR 0.0005
train: pixel_loss: 781.311406, subj_loss: 326.315040, loss: 553.813217
val: pixel_loss: 985.043837, subj_loss: 660.672514, loss: 822.858167
test: pixel_loss: 770.374695, subj_loss: 345.716278, loss: 558.045471
1m 42s
Epoch 17/99
----------
LR

test: pixel_loss: 1810.122559, subj_loss: 868.957153, loss: 1339.539795
1m 40s
Epoch 46/99
----------
LR 0.00015
train: pixel_loss: 331.234523, subj_loss: 30.459313, loss: 180.846918
val: pixel_loss: 838.579197, subj_loss: 389.858758, loss: 614.218976
test: pixel_loss: 1465.284912, subj_loss: 691.916199, loss: 1078.600586
1m 40s
Epoch 47/99
----------
LR 0.00015
train: pixel_loss: 350.621790, subj_loss: 39.628994, loss: 195.125393
val: pixel_loss: 852.273634, subj_loss: 379.679461, loss: 615.976535
test: pixel_loss: 1785.281616, subj_loss: 779.348145, loss: 1282.314941
1m 40s
Epoch 48/99
----------
LR 0.00015
train: pixel_loss: 317.764767, subj_loss: 36.319744, loss: 177.042255
val: pixel_loss: 835.537879, subj_loss: 358.899767, loss: 597.218815
test: pixel_loss: 1652.408691, subj_loss: 672.227173, loss: 1162.317871
1m 40s
Epoch 49/99
----------
LR 0.00015
train: pixel_loss: 318.946786, subj_loss: 27.618353, loss: 173.282569
val: pixel_loss: 920.363715, subj_loss: 385.629361, loss: 652

## Visualizing Outputs
- There are a couple different methods of visualization used for this notebook:
    - The `TorchToXr` class is for converting a bunch of pytorch objects into xarrays for debugging.
        - This is for use with the output of `train_mre_model` in `dry_run` mode ONLY.
        - Use this to debug data augmentation and normalization methods.
        - The images are shown as they are after all augmentation and processing, right before training.
    - The `MREtoXR` class can be used to combine the input files with the output predictions.
        - This is useful for examining the predictions for the test, train, and validation groups.

In [7]:
# Uncomment this to view a `dry-run` output
# This will NOT work outside of dry-run

# ds_torch = TorchToXr(train_model_output[0], train_model_output[2], train_model_output[1], train_model_output[3], sequences=['t2', 't1_pre_water', 't1_pos_160_water', 't1_pre_fat']).ds
# xr_viewer_v2(ds_torch, size=350, prediction=False, torch=True)

In [3]:
# This cell combines inputs with predictions.  The train group is uncommented.
# Make sure you trained your model for at least 50 epochs, or changed the minimum number of epochs for saving (its hardcoded)

from_file = '/ocean/projects/asc170022p/shared/Data/MRE/XR_full_gold_clinical_v3/*.nc'
# if you want to look at an older run, set the `now` parameter
# now = some_older_date

xr_maker = MREtoXr(
    from_file=from_file,
    from_file_pred=f'/ocean/projects/asc170022p/bpollack/mre_ai/data/XR/notebook_test_{now}/train/*.nc')
ds_train = xr_maker.get_ds()
ds_train = ds_train.load()

# xr_maker = MREtoXr(
#     from_file=from_file,
#     from_file_pred=f'/ocean/projects/asc170022p/bpollack/mre_ai/data/XR/notebook_test_{now}/test/*.nc')
# ds_test = xr_maker.get_ds()
# ds_test = ds_test.load()

# xr_maker = MREtoXr(
#     from_file=from_file,
#     from_file_pred=f'/pghbio/dbmi/batmanlab/bpollack/predictElasticity/data/XR/notebook_test_{now}/val/*.nc')
# ds_val = xr_maker.get_ds()
# ds_val = ds_val.load()

['1577' '1851' '1103' '0173' '1106' '1790' '0747' '1979' '1077' '0954'
 '1530' '0291' '0210' '1550' '1798' '1574' '0890' '1336' '1395' '1853'
 '1896' '1967' '2046' '1722' '0491' '1715' '1526' '1720' '0344' '0932'
 '1791' '1311' '0659' '0234' '1785' '0126' '1287' '1839' '0219' '1595'
 '0693' '0461' '1789' '2007' '1083' '0235' '0929' '0628' '1367' '2034'
 '1360' '1667' '1382' '1400' '0940' '1679' '1578' '1727' '1893' '1899'
 '1903' '1948' '1447' '0737' '1341' '0872' '0904' '1453' '1404' '0655'
 '1765' '1694' '1121' '0734' '1706' '0914' '0401' '1491' '0509' '1829'
 '1217' '1337' '0937' '0492' '1456' '1541' '1642' '1554' '0415' '1748'
 '0704' '1819' '0043' '0995' '1980' '1940' '0612' '1561' '0748' '0029'
 '1412' '0556' '1504' '1435' '1417' '0975' '0510' '0979' '0860' '1603'
 '0564' '1072' '1123' '0830' '1671']
['0006' '0020' '0029' '0043' '0126' '0135' '0173' '0210' '0219' '0234'
 '0235' '0291' '0344' '0401' '0415' '0461' '0491' '0492' '0509' '0510'
 '0556' '0564' '0612' '0628' '0653' '065

In [15]:
ds_train

In [4]:
# View the training group predictions
# Remember, only 4 input slices will have the actual MRE truth (usually around z=18)
xr_viewer_v2(ds_train, size=350, prediction=True)

<xarray.Dataset>
Dimensions:         (mask_type: 3, mre_type: 5, sequence: 9, subject: 115, x: 256, y: 256, z: 32, z_idx: 4)
Coordinates:
  * subject         (subject) object '1577' '1851' '1103' ... '0830' '1671'
  * y               (y) int64 255 254 253 252 251 250 249 248 ... 6 5 4 3 2 1 0
  * z               (z) int64 0 1 2 3 4 5 6 7 8 9 ... 23 24 25 26 27 28 29 30 31
  * z_idx           (z_idx) int64 0 1 2 3
  * x               (x) int64 0 1 2 3 4 5 6 7 ... 249 250 251 252 253 254 255
  * mre_type        (mre_type) object 'mre' 'mre_mask' ... 'mre_wave' 'mre_pred'
  * sequence        (sequence) object 't1_pre_water' ... 't1_pos_300_water'
  * mask_type       (mask_type) object 'liver' 'mre' 'combo'
Data variables: (12/19)
    image_mri       (subject, sequence, x, y, z) int16 0 0 0 0 0 0 ... 0 0 0 0 0
    mask_mri        (subject, mask_type, x, y, z) int16 0 0 0 0 0 ... 0 0 0 0 0
    image_mre       (subject, mre_type, x, y, z) int16 0 0 0 ... -121 -119 -117
    mask_mre        (s

In [6]:
# View the saved config (the kwargs)
pkl.load(open(f'../data/config/notebook_test_{now}_notebook.pkl', 'rb'))

{'train_aug': True,
 'train_sample': 'shuffle',
 'val_aug': False,
 'val_sample': 'shuffle',
 'test_aug': False,
 'train_num_samples': 200,
 'val_num_samples': 100,
 'train_smear': 'none',
 'val_smear': 'none',
 'test_smear': 'none',
 'smear_amt': 25,
 'lap_kernel': 25,
 'class_only': False,
 'batch_size': 4,
 'subj': ['1045', '1329', '1714', '1935'],
 'gamma': 0.3,
 'num_epochs': 100,
 'dry_run': False,
 'loss': 'l2',
 'mask_trimmer': False,
 'mask_mixer': 'mixed',
 'target_max': None,
 'target_bins': 100,
 'model_arch': 'deeplab',
 'out_channels': 1,
 'transfer_layer': False,
 'seed': 33,
 'worker_init_fn': 'rand_epoch',
 'resize': False,
 'patient_list': False,
 'num_workers': 1,
 'lr_scheduler': 'step',
 'lr': 0.0005,
 'lr_max': 0.01,
 'lr_min': 0.0001,
 'step_size': 30,
 'pixel_weight': 0.5,
 'sampling_breakdown': 'smart',
 'do_clinical': True,
 'do_clinical_only': False,
 'dataset_ver': 'rad_freeze',
 'norm_clinical': True,
 'norm_clin_vals': [array([ 58.086956  ,   0.5217391 , 1

## Running Slurm batch jobs
- Slurm batch jobs run using the same core code.  The only difference is the Slurm batch function and associated config file.
- The Slurm batch function is the `SlurmMaster`, which takes a config.ini file as input.
- The config.ini file has all the kwargs you'd specify above, plus slurm-related commands to tell it where and how to run.
- The `SlurmMaster` does the following:
    - Generates all logging and additional directories based on the date and time.
    - Making a staging directory, copies all core software into it, modifies the names, and compiles them.
        - This step is necessary to prevent conflicts that arise from editing code after a job is submitted.  If the job has not yet run, it could start executing on a version of code that differs from the code at submission time.
    - Auto-generates the required Slurm submission file(s).
    - Submits Slurm jobs.
- There are a couple paths and names that are hardcoded in the `SlurmMaster`, these may have to be changed for another user to deploy.
- The `configparser` that reads a .ini file makes a couple assumptions:
    - Under the `Hyper` heading, kwargs can be ints, floats, strings, bools, lists, or nested lists.
    - An argument that expects a single value will make multiple jobs if it gets a list:
        - `lr = [1e-2, 1e-3, 1e-4] # this makes 3 different jobs`
        - Be careful with too many list arguments, as the default method is to make all possible combinations of all possible arguments:
        ```
        lr = [1e-2, 1e-3, 1e-4]
        gamma = [0.1, 0.3, 0.6]
        step_size = [15, 30, 45]
        # This makes 3x3x3=27 jobs
        ```
    - If an argument expects a list as an input, this should be specified as a nested list:
        ```
        inputs = [['t2', 't1_pre_water', 't1_pos_160_water', 't1_pre_fat']] # This makes 1 job with 4 input images.
        ```
        ```
        inputs = [['t2', 't1_pre_water', 't1_pos_160_water', 't1_pre_fat'],
                   ['t2', 't1_pre_water']]
                   # This makes 2 job with 4 input images in job 1 and 2 input images in job 2.
        ```
    
    


```
# Run from command line
>> python mre_ai/SlurmMaster.py config_inis/MRE/test_config.ini
```

### Example config.ini

```
# Define the Project (determines certain settings in SlurmMaster)
[Project]
task = MRE

# Write notes, saved in a notes folder, helpful for remember the goal of the submission
[Notes]
note = Base settings: frozen for rad paper
    Test run for checking outputs and settings.

# Node settings
[Node]
partition = GPU-shared
ngpus = 1

# Various logging and staging directories
[Setup]
log_path = /ocean/projects/asc170022p/bpollack/mre_ai/data/slurm_logs
stage_path = /ocean/projects/asc170022p/bpollack/mre_ai/staging
notes_path = /ocean/projects/asc170022p/bpollack/mre_ai/data/notes
core_path = /ocean/projects/asc170022p/bpollack/mre_ai
core_package = mre_ai

# Hyperparameters, essentially identical to `kwargs`
[Hyper]
seed = [1234]
data_path = '/ocean/projects/asc170022p/shared/Data/MRE/XR_full_gold_clinical_v3/'
data_file = '*.nc'
output_path = '/ocean/projects/asc170022p/bpollack/mre_ai/data'

num_workers = 1
batch_size = 4

lr_scheduler='step'
gamma=0.3
lr=1e-3
step_size=30
weight_decay=[0]

inputs = [['t2', 't1_pre_water', 't1_pos_160_water', 't1_pre_fat']]


do_clinical = True
norm_clinical = True
# only_group = [1, 2, 16] # Only do these groups, if you need to rerun groups that failed
dataset_ver = 'rad_freeze'
subj_group = [['1045', '1329', '1714', '1935']]

model_arch = 'deeplab'

pixel_weight = 0.5
loss = 'l2'
out_channels = 1

sampling_breakdown = 'smart'
worker_init_fn = 'rand_epoch'

erode_mask = [2]
smear_amt = [25]
train_aug = True
train_smear = ['none']
train_sample = 'shuffle'
val_aug = False
val_smear = ['none']
val_sample = 'shuffle'
norm = ['bn']
dims=3

num_epochs = 12
```