## ImageBasedDataMiningContinuous.ipynb
‹ ImageBasedDataMining.ipynb › Copyright (C) ‹ 2019 › ‹ Andrew Green - andrew.green-2@manchester.ac.uk › This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.

In [None]:
"""
| date      | who     | note                                                  |
|----------|----------|-------------------------------------------------------|
| 20190201 |    afg   |  First version, adapted from MadagaSKA code           |
| 20190208 |    afg   |  Updated code, completed up to point of plotting      |
| 20190208 |    afg   |  Update paths in code to work with docker container   |
""";

This notebook shows how to do image based data mining against a continuous outcome. The outcome could be whatever you like, provided it is a continuous variable; some examples include weight loss, muscle area loss and feeding tube duration

In [1]:
## Import libraries and set up

import os
import time
import os.path
import numpy as np
try:
    from tqdm import tqdm
    haveTQDM = True
except:
    haveTQDM = False
import SimpleITK as sitk
import matplotlib.pyplot as plt

%matplotlib inline

ModuleNotFoundError: No module named 'SimpleITK'

Once again, we will use pandas to load the csv. The next cell contains all the pre-processing from the binary data mining, in a single cell. By the end of the cell, we should have a set of clean data again.

In [None]:
import pandas as pd

clinicalDataPath = "/data/clinicalData.csv"

clinicalData = pd.read_csv(clinicalDataPath)

ccrtOnlyPatients = clinicalData[(clinicalData["Oncologic Treatment Summary"].str.contains('^CCRT', regex=True)) & (clinicalData["Oncologic Treatment Summary"].str.contains('\+', regex=True) == False)]
len(ccrtOnlyPatients["Oncologic Treatment Summary"])

selectedPatients = ccrtOnlyPatients[ccrtOnlyPatients["Number of Fractions"].astype(int) < 40]
len(selectedPatients["Number of Fractions"])

def calculateBEDCorrection(df, early=True):
    earlyAlphaBeta = 10.0
    lateAlphaBeta  = 03.0
    if early:
        df = df.assign(BEDfactor = lambda d : 1.0 + d["Dose/Fraction (Gy/fx)"].astype(float)/earlyAlphaBeta)
    else:
        df = df.assign(BEDfactor = lambda d : 1.0 + d["Dose/Fraction (Gy/fx)"].astype(float)/lateAlphaBeta)
    return df

selectedPatients = calculateBEDCorrection(selectedPatients)

selectedPatients

Now we have our BED correction, we can start loading data ready to do mining. We will load each image using SimpleITK, convert it to a numpy array and put it into a bit numpy array. We will concurrently load the binary status from the clinical data as well and put that into a numpy array.

In the next cell, we will load all our data. Note that we pre-allocate the numpy array we will use to hold the data - this is a performance optimisation to make loading the data a bit quicker.

In [None]:
dosesPath = "/data/registeredDoses"

probeDose = sitk.GetArrayFromImage(sitk.ReadImage(os.path.join(dosesPath, "{0:04d}.nii".format(int(2)))))
print(probeDose.shape)
len(selectedPatients)

doseArray = np.zeros((*probeDose.shape, len(selectedPatients)))
statusArray = np.zeros((len(selectedPatients),1))

print(doseArray.shape)


n = 0
for idx, pt in selectedPatients.iterrows():
    doseArray[...,n] = pt.BEDfactor * sitk.GetArrayFromImage(sitk.ReadImage(os.path.join(dosesPath, "{0}.nii".format(pt.ID.split("-")[-1]))))
    n += 1

Now we have our data, and we have corrected all the doses tot eh same BED, we are ready to do continuous outcome image based data mining.

For this we need to select a suitable outcome variable - I suggest weight loss as a good one to start with. 

The next cell defines a function that calculates the pearson correlation coefficient in each voxel of the dose distribution. To do this, we slightly modify the online calculation of variance used in the binary data mining to do online calculation of covariance. The formula for pearson's correlation coefficient is then:

$ \rho = \frac{cov(X,Y)}{\sigma_{x} \sigma_{y}} $


(note: strictly, this is for a population, but the estimates for variance and covariance we return are for a sample, so it will work)


In [None]:
def imagesCorrelation(doseData, continuousOutcome):
    """
    Calculate a per-voxel correlation coefficient between two images. Uses Welford's method to calculate mean, variance and covariance. 
    
    Inputs:
        - doseData: the dose data, should be structured such that the number of patients in it is along the last axis
        - statuses: the outcome labels. 1 indicates an event, 0 indicates no event
    Returns:
        - rhoValues: an array of the same size as one of the images which contains the per-voxel rho values
    """
    doseMean = np.zeros_like(doseData[...,0])
    doseStd = np.zeros_like(doseData[...,0])
    C = np.zeros_like(doseData[...,0])
    
    outcomeMean = 0.0
    outcomeVar = 0.0
    doseMean = doseData[...,0]
    outcomeMean = continuousOutcome[0]
    subjectCount = 1.0
    
    print(doseData[...,1:].shape, doseMean.shape)
    for n,y in zip(range(1, doseData.shape[-1]), continuousOutcome[1:]):
        x = doseData[...,n]
        subjectCount += 1.0
        dx = x - doseMean
        
        om = doseMean.copy()
        yom = outcomeMean.copy()

        doseMean += dx/subjectCount
        outcomeMean += (y - outcomeMean)/subjectCount

        doseStd += ((x - om)*(x - doseMean))
        outcomeVar += (y - yom)*(y - outcomeMean)

        C += (dx * (y - outcomeMean))
        
    doseStd /= (subjectCount)
    outcomeVar /= (subjectCount)
    covariance = C / (subjectCount - 1) ## Bessel's correction for a sample

    rho = covariance / (np.sqrt(doseStd) * np.sqrt(outcomeVar))
    
    return rho

This function is very similar to the one we used in the binary IBDM code, but this time it also calculates the covariance between dose in each voxel and the continuous outcome variable. We then use that covariance to calculate the correlation between dose and outcome.

Now we can apply the mining to some data

In [None]:
dosesPath = "/data/registeredDoses"


probeDose = sitk.GetArrayFromImage(sitk.ReadImage(os.path.join(dosesPath, "{0:04d}.nii".format(int(2)))))
print(probeDose.shape)
len(selectedPatients)

doseArray = np.zeros((*probeDose.shape, len(selectedPatients)))
weightLoss = np.zeros((len(selectedPatients),1))
print(doseArray.shape)
n = 0
for idx, pt in selectedPatients.iterrows():
    doseArray[...,n] = pt.BEDfactor * sitk.GetArrayFromImage(sitk.ReadImage(os.path.join(dosesPath, "{0}.nii".format(pt.ID.split("-")[-1]))))
    weightLoss[n] = pt.WeightStop - pt.WeightStart
    n += 1
    
    


In [None]:
correlationmap = imagesCorrelation(doseArray, weightLoss)

Now we've got our correlation map with the true correspondence of weight loss to dose, we can compute the permutation distribution again to get the significance of the correlation.

This works just like before, we just rearrange the weight loss values and re-calculate the correlation coefficient and look at how the most extreme voxels behave

In [None]:
def doPermutation(doseData, outcome):
    """
    Permute the statuses and return the maximum t value for this permutation
    Inputs:
        - doseData: the dose data, should be structured such that the number of patients in it is along the last axis
        - outcome: the outcome values. Shuold be a continuous number. These will be permuted in this function to 
                    assess the null hypothesis of no dose interaction
    Returns:
        - (tMin, tMax): the extreme values of the whole t-value map for this permutation
    """
    poutcome = np.random.permutation(outcome)
    permT = imagesCorrelation(doseData, poutcome)
    return (np.min(permT), np.max(permT))


def permutationTest(doseData, outcome, nperm=1000):
    """
    Perform a permutation test to get the global p-value and t-thresholds
    Inputs:
        - doseData: the dose data, should be structured such that the number of patients in it is along the last axis
        - outcome: the outcome labels. Should be a continuous number.
        - nperm: The number of permutations to calculate. Defaults to 1000 which is the minimum for reasonable accuracy
    Returns:
        - globalPNeg: the global significance of the test for negative t-values
        - globalPPos: the global significance of the test for positive t-values
        - tThreshNeg: the list of minT from all the permutations, use it to set a significance threshold.
        - tThreshPos: the list of maxT from all the permutations, use it to set a significance threshold.
    """
    tthresh = []
    gtCount = 0
    ltCount = 0
    trueT = imagesCorrelation(doseData, outcome)
    trueMaxT = np.max(trueT)
    trueMinT = np.min(trueT)
    if haveTQDM:
        for perm in tqdm(range(nperm)):
            tthresh.append(doPermutation(doseData, outcome))
            if tthresh[-1][1] > trueMaxT:
                gtCount += 1.0
            if tthresh[-1][0] < trueMinT:
                ltCount += 1.0
    else:
        for perm in range(nperm):
            tthresh.append(doPermutation(doseData, outcome))
            if tthresh[-1][1] > trueMaxT:
                gtCount += 1.0
            if tthresh[-1][0] < trueMinT:
                ltCount += 1.0
    
    globalpPos = gtCount / float(nperm)
    globalpNeg = ltCount / float(nperm)
    tthresh = np.array(tthresh)
    return (globalpNeg, globalpPos, sorted(tthresh[:,0]), sorted(tthresh[:,1]))

If we run this function with our data, we will get back the global significance and threshold values of $\rho$. We can then use a contour plot at the 95th percentile to indicate regions of significance.