DicomSplitter #2724

asvcode · 2020-08-30T06:05:50Z

Feature Request.
Dicom images provide unique nuggets of information that can help in ensuring that the same patient does not exist in both the train and valid sets. Having the same patient in both sets may potentially skew the effectiveness of the model.

Update 12/19/2020

The previous version I had proposed did not work fully. This version however does and if there are duplicates in the train and valid indexes, then the duplicate is deleted from the valid index. This is explained further here

Proposed Method

def dicomsplit(valid_pct=0.2, seed=None, **kwargs):
    "Splits `items` between train/val with `valid_pct`"
    "and checks if identical patient IDs exist in both the train and valid sets"
    def _inner(o, **kwargs):
        if seed is not None: torch.manual_seed(seed)
        rand_idx = L(int(i) for i in torch.randperm(len(o)))
        cut = int(valid_pct * len(o))
        trn = rand_idx[cut:]; trn_p = o[rand_idx[cut:]]
        val = rand_idx[:cut]; val_p = o[rand_idx[:cut]]
        train_L = L(trn, trn_p); val_L = L(val, val_p)
        train_patient = []; train_images = []
        for i, tfile in enumerate(train_L[1]):
            file = dcmread(tfile)
            tpat = file.PatientID
            train_patient.append(tpat)
            file_array = dcmread(tfile).pixel_array
            train_images.append(file_array)
        val_patient = []; val_images = []
        for i, vfile in enumerate(val_L[1]):
            file2 = dcmread(vfile)
            vpat = file2.PatientID
            val_patient.append(vpat)
            val_array = dcmread(vfile).pixel_array
            val_images.append(val_array)   
        is_duplicate = set(train_patient) & set(val_patient)
        m_dict = dict(zip(val_patient, val))
        string_dup = list(is_duplicate)
        updated_dict = [m_dict.pop(key) for key in string_dup]
        new_val = list(m_dict.values())
        return trn, new_val
    return _inner

Example where there is no duplicate:

This is based on a test set of 10 images. With no duplicates the valid index remains the same.

With Duplicates:

In this case there is a duplicate and valid set is updated to not include the duplicate.

The text was updated successfully, but these errors were encountered:

antorsae · 2020-09-21T11:21:35Z

If I understand code correctly the above would read all dicoms and attempt a split.

Given that fastai already provides a convenient (and paralellized pd.DataFrame.from_dicoms(...) ) method, I suggest instead to create a generic (i.e. not dicom-specific) splitter:

def ColGroupKFoldSplitter(col,n_folds:int=5,fold:int=0):
    "Split `items` (supposed to be a dataframe) by GroupKFold in`col`"
    def _inner(o):
        assert isinstance(o, pd.DataFrame), "ColSplitter only works when your items are a pandas DataFrame"
        # TODO
    return _inner

and then assuming you're instanciating the DataBlock on df_mel, you'd just do: ColGroupKFoldSplitter('PatientID')

marii-moe added the enhancement label Jun 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DicomSplitter #2724

DicomSplitter #2724

asvcode commented Aug 30, 2020 •

edited

Loading

antorsae commented Sep 21, 2020

DicomSplitter #2724

DicomSplitter #2724

Comments

asvcode commented Aug 30, 2020 • edited Loading

antorsae commented Sep 21, 2020

asvcode commented Aug 30, 2020 •

edited

Loading