Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DicomSplitter #2724

Open
asvcode opened this issue Aug 30, 2020 · 1 comment
Open

DicomSplitter #2724

asvcode opened this issue Aug 30, 2020 · 1 comment

Comments

@asvcode
Copy link
Contributor

asvcode commented Aug 30, 2020

Feature Request.
Dicom images provide unique nuggets of information that can help in ensuring that the same patient does not exist in both the train and valid sets. Having the same patient in both sets may potentially skew the effectiveness of the model.

Update 12/19/2020

The previous version I had proposed did not work fully. This version however does and if there are duplicates in the train and valid indexes, then the duplicate is deleted from the valid index. This is explained further here

Proposed Method

def dicomsplit(valid_pct=0.2, seed=None, **kwargs):
    "Splits `items` between train/val with `valid_pct`"
    "and checks if identical patient IDs exist in both the train and valid sets"
    def _inner(o, **kwargs):
        if seed is not None: torch.manual_seed(seed)
        rand_idx = L(int(i) for i in torch.randperm(len(o)))
        cut = int(valid_pct * len(o))
        trn = rand_idx[cut:]; trn_p = o[rand_idx[cut:]]
        val = rand_idx[:cut]; val_p = o[rand_idx[:cut]]
        train_L = L(trn, trn_p); val_L = L(val, val_p)
        train_patient = []; train_images = []
        for i, tfile in enumerate(train_L[1]):
            file = dcmread(tfile)
            tpat = file.PatientID
            train_patient.append(tpat)
            file_array = dcmread(tfile).pixel_array
            train_images.append(file_array)
        val_patient = []; val_images = []
        for i, vfile in enumerate(val_L[1]):
            file2 = dcmread(vfile)
            vpat = file2.PatientID
            val_patient.append(vpat)
            val_array = dcmread(vfile).pixel_array
            val_images.append(val_array)   
        is_duplicate = set(train_patient) & set(val_patient)
        m_dict = dict(zip(val_patient, val))
        string_dup = list(is_duplicate)
        updated_dict = [m_dict.pop(key) for key in string_dup]
        new_val = list(m_dict.values())
        return trn, new_val
    return _inner

Example where there is no duplicate:

no_dup

This is based on a test set of 10 images. With no duplicates the valid index remains the same.

With Duplicates:

dup

In this case there is a duplicate and valid set is updated to not include the duplicate.

@antorsae
Copy link
Contributor

If I understand code correctly the above would read all dicoms and attempt a split.

Given that fastai already provides a convenient (and paralellized pd.DataFrame.from_dicoms(...) ) method, I suggest instead to create a generic (i.e. not dicom-specific) splitter:

def ColGroupKFoldSplitter(col,n_folds:int=5,fold:int=0):
    "Split `items` (supposed to be a dataframe) by GroupKFold in`col`"
    def _inner(o):
        assert isinstance(o, pd.DataFrame), "ColSplitter only works when your items are a pandas DataFrame"
        # TODO
    return _inner

and then assuming you're instanciating the DataBlock on df_mel, you'd just do: ColGroupKFoldSplitter('PatientID')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants