You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Feature Request.
Dicom images provide unique nuggets of information that can help in ensuring that the same patient does not exist in both the train and valid sets. Having the same patient in both sets may potentially skew the effectiveness of the model.
Update 12/19/2020
The previous version I had proposed did not work fully. This version however does and if there are duplicates in the train and valid indexes, then the duplicate is deleted from the valid index. This is explained further here
Proposed Method
def dicomsplit(valid_pct=0.2, seed=None, **kwargs):
"Splits `items` between train/val with `valid_pct`"
"and checks if identical patient IDs exist in both the train and valid sets"
def _inner(o, **kwargs):
if seed is not None: torch.manual_seed(seed)
rand_idx = L(int(i) for i in torch.randperm(len(o)))
cut = int(valid_pct * len(o))
trn = rand_idx[cut:]; trn_p = o[rand_idx[cut:]]
val = rand_idx[:cut]; val_p = o[rand_idx[:cut]]
train_L = L(trn, trn_p); val_L = L(val, val_p)
train_patient = []; train_images = []
for i, tfile in enumerate(train_L[1]):
file = dcmread(tfile)
tpat = file.PatientID
train_patient.append(tpat)
file_array = dcmread(tfile).pixel_array
train_images.append(file_array)
val_patient = []; val_images = []
for i, vfile in enumerate(val_L[1]):
file2 = dcmread(vfile)
vpat = file2.PatientID
val_patient.append(vpat)
val_array = dcmread(vfile).pixel_array
val_images.append(val_array)
is_duplicate = set(train_patient) & set(val_patient)
m_dict = dict(zip(val_patient, val))
string_dup = list(is_duplicate)
updated_dict = [m_dict.pop(key) for key in string_dup]
new_val = list(m_dict.values())
return trn, new_val
return _inner
Example where there is no duplicate:
This is based on a test set of 10 images. With no duplicates the valid index remains the same.
With Duplicates:
In this case there is a duplicate and valid set is updated to not include the duplicate.
The text was updated successfully, but these errors were encountered:
If I understand code correctly the above would read all dicoms and attempt a split.
Given that fastai already provides a convenient (and paralellized pd.DataFrame.from_dicoms(...) ) method, I suggest instead to create a generic (i.e. not dicom-specific) splitter:
def ColGroupKFoldSplitter(col,n_folds:int=5,fold:int=0):
"Split `items` (supposed to be a dataframe) by GroupKFold in`col`"
def _inner(o):
assert isinstance(o, pd.DataFrame), "ColSplitter only works when your items are a pandas DataFrame"
# TODO
return _inner
and then assuming you're instanciating the DataBlock on df_mel, you'd just do: ColGroupKFoldSplitter('PatientID')
Feature Request.
Dicom images provide unique nuggets of information that can help in ensuring that the same patient does not exist in both the train and valid sets. Having the same patient in both sets may potentially skew the effectiveness of the model.
Update 12/19/2020
The previous version I had proposed did not work fully. This version however does and if there are duplicates in the train and valid indexes, then the duplicate is deleted from the valid index. This is explained further here
Proposed Method
Example where there is no duplicate:
This is based on a test set of 10 images. With no duplicates the valid index remains the same.
With Duplicates:
In this case there is a duplicate and valid set is updated to not include the duplicate.
The text was updated successfully, but these errors were encountered: