It's really handy to have all the DICOM info available in a single DataFrame, so let's create that! In this notebook, we'll just create the DICOM DataFrames. To see how to use them to analyze the competition data, see [this followup notebook](https://www.kaggle.com/jhoward/some-dicom-gotchas-to-be-aware-of-fastai).

First, we'll install the latest versions of pytorch and fastai v2 (not officially released yet) so we can use the fastai medical imaging module.

In [1]:
#default_exp metadata

In [2]:
#export
from rsna_retro.imports import *

Loading imports


Let's take a look at what files we have in the dataset.

In [3]:
#export
set_num_threads(1)
path = Path('~/data/rsna').expanduser()
path_meta = path/'meta'

In [4]:
#export
dir_trn = 'stage_2_train'
dir_tst = 'stage_2_test'
fth_lbl = path_meta/'labels2.fth'
fth_trn = path_meta/'df_trn2.fth'
fth_tst = path_meta/'df_tst2.fth'
fth_trn_comb = path_meta/'df_trn2_comb.fth'
fth_trn_comb_any = path_meta/'df_trn2_any.fth'

In [5]:
#export
path_trn = path/dir_trn
path_tst = path/dir_tst


In [6]:
#export
fn_splits = path_meta/'splits.pkl'
fn_splits_any = path_meta/'splits_any.pkl'
fn_splits_sample = path_meta/'splits_sample.pkl'
fn_grps = path_meta/'grps.pkl'
fn_grps_any = path_meta/'grps_any.pkl'

In [7]:
#export
htypes = ['any','epidural','intraparenchymal','intraventricular','subarachnoid','subdural']

In [8]:
#export
# Stage 1 training
# dir_trn = 'stage_1_train_images'
# dir_tst = 'stage_1_test_images'
# fth_lbl = path_meta/'labels.fth'
# fth_trn = path_meta/'df_trn.fth'
# fth_tst = path_meta/'df_tst.fth'

fth_df_comb1 = path_meta/'df_trn1_comb.fth'
fth_df_tst1 = path_meta/'df_tst1.fth'

fn_splits_stg1 = path_meta/'splits_stg1.pkl'
fn_splits_stg1_any = path_meta/'splits_stg1_any.pkl'

fn_grps_stg1 = path_meta/'grps_stg1.pkl'

Most lists in fastai v2, including that returned by `Path.ls`, are returned as a [fastai.core.L](http://dev.fast.ai/core.html#L), which has lots of handy methods, such as `attrgot` used here to grab file names.

In [9]:
fns_trn = path_trn.ls()
fns_tst = path_tst.ls()

len(fns_trn),len(fns_tst)

We can grab a file and take a look inside using the `dcmread` method that fastai v2 adds.

# Labels

Before we pull the metadata out of the DIMCOM files, let's process the labels into a convenient format and save it for later. We'll use *feather* format because it's lightning fast!

In [11]:
#export
def save_lbls():
    path_lbls = path/f'{dir_trn}.csv'
    if fth_lbl.exists(): return
    lbls = pd.read_csv(path_lbls)
    lbls[["ID","htype"]] = lbls.ID.str.rsplit("_", n=1, expand=True)
    lbls.drop_duplicates(['ID','htype'], inplace=True)
    pvt = lbls.pivot('ID', 'htype', 'Label')
    pvt.reset_index(inplace=True)    
    pvt.to_feather(fth_lbl)

In [12]:
save_lbls()

In [13]:
df_lbls = pd.read_feather(fth_lbl).set_index('ID')

In [14]:
df_lbls.head(8)

Unnamed: 0_level_0,any,epidural,intraparenchymal,intraventricular,subarachnoid,subdural
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ID_000012eaf,0,0,0,0,0,0
ID_000039fa0,0,0,0,0,0,0
ID_00005679d,0,0,0,0,0,0
ID_00008ce3c,0,0,0,0,0,0
ID_0000950d7,0,0,0,0,0,0
ID_0000aee4b,0,0,0,0,0,0
ID_0000ca2f6,0,0,0,0,0,0
ID_0000f1657,0,0,0,0,0,0


In [15]:
df_lbls.mean()

any                 0.143375
epidural            0.004178
intraparenchymal    0.047978
intraventricular    0.034810
subarachnoid        0.047390
subdural            0.062654
dtype: float64

# DICOM Meta

To turn the DICOM file metadata into a DataFrame we can use the `from_dicoms` function that fastai v2 adds. By passing `px_summ=True` summary statistics of the image pixels (mean/min/max/std) will be added to the DataFrame as well (although it takes much longer if you include this, since the image data has to be uncompressed).

In [16]:
#export
def load_feather(fth_path):
    df = pd.read_feather(fth_path)
    return df.set_index('SOPInstanceUID').sort_values(['SeriesInstanceUID', "ImagePositionPatient2"])

def save_feather(df, fth_path): df.reset_index().sort_values(['SeriesInstanceUID', "ImagePositionPatient2"]).to_feather(fth_path)

In [17]:
#export
def process_metadata(fns, out_f, n_workers=12):
    if out_f.exists(): return
    df = pd.DataFrame.from_dicoms(fns, px_summ=True, window=dicom_windows.brain, n_workers=12)
    save_feather(df, out_f)
    return df
    

In [18]:
process_metadata(fns_tst, fth_tst)

In [19]:
df_tst = load_feather(fth_tst)
df_tst.shape

(121232, 41)

In [20]:
df_tst.head()

Unnamed: 0_level_0,Modality,PatientID,StudyInstanceUID,SeriesInstanceUID,StudyID,ImagePositionPatient,ImageOrientationPatient,SamplesPerPixel,PhotometricInterpretation,Rows,...,PixelSpacing1,MultiWindowCenter,WindowCenter1,MultiWindowWidth,WindowWidth1,img_min,img_max,img_mean,img_std,img_pct_window
SOPInstanceUID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ID_714683b15,CT,ID_f997418a,ID_16aac16e79,ID_0018be306d,,-167.0,1.0,1,MONOCHROME2,512,...,0.488281,1.0,40.0,1.0,80.0,0,2526,329.507587,451.948261,0.108215
ID_e4201ed62,CT,ID_f997418a,ID_16aac16e79,ID_0018be306d,,-167.0,1.0,1,MONOCHROME2,512,...,0.488281,1.0,40.0,1.0,80.0,0,2536,341.396053,469.553211,0.103333
ID_ec585911c,CT,ID_f997418a,ID_16aac16e79,ID_0018be306d,,-167.0,1.0,1,MONOCHROME2,512,...,0.488281,1.0,40.0,1.0,80.0,0,2530,359.314346,489.364968,0.098858
ID_61149ac47,CT,ID_f997418a,ID_16aac16e79,ID_0018be306d,,-167.0,1.0,1,MONOCHROME2,512,...,0.488281,1.0,40.0,1.0,80.0,0,2528,383.288368,518.095082,0.102551
ID_70bf363a9,CT,ID_f997418a,ID_16aac16e79,ID_0018be306d,,-167.0,1.0,1,MONOCHROME2,512,...,0.488281,1.0,40.0,1.0,80.0,0,2621,400.832222,545.006538,0.108265


In [21]:
process_metadata(fns_trn, fth_trn)

In [22]:
df_trn = load_feather(fth_trn)
df_trn.shape

(752802, 41)

In [23]:
df_trn.head()

Unnamed: 0_level_0,Modality,PatientID,StudyInstanceUID,SeriesInstanceUID,StudyID,ImagePositionPatient,ImageOrientationPatient,SamplesPerPixel,PhotometricInterpretation,Rows,...,PixelSpacing1,img_min,img_max,img_mean,img_std,img_pct_window,MultiWindowCenter,WindowCenter1,MultiWindowWidth,WindowWidth1
SOPInstanceUID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ID_76d55d9d0,CT,ID_b9797064,ID_00b9e1961f,ID_0000298a7d,,-125.0,1.0,1,MONOCHROME2,512,...,0.488281,-2000,2709,-16.228073,1157.254742,0.143887,,,,
ID_96d282ea9,CT,ID_b9797064,ID_00b9e1961f,ID_0000298a7d,,-125.0,1.0,1,MONOCHROME2,512,...,0.488281,-2000,3847,-10.814919,1165.228119,0.139881,,,,
ID_7d8a7c29d,CT,ID_b9797064,ID_00b9e1961f,ID_0000298a7d,,-125.0,1.0,1,MONOCHROME2,512,...,0.488281,-2000,2727,-0.899811,1174.712056,0.13282,,,,
ID_4d4401491,CT,ID_b9797064,ID_00b9e1961f,ID_0000298a7d,,-125.0,1.0,1,MONOCHROME2,512,...,0.488281,-2000,2715,0.538429,1174.011045,0.142532,,,,
ID_8f5ded0b7,CT,ID_b9797064,ID_00b9e1961f,ID_0000298a7d,,-125.0,1.0,1,MONOCHROME2,512,...,0.488281,-2000,2711,4.452103,1177.018285,0.157337,,,,


In [24]:
if not fth_trn_comb.exists():
    df_comb = df_trn.join(df_lbls, 'SOPInstanceUID')
    save_feather(df_comb, fth_trn_comb)

In [25]:
df_comb = load_feather(fth_trn_comb)

In [26]:
df_comb.head()

Unnamed: 0_level_0,Modality,PatientID,StudyInstanceUID,SeriesInstanceUID,StudyID,ImagePositionPatient,ImageOrientationPatient,SamplesPerPixel,PhotometricInterpretation,Rows,...,MultiWindowCenter,WindowCenter1,MultiWindowWidth,WindowWidth1,any,epidural,intraparenchymal,intraventricular,subarachnoid,subdural
SOPInstanceUID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ID_76d55d9d0,CT,ID_b9797064,ID_00b9e1961f,ID_0000298a7d,,-125.0,1.0,1,MONOCHROME2,512,...,,,,,0,0,0,0,0,0
ID_96d282ea9,CT,ID_b9797064,ID_00b9e1961f,ID_0000298a7d,,-125.0,1.0,1,MONOCHROME2,512,...,,,,,0,0,0,0,0,0
ID_7d8a7c29d,CT,ID_b9797064,ID_00b9e1961f,ID_0000298a7d,,-125.0,1.0,1,MONOCHROME2,512,...,,,,,0,0,0,0,0,0
ID_4d4401491,CT,ID_b9797064,ID_00b9e1961f,ID_0000298a7d,,-125.0,1.0,1,MONOCHROME2,512,...,,,,,0,0,0,0,0,0
ID_8f5ded0b7,CT,ID_b9797064,ID_00b9e1961f,ID_0000298a7d,,-125.0,1.0,1,MONOCHROME2,512,...,,,,,0,0,0,0,0,0


## Split by SeriesInstanceUID

In [27]:
#export
def group_cv(idx, grps): return np.concatenate([grps[o] for o in range_of(grps) if o!=idx])

# column can also be PatientID
def split_data(df, cv_idx, grps, column):
    idx = L.range(df)
    grp_cv = group_cv(cv_idx, grps)
    mask_grp = df[column].isin(grp_cv)
    mask_col = df[column].isin(grps[cv_idx])
    return idx[mask_grp],idx[mask_col]

Split by unique Patient ID

In [28]:
#export
def get_splits(df, column='SeriesInstanceUID', nfold=8, ifold=0):
    set_seed(42)
    unique_ids = df[column].unique()
    np.random.shuffle(unique_ids)
    grps = np.array_split(unique_ids, nfold)
    return split_data(df, ifold, grps, column=column), grps

In [29]:
splits, grps = get_splits(df_comb)

In [30]:
fn_splits.save(splits)
fn_grps.save(grps)

## Create Any DF

Smaller subsample of paients to train on. Removing all ct series with all healthy images

In [31]:
if not fth_trn_comb_any.exists():
    column = 'SeriesInstanceUID'
    df_sum = df_comb.groupby(column).sum();
    any_ids = df_sum.loc[df_sum['any'] != 0].index.values
    df_any = df_comb.loc[df_comb[column].isin(any_ids)]
    save_feather(df_any, fth_trn_comb_any)

In [32]:
df_any = load_feather(fth_trn_comb_any)

In [33]:
df_comb.shape, df_any.shape

((752802, 47), (300934, 47))

## Create Any Subsample - Remove completely healthy series

In [34]:
any_ids = set(df_any.SeriesInstanceUID.values)

In [35]:
# Make sure to use the same groups as splits_full
grps_any = [list(set(x)&any_ids) for x in grps]
[len(x) for x in grps_any], [len(x) for x in grps]

([1090, 1135, 1115, 1098, 1106, 1091, 1150, 1097],
 [2718, 2718, 2718, 2718, 2718, 2718, 2718, 2718])

In [36]:
fn_grps_any.save(grps_any)

In [37]:
splits_any = split_data(df_comb, 0, grps_any, 'SeriesInstanceUID')

In [38]:
splits_any

((#264116) [369,370,371,372,373,374,375,376,377,378...],
 (#36818) [1076,1077,1078,1079,1080,1081,1082,1083,1084,1085...])

In [39]:
# Here's the percentage split
df_lbls.loc[df_comb.index.values[splits_any[0]]].sum() / len(splits_any[0])

any                 0.358131
epidural            0.010624
intraparenchymal    0.120614
intraventricular    0.087795
subarachnoid        0.117475
subdural            0.156037
dtype: float64

In [40]:
fn_splits_any.save(splits_any)

## Create stage1 split

In [41]:
df_comb1 = pd.read_feather(fth_df_comb1)
df_tst1 = pd.read_feather(fth_df_tst1)

In [42]:
df_comb.shape, df_comb1.shape, df_tst1.shape

((752802, 47), (674257, 52), (78545, 46))

In [43]:
# sc = set(df_comb1.SeriesInstanceUID)
# st = set(df_tst1.SeriesInstanceUID)
# sc & st

In [49]:
grps_stg1 = [df_tst1.SeriesInstanceUID.unique(), df_comb1.SeriesInstanceUID.unique()]
splits_stg1 = split_data(df_comb, 0, grps_stg1, 'SeriesInstanceUID')
fn_grps_stg1.save(grps_stg1)

In [51]:
any_ids = set(df_any.SeriesInstanceUID.values)
# Make sure to use the same groups as splits_full
grps_stg1_any = [list(set(x)&any_ids) for x in grps_stg1]
[len(x) for x in grps_stg1_any], [len(x) for x in grps_stg1]

([879, 8003], [2214, 19530])

In [46]:
splits_stg1_any = split_data(df_comb, 0, grps_stg1_any, 'SeriesInstanceUID')

In [47]:
fn_splits_stg1_any.save(splits_stg1_any)

In [48]:
fn_splits_stg1.save(splits_stg1)

## Small sample

In [49]:
# Make sure to use the same groups as splits_full
grps_sample = [list(set(x)&any_ids)[:50] for x in grps]

In [50]:
splits_sample = split_data(df_comb, 0, grps_sample, 'SeriesInstanceUID')

In [51]:
fn_splits_sample.save(splits_sample)

## Meta class

For lazy loading metadata. Otherwise module takes way too long to load

In [52]:
#export
lazy_loaders = {
    'df_any': lambda: pd.read_feather(fth_trn_comb_any).set_index('SOPInstanceUID'),
    'df_labels': lambda: pd.read_feather(fth_lbl).set_index('ID'),
    'df_comb': lambda: pd.read_feather(fth_trn_comb).set_index('SOPInstanceUID'),
    'df_tst': lambda: pd.read_feather(fth_tst).set_index('SOPInstanceUID'),
    'df_comb1': lambda: pd.read_feather(fth_df_comb1).set_index('SOPInstanceUID'),
    'fns_trn': lambda: path_trn.ls(),
    'fns_tst': lambda: path_tst.ls(),
    'splits': lambda: fn_splits.load(),
    'grps': lambda: fn_grps.load(),
    'grps_any': lambda: fn_grps_any.load(),
    'grps_stg1': lambda: fn_grps_stg1.load(),
    'splits_any': lambda: fn_splits_any.load(),
    'splits_sample': lambda: fn_splits_sample.load(),
    'splits_stg1': lambda: fn_splits_stg1.load(),
    'splits_stg1_any': lambda: fn_splits_stg1_any.load(),
}

class MetaType(type):
    def __dir__(self):
        return lazy_loaders.keys()
    def __getattr__(self, name: str):
        if name in self.__dict__: return self.__dict__[name]
        if name in lazy_loaders:
            setattr(self, name, lazy_loaders[name]())
            return self.__dict__[name]
        raise AttributeError(f"module {__name__!r} has no attribute {name!r}")
        
class Meta(metaclass=MetaType): pass

In [53]:
Meta.df_any

Unnamed: 0_level_0,Modality,PatientID,StudyInstanceUID,SeriesInstanceUID,StudyID,ImagePositionPatient,ImageOrientationPatient,SamplesPerPixel,PhotometricInterpretation,Rows,...,MultiWindowCenter,WindowCenter1,MultiWindowWidth,WindowWidth1,any,epidural,intraparenchymal,intraventricular,subarachnoid,subdural
SOPInstanceUID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ID_4509f2560,CT,ID_4c16e232,ID_c174374b07,ID_002c9733b7,,-125.0,1.0,1,MONOCHROME2,512,...,1.0,40.0,1.0,80.0,0,0,0,0,0,0
ID_0969176c0,CT,ID_4c16e232,ID_c174374b07,ID_002c9733b7,,-125.0,1.0,1,MONOCHROME2,512,...,1.0,40.0,1.0,80.0,0,0,0,0,0,0
ID_2363aa3ef,CT,ID_4c16e232,ID_c174374b07,ID_002c9733b7,,-125.0,1.0,1,MONOCHROME2,512,...,1.0,40.0,1.0,80.0,0,0,0,0,0,0
ID_7b6119ddf,CT,ID_4c16e232,ID_c174374b07,ID_002c9733b7,,-125.0,1.0,1,MONOCHROME2,512,...,1.0,40.0,1.0,80.0,0,0,0,0,0,0
ID_a787384c2,CT,ID_4c16e232,ID_c174374b07,ID_002c9733b7,,-125.0,1.0,1,MONOCHROME2,512,...,1.0,40.0,1.0,80.0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ID_8eb7c45bc,CT,ID_984a3f15,ID_7891a70bf4,ID_fffde5ed33,,-125.0,1.0,1,MONOCHROME2,512,...,,,,,0,0,0,0,0,0
ID_cf66e9f08,CT,ID_984a3f15,ID_7891a70bf4,ID_fffde5ed33,,-125.0,1.0,1,MONOCHROME2,512,...,,,,,0,0,0,0,0,0
ID_6c779d850,CT,ID_984a3f15,ID_7891a70bf4,ID_fffde5ed33,,-125.0,1.0,1,MONOCHROME2,512,...,,,,,0,0,0,0,0,0
ID_b7aad552c,CT,ID_984a3f15,ID_7891a70bf4,ID_fffde5ed33,,-125.0,1.0,1,MONOCHROME2,512,...,,,,,0,0,0,0,0,0


## Export

In [54]:
#hide
from nbdev.export import notebook2script
notebook2script()

Converted 00_metadata.ipynb.
Converted 01_preprocess.ipynb.
Converted 01_preprocess_mean_std.ipynb.
Converted 02_train.ipynb.
Converted 03_train3d.ipynb.
Converted 03_train3d_01_train3d.ipynb.
Converted 03_train3d_01b_train_lstm.ipynb.
Converted 03_train3d_02_train_head.ipynb.
Converted 03_trainfull3d.ipynb.
Converted 04_trainSeq_01_lstm.ipynb.
Converted 04_trainSeq_02_transformer.ipynb.
Converted 04_trainSeq_03_lstm_seutao.ipynb.
Converted 05_train_adjacent.ipynb.
Converted 05_train_adjacent_01_5c_windowed.ipynb.
Converted 05_train_adjacent_01_5slice.ipynb.
Converted 05_train_adjacent_02_3c.ipynb.
Converted 05_train_adjacent_02_3c_stg1.ipynb.
Converted 06_seutao_features.ipynb.
Converted 06_seutao_features_01_simple_lstm_20ep.ipynb.
Converted 06_seutao_features_01b_simple_lstm_10ep.ipynb.
Converted 06_seutao_features_01c_simple_lstm_meta.ipynb.
Converted 06_seutao_features_01d_simple_lstm_meta_stg1.ipynb.
Converted 06_seutao_features_02_2ndPlace.ipynb.
Converted 06_seutao_features_03_