It's really handy to have all the DICOM info available in a single DataFrame, so let's create that! In this notebook, we'll just create the DICOM DataFrames. To see how to use them to analyze the competition data, see [this followup notebook](https://www.kaggle.com/jhoward/some-dicom-gotchas-to-be-aware-of-fastai).

First, we'll install the latest versions of pytorch and fastai v2 (not officially released yet) so we can use the fastai medical imaging module.

In [1]:
#default_exp metadata

In [2]:
#export
from rsna_retro.imports import *

Loading imports


Let's take a look at what files we have in the dataset.

In [3]:
#export
set_num_threads(1)
path = Path('~/data/rsna').expanduser()
path_meta = path/'meta'

In [4]:
#export
dir_trn = 'stage_2_train'
dir_tst = 'stage_2_test'
fth_lbl = path_meta/'labels2.fth'
fth_trn = path_meta/'df_trn2.fth'
fth_tst = path_meta/'df_tst2.fth'
fth_trn_comb = path_meta/'df_trn2_comb.fth'
fth_trn_comb_any = path_meta/'df_trn2_any.fth'

In [5]:
#export
path_trn = path/dir_trn
path_tst = path/dir_tst


In [6]:
#export
fn_splits = path_meta/'splits.pkl'
fn_splits_any = path_meta/'splits_any.pkl'
fn_splits_sample = path_meta/'splits_sample.pkl'
fn_grps = path_meta/'grps.pkl'
fn_grps_any = path_meta/'grps_any.pkl'

In [7]:
#export
htypes = ['any','epidural','intraparenchymal','intraventricular','subarachnoid','subdural']

In [8]:
#export
# Stage 1 training
# dir_trn = 'stage_1_train_images'
# dir_tst = 'stage_1_test_images'
# fth_lbl = path_meta/'labels.fth'
# fth_trn = path_meta/'df_trn.fth'
# fth_tst = path_meta/'df_tst.fth'

fth_df_comb1 = path_meta/'df_trn1_comb.fth'
fth_df_tst1 = path_meta/'df_tst1.fth'

fn_splits_stg1 = path_meta/'splits_stg1.pkl'
fn_splits_stg1_any = path_meta/'splits_stg1_any.pkl'

Most lists in fastai v2, including that returned by `Path.ls`, are returned as a [fastai.core.L](http://dev.fast.ai/core.html#L), which has lots of handy methods, such as `attrgot` used here to grab file names.

In [9]:
fns_trn = path_trn.ls()
fns_tst = path_tst.ls()

In [10]:
len(fns_trn),len(fns_tst)

(752803, 121232)

We can grab a file and take a look inside using the `dcmread` method that fastai v2 adds.

# Labels

Before we pull the metadata out of the DIMCOM files, let's process the labels into a convenient format and save it for later. We'll use *feather* format because it's lightning fast!

In [11]:
#export
def save_lbls():
    path_lbls = path/f'{dir_trn}.csv'
    if fth_lbl.exists(): return
    lbls = pd.read_csv(path_lbls)
    lbls[["ID","htype"]] = lbls.ID.str.rsplit("_", n=1, expand=True)
    lbls.drop_duplicates(['ID','htype'], inplace=True)
    pvt = lbls.pivot('ID', 'htype', 'Label')
    pvt.reset_index(inplace=True)    
    pvt.to_feather(fth_lbl)

In [12]:
save_lbls()

In [13]:
df_lbls = pd.read_feather(fth_lbl).set_index('ID')

In [14]:
df_lbls.head(8)

Unnamed: 0_level_0,any,epidural,intraparenchymal,intraventricular,subarachnoid,subdural
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ID_000012eaf,0,0,0,0,0,0
ID_000039fa0,0,0,0,0,0,0
ID_00005679d,0,0,0,0,0,0
ID_00008ce3c,0,0,0,0,0,0
ID_0000950d7,0,0,0,0,0,0
ID_0000aee4b,0,0,0,0,0,0
ID_0000ca2f6,0,0,0,0,0,0
ID_0000f1657,0,0,0,0,0,0


In [15]:
df_lbls.mean()

any                 0.143375
epidural            0.004178
intraparenchymal    0.047978
intraventricular    0.034810
subarachnoid        0.047390
subdural            0.062654
dtype: float64

# DICOM Meta

To turn the DICOM file metadata into a DataFrame we can use the `from_dicoms` function that fastai v2 adds. By passing `px_summ=True` summary statistics of the image pixels (mean/min/max/std) will be added to the DataFrame as well (although it takes much longer if you include this, since the image data has to be uncompressed).

In [16]:
def process_metadata(fns, out_f, n_workers=12):
    if out_f.exists(): return
    df = pd.DataFrame.from_dicoms(fns, px_summ=True, window=dicom_windows.brain, n_workers=12)
    df.to_feather(out_f)
    return df
    

In [17]:
process_metadata(fns_tst, fth_tst)

In [18]:
df_tst = pd.read_feather(fth_tst).set_index('SOPInstanceUID')

In [19]:
df_tst.head()

Unnamed: 0_level_0,Modality,PatientID,StudyInstanceUID,SeriesInstanceUID,StudyID,ImagePositionPatient,ImageOrientationPatient,SamplesPerPixel,PhotometricInterpretation,Rows,...,PixelSpacing1,MultiWindowCenter,WindowCenter1,MultiWindowWidth,WindowWidth1,img_min,img_max,img_mean,img_std,img_pct_window
SOPInstanceUID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ID_de973ed39,CT,ID_9a848de8,ID_636648ddc3,ID_3334944046,,-125.0,1.0,1,MONOCHROME2,512,...,0.488281,1.0,47.0,1.0,80.0,0,2766,458.317307,581.025693,0.238255
ID_124564e24,CT,ID_0f6dde7d,ID_5f6fc2a2cd,ID_7514a7fb7b,,-139.0,1.0,1,MONOCHROME2,512,...,0.488281,1.0,40.0,1.0,80.0,0,2441,481.852821,565.443686,0.252171
ID_15c8e85ea,CT,ID_00556373,ID_f5ef8bf6ea,ID_41bab3d33d,,-125.0,1.0,1,MONOCHROME2,512,...,0.488281,,,,,-2000,3299,19.316921,1172.887786,0.16597
ID_0e37b7ef5,CT,ID_ba67f475,ID_75e823b787,ID_cdbc8473f3,,-131.0,1.0,1,MONOCHROME2,512,...,0.488281,1.0,36.0,1.0,80.0,6,2566,416.983677,571.039439,0.232601
ID_25fab03f0,CT,ID_ebec6b48,ID_bb3711c18b,ID_7bbbbdb05f,,-119.5,1.0,1,MONOCHROME2,512,...,0.466797,1.0,36.0,1.0,80.0,4,2420,347.349949,566.757696,0.098629


In [20]:
process_metadata(fns_trn, fth_trn)

In [21]:
df_trn = pd.read_feather(fth_trn)

In [22]:
df_trn.head()

Unnamed: 0,SOPInstanceUID,Modality,PatientID,StudyInstanceUID,SeriesInstanceUID,StudyID,ImagePositionPatient,ImageOrientationPatient,SamplesPerPixel,PhotometricInterpretation,...,PixelSpacing1,img_min,img_max,img_mean,img_std,img_pct_window,MultiWindowCenter,WindowCenter1,MultiWindowWidth,WindowWidth1
0,ID_352e89f1c,CT,ID_d557ddd2,ID_05074a0d95,ID_be6165332c,,-125.0,1.0,1,MONOCHROME2,...,0.488281,-2000,2787,35.112926,1166.720843,0.164139,,,,
1,ID_3cf4fb50f,CT,ID_16b2ad86,ID_c3a404ea2e,ID_2c1454e208,,-125.0,1.0,1,MONOCHROME2,...,0.488281,0,2412,234.549896,392.132243,0.076015,1.0,36.0,1.0,80.0
2,ID_e3674b189,CT,ID_eb712bf0,ID_db83193795,ID_e1facea145,,-125.0,1.0,1,MONOCHROME2,...,0.488281,-2000,2749,50.59132,1216.541625,0.243259,,,,
3,ID_2a8702d25,CT,ID_ff137633,ID_d17053848c,ID_7098f7c836,,-126.437378,1.0,1,MONOCHROME2,...,0.494863,0,2806,482.248981,571.235614,0.241489,,,,
4,ID_7be0f1b3c,CT,ID_cd9169c2,ID_b42de79024,ID_f5bd86b25b,,-125.0,1.0,1,MONOCHROME2,...,0.488281,-2000,2776,10.762859,1164.588862,0.251751,,,,


In [23]:
if not fth_trn_comb.exists():
    df_comb = df_trn.join(df_lbls, 'SOPInstanceUID')
    df_comb.to_feather(fth_trn_comb)

In [24]:
df_comb = pd.read_feather(fth_trn_comb).set_index('SOPInstanceUID')

In [25]:
df_comb.head()

Unnamed: 0_level_0,Modality,PatientID,StudyInstanceUID,SeriesInstanceUID,StudyID,ImagePositionPatient,ImageOrientationPatient,SamplesPerPixel,PhotometricInterpretation,Rows,...,MultiWindowCenter,WindowCenter1,MultiWindowWidth,WindowWidth1,any,epidural,intraparenchymal,intraventricular,subarachnoid,subdural
SOPInstanceUID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ID_352e89f1c,CT,ID_d557ddd2,ID_05074a0d95,ID_be6165332c,,-125.0,1.0,1,MONOCHROME2,512,...,,,,,0,0,0,0,0,0
ID_3cf4fb50f,CT,ID_16b2ad86,ID_c3a404ea2e,ID_2c1454e208,,-125.0,1.0,1,MONOCHROME2,512,...,1.0,36.0,1.0,80.0,0,0,0,0,0,0
ID_e3674b189,CT,ID_eb712bf0,ID_db83193795,ID_e1facea145,,-125.0,1.0,1,MONOCHROME2,512,...,,,,,0,0,0,0,0,0
ID_2a8702d25,CT,ID_ff137633,ID_d17053848c,ID_7098f7c836,,-126.437378,1.0,1,MONOCHROME2,512,...,,,,,1,0,1,1,0,0
ID_7be0f1b3c,CT,ID_cd9169c2,ID_b42de79024,ID_f5bd86b25b,,-125.0,1.0,1,MONOCHROME2,512,...,,,,,0,0,0,0,0,0


## Split by SeriesInstanceUID

In [26]:
#export
def group_cv(idx, grps): return np.concatenate([grps[o] for o in range_of(grps) if o!=idx])

# column can also be PatientID
def split_data(df, cv_idx, grps, column):
    idx = L.range(df)
    grp_cv = group_cv(cv_idx, grps)
    mask_grp = df[column].isin(grp_cv)
    mask_col = df[column].isin(grps[cv_idx])
    return idx[mask_grp],idx[mask_col]

Split by unique Patient ID

In [27]:
#export
def get_splits(df, column='SeriesInstanceUID', nfold=8, ifold=0):
    set_seed(42)
    unique_ids = df[column].unique()
    np.random.shuffle(unique_ids)
    grps = np.array_split(unique_ids, nfold)
    return split_data(df, ifold, grps, column=column), grps

In [28]:
splits, grps = get_splits(df_comb)

In [29]:
fn_splits.save(splits)
fn_grps.save(grps)

## Create Any DF

Smaller subsample of paients to train on. Removing all ct series with all healthy images

In [30]:
if not fth_trn_comb_any.exists():
    column = 'SeriesInstanceUID'
    df_sum = df_comb.groupby(column).sum();
    any_ids = df_sum.loc[df_sum['any'] != 0].index.values
    df_any = df_comb.loc[df_comb[column].isin(any_ids)].reset_index()
    df_any.to_feather(fth_trn_comb_any)

In [31]:
df_any = pd.read_feather(fth_trn_comb_any).set_index('SOPInstanceUID')

In [32]:
df_comb.shape, df_any.shape

((752802, 47), (300934, 47))

## Create Any Subsample - Remove completely healthy series

In [33]:
any_ids = set(df_any.SeriesInstanceUID.values)

In [34]:
# Make sure to use the same groups as splits_full
grps_any = [list(set(x)&any_ids) for x in grps]
[len(x) for x in grps_any], [len(x) for x in grps]

([1115, 1101, 1129, 1089, 1097, 1114, 1132, 1105],
 [2718, 2718, 2718, 2718, 2718, 2718, 2718, 2718])

In [35]:
fn_grps_any.save(grps_any)

In [36]:
splits_any = split_data(df_comb, 0, grps_any, 'SeriesInstanceUID')

In [37]:
splits_any

((#263179) [3,5,7,9,14,15,20,22,23,25...],
 (#37755) [34,101,119,127,134,139,183,203,211,213...])

In [38]:
# Here's the percentage split
df_lbls.loc[df_comb.index.values[splits_any[0]]].sum() / len(splits_any[0])

any                 0.358965
epidural            0.010027
intraparenchymal    0.120177
intraventricular    0.087974
subarachnoid        0.118034
subdural            0.156958
dtype: float64

In [39]:
fn_splits_any.save(splits_any)

## Create stage1 split

In [40]:
df_comb1 = pd.read_feather(fth_df_comb1)
df_tst1 = pd.read_feather(fth_df_tst1)

In [55]:
df_comb.shape, df_comb1.shape, df_tst1.shape

((752802, 47), (674257, 52), (78545, 46))

In [44]:
# sc = set(df_comb1.SeriesInstanceUID)
# st = set(df_tst1.SeriesInstanceUID)
# sc & st

In [56]:
grps_stg1 = [df_tst1.SeriesInstanceUID.values, df_comb1.SeriesInstanceUID.values]
splits_stg1 = split_data(df_comb, 0, grps_stg1, 'SeriesInstanceUID')

In [57]:
any_ids = set(df_any.SeriesInstanceUID.values)
# Make sure to use the same groups as splits_full
grps_stg1_any = [list(set(x)&any_ids) for x in grps_stg1]
[len(x) for x in grps_stg1_any], [len(set(x)) for x in grps_stg1]

([879, 8003], [2214, 19530])

In [58]:
splits_stg1_any = split_data(df_comb, 0, grps_stg1_any, 'SeriesInstanceUID')

In [59]:
fn_splits_stg1_any.save(splits_stg1_any)

In [60]:
fn_splits_stg1.save(splits_stg1)

## Small sample

In [41]:
# Make sure to use the same groups as splits_full
grps_sample = [list(set(x)&any_ids)[:50] for x in grps]

In [42]:
splits_sample = split_data(df_comb, 0, grps_sample, 'SeriesInstanceUID')

In [43]:
fn_splits_sample.save(splits_sample)

## Meta class

For lazy loading metadata. Otherwise module takes way too long to load

In [34]:
df_comb1.shape

NameError: name 'df_comb1' is not defined

In [61]:
#export
lazy_loaders = {
    'df_any': lambda: pd.read_feather(fth_trn_comb_any).set_index('SOPInstanceUID'),
    'df_labels': lambda: pd.read_feather(fth_lbl).set_index('ID'),
    'df_comb': lambda: pd.read_feather(fth_trn_comb).set_index('SOPInstanceUID'),
    'df_tst': lambda: pd.read_feather(fth_tst).set_index('SOPInstanceUID'),
    'df_comb1': lambda: pd.read_feather(fn_df_comb1).set_index('SOPInstanceUID'),
    'fns_trn': lambda: path_trn.ls(),
    'fns_tst': lambda: path_tst.ls(),
    'splits': lambda: fn_splits.load(),
    'grps': lambda: fn_grps.load(),
    'grps_any': lambda: fn_grps_any.load(),
    'splits_any': lambda: fn_splits_any.load(),
    'splits_sample': lambda: fn_splits_sample.load(),
    'splits_stg1': lambda: fn_splits_stg1.load(),
    'splits_stg1_any': lambda: fn_splits_stg1_any.load(),
}

class MetaType(type):
    def __dir__(self):
        return lazy_loaders.keys()
    def __getattr__(self, name: str):
        if name in self.__dict__: return self.__dict__[name]
        if name in lazy_loaders:
            setattr(self, name, lazy_loaders[name]())
            return self.__dict__[name]
        raise AttributeError(f"module {__name__!r} has no attribute {name!r}")
        
class Meta(metaclass=MetaType): pass

In [40]:
Meta.df_any

Unnamed: 0_level_0,Modality,PatientID,StudyInstanceUID,SeriesInstanceUID,StudyID,ImagePositionPatient,ImageOrientationPatient,SamplesPerPixel,PhotometricInterpretation,Rows,...,MultiWindowCenter,WindowCenter1,MultiWindowWidth,WindowWidth1,any,epidural,intraparenchymal,intraventricular,subarachnoid,subdural
SOPInstanceUID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ID_2a8702d25,CT,ID_ff137633,ID_d17053848c,ID_7098f7c836,,-126.437378,1.0,1,MONOCHROME2,512,...,,,,,1,0,1,1,0,0
ID_66891ac22,CT,ID_42940b2c,ID_17e33f43d0,ID_e14dd0090b,,-125.000000,1.0,1,MONOCHROME2,512,...,,,,,1,0,1,0,0,0
ID_8e6e5b51f,CT,ID_76fbed32,ID_1d8eaa14ef,ID_e3919709a0,,-125.000000,1.0,1,MONOCHROME2,512,...,,,,,1,0,0,0,0,1
ID_cb8b9b514,CT,ID_20039b63,ID_1cfe3e70dd,ID_e3b5d8d9b8,,-125.000000,1.0,1,MONOCHROME2,512,...,,,,,0,0,0,0,0,0
ID_ee683911f,CT,ID_a065f3ac,ID_9c727ac231,ID_99e83a310d,,-125.000000,1.0,1,MONOCHROME2,512,...,1.0,40.0,1.0,80.0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ID_57f8e8605,CT,ID_841d28eb,ID_201ea3e707,ID_0037be4db7,,-125.000000,1.0,1,MONOCHROME2,512,...,,,,,0,0,0,0,0,0
ID_9839c95cb,CT,ID_9aab6e0d,ID_099843577f,ID_1266274136,,-125.000000,1.0,1,MONOCHROME2,512,...,1.0,47.0,1.0,80.0,0,0,0,0,0,0
ID_fe2430e26,CT,ID_16796133,ID_17d9e9af2b,ID_ad0a925334,,-137.000000,1.0,1,MONOCHROME2,512,...,1.0,40.0,1.0,80.0,1,0,0,0,1,0
ID_8272aee56,CT,ID_229559fd,ID_174e8ae89d,ID_eddea12732,,-125.000000,1.0,1,MONOCHROME2,512,...,,,,,0,0,0,0,0,0


## Export

In [1]:
#hide
from nbdev.export import notebook2script
notebook2script()

Converted 00_metadata.ipynb.
Converted 00_metadata_split2.ipynb.
Converted 01_preprocess.ipynb.
Converted 01_preprocess_mean_std.ipynb.
Converted 02_train.ipynb.
Converted 03_train3d.ipynb.
Converted 03_train3d_01_train3d.ipynb.
Converted 03_train3d_01b_train_lstm.ipynb.
Converted 03_train3d_02_train_head.ipynb.
Converted 03_trainfull3d.ipynb.
Converted 04_trainSeq_01_lstm.ipynb.
Converted 04_trainSeq_02_transformer.ipynb.
Converted 04_trainSeq_03_lstm_seutao.ipynb.
Converted 05_train_adjacent.ipynb.
Converted 05_train_adjacent_01_5c.ipynb.
Converted 05_train_adjacent_02_3c.ipynb.
Converted 06_seutao_features.ipynb.
Converted 06_seutao_features_01_simple_lstm_20ep.ipynb.
Converted 06_seutao_features_01b_simple_lstm_10ep.ipynb.
Converted 06_seutao_features_01c_simple_lstm_meta.ipynb.
Converted 06_seutao_features_01d_simple_lstm_meta_fulldataset.ipynb.
Converted 06_seutao_features_02_2ndPlace.ipynb.
Converted 06_seutao_features_03_1stPlace.ipynb.
Converted 07_train_3d_lstm.ipynb.
Convert