It's really handy to have all the DICOM info available in a single DataFrame, so let's create that! In this notebook, we'll just create the DICOM DataFrames. To see how to use them to analyze the competition data, see [this followup notebook](https://www.kaggle.com/jhoward/some-dicom-gotchas-to-be-aware-of-fastai).

First, we'll install the latest versions of pytorch and fastai v2 (not officially released yet) so we can use the fastai medical imaging module.

In [1]:
#default_exp metadata

In [2]:
#export
from rsna_retro.imports import *

Let's take a look at what files we have in the dataset.

In [3]:
#export
set_num_threads(1)
path = Path('~/data/rsna').expanduser()
path_meta = path/'meta'

In [4]:
#export
dir_trn = 'stage_2_train'
dir_tst = 'stage_2_test'
fth_lbl = path_meta/'labels2.fth'
fth_trn = path_meta/'df_trn2.fth'
fth_tst = path_meta/'df_tst2.fth'
fth_trn_comb = path_meta/'df_trn2_comb.fth'

In [5]:
#export
fn_splits = path/'splits.pkl'
fn_splits_wgt = path/'splits_wgt.pkl'

In [6]:
#export
htypes = ['any','epidural','intraparenchymal','intraventricular','subarachnoid','subdural']

In [7]:
# Stage 1 training
# dir_trn = 'stage_1_train_images'
# dir_tst = 'stage_1_test_images'
# fth_lbl = path_meta/'labels.fth'
# fth_trn = path_meta/'df_trn.fth'
# fth_tst = path_meta/'df_tst.fth'

Most lists in fastai v2, including that returned by `Path.ls`, are returned as a [fastai.core.L](http://dev.fast.ai/core.html#L), which has lots of handy methods, such as `attrgot` used here to grab file names.

In [8]:
#export
path_trn = path/dir_trn
fns_trn = path_trn.ls()

path_tst = path/dir_tst
fns_tst = path_tst.ls()

In [9]:
len(fns_trn),len(fns_tst)

(752803, 121232)

We can grab a file and take a look inside using the `dcmread` method that fastai v2 adds.

# Labels

Before we pull the metadata out of the DIMCOM files, let's process the labels into a convenient format and save it for later. We'll use *feather* format because it's lightning fast!

In [10]:
#export
def save_lbls():
    path_lbls = path/f'{dir_trn}.csv'
    if fth_lbl.exists(): return
    lbls = pd.read_csv(path_lbls)
    lbls[["ID","htype"]] = lbls.ID.str.rsplit("_", n=1, expand=True)
    lbls.drop_duplicates(['ID','htype'], inplace=True)
    pvt = lbls.pivot('ID', 'htype', 'Label')
    pvt.reset_index(inplace=True)    
    pvt.to_feather(fth_lbl)

In [11]:
save_lbls()

In [12]:
df_lbls = pd.read_feather(fth_lbl).set_index('ID')

In [13]:
df_lbls.head(8)

Unnamed: 0_level_0,any,epidural,intraparenchymal,intraventricular,subarachnoid,subdural
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ID_000012eaf,0,0,0,0,0,0
ID_000039fa0,0,0,0,0,0,0
ID_00005679d,0,0,0,0,0,0
ID_00008ce3c,0,0,0,0,0,0
ID_0000950d7,0,0,0,0,0,0
ID_0000aee4b,0,0,0,0,0,0
ID_0000ca2f6,0,0,0,0,0,0
ID_0000f1657,0,0,0,0,0,0


In [14]:
df_lbls.mean()

any                 0.143375
epidural            0.004178
intraparenchymal    0.047978
intraventricular    0.034810
subarachnoid        0.047390
subdural            0.062654
dtype: float64

# DICOM Meta

To turn the DICOM file metadata into a DataFrame we can use the `from_dicoms` function that fastai v2 adds. By passing `px_summ=True` summary statistics of the image pixels (mean/min/max/std) will be added to the DataFrame as well (although it takes much longer if you include this, since the image data has to be uncompressed).

In [15]:
def process_metadata(fns, out_f, n_workers=12):
    if out_f.exists(): return
    df = pd.DataFrame.from_dicoms(fns, px_summ=True, window=dicom_windows.brain, n_workers=12)
    df.to_feather(out_f)
    return df
    

In [16]:
process_metadata(fns_tst, fth_tst)

In [17]:
#export
df_tst = pd.read_feather(fth_tst).set_index('SOPInstanceUID')

In [18]:
df_tst.head()

Unnamed: 0_level_0,Modality,PatientID,StudyInstanceUID,SeriesInstanceUID,StudyID,ImagePositionPatient,ImageOrientationPatient,SamplesPerPixel,PhotometricInterpretation,Rows,...,PixelSpacing1,MultiWindowCenter,WindowCenter1,MultiWindowWidth,WindowWidth1,img_min,img_max,img_mean,img_std,img_pct_window
SOPInstanceUID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ID_de973ed39,CT,ID_9a848de8,ID_636648ddc3,ID_3334944046,,-125.0,1.0,1,MONOCHROME2,512,...,0.488281,1.0,47.0,1.0,80.0,0,2766,458.317307,581.025693,0.238255
ID_124564e24,CT,ID_0f6dde7d,ID_5f6fc2a2cd,ID_7514a7fb7b,,-139.0,1.0,1,MONOCHROME2,512,...,0.488281,1.0,40.0,1.0,80.0,0,2441,481.852821,565.443686,0.252171
ID_15c8e85ea,CT,ID_00556373,ID_f5ef8bf6ea,ID_41bab3d33d,,-125.0,1.0,1,MONOCHROME2,512,...,0.488281,,,,,-2000,3299,19.316921,1172.887786,0.16597
ID_0e37b7ef5,CT,ID_ba67f475,ID_75e823b787,ID_cdbc8473f3,,-131.0,1.0,1,MONOCHROME2,512,...,0.488281,1.0,36.0,1.0,80.0,6,2566,416.983677,571.039439,0.232601
ID_25fab03f0,CT,ID_ebec6b48,ID_bb3711c18b,ID_7bbbbdb05f,,-119.5,1.0,1,MONOCHROME2,512,...,0.466797,1.0,36.0,1.0,80.0,4,2420,347.349949,566.757696,0.098629


In [19]:
process_metadata(fns_trn, fth_trn)

In [20]:
df_trn = pd.read_feather(fth_trn)

In [21]:
df_trn.head()

Unnamed: 0,SOPInstanceUID,Modality,PatientID,StudyInstanceUID,SeriesInstanceUID,StudyID,ImagePositionPatient,ImageOrientationPatient,SamplesPerPixel,PhotometricInterpretation,...,PixelSpacing1,img_min,img_max,img_mean,img_std,img_pct_window,MultiWindowCenter,WindowCenter1,MultiWindowWidth,WindowWidth1
0,ID_352e89f1c,CT,ID_d557ddd2,ID_05074a0d95,ID_be6165332c,,-125.0,1.0,1,MONOCHROME2,...,0.488281,-2000,2787,35.112926,1166.720843,0.164139,,,,
1,ID_3cf4fb50f,CT,ID_16b2ad86,ID_c3a404ea2e,ID_2c1454e208,,-125.0,1.0,1,MONOCHROME2,...,0.488281,0,2412,234.549896,392.132243,0.076015,1.0,36.0,1.0,80.0
2,ID_e3674b189,CT,ID_eb712bf0,ID_db83193795,ID_e1facea145,,-125.0,1.0,1,MONOCHROME2,...,0.488281,-2000,2749,50.59132,1216.541625,0.243259,,,,
3,ID_2a8702d25,CT,ID_ff137633,ID_d17053848c,ID_7098f7c836,,-126.437378,1.0,1,MONOCHROME2,...,0.494863,0,2806,482.248981,571.235614,0.241489,,,,
4,ID_7be0f1b3c,CT,ID_cd9169c2,ID_b42de79024,ID_f5bd86b25b,,-125.0,1.0,1,MONOCHROME2,...,0.488281,-2000,2776,10.762859,1164.588862,0.251751,,,,


In [22]:
if not fth_trn_comb.exists():
    df_comb = df_trn.join(df_lbls, 'SOPInstanceUID')
    df_comb.to_feather(fth_trn_comb)

In [23]:
#export
df_comb = pd.read_feather(fth_trn_comb).set_index('SOPInstanceUID')

In [24]:
df_comb.head()

Unnamed: 0_level_0,Modality,PatientID,StudyInstanceUID,SeriesInstanceUID,StudyID,ImagePositionPatient,ImageOrientationPatient,SamplesPerPixel,PhotometricInterpretation,Rows,...,MultiWindowCenter,WindowCenter1,MultiWindowWidth,WindowWidth1,any,epidural,intraparenchymal,intraventricular,subarachnoid,subdural
SOPInstanceUID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ID_352e89f1c,CT,ID_d557ddd2,ID_05074a0d95,ID_be6165332c,,-125.0,1.0,1,MONOCHROME2,512,...,,,,,0,0,0,0,0,0
ID_3cf4fb50f,CT,ID_16b2ad86,ID_c3a404ea2e,ID_2c1454e208,,-125.0,1.0,1,MONOCHROME2,512,...,1.0,36.0,1.0,80.0,0,0,0,0,0,0
ID_e3674b189,CT,ID_eb712bf0,ID_db83193795,ID_e1facea145,,-125.0,1.0,1,MONOCHROME2,512,...,,,,,0,0,0,0,0,0
ID_2a8702d25,CT,ID_ff137633,ID_d17053848c,ID_7098f7c836,,-126.437378,1.0,1,MONOCHROME2,512,...,,,,,1,0,1,1,0,0
ID_7be0f1b3c,CT,ID_cd9169c2,ID_b42de79024,ID_f5bd86b25b,,-125.0,1.0,1,MONOCHROME2,512,...,,,,,0,0,0,0,0,0


## Split Patients by IDS

In [25]:
# sops = set(Path('val_sops.pkl').load())

In [26]:
#export
def patient_cv(idx, patient_grps): return np.concatenate([patient_grps[o] for o in range_of(patient_grps) if o!=idx])

def split_data(df, cv_idx, patient_grps):
    idx = L.range(df)
    pgrp = patient_cv(cv_idx, patient_grps)
    mask = df.PatientID.isin(pgrp)
    return idx[mask],idx[~mask]

Split by unique Patient ID

In [27]:
#export
def get_splits(df_comb, nfold=8, ifold=0):
    set_seed(42)
    patients = df_comb.PatientID.unique()
    np.random.shuffle(patients)
    patient_grps = np.array_split(patients, nfold)
    return split_data(df_comb, ifold, patient_grps)

In [28]:
splits = get_splits(df_comb)

In [29]:
fn_splits.save(splits)

In [30]:
#export
splits = fn_splits.load()

Weighted split by Patiend ID - Remove patients without any labels

In [40]:
#export
def get_splits_healthy(df_comb, nfold=8, ifold=0, remove_pct=1.0):
    df_sum = df_comb.groupby('PatientID').sum();
    patients = df_comb.PatientID.unique()
    patients_healthy = df_sum.loc[df_sum['any'] == 0].index.values
    # np.random.shuffle(patients_healthy)
    remove_to_idx = int(remove_pct * len(patients_healthy))
    print(f'Removing num healthy: {remove_to_idx}/{len(patients_healthy)}')
    patients_wgt = np.array(list(set(patients) - set(patients_healthy[:remove_to_idx])))
    
    patient_grps_wgt = np.array_split(patients_wgt, nfold)
    splits_wgt = split_data(df_comb, ifold, patient_grps_wgt)
    return splits_wgt

In [41]:
splits_wgt = get_splits_healthy(df_comb)

Removing num healthy: 11286/11286


In [43]:
# Here's the percentage split
df_lbls.loc[df_comb.index.values[splits_wgt[0]]].sum() / len(splits_wgt[0])

any                 0.340246
epidural            0.009882
intraparenchymal    0.114255
intraventricular    0.083112
subarachnoid        0.113125
subdural            0.146619
dtype: float64

In [44]:
fn_splits_wgt.save(splits_wgt)

In [46]:
#export
splits_wgt = fn_splits_wgt.load()

## Export

In [47]:
#hide
from nbdev.export import notebook2script
notebook2script()

Converted 01_data_01_metadata_stage2.ipynb.
Converted 01_data_02_preprocess_windows.ipynb.
Converted 02_train_01_train.ipynb.
Converted 04_orig_replace_ashaw_refactor.ipynb.
Converted 04_replace_ashaw_refactor.ipynb.
Converted 04b_orig_replace_ashaw_refactor.ipynb.
Converted 10_qure.ipynb.
Converted 12_merge.ipynb.
Converted 14_xgboost.ipynb.
Converted 16_slice_e2e-shallow.ipynb.
Converted 16b_orig_slice_e2e-shallow.ipynb.
This cell doesn't have an export destination and was ignored:
e
Converted 17_slice_model-deep.ipynb.
Converted 21_cleanup-nocrop2.ipynb.
Converted 26_submit_final.ipynb.
Converted 27_ensemble_tabular_nn.ipynb.
Converted 99_index.ipynb.
Converted cleanup-combine-qure.ipynb.
Converted delete_03b_cleanup-tif.ipynb.
Converted submit.ipynb.
Converted walkthru.ipynb.
Converted x00_tcia-ct-segm-prep.ipynb.
Converted x00_tcia-ct-segm-train.ipynb.
