It's really handy to have all the DICOM info available in a single DataFrame, so let's create that! In this notebook, we'll just create the DICOM DataFrames. To see how to use them to analyze the competition data, see [this followup notebook](https://www.kaggle.com/jhoward/some-dicom-gotchas-to-be-aware-of-fastai).

First, we'll install the latest versions of pytorch and fastai v2 (not officially released yet) so we can use the fastai medical imaging module.

In [1]:
from fastai2.basics import *
from fastai2.medical.imaging import *

Let's take a look at what files we have in the dataset.

In [2]:
path = Path('~/data/rsna').expanduser()
path_meta = path/'meta'

Most lists in fastai v2, including that returned by `Path.ls`, are returned as a [fastai.core.L](http://dev.fast.ai/core.html#L), which has lots of handy methods, such as `attrgot` used here to grab file names.

In [3]:
path_trn = path/'stage_1_train_images'
fns_trn = path_trn.ls()
fns_trn[:5].attrgot('name')

(#5) [ID_352e89f1c.dcm,ID_3cf4fb50f.dcm,ID_2a8702d25.dcm,ID_66891ac22.dcm,ID_54f412d54.dcm]

In [6]:
path_tst = path/'stage_1_test_images'
fns_tst = path_tst.ls()
len(fns_trn),len(fns_tst)

We can grab a file and take a look inside using the `dcmread` method that fastai v2 adds.

In [7]:
fn = fns_trn[0]
dcm = fn.dcmread()
dcm

(0008, 0018) SOP Instance UID                    UI: ID_352e89f1c
(0008, 0060) Modality                            CS: 'CT'
(0010, 0020) Patient ID                          LO: 'ID_d557ddd2'
(0020, 000d) Study Instance UID                  UI: ID_05074a0d95
(0020, 000e) Series Instance UID                 UI: ID_be6165332c
(0020, 0010) Study ID                            SH: ''
(0020, 0032) Image Position (Patient)            DS: ['-125.000000', '-119.997978', '44.732330']
(0020, 0037) Image Orientation (Patient)         DS: ['1.000000', '0.000000', '0.000000', '0.000000', '0.927184', '-0.374607']
(0028, 0002) Samples per Pixel                   US: 1
(0028, 0004) Photometric Interpretation          CS: 'MONOCHROME2'
(0028, 0010) Rows                                US: 512
(0028, 0011) Columns                             US: 512
(0028, 0030) Pixel Spacing                       DS: ['0.488281', '0.488281']
(0028, 0100) Bits Allocated                      US: 16
(0028, 0101) Bits Stored 

# Labels

Before we pull the metadata out of the DIMCOM files, let's process the labels into a convenient format and save it for later. We'll use *feather* format because it's lightning fast!

In [6]:
def save_lbls():
    path_lbls = path/'stage_1_train.csv'
    lbls = pd.read_csv(path_lbls)
    lbls[["ID","htype"]] = lbls.ID.str.rsplit("_", n=1, expand=True)
    lbls.drop_duplicates(['ID','htype'], inplace=True)
    pvt = lbls.pivot('ID', 'htype', 'Label')
    pvt.reset_index(inplace=True)    
    pvt.to_feather(path_meta/'labels.fth')

In [7]:
save_lbls()

In [8]:
df_lbls = pd.read_feather('labels.fth').set_index('ID')
df_lbls.head(8)

Unnamed: 0_level_0,any,epidural,intraparenchymal,intraventricular,subarachnoid,subdural
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
ID_000039fa0,0,0,0,0,0,0
ID_00005679d,0,0,0,0,0,0
ID_00008ce3c,0,0,0,0,0,0
ID_0000950d7,0,0,0,0,0,0
ID_0000aee4b,0,0,0,0,0,0
ID_0000f1657,0,0,0,0,0,0
ID_000178e76,0,0,0,0,0,0
ID_00019828f,0,0,0,0,0,0


In [9]:
df_lbls.mean()

any                 0.144015
epidural            0.004095
intraparenchymal    0.048296
intraventricular    0.035248
subarachnoid        0.047641
subdural            0.063026
dtype: float64

# DICOM Meta

To turn the DICOM file metadata into a DataFrame we can use the `from_dicoms` function that fastai v2 adds. By passing `px_summ=True` summary statistics of the image pixels (mean/min/max/std) will be added to the DataFrame as well (although it takes much longer if you include this, since the image data has to be uncompressed).

In [8]:
df_tst = pd.DataFrame.from_dicoms(fns_tst, px_summ=True, window=dicom_windows.brain)

In [9]:
df_tst.to_feather('df_tst.fth')
df_tst.head()

Unnamed: 0,BitsAllocated,BitsStored,Columns,HighBit,ImageOrientationPatient,ImageOrientationPatient1,ImageOrientationPatient2,ImageOrientationPatient3,ImageOrientationPatient4,ImageOrientationPatient5,...,WindowCenter,WindowCenter1,WindowWidth,WindowWidth1,fname,img_max,img_mean,img_min,img_pct_window,img_std
0,16,16,512,15,1.0,0.0,0.0,0.0,0.948324,-0.317305,...,30.0,,80.0,,/home/jhoward/data/rsna/stage_1_test_images/ID_e3674b189.dcm,2749,50.59132,-2000,0.243259,1216.541625
1,16,16,512,15,1.0,0.0,0.0,0.0,0.976296,-0.21644,...,30.0,,80.0,,/home/jhoward/data/rsna/stage_1_test_images/ID_7be0f1b3c.dcm,2776,10.762859,-2000,0.251751,1164.588862
2,16,16,512,15,1.0,0.0,0.0,0.0,0.927184,-0.374607,...,30.0,,80.0,,/home/jhoward/data/rsna/stage_1_test_images/ID_8abc0dbe8.dcm,3249,48.297985,-2000,0.277439,1190.333214
3,16,16,512,15,1.0,0.0,0.0,0.0,0.927184,-0.374607,...,30.0,,80.0,,/home/jhoward/data/rsna/stage_1_test_images/ID_89444fe23.dcm,3002,74.667126,-2000,0.174026,1292.603296
4,16,16,512,15,1.0,0.0,0.0,0.0,0.882948,-0.469472,...,30.0,,80.0,,/home/jhoward/data/rsna/stage_1_test_images/ID_a5d5df4b7.dcm,2800,44.38295,-2000,0.26078,1191.494654


In [13]:
%time df_trn = pd.DataFrame.from_dicoms(fns_trn, px_summ=True, window=dicom_windows.brain)
df_trn.to_feather('df_trn.fth')

CPU times: user 1d 13h 10min 59s, sys: 2h 33min 6s, total: 1d 15h 44min 6s
Wall time: 1h 15min 4s


There is one corrupted DICOM in the competition data, so the command above prints out the information about this file. Despite the error message show above, the command completes successfully, and the data from the corrupted file is not included in the output DataFrame.

In [18]:
df_trn.query('SOPInstanceUID=="ID_6431af929"')

Unnamed: 0,BitsAllocated,BitsStored,Columns,HighBit,ImageOrientationPatient,ImageOrientationPatient1,ImageOrientationPatient2,ImageOrientationPatient3,ImageOrientationPatient4,ImageOrientationPatient5,...,WindowCenter,WindowCenter1,WindowWidth,WindowWidth1,fname,img_max,img_mean,img_min,img_pct_window,img_std
218429,16,16,512,15,1.0,0.0,0.0,0.0,0.97237,-0.233445,...,30.0,,80.0,,/home/jhoward/data/rsna/stage_1_train_images/ID_6431af929.dcm,0,0.0,0,,0.0
