<h1><center><font size="6">Visualize CT DICOM Data</font></center></h1>

<img src="https://kaggle2.blob.core.windows.net/datasets-images/1012/1826/dcf783d7dd5f628dccb49111ade64649/dataset-card.jpg" width=400></img>  

# <a id='0'>Content</a>

- <a href='#1'>Introduction</a>  
- <a href='#2'>Load packages</a>  
- <a href='#3'>Read the data</a> 
    - <a href='#31'>Read overview data</a> 
    - <a href='#32'>Read TIFF data</a> 
    - <a href='#33'>Read DICOM data</a>  
- <a href='#4'>Data exploration</a>
    - <a href='#41'>Check data consistency</a> 
    - <a href='#42'>Show TIFF images</a> 
    - <a href='#43'>Show DICOM data</a> 
- <a href='#7'>Conclusions</a>
- <a href='#8'>References</a>

# <a id="1">Introduction<a/>

## Overview  

The dataset is designed to allow for different methods to be tested for examining the trends in CT image data associated with using contrast and patient age. The basic idea is to identify image textures, statistical patterns and features correlating strongly with these traits and possibly build simple tools for automatically classifying these images when they have been misclassified (or finding outliers which could be suspicious cases, bad measurements, or poorly calibrated machines)

## Data
The data are a tiny subset of images from the cancer imaging archive. They consist of the middle slice of all CT images taken where valid age, modality, and contrast tags could be found.   TCIA Archive Link - https://wiki.cancerimagingarchive.net/display/Public/TCGA-LUAD

# <a id="2">Load packages</a>

We will load the packages for showing tiff images and dicom data.   
For dicom data, we are loading the **dicom** and **dicom_numpy** packages.   


In [1]:
import numpy as np
import pandas as pd
from skimage.io import imread
import seaborn as sns
import matplotlib.pyplot as plt
from glob import glob
import dicom #for processing dicom files
import dicom_numpy  #for processing dicom files

import os
PATH="../input/"
print(os.listdir(PATH))

['tiff_images', 'overview.csv', 'dicom_dir', 'full_archive.npz']


This code is using an older version of pydicom, which is no longer 
maintained as of Jan 2017.  You can access the new pydicom features and API 
by installing `pydicom` from PyPI.
See 'Transitioning to pydicom 1.x' section at pydicom.readthedocs.org 
for more information.



# <a id="3">Read the data</a>


## <a id="31">Read overview data</a>

In [2]:
data_df = pd.read_csv(PATH+"/overview.csv")

In [3]:
print("CT Medical images -  rows:",data_df.shape[0]," columns:", data_df.shape[1])

CT Medical images -  rows: 100  columns: 8


In [4]:
data_df.head()

Unnamed: 0.1,Unnamed: 0,Age,Contrast,ContrastTag,raw_input_path,id,tiff_name,dicom_name
0,0,60,True,NONE,../data/50_50_dicom_cases\Contrast\00001 (1).dcm,0,ID_0000_AGE_0060_CONTRAST_1_CT.tif,ID_0000_AGE_0060_CONTRAST_1_CT.dcm
1,1,69,True,NONE,../data/50_50_dicom_cases\Contrast\00001 (10).dcm,1,ID_0001_AGE_0069_CONTRAST_1_CT.tif,ID_0001_AGE_0069_CONTRAST_1_CT.dcm
2,2,74,True,APPLIED,../data/50_50_dicom_cases\Contrast\00001 (11).dcm,2,ID_0002_AGE_0074_CONTRAST_1_CT.tif,ID_0002_AGE_0074_CONTRAST_1_CT.dcm
3,3,75,True,NONE,../data/50_50_dicom_cases\Contrast\00001 (12).dcm,3,ID_0003_AGE_0075_CONTRAST_1_CT.tif,ID_0003_AGE_0075_CONTRAST_1_CT.dcm
4,4,56,True,NONE,../data/50_50_dicom_cases\Contrast\00001 (13).dcm,4,ID_0004_AGE_0056_CONTRAST_1_CT.tif,ID_0004_AGE_0056_CONTRAST_1_CT.dcm


## <a id="32">Read TIFF data</a>

In [None]:
print("Number of TIFF images:", len(os.listdir("../input/tiff_images")))

In [None]:
tiff_data = pd.DataFrame([{'path': filepath} for filepath in glob('../input/tiff_images/*.tif')])

### Process TIFF data

In [None]:
def process_data(path):
    data = pd.DataFrame([{'path': filepath} for filepath in glob(PATH+path)])
    data['file'] = data['path'].map(os.path.basename)
    data['ID'] = data['file'].map(lambda x: str(x.split('_')[1]))
    data['Age'] = data['file'].map(lambda x: int(x.split('_')[3]))
    data['Contrast'] = data['file'].map(lambda x: bool(int(x.split('_')[5])))
    data['Modality'] = data['file'].map(lambda x: str(x.split('_')[6].split('.')[-2]))
    return data

In [None]:
tiff_data = process_data('tiff_images/*.tif')

### Check TIFF data

Let's check the TIFF data

In [None]:
tiff_data.head(10)

## <a id="33">Read DICOM data</a>

In [None]:
print("Number of DICOM files:", len(os.listdir(PATH+"dicom_dir")))

### Process DICOM data

In [None]:
dicom_data = process_data('dicom_dir/*.dcm')

### Check DICOM data

In [None]:
dicom_data.head(10)

# <a id="4">Data exploration</a>

## <a id="41">Check data consistency</a>

Let's verify if the content in overview.csv is consistent with the data in tiff_images folder.

In [None]:
def countplot_comparison(feature):
    fig, (ax1, ax2, ax3) = plt.subplots(1,3, figsize = (16, 4))
    s1 = sns.countplot(data_df[feature], ax=ax1)
    s1.set_title("Overview data")
    s2 = sns.countplot(tiff_data[feature], ax=ax2)
    s2.set_title("Tiff files data")
    s3 = sns.countplot(dicom_data[feature], ax=ax3)
    s3.set_title("Dicom files data")

In [None]:
countplot_comparison('Contrast')

In [None]:
countplot_comparison('Age')

The values in the 3 data sources are consistent.

## <a id="42">Show TIFF images</a>

We will show a subsample of 16 images from the total of 100 images.     
We will select the first 16 images from the data set.   
We will use grayscale.   
We define here a generic function to represent both TIFF images and DICOM images.

In [None]:
def show_images(data, dim=16, imtype='TIFF'):
    img_data = list(data[:dim].T.to_dict().values())
    f, ax = plt.subplots(4,4, figsize=(16,20))
    for i,data_row in enumerate(img_data):
        if(imtype=='TIFF'): 
            data_row_img = imread(data_row['path'])
        elif(imtype=='DICOM'):
            data_row_img = dicom.read_file(data_row['path'])
        if(imtype=='TIFF'):
            ax[i//4, i%4].matshow(data_row_img,cmap='gray')
        elif(imtype=='DICOM'):
            ax[i//4, i%4].imshow(data_row_img.pixel_array, cmap=plt.cm.bone) 
        ax[i//4, i%4].axis('off')
        ax[i//4, i%4].set_title('Modality: {Modality} Age: {Age}\nSlice: {ID} Contrast: {Contrast}'.format(**data_row))
    plt.show()


We apply the function to show TIFF images.

In [None]:
show_images(tiff_data,16,'TIFF')

## <a id="43">Show DICOM data</a>

We will show a subsample of 16 images from the total of 100 images.   
We will use grayscale.   
Ideally, if the DICOM images would be a set of slices from a single examen, they could be aggregated using a function like the one shown here: extract_voxel_data - which read the DICOM slices (each in a separate file) and aggregate the image data in a 3D voxel tensor. This will not be the case here, because we are storing slices from different patients and exams (one slice / exam / patient).

    # extract voxel data  
    def extract_voxel_data(list_of_dicom_files):  
        datasets = [dicom.read_file(f) for f in list_of_dicom_files]  
         try:  
             voxel_ndarray, ijk_to_xyz = dicom_numpy.combine_slices(datasets)  
         except dicom_numpy.DicomImportException as e:  
         # invalid DICOM data  
             raise  
         return voxel_ndarray  

In [None]:
show_images(dicom_data,16,'DICOM')

### More about DICOM data

A DICOM file containg much more information than the image itself that we represented. Let's glimpse, for one of the DICOM files, this information. We will read the first dicom file only and show this information.

In [None]:
dicom_file_path = list(dicom_data[:1].T.to_dict().values())[0]['path']
dicom_file_dataset = dicom.read_file(dicom_file_path)
dicom_file_dataset

We can extract various fields from the DICOM FileDataset. Here are few examples:  
* Modality  
* Manufacturer
* Patient Age  
* Patient Sex
* Patient Name  
* Patient ID



In [None]:
print("Modality: {}\nManufacturer: {}\nPatient Age: {}\nPatient Sex: {}\nPatient Name: {}\nPatient ID: {}".format(
    dicom_file_dataset.Modality, 
    dicom_file_dataset.Manufacturer,
    dicom_file_dataset.PatientAge,
    dicom_file_dataset.PatientSex,
    dicom_file_dataset.PatientName,
    dicom_file_dataset.PatientID))

Some of the information are anonymized (like Name and ID), which is common standard for public medical data.   

We will modify the visualization function, to show parameters from the DICOM data instead of the parameters extracted from the image name.  



In [None]:
def show_dicom_images(data):
    img_data = list(data[:16].T.to_dict().values())
    f, ax = plt.subplots(4,4, figsize=(16,20))
    for i,data_row in enumerate(img_data):

        data_row_img = dicom.read_file(data_row['path'])
        modality = data_row_img.Modality
        age = data_row_img.PatientAge
        
        ax[i//4, i%4].imshow(data_row_img.pixel_array, cmap=plt.cm.bone) 
        ax[i//4, i%4].axis('off')
        ax[i//4, i%4].set_title('Modality: {} Age: {}\nSlice: {} Contrast: {}'.format(
         modality, age, data_row['ID'], data_row['Contrast']))
    plt.show()


In [None]:
show_dicom_images(dicom_data)

# Conclusion

We demonstrated how we can load and show **TIFF** images.  
As well, using **dicom** and **dicom-numpy** packages, we demonstrated how to read and visualize **DICOM** data.  
We also explored preliminary the content of a DICOM data file and modified the visualization function to use (partially) DICOM data for the image attributes.


# References

[1] <a href="https://www.kaggle.com/kmader">Kevin Mader</a>,  <a href="https://www.kaggle.com/kmader/show-the-data-in-the-zip-file">Show the data in the Zip File</a>    
[2] <a href="https://www.kaggle.com/byrachonok">Vitaly Byrachonok</a>,  <a href="https://www.kaggle.com/byrachonok/study-ct-medical-images">Study CT Medical Images</a>    
[3] Python package for processing DICOM data, dicom-numpy, https://dicom-numpy.readthedocs.io     
[4] Viewing DICOM images in Python, https://pydicom.github.io/pydicom/stable/viewing_images.html   


