# Module 4 Exercise 1 - DICOM data processing

## Overview

Prior to conducting a study of medical images, it is useful to verify some basic information about the images, how they were captured, when, and on what equipment and modality.  Discrepancies in the study and series of the images should also be uncovered.  The purpose of this exercise is to extract the necessary information to ensure that each series of images is matched.

## File Format
The files are the dicom images used in the lab and practice for DICOM processing.  This is the structure:

```
Brain-Tumor-Progression
  |_ PGBM-001 (the patient identifier in the study)
    |_ 11-19-1991-FH-HEADBrain Protocols-40993 (folder for the scan date)
        |_ 11.000000-T1post-03326 (folder for the scan series)
            |_ 1-01.dcm (sequential scan file)
            ...
            |_ 1-25.dcm (sequential scan file)
    |_ 04-02-1992-FH-HEADBrain Protocols-79896 (folder for the scan date)
        |_ 11.000000-T1post-80644 (folder for the scan series)
            |_ 1-01.dcm (sequential scan file)
            ...
            |_ 1-24.dcm (sequential scan file)
  |_ PGBM-002
    |_ <scan date 1>
        |_ <series>
            |_ 1-01.dcm
            ...
            |_ 1-xx.dcm (the number of images in each series varies)
    |_ <scan date 2>
        |_ <series>
            |_ 1-01.dcm
            ...
            |_ 1-xx.dcm (the number of images in each series varies)
  ...
  |_ PGBM-005 (there are 5 patients)
    |_ <scan date 1>
        |_ <series>
            |_ 1-01.dcm
            ...
            |_ 1-xx.dcm (the number of images in each series varies)
    |_ <scan date 2>
        |_ <series>
            |_ 1-01.dcm
            ...
            |_ 1-xx.dcm (the number of images in each series varies)
  
```

There are five subjects, two scan dates per subject, and multiple series for each scan date.  You will loop over each DICOM image in the `Brain-Tumor-Progression` folder structure and load the image as well as the following metadata from the DICOM file into a pandas dataframe:

```
Patient ID 
Body Part Examined
Study Date                         
Series Date                        
Content Date                                         
Accession Number                   
Modality                           
Manufacturer                       
Study Description                  
Series Description                 
Manufacturer's Model Name          
Slice Location
Rows
Columns
Photometric Interpretation
Bits Allocated
```
Reference the DICOM documentation and resources below or prior labs/practices to find the necessary tag identifiers for these values.

Reference material:
* The official specification, includes the tags and keywords: http://dicom.nema.org/Dicom/2011/11_06pu.pdf
* An online source for searching by tag: https://www.dicomlibrary.com/dicom/dicom-tags/

## Required Output
Along with the output of this notebook, you will respond to the questions located in the Quiz for this exercise in the Canvas site for this course.
        
## Grading
There are two parts to submission of this exercise. The first is submission of this notebook, and is worth 20 points. Not submitting code will result in a loss of 20 points. Submitting code that is not functional will result in a loss of 10 points. Code functionality is tested by running your submitted notebook from a restarted kernel and checking that it completes. 

The second part of the exercise is submission of the answers via the associated Canvas quiz. Each correct answer on the Canvas Quiz is worth 2 points.

Any numeric answer typed into Canvas will be considered correct if it is within $\pm$ 1% from the reference answer.  Answers in which you select a given choice will be graded based on the identified correct choice(s).  For multi-select, partial credit is given if a portion of the correct answers are selected.



In [1]:
import sys
!{sys.executable} -m pip install pydicom scikit-image
import pydicom
import skimage
import numpy as np
import pandas as pd


Collecting pydicom
[?25l  Downloading https://files.pythonhosted.org/packages/53/9a/98df4fb41e7905b587be2ee9ce38bab8a092990bd174f46fd915a23ec0ea/pydicom-2.2.2-py3-none-any.whl (2.0MB)
[K     |████████████████████████████████| 2.0MB 3.4MB/s eta 0:00:01
Installing collected packages: pydicom
Successfully installed pydicom-2.2.2


## Load the DICOM images and metadata into a pandas dataframe
Loop over all dicom images under the `Brain-Tumor-Progression` directory structure and extract the required metadata and the image object itself into a pandas dataframe.

Convert the different metadata tags to the appropriate pandas datatypes.

This dataframe will be used for the remainder of the exercise.

In [2]:
# your code here

from glob import glob

# there are a number of ways to build a pandas dataframe from scratch.  This method builds a list of lists,
# then converts that into a dataframe.
list_data = []
for filename in glob('../resources/**/*.dcm', recursive=True):
    img = pydicom.dcmread(filename)
    list_data.append([
        img[0x00100020].value, # patient id
        img[0x00180015].value, # body part examined
        pd.to_datetime(img[0x00080020].value, format='%Y%m%d'), # study date
        pd.to_datetime(img[0x00080021].value, format='%Y%m%d'), # series date
        pd.to_datetime(img[0x00080023].value, format='%Y%m%d'), # content date
        int(img[0x00080050].value), # accession number
        img[0x00080060].value, # modality
        img[0x00080070].value, # manufacturer
        img[0x00081030].value, # study description
        img[0x0008103E].value, # series description
        img[0x00081090].value, # manufacturer's model name
        img[0x00201041].value, # slice location
        img[0x00280010].value, # rows
        img[0x00280011].value, # columns
        img[0x00280004].value, # photometric interpretation
        img[0x00280100].value, # bits allocated
    ])
    
data = pd.DataFrame(list_data, 
                    columns=['patient_id', 'body_part_exmn', 'study_date', 'series_date', 'content_date', 'accession_nbr',
                            'modality', 'manufctr', 'study_desc', 'series_desc', 'manufctr_model_nm', 'slice_loc',
                            'rows', 'columns', 'photomtrc_intrpt', 'bits_alloc'])

print(data.dtypes)
display(data)

patient_id                   object
body_part_exmn               object
study_date           datetime64[ns]
series_date          datetime64[ns]
content_date         datetime64[ns]
accession_nbr                 int64
modality                     object
manufctr                     object
study_desc                   object
series_desc                  object
manufctr_model_nm            object
slice_loc                   float64
rows                          int64
columns                       int64
photomtrc_intrpt             object
bits_alloc                    int64
dtype: object


Unnamed: 0,patient_id,body_part_exmn,study_date,series_date,content_date,accession_nbr,modality,manufctr,study_desc,series_desc,manufctr_model_nm,slice_loc,rows,columns,photomtrc_intrpt,bits_alloc
0,PGBM-001,BRAIN,1992-04-02,1992-04-02,1992-04-02,5686274134839343,MR,SIEMENS,FH-HEAD^Brain Protocols,T1post,Verio,-74.445322,320,260,MONOCHROME2,16
1,PGBM-001,BRAIN,1992-04-02,1992-04-02,1992-04-02,5686274134839343,MR,SIEMENS,FH-HEAD^Brain Protocols,T1post,Verio,-67.945319,320,260,MONOCHROME2,16
2,PGBM-001,BRAIN,1992-04-02,1992-04-02,1992-04-02,5686274134839343,MR,SIEMENS,FH-HEAD^Brain Protocols,T1post,Verio,-61.445324,320,260,MONOCHROME2,16
3,PGBM-001,BRAIN,1992-04-02,1992-04-02,1992-04-02,5686274134839343,MR,SIEMENS,FH-HEAD^Brain Protocols,T1post,Verio,-54.945321,320,260,MONOCHROME2,16
4,PGBM-001,BRAIN,1992-04-02,1992-04-02,1992-04-02,5686274134839343,MR,SIEMENS,FH-HEAD^Brain Protocols,T1post,Verio,-48.445323,320,260,MONOCHROME2,16
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
248,PGBM-005,BRAIN,1991-07-02,1991-07-02,1991-07-02,2721434998651531,MR,SIEMENS,FH-HEAD^Brain Protocols,T1post,Verio,39.373348,320,260,MONOCHROME2,16
249,PGBM-005,BRAIN,1991-07-02,1991-07-02,1991-07-02,2721434998651531,MR,SIEMENS,FH-HEAD^Brain Protocols,T1post,Verio,45.873348,320,260,MONOCHROME2,16
250,PGBM-005,BRAIN,1991-07-02,1991-07-02,1991-07-02,2721434998651531,MR,SIEMENS,FH-HEAD^Brain Protocols,T1post,Verio,52.373348,320,260,MONOCHROME2,16
251,PGBM-005,BRAIN,1991-07-02,1991-07-02,1991-07-02,2721434998651531,MR,SIEMENS,FH-HEAD^Brain Protocols,T1post,Verio,58.873349,320,260,MONOCHROME2,16


## Eliminate series other than T1post
We only wish to study the series `T1post`.  If there are images associated with a series other than `T1post`, eliminate them from further analysis.

### Quiz 1 Question 1
Which patient had a series other than T1post?

In [3]:
# your code here

data[data['series_desc'] != 'T1post']

Unnamed: 0,patient_id,body_part_exmn,study_date,series_date,content_date,accession_nbr,modality,manufctr,study_desc,series_desc,manufctr_model_nm,slice_loc,rows,columns,photomtrc_intrpt,bits_alloc
162,PGBM-004,BRAIN,1994-01-12,1994-01-12,1994-01-12,4468773825686010,MR,Imaging Biometrics LLC,MR RCBV SEQUENCE FH,T1pre_reg,IB Delta Suite,-97.417328,512,512,MONOCHROME2,16
163,PGBM-004,BRAIN,1994-01-12,1994-01-12,1994-01-12,4468773825686010,MR,Imaging Biometrics LLC,MR RCBV SEQUENCE FH,T1pre_reg,IB Delta Suite,-91.162354,512,512,MONOCHROME2,16
164,PGBM-004,BRAIN,1994-01-12,1994-01-12,1994-01-12,4468773825686010,MR,Imaging Biometrics LLC,MR RCBV SEQUENCE FH,T1pre_reg,IB Delta Suite,-84.907379,512,512,MONOCHROME2,16
165,PGBM-004,BRAIN,1994-01-12,1994-01-12,1994-01-12,4468773825686010,MR,Imaging Biometrics LLC,MR RCBV SEQUENCE FH,T1pre_reg,IB Delta Suite,-78.652405,512,512,MONOCHROME2,16
166,PGBM-004,BRAIN,1994-01-12,1994-01-12,1994-01-12,4468773825686010,MR,Imaging Biometrics LLC,MR RCBV SEQUENCE FH,T1pre_reg,IB Delta Suite,-72.39743,512,512,MONOCHROME2,16
167,PGBM-004,BRAIN,1994-01-12,1994-01-12,1994-01-12,4468773825686010,MR,Imaging Biometrics LLC,MR RCBV SEQUENCE FH,T1pre_reg,IB Delta Suite,-66.142456,512,512,MONOCHROME2,16
168,PGBM-004,BRAIN,1994-01-12,1994-01-12,1994-01-12,4468773825686010,MR,Imaging Biometrics LLC,MR RCBV SEQUENCE FH,T1pre_reg,IB Delta Suite,-59.887482,512,512,MONOCHROME2,16
169,PGBM-004,BRAIN,1994-01-12,1994-01-12,1994-01-12,4468773825686010,MR,Imaging Biometrics LLC,MR RCBV SEQUENCE FH,T1pre_reg,IB Delta Suite,-53.632507,512,512,MONOCHROME2,16
170,PGBM-004,BRAIN,1994-01-12,1994-01-12,1994-01-12,4468773825686010,MR,Imaging Biometrics LLC,MR RCBV SEQUENCE FH,T1pre_reg,IB Delta Suite,-47.377533,512,512,MONOCHROME2,16
171,PGBM-004,BRAIN,1994-01-12,1994-01-12,1994-01-12,4468773825686010,MR,Imaging Biometrics LLC,MR RCBV SEQUENCE FH,T1pre_reg,IB Delta Suite,-41.122559,512,512,MONOCHROME2,16


In [4]:
data = data[data['series_desc'] == 'T1post']
data

Unnamed: 0,patient_id,body_part_exmn,study_date,series_date,content_date,accession_nbr,modality,manufctr,study_desc,series_desc,manufctr_model_nm,slice_loc,rows,columns,photomtrc_intrpt,bits_alloc
0,PGBM-001,BRAIN,1992-04-02,1992-04-02,1992-04-02,5686274134839343,MR,SIEMENS,FH-HEAD^Brain Protocols,T1post,Verio,-74.445322,320,260,MONOCHROME2,16
1,PGBM-001,BRAIN,1992-04-02,1992-04-02,1992-04-02,5686274134839343,MR,SIEMENS,FH-HEAD^Brain Protocols,T1post,Verio,-67.945319,320,260,MONOCHROME2,16
2,PGBM-001,BRAIN,1992-04-02,1992-04-02,1992-04-02,5686274134839343,MR,SIEMENS,FH-HEAD^Brain Protocols,T1post,Verio,-61.445324,320,260,MONOCHROME2,16
3,PGBM-001,BRAIN,1992-04-02,1992-04-02,1992-04-02,5686274134839343,MR,SIEMENS,FH-HEAD^Brain Protocols,T1post,Verio,-54.945321,320,260,MONOCHROME2,16
4,PGBM-001,BRAIN,1992-04-02,1992-04-02,1992-04-02,5686274134839343,MR,SIEMENS,FH-HEAD^Brain Protocols,T1post,Verio,-48.445323,320,260,MONOCHROME2,16
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
248,PGBM-005,BRAIN,1991-07-02,1991-07-02,1991-07-02,2721434998651531,MR,SIEMENS,FH-HEAD^Brain Protocols,T1post,Verio,39.373348,320,260,MONOCHROME2,16
249,PGBM-005,BRAIN,1991-07-02,1991-07-02,1991-07-02,2721434998651531,MR,SIEMENS,FH-HEAD^Brain Protocols,T1post,Verio,45.873348,320,260,MONOCHROME2,16
250,PGBM-005,BRAIN,1991-07-02,1991-07-02,1991-07-02,2721434998651531,MR,SIEMENS,FH-HEAD^Brain Protocols,T1post,Verio,52.373348,320,260,MONOCHROME2,16
251,PGBM-005,BRAIN,1991-07-02,1991-07-02,1991-07-02,2721434998651531,MR,SIEMENS,FH-HEAD^Brain Protocols,T1post,Verio,58.873349,320,260,MONOCHROME2,16


## Ensure all modalities are the same
We only want to study the MR modality.  Make sure there are no other modality images in the data.

### Quiz 1 Question 2
How many images had a modality other than MR?

In [5]:
# your code here
data[data['modality'] != 'MR']


Unnamed: 0,patient_id,body_part_exmn,study_date,series_date,content_date,accession_nbr,modality,manufctr,study_desc,series_desc,manufctr_model_nm,slice_loc,rows,columns,photomtrc_intrpt,bits_alloc


## How many different manufacturer and model combinations are in the study?
In an examination of images, the manufacturer or model may play a role.  We would want to know how many we will be dealing with in this study.

### Quiz 1 Question 3
How many distinct manufacturer and model combinations are there?

In [6]:
# your code here

data.groupby(['manufctr','manufctr_model_nm'], as_index=False).size()

manufctr            manufctr_model_nm
GE MEDICAL SYSTEMS  DISCOVERY MR750      69
                    Optima MR450w        44
SIEMENS             Espree               44
                    Verio                73
dtype: int64

## Compare images sizes and color depth
For each patient, there should be two series.  Within each patient, are the image sizes the same for the first and last series?  Is the `Photometric Interpretation` the same?

### Quiz 1 Question 4
Which patients had different images sizes between their first and last series?

### Quiz 1 Question 5
Which patients had different photometric interpretations between their first and last series?

In [10]:
# your code here

data['img_size'] = data[['rows', 'columns']].apply(tuple, axis=1)

data.groupby(['patient_id','series_date', 'img_size'], as_index=False).size()

patient_id  series_date  img_size  
PGBM-001    1991-11-19   (320, 260)    25
            1992-04-02   (320, 260)    24
PGBM-002    1996-08-13   (512, 512)    22
            1997-01-17   (512, 512)    22
PGBM-003    1995-03-29   (320, 280)    22
            1995-10-17   (512, 512)    24
PGBM-004    1993-06-22   (512, 512)    22
            1994-01-12   (512, 512)    23
PGBM-005    1991-07-02   (320, 260)    24
            1992-05-05   (320, 280)    22
dtype: int64

PGBM-003 had different image sizes. No patients had different photometric interpretations