## Module 4 Practice 1 Answers - DICOM data extraction

In this exercise, you will extract the metadata from a directory containing multiple subjects, and perform basic descriptive analysis of that data.

The directory is the same as used in the lab, except we will read the DICOM files of every patient represented rather than just one.

This is the structure:

```
Brain-Tumor-Progression
  |_ PGBM-001 (the patient identifier in the study)
    |_ 11-19-1991-FH-HEADBrain Protocols-40993 (folder for the scan date)
        |_ 11.000000-T1post-03326 (folder for the scan series)
            |_ 1-01.dcm (sequential scan file)
            ...
            |_ 1-25.dcm (sequential scan file)
    |_ 04-02-1992-FH-HEADBrain Protocols-79896 (folder for the scan date)
        |_ 11.000000-T1post-80644 (folder for the scan series)
            |_ 1-01.dcm (sequential scan file)
            ...
            |_ 1-24.dcm (sequential scan file)
  |_ PGBM-002
    |_ <scan date 1>
        |_ <series>
            |_ 1-01.dcm
            ...
            |_ 1-xx.dcm (the number of images in each series varies)
    |_ <scan date 2>
        |_ <series>
            |_ 1-01.dcm
            ...
            |_ 1-xx.dcm (the number of images in each series varies)
  ...
  |_ PGBM-005 (there are 5 patients)
    |_ <scan date 1>
        |_ <series>
            |_ 1-01.dcm
            ...
            |_ 1-xx.dcm (the number of images in each series varies)
    |_ <scan date 2>
        |_ <series>
            |_ 1-01.dcm
            ...
            |_ 1-xx.dcm (the number of images in each series varies)
  
```

There are five subjects, two scan dates per subject, and possibly multiple series for each scan date.

In [None]:
import sys
!{sys.executable} -m pip install pydicom scikit-image
import pydicom
import skimage
import numpy as np
import pandas as pd

## Extract data from all DICOM files

Loop over all dicom images under the `Brain-Tumor-Progression` directory structure and extract the following tags into a pandas dataframe.  Refer to the lab on recursive directory looping for the technique to read all of the images.

```
(0010, 0010) Patient's Name                      
(0010, 0020) Patient ID                          
(0010, 0030) Patient's Birth Date                
(0010, 0040) Patient's Sex                       
(0010, 1010) Patient's Age                       
(0010, 1030) Patient's Weight                    
(0012, 0062) Patient Identity Removed            
(0008, 0050) Accession Number
(0008, 0021) Series Date
```

Some of these tags will result in a pydicom object rather than a Python base class:

* (0010, 0010) Patient's Name - returns a [pydicom.valuerep.PersonName](https://pydicom.github.io/pydicom/stable/reference/generated/pydicom.valuerep.PersonName.html)
    (this can be cast to a string)
* (0010, 1030) Patient's Weight - returns a [pydicom.valuerep.DSfloat](https://pydicom.github.io/pydicom/dev/reference/generated/pydicom.valuerep.DSfloat.html)
    (this can be cast to a float)
    
Some other values (like dates and ages) will be represented as strings, but are better loaded as a different type.

If necessary, reference the documentation and select an appropriate method to get the data in the format you desire.

In [None]:
from glob import glob

# there are a number of ways to build a pandas dataframe from scratch.  This method builds a list of lists,
# then converts that into a dataframe.
list_data = []
for filename in glob('../resources/**/*.dcm', recursive=True):
    img = pydicom.dcmread(filename)
    list_data.append([
        str(img[0x00100010].value),
        img[0x00100020].value,
        img[0x00100030].value,
        img[0x00100040].value,
        int(img[0x00101010].value.replace('Y', '')),
        float(img[0x00101030].value),
        img[0x00120062].value,
        img[0x00080050].value,
        pd.to_datetime(img[0x00080021].value, format='%Y%m%d')
    ])
    
data = pd.DataFrame(list_data, 
                    columns=['pt_name', 'pt_id', 'pt_birth_date', 'pt_sex', 'pt_age', 'pt_weight', 'pt_deid',
                            'acc_num', 'series_date'])

print(data.dtypes)
display(data)

## Has every image been deidentified?
One of the first things you should check if working with potential PHI is if the data have been de-identified.  If it has not and you do not have legal access to the PHI you should stop immediately.

The DICOM format has a tag for this, noted above.

In [None]:
data[data['pt_deid'] != 'YES']['pt_id'].count()

## How many images for each subject and series?
For each subject, print the number of images per subject and series date.

In [None]:
data.groupby(by=['pt_id', 'series_date'])[['acc_num']].count() # grouping by patient id and series date, and then counting just the accession number column

## How many of each sex in the study?

Count the number of subjects by sex.

In [None]:
# remember we have one row per image, so we first need just one row per subject, then group by sex and count

data[['pt_id', 'pt_sex']].drop_duplicates().groupby(by='pt_sex').count()

## What was the minimum, maximum, and average age of the subjects during their last series?

Find the last scan series by date for each subject, then find the mean age of all subjects from this subset.

hint: look at the [idxmax()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.idxmax.html) method in pandas.

In [None]:
# idxmax will give the first encountered index of the maximum for each group.  
# We use that result to get the detailed rows from the data, then perform the aggregate functions on patient age
data.loc[data.groupby('pt_id')['series_date'].idxmax()]['pt_age'].agg(['min', 'max', 'mean'])

## What was the minimum, maximum, and average weight of the subjects during their last series? 

Similar to the above question, but for weight.

In [None]:
data.loc[data.groupby('pt_id')['series_date'].idxmax()]['pt_weight'].agg(['min', 'max', 'mean'])

## What was the mean weight loss/gain between the first and the last series?

hint: is there an idxmin()?

In [None]:
# using idxmin and idxmax to get the first and last series rows.  index by pt_id so pandas can match the rows easily
first = data.loc[data.groupby('pt_id')['series_date'].idxmin()].set_index('pt_id')['pt_weight']
last = data.loc[data.groupby('pt_id')['series_date'].idxmax()].set_index('pt_id')['pt_weight']

display(first)
display(last)

# subtract the first from the last, so weight loss is negative and weight gain is positive
display(last - first)

display((last - first).mean())