<h3> Introduction </h3>

This notebook has functions that loads and processes DICOM files. 

</br>

<h4> Functions </h4>

Specifically, it has functions that can load compressed/uncompressed DICOM or PNG breast scan images, resize them, save as numpy files, and combine these images with the associated breast density. This combined information will be saved into numpyz files. 

</br>

<h4> Supporting datasets </h4>

The images are Mass-Training Full Mammogram Images (DICOM) and Mass-Testing Full Mammogram Images (DICOM) obtained from  <a href=https://wiki.cancerimagingarchive.net/display/Public/CBIS-DDSM> the  Digital Database for Screening Mammography (DDSM)</a>. A `metadata` csv file is used to keep track of which DICOM images are downloaded (this file is provided when using the NBIA Data Retriever to download the images). Furthermore, the `Mass-Training-Description` (csv) and `Mass-Testing-Description` (csv) files are has the breast density information and information on all files available in the original dataset. 

</br>



In [None]:
pip install pydicom
import pydicom as pdm


Collecting pydicom
  Downloading pydicom-2.2.2-py3-none-any.whl (2.0 MB)
[?25l[K     |▏                               | 10 kB 22.0 MB/s eta 0:00:01[K     |▍                               | 20 kB 25.6 MB/s eta 0:00:01[K     |▌                               | 30 kB 28.4 MB/s eta 0:00:01[K     |▊                               | 40 kB 30.1 MB/s eta 0:00:01[K     |▉                               | 51 kB 30.2 MB/s eta 0:00:01[K     |█                               | 61 kB 27.5 MB/s eta 0:00:01[K     |█▏                              | 71 kB 27.6 MB/s eta 0:00:01[K     |█▍                              | 81 kB 28.3 MB/s eta 0:00:01[K     |█▌                              | 92 kB 29.2 MB/s eta 0:00:01[K     |█▊                              | 102 kB 27.7 MB/s eta 0:00:01[K     |█▉                              | 112 kB 27.7 MB/s eta 0:00:01[K     |██                              | 122 kB 27.7 MB/s eta 0:00:01[K     |██▏                             | 133 kB 27.7 MB/s eta 0

In [2]:
# import libraries

import os
import pandas as pd
import numpy as np
import skimage as skm
from skimage.transform import resize

In [31]:
# Mounting drive 

from google.colab import drive
drive.mount('/content/gdrive')
os.chdir("/content/gdrive/MyDrive/Fa21/UCSF/Onboarding/manifest")

Mounted at /content/gdrive


In [None]:
def resize_png_image(image, dimensions):
  '''
  Image: (STRING) file path of the image
  Dimensions: (TUPLE of INTEGERS) dimension to reshape to 
  -----
  Requires the function 'resize' from skimage.transform
  Requires io from skimage 
  -----
  Resizing png image
  '''
  read_image = skm.io.imread(image)
  resized_image = resize(read_image, dimensions, anti_aliasing=False)
  return resized_image

def resize_dicom_image(image, dimensions):
  '''
  Image: (STRING) file path of the image 
  Dimensions: (TUPLE of INTEGERS) dimension to reshape to 
  -----
  Requires the function 'resize' from skimage.transform
  Requires pydicom
  -----
  Resizing uncompressed and compressed dicom image
  '''
  read_image = pdm.dcmread(image, defer_size=False, force=True)
  if not image.endswith(".dcm"): 
    read_image = read_image.decompress()
  pixel_arr = read_image.pixel_array
  resized_image = resize(pixel_arr, dimensions, anti_aliasing=False)
  return resized_image 

def save_to_numpy(fpath, file):
  '''
  fpath: (STRING) file path to save to 
  file: (NUMPY ARRAY) the file to save into a numpy file 
  -----
  Requires numpy 
  -----
  Save to a numpy file
  '''
  np.save(fpath, file)

def patient_id(subject_id):
  '''
  subject_id: (STRING) subject_id column from the metadata table that's
  provided using NBIA Data Retriever 

  get the patient_id from the subject_id
  '''
  id_len = 7
  if subject_id.startswith("Mass-Test_"):
    start = len("Mass-Test_")
    return subject_id[start:start+id_len]
  else:
    start = len("Mass-Training_")
    return subject_id[start:start+id_len]

def left_or_right(subject_id):
  '''
  subject_id: (STRING) subject_id column from the metadata table that's
  provided using NBIA Data Retriever 

  find out if the image is of the right or left breast from the subject_id
  '''
  if "LEFT" in subject_id:
    return "LEFT"
  else:
    return "RIGHT"

def find_breast_density(patient_id, left_right, des):
  '''
  des: (PANDAS TABLE) table of description to retrieve breast density 
  Find the breast density given patient_id and whether the 
  image is of the left or the right breast.
  If breast density is 1, 2 then return (density, 0) (i.e not dense); 
  if 3, 4 then return (density, 1) (i.e dense).
  '''

  patient = des[des["patient_id"] == patient_id]
  breast = patient[patient["left or right breast"] == left_right]
  density = breast["breast_density"].values[0] 
  return density

def numpyz_file(fpath, save_to, bd_info):
  '''
  fpath: (STRING) file path to load numpy file. 
  save_to: (NUMPY NDARRAY) file path to save numpyZ file.
  bd_info: (PANDAS TABLE) table with breast density information. 

  Load the numpy file, find breast density information, 
  and save as numpyZ file. 
  '''

  loaded_file = np.load(file=fpath + '.npy', allow_pickle=False)
  subject_id = fpath[fpath.index("Mass-"):]
  id_patient = patient_id(subject_id)
  which_breast = left_or_right(subject_id)
  density = find_breast_density(id_patient, which_breast, bd_info)

  density_arr = np.array([density])

  np.savez(save_to, arr_0=loaded_file, arr_1=density_arr)

def loading_data_pipeline(testn, trainn, metan, dimensions, verbose=False):
  '''
  testn: (STRING) name of the Mass-Test-Description table 
  trainn: (STRING) name of the Mass-Training-Description table 
  metan: (STRING) name of the metadata table that comes with the NBIA Data Retriever
  dimensions: (TUPLE of INTEGERS) dimensions to resize the images 
  verbose: (BOOLEAN) option to print progress of function 
  -------
  Create the entire pipeline that loads the dicom images, 
  resizes it, save as numpy files, find the breast density 
  associated with each image, and combine the density with the 
  images to create numpyz files. 
  '''
  
  # load the necessary tables 
  test = pd.read_csv(testn)
  train = pd.read_csv(trainn)
  des = pd.concat([test, train], axis=0) 
  meta = pd.read_csv(metan)

  f_locations = meta["File Location"].str[2:]
  patient_ids = meta["Subject ID"]
  
  for idx in range(meta.shape[0]):

    # putting together file path 
    location = f_locations[idx]
    subject_id = patient_ids[idx]
    f_name = os.listdir(location)[0]
    fpath = os.path.join(location, f_name)

    # resize data
    if f_name.endswith("png"):
      resized = resize_png_image(fpath, dimensions)
    else: 
      resized = resize_dicom_image(fpath, dimensions)

    # save as numpy files
    np_name = os.path.join('numpy_files', subject_id)
    save_to_numpy(np_name, resized)

    # find breast density and combine to create numpyz file
    save_to = os.path.join("numpyz_files", subject_id)
    numpyz_file(np_name, save_to, des)

    # print progress
    p_id = patient_id(subject_id)
    msg = "Finished with patient " + p_id
    print_progress(msg, verbose)


def print_progress(msg, verbose=False):
  '''
  msg: (STRING) message to print 
  verbose: (BOOLEAN) determines to print message or not
  -------
  Used as helper functions to track progress 
  in the data loading and processing pipeline.
  '''
  if verbose:
    print(msg)


In [None]:
loading_data_pipeline(testn="mass_case_description_test_set.csv", 
                      trainn="mass_case_description_train_set.csv", 
                      metan="metadata.csv", 
                      dimensions=(229, 229), 
                      verbose=True)

Finished with patient P_00016
Finished with patient P_00016
Finished with patient P_00017
Finished with patient P_00032
Finished with patient P_00017
Finished with patient P_00032
Finished with patient P_00037
Finished with patient P_00056
Finished with patient P_00037
Finished with patient P_00066
Finished with patient P_00066
Finished with patient P_00114
Finished with patient P_00099
Finished with patient P_00116
Finished with patient P_00116
Finished with patient P_00118
Finished with patient P_00124
Finished with patient P_00126
Finished with patient P_00118
Finished with patient P_00126
Finished with patient P_00131
Finished with patient P_00131
Finished with patient P_00145
Finished with patient P_00145
Finished with patient P_00156
Finished with patient P_00147
Finished with patient P_00147
Finished with patient P_00158
Finished with patient P_00171
Finished with patient P_00159
Finished with patient P_00173
Finished with patient P_00171
Finished with patient P_00173
Finished w

In [9]:
npz_file = np.load('numpyz_files/Mass-Training_P_00495_RIGHT_CC.npz')

In [19]:
npz_file['arr_1']

array([3])

In [20]:
files_name = np.array([])

In [24]:
files_name = np.append(files_name, 0)

In [25]:
files_name

array([0., 0.])

In [26]:
label = 1 if 0 < 1 else -1

In [27]:
label

1

In [51]:
# creating a label csv

def create_label_csv(dpath, save_to):
  '''
  dpath: (STRING) path to directory containing the numpyz files. 
  save_to: (STRING) path to save the resulting csv file to. 
  ---------
  Helper function that loads the numpyz files, 
  and create a pandas table with the following format: 
  file_name | label 
  The label will be binary 0 and 1 (0 is if breast density 
  is 1 and 2; 1 otherwise). The pandas table will 
  be saved. 
  '''
  file_names = np.array([])
  labels = np.array([])

  for f in os.listdir(dpath):
    path = os.path.join(dpath, f)
    file_names = np.append(file_names, f)

    # extract breast density from numpyz file
    label = np.load(path)['arr_1'][0]
    # if convert density to binary values 
    label = 0 if label < 3 else 1
    labels = np.append(labels, label)



  results_tbl = pd.DataFrame({"Example": file_names, 
                              "Label": labels})
  results_tbl.to_csv(save_to, index=False)


In [52]:
create_label_csv("numpyz_files", "labels.csv")