### Convert the images from DICOM to PNG format 

The CNN models in this study do not support DICOM formatting, making conversion to PNG the first necessary step in data pre-processing. The script used will iterate through the images and systematically convert them to .png files in a new folder. The script will also ensure that the original folder structure is preserved, to retain access as intended to the associated categorical data found in accompanying csv files. The images will be individually normalized as part of this process to stretch the range of pixel intensities, improving contrast and readability.

In [None]:
import os
import cv2
import pydicom

# Input folder - DICOM files
input_folder = r'E:\vindr-mammo-1.0.0\images'
# Output folder - PNG files
output_folder = r'E:\vindr-mammo-1.0.0\png_images'

# Iterate through sub-folders and files
for root, folders, files in os.walk(input_folder):
    for file in files:
        if file.endswith(".dicom"):
            dicom_path = os.path.join(root, file)
            relative_path = os.path.relpath(root, input_folder)
            output_subfolder = os.path.join(output_folder, relative_path)  # Maintain original folder structure
            os.makedirs(output_subfolder, exist_ok=True)  # Create output subfolder
            output_path = os.path.join(output_subfolder, file.replace('.dicom', '.png'))
            dicom_file = pydicom.dcmread(dicom_path)
            img = dicom_file.pixel_array

            if img.max() > 255:
                img = (img - img.min()) / (img.max() - img.min()) * 255    # Normalize images (Stretch range of pixel intensities)
                img = img.astype('uint8')

            cv2.imwrite(output_path, img)  # Save as PNG

### Resize the images

After converting the images, they should be resized for use with CNNs – most models require consistent input dimensions. Resizing the images makes sure all of them are consistent in size. Standardizing the size also reduces the computational load and ensures the models can process the data efficiently. The images are resized to 256x256, which works well with most CNN models. Using cv2.INTER_AREA preserves image quality when downscaling through the use of pixel resampling.


In [None]:
from tqdm import tqdm  # Progress bar

images_folder = r'E:\vindr-mammo-1.0.0\png_images'

# Load images into a list
all_images = []
for root, _, files in os.walk(images_folder):
    for file in files:
        if file.endswith(".png"):  # Only process PNG images
            all_images.append(os.path.join(root, file))

# Loop through images and resize - including progress bar
for img_path in all_images:
    img = cv2.imread(img_path)
    resized_img = cv2.resize(img, (256, 256), interpolation=cv2.INTER_AREA) # INTER_AREA better for downscaling
    cv2.imwrite(img_path, resized_img)  # Overwrite original images (not necessary to retain them)

print('Resizing Complete!')

### Apply CLAHE contrast enhancement

CLAHE (Contrast Limited Adaptive Histogram Equalization) considers the global contrast of
the image. With CLAHE the image is split into multiple sections or ‘tiles’ which are
individually equalized to avoid darker tiles influencing lighter ones. It is particularly common
with medical images to find large areas of low contrast. Using OpenCV’s cv2 library this
equalization method can be easily applied to the images in this dataset. As with each step in
this process, special care is also taken in order to preserve the original folder structure.

In [None]:
images_folder = r'E:\vindr-mammo-1.0.0\png_images'
clahe_images_folder = r'E:\vindr-mammo-1.0.0\clahe_images'

os.makedirs(clahe_images_folder, exist_ok=True)
clahe = cv2.createCLAHE(clipLimit=2.0) # clip limit recommended by OpenCV. Tile size is 8x8 by default

# Loop through study ids to preserve folder structure
for study_id in os.listdir(images_folder):
    study_folder = os.path.join(images_folder, study_id)
    clahe_study_folder = os.path.join(clahe_images_folder, study_id)
    os.makedirs(clahe_study_folder, exist_ok=True)

    for file in os.listdir(study_folder):
        if file.endswith('.png'):
            image_path = os.path.join(study_folder, file)
            output_path = os.path.join(clahe_study_folder, file)

            img = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE) # Mammogram images are grayscale anyway
            cv2.imwrite(output_path, clahe.apply(img))
