
Untitled0.ipynb_

Hey there!

I wanted to share this awesome notebook I've been working on. It's all about processing medical imaging data and getting it ready for some serious machine learning action. Let me walk you through what's going on here: What's Inside

First off, we've got a bunch of libraries that are super helpful for different parts of the process:

os, zipfile, glob: These are for handling files and directories. Think of them as our handy tools for organizing and accessing data.

pydicom: This one's a lifesaver for reading DICOM files, which are super common in medical imaging.

numpy, pandas: These are our go-to libraries for crunching numbers and managing data. They make it easy to work with large datasets.

xml.etree.ElementTree: This helps us deal with XML data, which often comes with medical images to store important metadata.

tensorflow, keras: These are the brains behind our deep learning models. They help us build and train neural networks.

h5py: This is great for handling HDF5 files, which are perfect for storing large amounts of data efficiently.

json: We use this for parsing JSON data, which is handy for configuration files and data exchange.

multiprocessing: This lets us speed things up by using multiple CPU cores. It's like having a team of helpers working together to get the job done faster.

gc: This helps us manage memory usage, which is crucial when dealing with big datasets.

logging: This is for keeping track of what's happening in our code. It's like a journal that helps us debug and understand the process better.

The Game Plan

Here's a quick rundown of what we're doing in this notebook:

Data Extraction: We start by extracting medical imaging data from ZIP files and reading DICOM files.

Data Preprocessing: Next, we convert the raw data into a format that's ready for machine learning. This might involve resizing images, normalizing data, and other tweaks.

Metadata Handling: We parse XML files to grab important metadata and store it in a structured format using pandas.

Data Augmentation: We use tensorflow and keras to create more diverse training data, which can help our models perform better.

Model Training: We define and train a deep learning model. Once it's trained, we can save it in HDF5 format.

Parallel Processing: We use multiprocessing to speed up data preprocessing and model training by splitting the work across multiple CPU cores.

Memory Management: We keep an eye on memory usage with gc to make sure we don't run into any issues.

Logging: We log important events and errors to keep track of what's happening and debug any problems.

Why It's Cool

This notebook is a great template for processing medical imaging data and training deep learning models. You can easily adapt it to different datasets and models by tweaking the data preprocessing and model training sections. Plus, with parallel processing and memory management, it can handle even the biggest datasets efficiently. Wrap-Up

By following this pipeline, you can turn raw medical imaging data into powerful machine learning models that can do things like classify images, segment structures, and detect anomalies. It's a pretty awesome way to leverage the power of deep learning in the medical field.

Hope you find this helpful! Let me know if you have any questions or need a hand with anything.

Cheers!

!pip install pydicom tqdm h5py
from google.colab import drive
drive.mount('/content/drive')


Colab paid products - Cancel contracts here


# Part 1: Importing Libraries
This section imports all the necessary libraries for the script. These libraries cover a range of functionalities, including file handling, data processing, XML parsing, machine learning, and parallel processing.

In [None]:
import os
import zipfile
import glob
import pydicom
import numpy as np
import pandas as pd
import xml.etree.ElementTree as ET
import tensorflow as tf
from keras.utils import to_categorical
import h5py
import json
import multiprocessing
from multiprocessing import Pool, cpu_count
import gc
import logging


# Part 2: Setting Up Logging
This line sets up basic logging configuration. It specifies that log messages should include the timestamp, log level, and the message itself. This is useful for tracking the progress and debugging issues during execution.

In [None]:
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')


# Part 3: Defining the MedicalImagePreprocessor Class
This class is the core of the preprocessing pipeline. It initializes with paths for the input ZIP file, metadata file, and output file. It also defines directories and files for extracted data and annotations.

In [None]:
class MedicalImagePreprocessor:
    def __init__(self, zipPath, metadataPath, outputPath):
        self.zipPath = zipPath
        self.metadataPath = metadataPath
        self.outputPath = outputPath
        self.extractedDir = '/content/extractedData'
        self.annotationsFile = 'annotations.json'


# Part 4: Extracting ZIP Files
This method extracts the contents of the ZIP file into a specified directory. It ensures the directory exists before extracting the files.

In [None]:
def extractZipFiles(self):
        os.makedirs(self.extractedDir, exist_ok=True)
        with zipfile.ZipFile(self.zipPath, 'r') as zipRef:
            zipRef.extractall(self.extractedDir)


# Part 5: Parsing XML Files
This method parses an XML file to extract the study UID and malignancy scores. It handles namespaces and searches for specific elements within the XML structure.

In [None]:
def parseXmlFile(self, xmlPath):
        try:
            tree = ET.parse(xmlPath)
            rootXml = tree.getroot()
            namespaces = {'nih': 'http://www.nih.gov'}

            studyUidElement = rootXml.find('.//nih:StudyInstanceUID', namespaces)
            if studyUidElement is None:
                studyUidElement = rootXml.find('.//nih:CXRSeriesInstanceUid', namespaces)

            studyUid = studyUidElement.text if studyUidElement is not None else None

            malignancyScores = [int(nodule.find('.//nih:malignancy', namespaces).text)
                                for nodule in rootXml.findall('.//nih:unblindedReadNodule', namespaces)
                                if nodule.find('.//nih:malignancy', namespaces) is not None]

            return studyUid, max(malignancyScores) if malignancyScores else 0
        except Exception:
            return None, None


# Part 6: Parsing XML Annotations
This method processes all XML files in a directory to extract annotations. It uses parallel processing to speed up the operation and saves the results to a JSON file.

In [None]:
def parseXmlAnnotations(self):
        lidcDir = os.path.join(self.extractedDir, 'LIDC-IDRI')
        xmlFiles = glob.glob(os.path.join(lidcDir, '**/*.xml'), recursive=True)

        with Pool(max(1, cpu_count() - 1)) as pool:
            results = pool.map(self.parseXmlFile, xmlFiles)

        annotations = {studyUid: malignancy for studyUid, malignancy in results if studyUid}

        with open(self.annotationsFile, "w") as f:
            json.dump(annotations, f)

        return annotations


# Part 7: Loading Annotations
This method loads annotations from a JSON file if it exists. It returns an empty dictionary if the file is not found.

In [None]:
def loadAnnotations(self):
        if os.path.exists(self.annotationsFile):
            with open(self.annotationsFile, "r") as f:
                return json.load(f)
        return {}


# Part 7: Loading Annotations
This method loads annotations from a JSON file if it exists. It returns an empty dictionary if the file is not found.

In [None]:
def loadAnnotations(self):
        if os.path.exists(self.annotationsFile):
            with open(self.annotationsFile, "r") as f:
                return json.load(f)
        return {}


# Part 8: Processing DICOM Files
This method processes a DICOM file to extract and normalize the image data. It resizes the image and assigns a label based on the annotations.

In [None]:
def processDicomFile(self, dicomPath, studyUid, annotations):
        try:
            ds = pydicom.dcmread(dicomPath)
            image = ds.pixel_array.astype(np.float32)

            image = (image - np.min(image)) / (np.max(image) - np.min(image))
            image = np.array(tf.image.resize(image[..., np.newaxis], (128, 128)))

            label = 1 if annotations.get(studyUid, 0) > 3 else 0
            return image, label
        except Exception:
            return None, None


# Part 9: Preprocessing Images
This method preprocesses all the images in the dataset. It reads the metadata, processes DICOM files, and saves the preprocessed data to an HDF5 file.

In [None]:
def preprocessImages(self):
        metadataDf = pd.read_csv(self.metadataPath)
        ctSeries = metadataDf[metadataDf['Modality'] == 'CT']

        annotations = self.loadAnnotations()
        if not annotations:
            annotations = self.parseXmlAnnotations()

        slices, labels, studyUids = [], [], []

        for _, row in ctSeries.iterrows():
            studyUid = row['Study UID']
            fileLocation = row['File Location'].lstrip('.\\')
            patientDir = os.path.join(self.extractedDir, fileLocation)

            if os.path.exists(patientDir):
                dicomFiles = glob.glob(os.path.join(patientDir, "*.dcm"))

                with Pool(max(1, cpu_count() // 2)) as pool:
                    results = pool.starmap(self.processDicomFile,
                                           [(dicomFile, studyUid, annotations) for dicomFile in dicomFiles])

                for image, label in results:
                    if image is not None and label is not None:
                        slices.append(image)
                        labels.append(label)
                        studyUids.append(studyUid)

        slices = np.array(slices)
        labels = np.array(labels)

        os.makedirs(os.path.dirname(self.outputPath), exist_ok=True)
        with h5py.File(self.outputPath, 'w') as hdf:
            hdf.create_dataset('slices', data=slices, compression='gzip', chunks=True)
            hdf.create_dataset('labels', data=labels, compression='gzip')
            hdf.create_dataset('studyUids', data=np.array(studyUids, dtype='S'), compression='gzip')

        return slices, labels, studyUids


# Part 10: Cleaning Up
This method cleans up the extracted data and annotations file to free up space.

In [None]:
def cleanup(self):
        os.system(f'rm -rf {self.extractedDir}')
        os.system(f'rm {self.annotationsFile}')
