# Case Study: Robust DICOM Series Classification for Preprocessing Pipelines

**Goal:** To develop a reliable method for classifying medical imaging series (CT, MR) into specific subtypes (e.g. CTA, MRA)

**Problem:** Medical imaging data acquired from different hospitals and scanners exhibits high variability in its metadata. A single study type might have dozens of different SeriesDescription labels due to vendor differences, local hospital protocols, and even language variations.

**Application:** Accurate classification is critical for downstream automated processing. For example, CT scans require Hounsfield Unit (HU) conversion and windowing, while MR scans require z-score normalization or min-max scaling. An MRA sequence and a T2-weighted sequence from the same MR scan require different handling.

In [2]:
import re
import pydicom
import pandas as pd
from pathlib import Path
from collections import Counter
from pydicom.dataset import Dataset

data_root = Path("/fsx")

First, let's explore the raw metadata to understand the scale of the challenge. We will scan the first DICOM file from each series and collect key metadata fields: Modality, SeriesDescription, and ProcedureCodeSequence.

In [3]:
series_dir = data_root / "raw" / "series"

series_descriptions = []
char_sets = []
procedure_code_meanings = []

for uid_dir in series_dir.iterdir():
    files = sorted(uid_dir.glob("*.dcm"))
    if files:
        ds = pydicom.dcmread(str(files[0]))
        desc = str(getattr(ds, "SeriesDescription", "")).upper()
        if desc:
            series_descriptions.append(desc)
        char_set = str(getattr(ds, "SpecificCharacterSet", "")).upper()
        if char_set:
            char_sets.append(char_set)
        try:
            procedure_codes = getattr(ds, "ProcedureCodeSequence", [])
            for item in procedure_codes:
                code_meaning = str(getattr(item, "CodeMeaning", "")).upper()
                if code_meaning:
                    procedure_code_meanings.append(code_meaning)
        except Exception:
            pass
    
print("--- DICOM Metadata Summary ---")
print("\nSeries Descriptions:")
print(Counter(series_descriptions).most_common(10))

print("\nSpecific Character Sets:")
print(Counter(char_sets).most_common(10))

print("\nProcedure Code Meanings:")
print(Counter(procedure_code_meanings).most_common(10))
print("-----------------------------")     
    


--- DICOM Metadata Summary ---

Series Descriptions:
[('T2 TSE', 165), ('MRA', 157), ('TOF_3D_MULTI-SLAB', 109), ('AX T2', 92), ('ANEURYSM', 90), ('AX COW', 78), ('AX T2 PROPELLER', 68), ('TOF_FL3D_TRA', 59), ('CTA HEAD & NECK, IDOSE (3)', 56), ('CTA 3.000 CE', 56)]

Specific Character Sets:
[('ISO_IR 100', 3900), ('ISO_IR 166', 92), ('ISO 2022 IR 100', 6), ('ISO_IR 101', 2), ('ISO_IR 192', 1)]

Procedure Code Meanings:
[('BT ANJIYOGRAFI, BEYIN - BOYUN', 45), ('MRA BRAIN WO/W CONTRAST', 28), ('MRG BEYIN, INME (DIFUZYON,SWI, BEYIN MRA DAHIL)', 25), ('BT ANJIYOGRAFI,BEYIN', 23), ('MRI BRAIN WO/W CONTRAST', 23), ('MRI BRAIN W/O CONTRAST', 13), ('CT ANGIOGRAM HEAD AND NECK WITH CONTRAST', 13), ('MRI/MRA BRAIN WITHOUT AND WITH CONTRAST (INCLUDE VWI)', 8), ('MR BRAIN/MR ANGIOGRAM WITH AND WITHOUT CONTRAST', 7), ('CT VASCULAR', 7)]
-----------------------------


The exploration reveals several challenges:

- High Variability: T2 scans are labeled as 'T2 TSE', 'AX T2', 'AX T2 PROPELLER', etc.

- Implicit Angiography: MRA scans are often not labeled "MRA" but use vendor-specific sequence names like 'TOF_3D_MULTI-SLAB' (Time-of-Flight) or refer to anatomy like 'AX COW' (Circle of Willis).

- Keyword Ambiguity: We need to identify strong keywords like 'ANEURYSM', 'CAROTID', and 'TOF' that signal an angiogram (CTA/MRA) even when the description doesn't explicitly state it.

- Language Barriers: As seen in ProcedureCodeMeaning, terms like 'ANJIYOGRAFI' (Turkish for Angiography) require handling.

In [71]:
def get_upper_str(ds: Dataset, attr: str) -> str:
    """Safely gets an attribute from a pydicom dataset and returns it as an uppercase string."""
    return str(getattr(ds, attr, "") or "").upper()

def _infer_subtype(ds_first: Dataset) -> str:
    """Infers a broad subtype from DICOM metadata using modality-specific keyword sets."""
    mod = get_upper_str(ds_first, "Modality")
    desc = get_upper_str(ds_first, "SeriesDescription") + " " + get_upper_str(ds_first, "ProtocolName")

    try:
        proc_code = ""
        for item in getattr(ds_first, "ProcedureCodeSequence", []) or []:
            proc_code += get_upper_str(item, "CodeMeaning") + " "
        if any(k in proc_code for k in ["ANGIO", "ANJIYOGRAFI", "MRA"]):
            return "CTA" if mod == "CT" else "MRA"
    except Exception:
        pass


    cta_keywords = ["CTA", "ANGIO", "CAROTID", "MIP", "AX-MIP"]
    mra_keywords = ["MRA", "ANGIO", "TOF", "3DTOF", "CEMRA", "FL3D", "SPGR"]
    t1_tokens = ["T1", "MPRAGE", "BRAVO", "SPGR", "FSPGR", "MPR"]
    t1_post_tokens = ["POST", "+C", "GAD", "GD", "CONTRAST"]

    if mod == "CT":
        if any(k in desc for k in cta_keywords):
            return "CTA"
        return "CT"

    if mod == "MR":
        if any(k in desc for k in mra_keywords) or re.search(r'\bCOW\b', desc):
            return "MRA"
        
        if any(k in desc for k in t1_tokens) and any(k in desc for k in t1_post_tokens):
            return "MRI T1post"
        
        if "T2" in desc:
            return "MRI T2"
        
        return "MR"

    return mod if mod else "Unknown"

In [72]:
series_dir = data_root / "raw" / "series"
val_results = []

for uid_dir in series_dir.iterdir():
    files = sorted(uid_dir.glob("*.dcm"))
    if files:
        ds = pydicom.dcmread(str(files[0]))
        subtype = _infer_subtype(ds)
        val_results.append({
        "Modality": get_upper_str(ds, "Modality"),
        "SeriesDescription": get_upper_str(ds, "SeriesDescription"),
        "InferredSubtype": subtype
        })

df_val = pd.DataFrame(val_results)
print(df_val.head(30))

   Modality                    SeriesDescription InferredSubtype
0        MR                          AX T2 FRFSE          MRI T2
1        MR                                  MRA             MRA
2        CT                                                   CT
3        MR                                                   MR
4        CT                          ANGIO THINS             CTA
5        MR                        AX TOF 3D COW             MRA
6        MR                       AX TOF COW 0.6             MRA
7        CT  DE_CAROTIDANGIO AX SOFT THIN  F_0.7             CTA
8        MR                             T2 AX FS          MRI T2
9        CT                 ANGIO 0.8, IDOSE (3)             CTA
10       MR                       AX TOF COW 0.6             MRA
11       MR           T1FS_3D_AX WHOLE BRAIN_GD.      MRI T1post
12       MR                          AX T2 FRFSE          MRI T2
13       MR                      AX T2 TSE BLADE          MRI T2
14       CT              

**Outcome:** The developed rule-based classifier successfully normalizes heterogeneous DICOM metadata into clean, reliable categories.

**Impact:** This enables the creation of robust, automated preprocessing pipelines where different scan types receive the appropriate specific transformations (e.g., windowing for CTA, normalization for MRA), significantly improving reliability and reducing manual intervention. This iterative process of explore-build-validate is essential for handling real-world medical data.