# Comparing DICOMs with `rosamllib.compare_dcms` — A Hands-On Tutorial

This notebook walks through comparing two DICOM files using **`rosamllib.compare_dcms`**.

**What you'll do:**

1. Generate two tiny DICOMs (no external data needed).
2. Run a quick comparison and read the text/JSON results.
3. Render an interactive HTML report (nested, collapsible, filterable).
4. Use a YAML config for ignores and **sequence item key** matching.
5. Switch between **“show everything”** (including matches) and **“diffs only”**.
6. Do programmatic analysis with Pandas/CSV.

> The goal is to help you understand **what changed and where**, including inside **nested sequences**, while skipping huge byte blobs (e.g., PixelData).

## Table of Contents
1. [Setup](#setup)
2. [Create tiny DICOMs for the demo](#tiny)
3. [Quick compare (sensible defaults)](#quick)
4. [Human-readable text and JSON](#textjson)
5. [HTML report (inline)](#html)
6. [Save the HTML report](#savehtml)
7. [YAML config: ignores + sequence key matching](#yaml)
8. [“Show everything” vs “Diffs only”](#allvsdiffs)
9. [Programmatic analysis (Pandas/CSV)](#pandas)
10. [Tips & Troubleshooting](#tips)
11. [Use the CLI from the notebook (optional)](#cli)

## 1) Setup <a id='setup'></a>

In [1]:
%load_ext autoreload
%autoreload 2

from rosamllib.compare_dcms import DiffOptions, Tolerance, compare_files, load_yaml_config
from IPython.display import HTML, display
import pandas as pd
import os, textwrap, json

## 2) Create tiny DICOMs for the demo <a id='tiny'></a>

### Why generate tiny DICOMs?
- You can test the comparison logic without hunting for sample data.
- The files are small, simple, and deterministic.
- We'll purposely change a couple of fields so we can see what the diffs look like.

We’ll create two DICOM files:
- `demo_left.dcm`: base file
- `demo_right.dcm`: identical except for a few controlled differences (e.g., `StationName`)
- Both include a simple **nested sequence** (`RequestAttributesSequence`) so we can demonstrate sequence handling.

In [2]:
from pydicom.dataset import Dataset, FileDataset
from pydicom.uid import ExplicitVRLittleEndian
import time, datetime as dt

def make_tiny_dcm(path, *, patient_name="DOE^JOHN", station_name="CT-01", study_desc="TEST", req_id="PROC-001", extra=None):
    """Create a minimal, valid DICOM file with a small nested sequence."""
    meta = Dataset()
    meta.MediaStorageSOPClassUID = "1.2.840.10008.5.1.4.1.1.2"  # CT Image Storage
    meta.MediaStorageSOPInstanceUID = f"1.2.826.0.1.3680043.2.1125.{int(time.time()*1000)}"
    meta.TransferSyntaxUID = ExplicitVRLittleEndian

    ds = FileDataset(path, {}, file_meta=meta, preamble=b"\0"*128)
    ds.is_little_endian = True
    ds.is_implicit_VR = False

    # Common tags
    ds.PatientName = patient_name
    ds.PatientID = "12345"
    ds.StudyDescription = study_desc
    ds.StationName = station_name
    ds.StudyInstanceUID = "1.2.3.4.5.6"
    ds.SeriesInstanceUID = "1.2.3.4.5.6.7"
    ds.SOPClassUID = meta.MediaStorageSOPClassUID
    ds.SOPInstanceUID = meta.MediaStorageSOPInstanceUID
    ds.Modality = "CT"
    ds.StudyDate = dt.date.today().strftime("%Y%m%d")

    # Simple nested sequence
    ds.RequestAttributesSequence = []
    item = Dataset()
    item.RequestedProcedureID = req_id
    ds.RequestAttributesSequence.append(item)

    # Optional tweaks
    if extra:
        for k, v in extra.items():
            setattr(ds, k, v)

    ds.save_as(path, write_like_original=False)
    return path

LEFT  = os.path.abspath("demo_left.dcm")
RIGHT = os.path.abspath("demo_right.dcm")

make_tiny_dcm(LEFT,  station_name="CT-01", req_id="PROC-001")
make_tiny_dcm(RIGHT, station_name="CT-02", req_id="PROC-001")  # one intentional difference

LEFT, RIGHT

('c:\\Users\\yabdulkadir\\Desktop\\open_source\\Github\\rosamllib\\examples\\demo_left.dcm',
 'c:\\Users\\yabdulkadir\\Desktop\\open_source\\Github\\rosamllib\\examples\\demo_right.dcm')

## 3) Quick compare (sensible defaults) <a id='quick'></a>

By default we’ll:
- **ignore private tags**,
- **ignore bulk/byte-heavy fields** (e.g., PixelData),
- apply a small **numeric tolerance** for numeric VRs,
- and **collect matches** (`ok` rows) so the HTML shows *everything* (you can filter in the UI).

In [3]:
opts = DiffOptions(
    ignore_private=True,
    ignore_bulk=True,
    numeric_tol=Tolerance(1e-6),
    case_insensitive_strings=False,
    collect_all_matches=True,  # show ok rows too
)

report = compare_files(LEFT, RIGHT, opts)
summary = report.to_dict()["summary"]
summary

{'total': 12, 'by_severity': {'diff': 2, 'warn': 0, 'info': 0, 'ok': 10}}

## 4) Human-readable text and JSON <a id='textjson'></a>

In [4]:
# Text (compact summary)
print(report.to_text())

# JSON (peek at the summary)
data = report.to_dict()
print("Total rows:", data["summary"]["total"]) 
print("By severity:", data["summary"]["by_severity"])

Diffs: 12
- SOPClassUID: '1.2.840.10008.5.1.4.1.1.2' != '1.2.840.10008.5.1.4.1.1.2' [match]
- SOPInstanceUID: '1.2.826.0.1.3680043.2.1125.1758961220163' != '1.2.826.0.1.3680043.2.1125.1758961220165' [value differs (tol=1e-06)]
- StudyDate: '20250927' != '20250927' [match]
- Modality: 'CT' != 'CT' [match]
- StationName: 'CT-01' != 'CT-02' [value differs (tol=1e-06)]
- StudyDescription: 'TEST' != 'TEST' [match]
- PatientName: 'DOE^JOHN' != 'DOE^JOHN' [match]
- PatientID: '12345' != '12345' [match]
- StudyInstanceUID: '1.2.3.4.5.6' != '1.2.3.4.5.6' [match]
- SeriesInstanceUID: '1.2.3.4.5.6.7' != '1.2.3.4.5.6.7' [match]
- RequestAttributesSequence.length: 1 != 1 [match]
- RequestAttributesSequence.[0].RequestedProcedureID: 'PROC-001' != 'PROC-001' [match]
Total rows: 12
By severity: {'diff': 2, 'warn': 0, 'info': 0, 'ok': 10}


## 5) HTML report (inline) <a id='html'></a>

The HTML report is:
- **Nested & collapsible:** mirrors your sequence paths (e.g., `RequestAttributesSequence[0]...`).
- **Filterable:** search box filters by any text.
- **Toggleable severities:** quickly switch between *All*, *Diffs only*, or *Non-matches*.

Because we set `collect_all_matches=True`, you’ll see both **matches** (`ok`) and **differences** (`diff`).

In [5]:
HTML(report.to_html(title="DICOM Diff — Inline Report"))

0,1
path,c:\Users\yabdulkadir\Desktop\open_source\Github\rosamllib\examples\demo_left.dcm
SOPClassUID,1.2.840.10008.5.1.4.1.1.2
SOPInstanceUID,1.2.826.0.1.3680043.2.1125.1758961220163
Modality,CT
StudyInstanceUID,1.2.3.4.5.6
SeriesInstanceUID,1.2.3.4.5.6.7

0,1
path,c:\Users\yabdulkadir\Desktop\open_source\Github\rosamllib\examples\demo_right.dcm
SOPClassUID,1.2.840.10008.5.1.4.1.1.2
SOPInstanceUID,1.2.826.0.1.3680043.2.1125.1758961220165
Modality,CT
StudyInstanceUID,1.2.3.4.5.6
SeriesInstanceUID,1.2.3.4.5.6.7

Severity,Path,Left,Right,Note
ok,SOPClassUID,1.2.840.10008.5.1.4.1.1.2,1.2.840.10008.5.1.4.1.1.2,match
diff,SOPInstanceUID,1.2.826.0.1.3680043.2.1125.1758961220163,1.2.826.0.1.3680043.2.1125.1758961220165,value differs (tol=1e-06)
ok,StudyDate,20250927,20250927,match
ok,Modality,CT,CT,match
diff,StationName,CT-01,CT-02,value differs (tol=1e-06)
ok,StudyDescription,TEST,TEST,match
ok,PatientName,DOE^JOHN,DOE^JOHN,match
ok,PatientID,12345,12345,match
ok,StudyInstanceUID,1.2.3.4.5.6,1.2.3.4.5.6,match
ok,SeriesInstanceUID,1.2.3.4.5.6.7,1.2.3.4.5.6.7,match

Severity,Path,Left,Right,Note
ok,RequestAttributesSequence.length,1,1,match

Severity,Path,Left,Right,Note
ok,RequestAttributesSequence.[0].RequestedProcedureID,PROC-001,PROC-001,match


## 6) Save the HTML report <a id='savehtml'></a>

In [6]:
out_html = report.write_html("dicom_diff_report.html", title="DICOM Diff — Saved Report")
out_html

'dicom_diff_report.html'

## 7) YAML config: ignores + sequence key matching <a id='yaml'></a>

You can keep configuration in YAML (easier to reuse and share). Here’s what we’ll include:

- Global options (ignore private/bulk, numeric tolerance).
- Ignore list examples (e.g., `PatientName`).
- **Sequence item key rules**: choose which fields define identity for items in a sequence, so comparison doesn’t rely on index order.  
  For this demo, our simple `RequestAttributesSequence` has only one item with a `RequestedProcedureID`, so we’ll set that as the key.

If a key is missing, behavior is controlled by `sequence_fallback`:
- `"order"` → fall back to index-order compare (adds an `info` note).
- `"report"` → do not fall back; only report as unmatched.

In [7]:
yaml_text = textwrap.dedent("""
ignore_private: true
ignore_bulk: true
numeric_tol: 1e-6
case_insensitive_strings: false

ignore:
  - PatientName

# Match items in sequences by key fields
sequence_keys:
  RequestAttributesSequence: ["RequestedProcedureID"]  # for our tiny demo

sequence_fallback: "order"

# HTML 'ok' rows controls (optional)
collect_all_matches: true
max_ok_rows: 100000
""").strip()

with open("compare_dcm_config.yaml", "w", encoding="utf-8") as f:
    f.write(yaml_text)

print("Wrote compare_dcm_config.yaml")

Wrote compare_dcm_config.yaml


In [8]:
opts_yaml = load_yaml_config("compare_dcm_config.yaml")
report_yaml = compare_files(LEFT, RIGHT, opts_yaml)
HTML(report_yaml.to_html("DICOM Diff — YAML Config"))

0,1
path,c:\Users\yabdulkadir\Desktop\open_source\Github\rosamllib\examples\demo_left.dcm
SOPClassUID,1.2.840.10008.5.1.4.1.1.2
SOPInstanceUID,1.2.826.0.1.3680043.2.1125.1758961220163
Modality,CT
StudyInstanceUID,1.2.3.4.5.6
SeriesInstanceUID,1.2.3.4.5.6.7

0,1
path,c:\Users\yabdulkadir\Desktop\open_source\Github\rosamllib\examples\demo_right.dcm
SOPClassUID,1.2.840.10008.5.1.4.1.1.2
SOPInstanceUID,1.2.826.0.1.3680043.2.1125.1758961220165
Modality,CT
StudyInstanceUID,1.2.3.4.5.6
SeriesInstanceUID,1.2.3.4.5.6.7

Severity,Path,Left,Right,Note
ok,SOPClassUID,1.2.840.10008.5.1.4.1.1.2,1.2.840.10008.5.1.4.1.1.2,match
diff,SOPInstanceUID,1.2.826.0.1.3680043.2.1125.1758961220163,1.2.826.0.1.3680043.2.1125.1758961220165,value differs (tol=1e-06)
ok,StudyDate,20250927,20250927,match
ok,Modality,CT,CT,match
diff,StationName,CT-01,CT-02,value differs (tol=1e-06)
ok,StudyDescription,TEST,TEST,match
ok,PatientID,12345,12345,match
ok,StudyInstanceUID,1.2.3.4.5.6,1.2.3.4.5.6,match
ok,SeriesInstanceUID,1.2.3.4.5.6.7,1.2.3.4.5.6.7,match

Severity,Path,Left,Right,Note
ok,"RequestAttributesSequence.item[key=('PROC-001',)].count",1,1,match
ok,"RequestAttributesSequence.item[key=('PROC-001',)].RequestedProcedureID",PROC-001,PROC-001,match


## 8) “Show everything” vs “Diffs only” <a id='allvsdiffs'></a>

Sometimes you want to review *everything* (including matches) for auditability; other times you only want the changes.

Toggle this with `collect_all_matches`:

- `True` → emit `ok` rows (matches).
- `False` → only diffs/warns/infos show up in the report.

In [9]:
# Show everything
opts_all = DiffOptions(collect_all_matches=True, numeric_tol=Tolerance(1e-6))
rep_all = compare_files(LEFT, RIGHT, opts_all)
HTML(rep_all.to_html("Show Everything (ok + diffs)"))

0,1
path,c:\Users\yabdulkadir\Desktop\open_source\Github\rosamllib\examples\demo_left.dcm
SOPClassUID,1.2.840.10008.5.1.4.1.1.2
SOPInstanceUID,1.2.826.0.1.3680043.2.1125.1758961220163
Modality,CT
StudyInstanceUID,1.2.3.4.5.6
SeriesInstanceUID,1.2.3.4.5.6.7

0,1
path,c:\Users\yabdulkadir\Desktop\open_source\Github\rosamllib\examples\demo_right.dcm
SOPClassUID,1.2.840.10008.5.1.4.1.1.2
SOPInstanceUID,1.2.826.0.1.3680043.2.1125.1758961220165
Modality,CT
StudyInstanceUID,1.2.3.4.5.6
SeriesInstanceUID,1.2.3.4.5.6.7

Severity,Path,Left,Right,Note
ok,SOPClassUID,1.2.840.10008.5.1.4.1.1.2,1.2.840.10008.5.1.4.1.1.2,match
diff,SOPInstanceUID,1.2.826.0.1.3680043.2.1125.1758961220163,1.2.826.0.1.3680043.2.1125.1758961220165,value differs (tol=1e-06)
ok,StudyDate,20250927,20250927,match
ok,Modality,CT,CT,match
diff,StationName,CT-01,CT-02,value differs (tol=1e-06)
ok,StudyDescription,TEST,TEST,match
ok,PatientName,DOE^JOHN,DOE^JOHN,match
ok,PatientID,12345,12345,match
ok,StudyInstanceUID,1.2.3.4.5.6,1.2.3.4.5.6,match
ok,SeriesInstanceUID,1.2.3.4.5.6.7,1.2.3.4.5.6.7,match

Severity,Path,Left,Right,Note
ok,RequestAttributesSequence.length,1,1,match

Severity,Path,Left,Right,Note
ok,RequestAttributesSequence.[0].RequestedProcedureID,PROC-001,PROC-001,match


In [10]:
# Diffs only
opts_diffs_only = DiffOptions(collect_all_matches=False, numeric_tol=Tolerance(1e-6))
rep_diffs = compare_files(LEFT, RIGHT, opts_diffs_only)
HTML(rep_diffs.to_html("Diffs Only"))

0,1
path,c:\Users\yabdulkadir\Desktop\open_source\Github\rosamllib\examples\demo_left.dcm
SOPClassUID,1.2.840.10008.5.1.4.1.1.2
SOPInstanceUID,1.2.826.0.1.3680043.2.1125.1758961220163
Modality,CT
StudyInstanceUID,1.2.3.4.5.6
SeriesInstanceUID,1.2.3.4.5.6.7

0,1
path,c:\Users\yabdulkadir\Desktop\open_source\Github\rosamllib\examples\demo_right.dcm
SOPClassUID,1.2.840.10008.5.1.4.1.1.2
SOPInstanceUID,1.2.826.0.1.3680043.2.1125.1758961220165
Modality,CT
StudyInstanceUID,1.2.3.4.5.6
SeriesInstanceUID,1.2.3.4.5.6.7

Severity,Path,Left,Right,Note
diff,SOPInstanceUID,1.2.826.0.1.3680043.2.1125.1758961220163,1.2.826.0.1.3680043.2.1125.1758961220165,value differs (tol=1e-06)
diff,StationName,CT-01,CT-02,value differs (tol=1e-06)


## 9) Programmatic analysis (Pandas/CSV) <a id='pandas'></a>

You might want to post-process results—e.g., export diffs to CSV, filter by severity, or build custom dashboards.

In [11]:
data = report.to_dict()
df = pd.DataFrame(data["diffs"])  # columns: path, left, right, note, severity
display(df.head())

# Only true differences
df_diffs = df[df["severity"] == "diff"].copy()
display(df_diffs)

# Save all rows to CSV
df.to_csv("dicom_diff_rows.csv", index=False)
"dicom_diff_rows.csv"

Unnamed: 0,path,left,right,note,severity
0,SOPClassUID,1.2.840.10008.5.1.4.1.1.2,1.2.840.10008.5.1.4.1.1.2,match,ok
1,SOPInstanceUID,1.2.826.0.1.3680043.2.1125.1758961220163,1.2.826.0.1.3680043.2.1125.1758961220165,value differs (tol=1e-06),diff
2,StudyDate,20250927,20250927,match,ok
3,Modality,CT,CT,match,ok
4,StationName,CT-01,CT-02,value differs (tol=1e-06),diff


Unnamed: 0,path,left,right,note,severity
1,SOPInstanceUID,1.2.826.0.1.3680043.2.1125.1758961220163,1.2.826.0.1.3680043.2.1125.1758961220165,value differs (tol=1e-06),diff
4,StationName,CT-01,CT-02,value differs (tol=1e-06),diff


'dicom_diff_rows.csv'

## 10) Tips & Troubleshooting <a id='tips'></a>

### Reading severities
- **ok**: values match (only if `collect_all_matches=True`).
- **diff**: real difference (value/VM/VR/presence/count).
- **warn**: something suspicious that could affect correctness (e.g., key fields missing while not falling back).
- **info**: diagnostic notes (e.g., `"fallback=order"`).

### Common knobs
- `ignore_private=True` — skip private tags.
- `ignore_bulk=True` — skip heavy byte fields (PixelData, WaveformData, etc.).
- `numeric_tol=Tolerance(1e-6)` — absolute tolerance for numeric VRs (DS/IS).
- `sequence_keys={...}` — match sequence items by key(s) instead of index.
- `sequence_fallback="order"` — fall back to index when keys missing (adds `info`).
- `collect_all_matches=True` — show `ok` rows; set False for “diffs only”.
- `max_ok_rows` — cap to avoid huge “everything” reports.

### If results look odd
- Ensure you’re on the version that iterates **top-level elements** at each recursion level (not `iterall()` at the same level).
- If you see sequence items mismatching, check your `sequence_keys` rules.
- If you want pixel comparisons, set `ignore_bulk=False` and add your own pixel checks (this tutorial focuses on metadata/sequence diffs).

## 11) Use the CLI from the notebook (optional) <a id='cli'></a>

In [12]:
# Example: write an HTML report using the CLI (handy for pipelines)
# !python -m rosamllib.compare_dcms "{LEFT}" "{RIGHT}" --config compare_dcm_config.yaml --html out.html --title "CLI Report"