You may find this solution accelerator at https://github.com/databricks-industry-solutions/pixels. 

# Analytics on DICOM images should be simple

- Catalog all of your files in parallel and scale with Spark
- Spark SQL on top of Delta Lake powers fast metadata analytics
- Python based Transformers / pandas udfs form building blocks for:
  - Metadata extraction
  - Uses proven `pydicom` and `pylibjpeg` python packages & C++ libraries
  - Simple composing and extension into De-Identification and Deep Learing
<!-- -->

The `dbx.pixels` solution accelerator turns DICOM images into SQL data

## Requirements
This notebook will requires a Unity Catalog enabled compute, A dedicated cluster, a Shared Cluster or Notebook Serverless Compute (CPU) will work. Please leverage the latest LTS Runtime

In [0]:
%run ./config/setup

In [0]:
path,table,volume,write_mode = init_widgets()
init_catalog_schema_volume()

## Catalog the objects and files
`dbx.pixels.Catalog` just looks at the file metadata
The Catalog function recursively list all files, parsing the path and filename into a dataframe. This dataframe can be saved into a file 'catalog'. This file catalog can be the basis of further annotations

In [0]:
from dbx.pixels import Catalog
from dbx.pixels.dicom import DicomMetaExtractor # The Dicom transformers

In [0]:
catalog = Catalog(spark, table=table, volume=volume)
catalog.init_tables()
catalog_df = catalog.catalog(path=path, extractZip=True)

## Extract Metadata from the Dicom images
Using the Catalog dataframe, we can now open each Dicom file and extract the metadata from the Dicom file header. This operation runs in parallel, speeding up processing. The resulting `dcm_df` does not in-line the entire Dicom file. Dicom files tend to be larger so we process Dicom files only by reference.

Under the covers we use PyDicom and gdcm to parse the Dicom files

The Dicom metadata is extracted into a JSON string formatted column named `meta`

In [0]:
meta_df = DicomMetaExtractor(catalog).transform(catalog_df)

## Save the metadata to a table

In [0]:
catalog.save(meta_df, mode=write_mode)

In [0]:
%sql describe IDENTIFIER(:table)

# Analyze DICOM Metadata with SQL

In [0]:
%sql select path, modificationTime, length, original_path, extension, file_type, path_tags, is_anon, meta from IDENTIFIER(:table)

In [0]:
%sql
with x as (
  select
  format_number(count(DISTINCT meta:['00100010'].Value[0].Alphabetic::STRING),0) as patient_count,
  format_number(count(1),0) num_dicoms,
  format_number(sum(length) /(1024*1024*1024), 1) as total_size_in_gb,
  format_number(avg(length), 0) avg_size_in_bytes
  from IDENTIFIER(:table) t
  where extension = 'dcm'
)
select patient_count, num_dicoms, total_size_in_gb, avg_size_in_bytes from x

### Decode Dicom attributes
Using the codes in the DICOM Standard Browser (https://dicom.innolitics.com/ciods) by Innolitics

In [0]:
%sql
SELECT
    meta:['0020000D'].Value[0]::STRING as study_uid,
    meta:['0020000E'].Value[0]::STRING as series_instance_uid,
--    meta:['00080018'].Value[0]::STRING as sop_instance_uid,
    meta:['00100020'].Value[0]::STRING as patient_id,
    meta:['00100010'].Value[0].Alphabetic::STRING patient_name, 
    meta:['00082218'].Value[0]['00080104'].Value[0]::STRING `Anatomic Region Sequence Attribute decoded`,
    meta:['0008103E'].Value[0]::STRING `Series Description Attribute`,
    meta:['00081030'].Value[0]::STRING `Study Description Attribute`,
    meta:`00540220`.Value[0].`00080104`.Value[0]::STRING `projection` -- backticks work for numeric keys
FROM IDENTIFIER(:table)
GROUP BY ALL
ORDER BY 1,2,3

In [0]:
%sql
SELECT 
  --rowid,
  meta['00100010'].Value[0].Alphabetic::STRING as patient_name,  -- Medical information from the DICOM header
  meta:hash::STRING, meta:img_min::STRING, meta:img_max::STRING, path,            -- technical metadata
  meta                                                    -- DICOM header metadata as JSON
FROM IDENTIFIER(:table)
WHERE array_contains( path_tags, 'patient5397' ) -- query based on a part of the filename
order by patient_name