# DICOM Tags Table Creation and Vector Search Database

This notebook aims to create the `dicom_tags` table, establish a vector search database for it, and provide comprehensive instructions to create a powerful Genie space. The steps include retrieving DICOM tags, extracting tag values, and ensuring accurate identification of patient and scan data.

In [0]:
%pip install databricks-vectorsearch
dbutils.library.restartPython()

In [0]:
dbutils.widgets.text("schema", "main.pixels_solacc", label="1.0 Catalog and Schema to store the dicom tags and vector search database")

schema = dbutils.widgets.get("schema")

In [0]:
import json

with open('dbx/pixels/resources/dicom_tags.ndjson', 'r') as f:
    tags = [json.loads(line) for line in f]

tags_df = spark.createDataFrame(tags)

tags_df.select("Tag","Name","Keyword","VR","VM","Retired").write.mode("overwrite").saveAsTable(schema+".dicom_tags")

In [0]:
%sql
ALTER TABLE $schema.dicom_tags SET TBLPROPERTIES (delta.enableChangeDataFeed = true)

In [0]:
from databricks.vector_search.client import VectorSearchClient

vs_endpoint = "pixels_vs_endpoint"

# The following line automatically generates a PAT Token for authentication
client = VectorSearchClient()

if client.get_endpoint(vs_endpoint) is None:
  client.create_endpoint(
      name=vs_endpoint,
      endpoint_type="STANDARD"
  )

index = client.create_delta_sync_index(
  endpoint_name=vs_endpoint,
  source_table_name=schema+".dicom_tags",
  index_name=schema+".dicom_tags_vs",
  pipeline_type="TRIGGERED",
  primary_key="Tag",
  embedding_source_column="Name",
  embedding_model_endpoint_name="databricks-bge-large-en"
)


#GENIE INSTRUCTIONS

Use the function hls_radiology.ddsm.retrieve_dicom_tag to retrieve a DICOM TAG from a DICOM TAG DESCRIPTION if you don't know the tag. After retrieving that value you can query the object_catalog table with the **hls_radiology.ddsm.extract_tag_value** to extract the right DICOM TAG VALUE from the META column.

Always add the meta:["hash] to the select statement as identifier of single file.
To identify a single patient, use the following script in the select statement: <p>distinct(**hls_radiology.ddsm.extract_tag_value**(struct("00100020","PatientID","1"), meta )) as PATIENT_ID

The modificationTime is the time of the file creation and not always related to the scan date.
To identify the scan date, use the 'AcquisitionDate' tag, if the value is null then use modificationTime column.

#GENIE SQL EXAMPLES

##How old are the patients?

select 
distinct(**hls_radiology.ddsm.extract_tag_value**(
      struct("00100020","PatientID","1"), meta
    )) as PATIENT_ID,
cast(
  replace(
    **hls_radiology.ddsm.extract_tag_value**(
      **hls_radiology.ddsm.retrieve_dicom_tag**("patient age"), meta )
   ,"Y", "") 
   as integer) as patient_age
from **hls_radiology.ddsm.object_catalog**;

##Show me the distribution of the modalities

SELECT
  modality,
  COUNT(1) AS count
FROM
  (
    SELECT
      **hls_radiology.ddsm.extract_tag_value**(
        **hls_radiology.ddsm.retrieve_dicom_tag**('modality'), `meta`
      ) AS modality
    FROM
      **`hls_radiology`.`ddsm`.`object_catalog`**
  ) AS modalities
GROUP BY
  modality

##For the MG modality, are there any patients over 50?

SELECT DISTINCT
  **hls_radiology.ddsm.extract_tag_value**(struct('00100020', 'PatientID', '1'), `meta`) AS PATIENT_ID,
  CAST(
    REPLACE(
      **hls_radiology.ddsm.extract_tag_value**(**hls_radiology.ddsm.retrieve_dicom_tag**('patient age'), `meta`),
      'Y',
      ''
    ) AS INTEGER
  ) AS patient_age
FROM
  **`hls_radiology`.`ddsm`.`object_catalog`**
WHERE
  **hls_radiology.ddsm.extract_tag_value**(**hls_radiology.ddsm.retrieve_dicom_tag**('modality'), `meta`) = 'MG'
  AND CAST(
    REPLACE(
      **hls_radiology.ddsm.extract_tag_value**(**hls_radiology.ddsm.retrieve_dicom_tag**('patient age'), `meta`),
      'Y',
      ''
    ) AS INTEGER
  ) > 50