# DICOM Standard harvesting of Part 3 
## Extract relationships between Tags and Value Sets

[Click to access an online XML viewer](https://jsonformatter.org/xml-viewer)

Links to the XML Objects
- [DICOM Part 3](https://dicom.nema.org/medical/dicom/current/source/docbook/part03/part03.xml)
- [DICOM Part 6](https://dicom.nema.org/medical/dicom/current/source/docbook/part06/part06.xml)
- [DICOM Part 16](https://dicom.nema.org/medical/dicom/current/source/docbook/part16/part16.xml)

This notebook demonstrates how to parse out DICOM Part 3. 
1. Extract one IOD module table: A.36 Enhanced MR and its Reference tables
2. Extract all IOD module tables and their Reference tables


## 1.a Extract one IOD Module Table: A.36 Enhanced MR

In [1]:
import requests
from bs4 import BeautifulSoup

# Fetch XML content from the URL
url = 'https://dicom.nema.org/medical/dicom/current/source/docbook/part03/part03.xml' #Part3 Standard XML
response = requests.get(url) #response 객체에서 데이터 저장
xml_content = response.content #서버로부터 받은 응답 원시 바이트 데이터로 변환

# Parse the XML content
soup = BeautifulSoup(xml_content, 'xml') #XML 데이터를 계층 구조로 분석 

In [9]:
import requests
import pandas as pd

#Find the table with label 'A.36-1'
table = soup.find('table', {'label' : 'A.36-1'})
table

<table frame="box" label="A.36-1" rules="all" xml:id="table_A.36-1">
<caption>Enhanced MR Image IOD Modules</caption>
<thead>
<tr valign="top">
<th align="center" colspan="1" rowspan="1">
<para xml:id="para_2efea782-6f1b-47ce-8516-5b06a65a0fc5">IE</para>
</th>
<th align="center" colspan="1" rowspan="1">
<para xml:id="para_029f7031-5b56-436e-968d-23d9ba46a81e">Module</para>
</th>
<th align="center" colspan="1" rowspan="1">
<para xml:id="para_630bd1aa-5c0e-4ebc-b007-287684f6cd4b">Reference</para>
</th>
<th align="center" colspan="1" rowspan="1">
<para xml:id="para_4d8c0d47-27c1-4318-9b9e-dd59ef36c1e4">Usage</para>
</th>
</tr>
</thead>
<tbody>
<tr valign="top">
<td align="left" colspan="1" rowspan="2">
<para xml:id="para_d2335fce-97b7-4175-9953-e2c89b17b578">Patient</para>
</td>
<td align="left" colspan="1" rowspan="1">
<para xml:id="para_e2c47885-4138-43c3-8097-4f6e7c77d192">Patient</para>
</td>
<td align="left" colspan="1" rowspan="1">
<para xml:id="para_cb68633d-303b-4b1f-a3b2-83047554

In [11]:
#Find the section containing the table and extract its title
section_title = soup.find('section', {'xml:id':table.parent['xml:id']}).find('title').text.strip()
section_title

'Enhanced MR Image IOD Module Table'

In [16]:
#Extract the part of the title before "IOD"
iod_index = section_title.find(" IOD")
iod = section_title[:iod_index]
iod

'Enhanced MR Image'

In [18]:
#Extract table headers
headers = [th.text.strip() for th in table.find_all('th')]
headers

['IE', 'Module', 'Reference', 'Usage']

In [21]:
# Extract table rows
rows = []
current_ie = None
for tr in table.find_all('tr')[1:]:  # Skip the header row
    cells = tr.find_all(['td', 'th'])

    # Extract cell values
    cell_values = [cell.text.strip() for cell in cells]

    # Check if the cell values are complete (some cells may be missing due to rowspan)
    if len(cell_values) < len(headers):
        # Fill in the missing values based on the previous row
        for i in range(len(headers) - len(cell_values)):
            cell_values.insert(0, current_ie)

    # Store the IE value for the next iteration
    current_ie = cell_values[0]

    # Extract the reference value if available
    reference = None
    for cell in cells:
        xref = cell.find('xref', {'xrefstyle': 'select: labelnumber'})
        if xref:
            reference = xref['linkend']
            break

    # Insert the reference value into the correct position
    if reference:
        # Find the index of the cell containing the reference
        reference_index = next((idx for idx, val in enumerate(cell_values) if val == ''), None)
        if reference_index is not None:
            cell_values[reference_index] = reference

    # Insert the title of the table as the first element
    cell_values.insert(0, iod)

    rows.append(cell_values)

rows

[['Enhanced MR Image', 'Patient', 'Patient', 'sect_C.7.1.1', 'M'],
 ['Enhanced MR Image',
  'Patient',
  'Clinical Trial Subject',
  'sect_C.7.1.3',
  'U'],
 ['Enhanced MR Image', 'Study', 'General Study', 'sect_C.7.2.1', 'M'],
 ['Enhanced MR Image', 'Study', 'Patient Study', 'sect_C.7.2.2', 'U'],
 ['Enhanced MR Image', 'Study', 'Clinical Trial Study', 'sect_C.7.2.3', 'U'],
 ['Enhanced MR Image', 'Series', 'General Series', 'sect_C.7.3.1', 'M'],
 ['Enhanced MR Image', 'Series', 'Clinical Trial Series', 'sect_C.7.3.2', 'U'],
 ['Enhanced MR Image', 'Series', 'MR Series', 'sect_C.8.13.6', 'M'],
 ['Enhanced MR Image',
  'Frame of Reference',
  'Frame of Reference',
  'sect_C.7.4.1',
  'M'],
 ['Enhanced MR Image',
  'Frame of Reference',
  'Synchronization',
  'sect_C.7.4.2',
  'C - Required if time synchronization was applied.'],
 ['Enhanced MR Image', 'Equipment', 'General Equipment', 'sect_C.7.5.1', 'M'],
 ['Enhanced MR Image',
  'Equipment',
  'Enhanced General Equipment',
  'sect_C.7.5

In [None]:
# Create a DataFrame
df_iod = pd.DataFrame(rows, columns=['IOD'] + headers)
df_iod
# IOD: Information Object Definition -> DICOM 데이터의 전체적 구조 정의
# IE: Information Entity ex) Patient, Study, Series, Image
# Module: tag group
# Reference: 참조 위치, DICOM Standard 문서의 section
# Usage: see next block


Unnamed: 0,IOD,IE,Module,Reference,Usage
0,Enhanced MR Image,Patient,Patient,sect_C.7.1.1,M
1,Enhanced MR Image,Patient,Clinical Trial Subject,sect_C.7.1.3,U
2,Enhanced MR Image,Study,General Study,sect_C.7.2.1,M
3,Enhanced MR Image,Study,Patient Study,sect_C.7.2.2,U
4,Enhanced MR Image,Study,Clinical Trial Study,sect_C.7.2.3,U
5,Enhanced MR Image,Series,General Series,sect_C.7.3.1,M
6,Enhanced MR Image,Series,Clinical Trial Series,sect_C.7.3.2,U
7,Enhanced MR Image,Series,MR Series,sect_C.8.13.6,M
8,Enhanced MR Image,Frame of Reference,Frame of Reference,sect_C.7.4.1,M
9,Enhanced MR Image,Frame of Reference,Synchronization,sect_C.7.4.2,C - Required if time synchronization was applied.


In [27]:
set(df_iod.Usage)
# M : Mandatory
# U : User Optional
# C - Required if ~~ : Conditional

{'C - Required if Image Type (0008,0008) Value 1 is ORIGINAL or MIXED. May be present otherwise.',
 'C - Required if Pixel Presentation (0008,9205) in the  equals COLOR or MIXED.',
 'C - Required if bulk motion synchronization was applied.',
 'C - Required if cardiac synchronization was applied.',
 'C - Required if contrast media were applied.',
 'C - Required if respiratory synchronization was applied.',
 'C - Required if the SOP Instance was created in response to a Frame-Level retrieve request',
 'C - Required if time synchronization was applied.',
 'M',
 'U'}

## 1.b Extract Reference tables

In [31]:
import pandas as pd
from bs4 import BeautifulSoup

# Find the mapping table between reference module from IOD table to reference table
rows = []

df_reference = df_iod[df_iod['Usage']=="M"]

for section_id in df_reference['Reference']:
    section_element = soup.find('section', {'xml:id': section_id})
    
    if section_element:
        tables = section_element.findChildren('table', recursive=False)
        # For each table, create a dictionary with section_id and table XML id and add it to the rows list
        for table in tables:
            if table.has_attr('xml:id'):
                row = {'section_id': section_id, 'table_xml_id': table['xml:id']}
                rows.append(row)
# Create a DataFrame from the rows list
table_df = pd.DataFrame(rows)
table_df

Unnamed: 0,section_id,table_xml_id
0,sect_C.7.1.1,table_C.7-1
1,sect_C.7.2.1,table_C.7-3
2,sect_C.7.3.1,table_C.7-5a
3,sect_C.8.13.6,table_C.8-101
4,sect_C.7.4.1,table_C.7-6
5,sect_C.7.5.1,table_C.7-8
6,sect_C.7.5.2,table_C.7-8b
7,sect_C.7.6.3,table_C.7-11a
8,sect_C.7.6.16,table_C.7.6.16-1
9,sect_C.7.6.17,table_C.7.6.17-1


In [36]:
def process_table(soup, table_xml_id):
    # Find the table using table_xml_id
    table = soup.find('table', {'xml:id': table_xml_id}) #첫번째 매개변수: 찾으려는 태그 이름 ex : 'table', 'section', 'div'
    section_xml_id = table.parent['xml:id']

    # Initialize lists to store extracted data
    attribute_names = []
    tags = []
    types = []
    attribute_descriptions = []
    cid = []

    # Extract data from the table rows
    for index, row in enumerate(table.find_all('tr')):
        columns = row.find_all('td')
        if len(columns) == 4:
            attribute_name = columns[0].text.strip()
            if attribute_name.startswith('>'):
                continue
            tag = columns[1].text.strip()
            type_ = columns[2].text.strip()
            description = columns[3].find_all('variablelist')

            defined_terms_dict = {}
            if description:
                for variablelist in description:
                    title = variablelist.title.text.strip().replace(':', '')
                    defined_terms = [term.text.strip() for term in variablelist.find_all('term')]
                    defined_terms_dict[title] = defined_terms
            
            attribute_names.append(attribute_name)
            tags.append(tag)
            types.append(type_)
            attribute_descriptions.append(defined_terms_dict)
            cid.append('')

        elif len(columns) == 2 and index != 0 and cid:
            first_col = columns[0].text.strip()
            if first_col.startswith('>>'):
                continue
            if first_col.startswith('>'):
                olink = columns[1].find('olink')
                if olink:
                    targetptr = olink.get('targetptr', '')
                    if 'CID_' in targetptr:
                        cid_value = targetptr.split('CID_')[-1]
                        cid[-1] += cid_value

    # Create a DataFrame from the extracted data
    data = {
        'section_id': section_xml_id,
        'Attribute Name': attribute_names,
        'Tag': tags,
        'Type': types,
        'Attribute Description': attribute_descriptions,
        'CID': cid
    }
    return pd.DataFrame(data)

In [None]:
reference_table_data = []
for table_xml_id in table_df['table_xml_id']:
    df = process_table(soup, table_xml_id)
    reference_table_data.append(df)

# combine all DataFrames into one
reference_tables = pd.concat(reference_table_data, ignore_index=True)
reference_tables

# Tag: data 고유 식별자 (groupnumber, elementnumber)
# Type: 데이터 요소 필수 여부와 유효성 type: 1(mandatory), 1C(conditional), 2(항상 존재해야하지만 비어있을 수 있음), 2C(2+ conditional), 3: Optinal
# CID: Context Identifier - value set definition

Unnamed: 0,section_id,Attribute Name,Tag,Type,Attribute Description,CID
0,sect_C.7.1.1,Patient's Name,"(0010,0010)",2,{},
1,sect_C.7.1.1,Patient ID,"(0010,0020)",2,{},
2,sect_C.7.1.1,Type of Patient ID,"(0010,0022)",3,"{'Defined Terms': ['TEXT', 'RFID', 'BARCODE']}",
3,sect_C.7.1.1,Patient's Birth Date,"(0010,0030)",2,{},
4,sect_C.7.1.1,Patient's Birth Date in Alternative Calendar,"(0010,0033)",3,{},
...,...,...,...,...,...,...
166,sect_C.12.1,Conversion Source Attributes Sequence,"(0020,9172)",1C,{},
167,sect_C.12.1,Content Qualification,"(0018,9004)",3,"{'Enumerated Values': ['PRODUCT', 'RESEARCH', ...",
168,sect_C.12.1,Private Data Element Characteristics Sequence,"(0008,0300)",3,{},
169,sect_C.12.1,Instance Origin Status,"(0400,0600)",3,{},


## 2.a Extract Module Tables for all IODs

In [42]:
from bs4 import BeautifulSoup

# Fetch XML content from the URL
url = 'https://dicom.nema.org/medical/dicom/current/source/docbook/part03/part03.xml'
response = requests.get(url)
xml_content = response.content

# Parse the XML content
soup = BeautifulSoup(xml_content, 'xml')

In [43]:
import re
from bs4 import BeautifulSoup
import pandas as pd

data = []

for section_element in soup.find_all('section', {'label': re.compile('A.')}):
    # Find all table elements within the section_element
    tables = section_element.find_all('table')
    for table in tables:
        # Find the caption element for the current table
        caption = table.find('caption')
        # Check if the caption exists and matches the regular expression
        if caption and re.search('IOD Modules', caption.text):
            # Extract the xml:id attribute from the table
            table_id = table.get('xml:id')  # Use .get() to avoid KeyError if the attribute is missing
            # Append the xml:id and caption text to the data list
            data.append((table_id, caption.text.strip()))

# Create a DataFrame from the collected data
df_iod_tables = pd.DataFrame(data, columns=['xml_id', 'iod'])
df_iod_tables = df_iod_tables.drop_duplicates().reset_index(drop=True)

In [44]:
df_iod_tables

Unnamed: 0,xml_id,iod
0,table_A.2-1,Computed Radiography Image IOD Modules
1,table_A.3-1,CT Image IOD Modules
2,table_A.4-1,MR Image IOD Modules
3,table_A.5-1,Nuclear Medicine Image IOD Modules
4,table_A.6-1,Ultrasound Image IOD Modules
...,...,...
167,table_A.88.3-1,Inventory IOD Modules
168,table_A.89.3-1,Photoacoustic Image IOD Modules
169,table_A.90.1.3-1,Confocal Microscopy Image IOD Modules
170,table_A.90.2.3-1,Confocal Microscopy Tiled Pyramidal Image IOD ...


In [45]:
import pandas as pd

first_table_headers = None
table_data = []  # A list to store the DataFrame for each table

for index, row in df_iod_tables.iterrows():
    xml_id = row['xml_id']
    caption = row['iod']

    # Find the table in 'soup' with the matching 'xml:id'
    table = soup.find('table', {'xml:id': xml_id})
    if table:
        # Extract table headers
        headers = [th.text.strip() for th in table.find_all('th')]
        # If it's the first table, store its headers
        if first_table_headers is None:
            first_table_headers = headers
        # For subsequent tables, check if the headers match the first table's headers
        elif headers != first_table_headers:
            # If headers don't match, skip this table
            continue

        # Reset the current_ie for each table to ensure data integrity
        current_ie = None  
        rows = []
        
        for tr in table.find_all('tr')[1:]:  # Skip the header row
            cells = tr.find_all(['td', 'th'])
            # Extract cell values
            cell_values = [cell.text.strip() for cell in cells]

            # Check if the cell values are complete (some cells may be missing due to rowspan)
            if len(cell_values) < len(headers):
                # Fill in the missing values based on the previous row
                for i in range(len(headers) - len(cell_values)):
                    cell_values.insert(0, current_ie)

            # Update current_ie based on the first cell value (the IE column)
            current_ie = cell_values[0]

            # Extract the reference value if available
            reference = None
            for cell in cells:
                xref = cell.find('xref', {'xrefstyle': 'select: labelnumber'})
                if xref:
                    reference = xref['linkend']
                    break

            # Insert the reference value into the correct position
            if reference:
                # Find the index of the cell containing the reference
                reference_index = next((idx for idx, val in enumerate(cell_values) if val == ''), None)
                if reference_index is not None:
                    cell_values[reference_index] = reference

            # Append the row data
            table_data.append([xml_id, caption] + cell_values)

# Create the final DataFrame
final_df = pd.DataFrame(table_data, columns=['xml_id', 'iod'] + first_table_headers)

In [59]:
final_df['Usage_code'] = final_df['Usage'].str[0] # 문법 주의 : [0]으로 하면 ['Usage]의 가장 첫번째 값 반환! str[0]으로 해야 각 열의 code 반환
final_df

Unnamed: 0,xml_id,iod,IE,Module,Reference,Usage,Usage_code
0,table_A.2-1,Computed Radiography Image IOD Modules,Patient,Patient,sect_C.7.1.1,M,M
1,table_A.2-1,Computed Radiography Image IOD Modules,Patient,Clinical Trial Subject,sect_C.7.1.3,U,U
2,table_A.2-1,Computed Radiography Image IOD Modules,Study,General Study,sect_C.7.2.1,M,M
3,table_A.2-1,Computed Radiography Image IOD Modules,Study,Patient Study,sect_C.7.2.2,U,U
4,table_A.2-1,Computed Radiography Image IOD Modules,Study,Clinical Trial Study,sect_C.7.2.3,U,U
...,...,...,...,...,...,...,...
3302,table_A.91-1,Height Map Segmentation IOD Modules,Image,ICC Profile,sect_C.11.15,U,U
3303,table_A.91-1,Height Map Segmentation IOD Modules,Image,SOP Common,sect_C.12.1,M,M
3304,table_A.91-1,Height Map Segmentation IOD Modules,Image,Common Instance Reference,sect_C.12.2,M,M
3305,table_A.91-1,Height Map Segmentation IOD Modules,Image,Frame Extraction,sect_C.12.3,C - Required if the SOP Instance was created i...,C


In [56]:
final_df[final_df['xml_id']=="table_A.36-1"]

Unnamed: 0,xml_id,iod,IE,Module,Reference,Usage,Usage_code
1614,table_A.36-1,Enhanced MR Image IOD Modules,Patient,Patient,sect_C.7.1.1,M,M
1615,table_A.36-1,Enhanced MR Image IOD Modules,Patient,Clinical Trial Subject,sect_C.7.1.3,U,U
1616,table_A.36-1,Enhanced MR Image IOD Modules,Study,General Study,sect_C.7.2.1,M,M
1617,table_A.36-1,Enhanced MR Image IOD Modules,Study,Patient Study,sect_C.7.2.2,U,U
1618,table_A.36-1,Enhanced MR Image IOD Modules,Study,Clinical Trial Study,sect_C.7.2.3,U,U
1619,table_A.36-1,Enhanced MR Image IOD Modules,Series,General Series,sect_C.7.3.1,M,M
1620,table_A.36-1,Enhanced MR Image IOD Modules,Series,Clinical Trial Series,sect_C.7.3.2,U,U
1621,table_A.36-1,Enhanced MR Image IOD Modules,Series,MR Series,sect_C.8.13.6,M,M
1622,table_A.36-1,Enhanced MR Image IOD Modules,Frame of Reference,Frame of Reference,sect_C.7.4.1,M,M
1623,table_A.36-1,Enhanced MR Image IOD Modules,Frame of Reference,Synchronization,sect_C.7.4.2,C - Required if time synchronization was applied.,C


In [63]:
final_df.groupby('Usage_code')['xml_id'].count() #groupby(condition column)[pk]

Usage_code
C     319
M    1782
U    1206
Name: xml_id, dtype: int64

In [67]:
final_df['iod'].nunique()

172

In [65]:
final_df['xml_id'].nunique()

172

In [66]:
final_df['Reference'].nunique()

360

### Exception handling
It looks like we have a list of Reference tables that needed further adjustments. Upon investigation, I found that the standard for Reference tables above redirect to another table or tables. To handle this exception, I have manually collected the redirected References and created a dataframe as mapping_data.

In [74]:
#'sect_C.8.15.4': 'sect_C.8.2.2.1', 'sect_C.8.2.2.2', 'sect_C.8.2.2.3'
#'sect_C.11.11': 'sect_C.11.11.1'
#'sect_C.17.3': 'sect_C.17.3.3', 'sect_C.17.3.4'
#'sect_C.7.6.20': 'sect_C.10.12'
#'sect_C.28.1':'sect_C.10.9'
#'sect_C.29.3':'sect_C.29.3.1'
#'sect_C.36.18': 'sect_C.36.2.2.7', 'sect_C.36.2.2.8', 'sect_C.36.2.2.14'

mapping_data = {
    'Reference': ['sect_C.8.15.4', 'sect_C.8.15.4', 'sect_C.8.15.4', 'sect_C.11.11', 'sect_C.17.3', 'sect_C.17.3', 'sect_C.7.6.20', 'sect_C.28.1', 'sect_C.29.3', 'sect_C.36.18', 'sect_C.36.18', 'sect_C.36.18'],
    'Reference_adjusted': ['sect_C.8.2.2.1', 'sect_C.8.2.2.2', 'sect_C.8.2.2.3', 'sect_C.11.11.1', 'sect_C.17.3.3', 'sect_C.17.3.4', 'sect_C.10.12', 'sect_C.10.9', 'sect_C.29.3.1', 'sect_C.36.2.2.7', 'sect_C.36.2.2.8', 'sect_C.36.2.2.14']
}
mapping_data = pd.DataFrame(mapping_data)
mapping_data

Unnamed: 0,Reference,Reference_adjusted
0,sect_C.8.15.4,sect_C.8.2.2.1
1,sect_C.8.15.4,sect_C.8.2.2.2
2,sect_C.8.15.4,sect_C.8.2.2.3
3,sect_C.11.11,sect_C.11.11.1
4,sect_C.17.3,sect_C.17.3.3
5,sect_C.17.3,sect_C.17.3.4
6,sect_C.7.6.20,sect_C.10.12
7,sect_C.28.1,sect_C.10.9
8,sect_C.29.3,sect_C.29.3.1
9,sect_C.36.18,sect_C.36.2.2.7


In [None]:
final_df_expanded = pd.merge(final_df, mapping_data, on = 'Reference', how = 'left')
final_df_expanded['Reference_adjusted'] = final_df_expanded['Reference_adjusted'].fillna(final_df_expanded['Reference']) #Reference_adjusted에 해당되징 않는 값들은 원래대로 반환환
final_df_expanded

Unnamed: 0,xml_id,iod,IE,Module,Reference,Usage,Usage_code,Reference_adjusted
0,table_A.2-1,Computed Radiography Image IOD Modules,Patient,Patient,sect_C.7.1.1,M,M,sect_C.7.1.1
1,table_A.2-1,Computed Radiography Image IOD Modules,Patient,Clinical Trial Subject,sect_C.7.1.3,U,U,sect_C.7.1.3
2,table_A.2-1,Computed Radiography Image IOD Modules,Study,General Study,sect_C.7.2.1,M,M,sect_C.7.2.1
3,table_A.2-1,Computed Radiography Image IOD Modules,Study,Patient Study,sect_C.7.2.2,U,U,sect_C.7.2.2
4,table_A.2-1,Computed Radiography Image IOD Modules,Study,Clinical Trial Study,sect_C.7.2.3,U,U,sect_C.7.2.3
...,...,...,...,...,...,...,...,...
3331,table_A.91-1,Height Map Segmentation IOD Modules,Image,ICC Profile,sect_C.11.15,U,U,sect_C.11.15
3332,table_A.91-1,Height Map Segmentation IOD Modules,Image,SOP Common,sect_C.12.1,M,M,sect_C.12.1
3333,table_A.91-1,Height Map Segmentation IOD Modules,Image,Common Instance Reference,sect_C.12.2,M,M,sect_C.12.2
3334,table_A.91-1,Height Map Segmentation IOD Modules,Image,Frame Extraction,sect_C.12.3,C - Required if the SOP Instance was created i...,C,sect_C.12.3


In [72]:
final_df_expanded[final_df_expanded['Reference']!= final_df_expanded['Reference_adjusted']]

Unnamed: 0,xml_id,iod,IE,Module,Reference,Usage,Usage_code,Reference_adjusted
896,table_A.33.1-1,Grayscale Softcopy Presentation State IOD Modules,Presentation State,Presentation State Relationship,sect_C.11.11,M,M,sect_C.11.11.1
924,table_A.33.2-1,Color Softcopy Presentation State IOD Modules,Presentation State,Presentation State Relationship,sect_C.11.11,M,M,sect_C.11.11.1
948,table_A.33.3-1,Pseudo-Color Softcopy Presentation State IOD M...,Presentation State,Presentation State Relationship,sect_C.11.11,M,M,sect_C.11.11.1
1014,table_A.33.6-1,XA/XRF Grayscale Softcopy Presentation State I...,Presentation State,Presentation State Relationship,sect_C.11.11,M,M,sect_C.11.11.1
1063,table_A.33.8-1,Variable Modality LUT Softcopy Presentation St...,Presentation State,Presentation State Relationship,sect_C.11.11,M,M,sect_C.11.11.1
...,...,...,...,...,...,...,...,...
3014,table_A.86.1.7-1,Robotic-Arm Radiation IOD Modules,RT Radiation,Robotic-Arm Delivery Device,sect_C.36.18,M,M,sect_C.36.2.2.8
3015,table_A.86.1.7-1,Robotic-Arm Radiation IOD Modules,RT Radiation,Robotic-Arm Delivery Device,sect_C.36.18,M,M,sect_C.36.2.2.14
3109,table_A.86.1.12-1,Robotic-Arm Radiation Record IOD Modules,RT Delivered Radiation,Robotic-Arm Delivery Device,sect_C.36.18,M,M,sect_C.36.2.2.7
3110,table_A.86.1.12-1,Robotic-Arm Radiation Record IOD Modules,RT Delivered Radiation,Robotic-Arm Delivery Device,sect_C.36.18,M,M,sect_C.36.2.2.8


In [73]:
final_df_expanded['Reference_adjusted'].nunique()

364

## 2.b Extract all Reference tables

In [42]:
# Find the mapping table between reference module from IOD table to reference table
# Find from the all IOD list for ALL usage
rows = []

df_reference = final_df_expanded['Reference_adjusted'].drop_duplicates()

for section_id in df_reference:
    section_element = soup.find('section', {'xml:id': section_id})
    
    if section_element:
        tables = section_element.findChildren('table', recursive=False)
        # For each table, create a dictionary with section_id and table XML id and add it to the rows list
        for table in tables:
            if table.has_attr('xml:id'):
                row = {'section_id': section_id, 'table_xml_id': table['xml:id']}
                rows.append(row)
# Create a DataFrame from the rows list
table_df = pd.DataFrame(rows)
table_df

Unnamed: 0,section_id,table_xml_id
0,sect_C.7.1.1,table_C.7-1
1,sect_C.7.1.3,table_C.7-2b
2,sect_C.7.2.1,table_C.7-3
3,sect_C.7.2.2,table_C.7-4a
4,sect_C.7.2.3,table_C.7-4b
...,...,...
371,sect_C.8.34.3,table_C.8.34.3-1
372,sect_C.8.34.4,table_C.8.34.4-1
373,sect_C.8.35.1,table_C.8.35.1-1
374,sect_C.8.35.3,table_C.8.35.3-1


In [75]:
table_df_agg = table_df.groupby('section_id')['table_xml_id'].agg(count = 'count').reset_index()
multiple_ref_tables = table_df_agg[table_df_agg['count']>1]['section_id']

In [77]:
dedup_table_xml = table_df[table_df['section_id'].isin(multiple_ref_tables)]
dedup_table_xml = dedup_table_xml.groupby('section_id').nth(1).reset_index(drop=True)
dedup_table_xml

Unnamed: 0,section_id,table_xml_id


In [78]:
table_df_new = table_df.merge(dedup_table_xml, on = 'section_id', how = 'left', suffixes=('', '_adj'))
table_df_new['table_xml_id_new'] = table_df_new['table_xml_id_adj'].combine_first(table_df_new['table_xml_id'])
table_df_new

Unnamed: 0,section_id,table_xml_id,table_xml_id_adj,table_xml_id_new
0,sect_C.7.1.1,table_C.7-1,,table_C.7-1
1,sect_C.7.2.1,table_C.7-3,,table_C.7-3
2,sect_C.7.3.1,table_C.7-5a,,table_C.7-5a
3,sect_C.8.13.6,table_C.8-101,,table_C.8-101
4,sect_C.7.4.1,table_C.7-6,,table_C.7-6
5,sect_C.7.5.1,table_C.7-8,,table_C.7-8
6,sect_C.7.5.2,table_C.7-8b,,table_C.7-8b
7,sect_C.7.6.3,table_C.7-11a,,table_C.7-11a
8,sect_C.7.6.16,table_C.7.6.16-1,,table_C.7.6.16-1
9,sect_C.7.6.17,table_C.7.6.17-1,,table_C.7.6.17-1


In [82]:
table_df_new = table_df_new[['section_id', 'table_xml_id_new']].drop_duplicates().reset_index(drop=True)

In [83]:
# same function as above 
import pandas as pd
from bs4 import BeautifulSoup

def process_table(soup, table_xml_id):
    # Find the table using table_xml_id
    table = soup.find('table', {'xml:id': table_xml_id})
    section_xml_id = table.parent['xml:id']

    # Initialize lists to store extracted data
    attribute_names = []
    tags = []
    types = []
    attribute_descriptions = []
    cid = []

    # Extract data from the table rows
    for index, row in enumerate(table.find_all('tr')):
        columns = row.find_all('td')
        if len(columns) == 4:
            attribute_name = columns[0].text.strip()
            if attribute_name.startswith('>'):
                continue
            tag = columns[1].text.strip()
            type_ = columns[2].text.strip()
            description = columns[3].find_all('variablelist')

            defined_terms_dict = {}
            if description:
                for variablelist in description:
                    title = variablelist.title.text.strip().replace(':', '')
                    defined_terms = [term.text.strip() for term in variablelist.find_all('term')]
                    defined_terms_dict[title] = defined_terms
            
            attribute_names.append(attribute_name)
            tags.append(tag)
            types.append(type_)
            attribute_descriptions.append(defined_terms_dict)
            cid.append('')

        elif len(columns) == 2 and index != 0 and cid:
            first_col = columns[0].text.strip()
            if first_col.startswith('>>'):
                continue
            if first_col.startswith('>'):
                olink = columns[1].find('olink')
                if olink:
                    targetptr = olink.get('targetptr', '')
                    if 'CID_' in targetptr:
                        cid_value = targetptr.split('CID_')[-1]
                        cid[-1] += cid_value

    # Create a DataFrame from the extracted data
    data = {
        'section_id': section_xml_id,
        'Attribute Name': attribute_names,
        'Tag': tags,
        'Type': types,
        'Attribute Description': attribute_descriptions,
        'CID': cid
    }
    return pd.DataFrame(data)

In [84]:
reference_table_data = []
for table_xml_id in table_df_new['table_xml_id_new']:
    df = process_table(soup, table_xml_id)
    reference_table_data.append(df)

# combine all DataFrames into one
reference_tables = pd.concat(reference_table_data, ignore_index=True)

In [85]:
# subset of tags based on type = required
reference_tables[reference_tables['Type']=='1']

Unnamed: 0,section_id,Attribute Name,Tag,Type,Attribute Description,CID
34,sect_C.7.2.1,Study Instance UID,"(0020,000D)",1,{},
54,sect_C.7.3.1,Modality,"(0008,0060)",1,{},
55,sect_C.7.3.1,Series Instance UID,"(0020,000E)",1,{},
76,sect_C.8.13.6,Modality,"(0008,0060)",1,{'Enumerated Values': ['MR']},
78,sect_C.7.4.1,Frame of Reference UID,"(0020,0052)",1,{},
99,sect_C.7.5.2,Manufacturer,"(0008,0070)",1,{},
100,sect_C.7.5.2,Manufacturer's Model Name,"(0008,1090)",1,{},
101,sect_C.7.5.2,Device Serial Number,"(0018,1000)",1,{},
102,sect_C.7.5.2,Software Versions,"(0018,1020)",1,{},
108,sect_C.7.6.16,Shared Functional Groups Sequence,"(5200,9229)",1,{},


In [86]:
# subset of tags with attribute description
reference_tables[reference_tables['Attribute Description']!= {}]

Unnamed: 0,section_id,Attribute Name,Tag,Type,Attribute Description,CID
2,sect_C.7.1.1,Type of Patient ID,"(0010,0022)",3,"{'Defined Terms': ['TEXT', 'RFID', 'BARCODE']}",
7,sect_C.7.1.1,Patient's Sex,"(0010,0040)",2,"{'Enumerated Values': ['M', 'F', 'O']}",
9,sect_C.7.1.1,Quality Control Subject,"(0010,0200)",3,"{'Enumerated Values': ['YES', 'NO']}",
31,sect_C.7.1.1,Patient Identity Removed,"(0012,0062)",3,"{'Enumerated Values': ['YES', 'NO']}",
57,sect_C.7.3.1,Laterality,"(0020,0060)",2C,"{'Enumerated Values': ['R', 'L']}",
74,sect_C.7.3.1,Anatomical Orientation Type,"(0010,2210)",1C,"{'Enumerated Values': ['BIPED', 'QUADRUPED']}",
76,sect_C.8.13.6,Modality,"(0008,0060)",1,{'Enumerated Values': ['MR']},
114,sect_C.7.6.16,Stereo Pairs Present,"(0022,0028)",3,"{'Enumerated Values': ['YES', 'NO']}",
123,sect_C.7.6.17,Dimension Organization Type,"(0020,9311)",3,"{'Defined Terms': ['3D', '3D_TEMPORAL', 'TILED...",
135,sect_C.8.13.1,Burned In Annotation,"(0028,0301)",1C,{'Enumerated Values': ['NO']},


In [87]:
# subset of tags with CIDs
reference_tables[reference_tables['CID']!='']

Unnamed: 0,section_id,Attribute Name,Tag,Type,Attribute Description,CID
15,sect_C.7.1.1,Ethnic Group Code Sequence,"(0010,2161)",3,{},6099
18,sect_C.7.1.1,Patient Species Code Sequence,"(0010,2202)",1C,{},7454
20,sect_C.7.1.1,Patient Breed Code Sequence,"(0010,2293)",2C,{},7480
33,sect_C.7.1.1,De-identification Method Code Sequence,"(0012,0064)",1C,{},7050
50,sect_C.7.2.1,Requesting Service Code Sequence,"(0032,1034)",3,{},7030
52,sect_C.7.2.1,Procedure Code Sequence,"(0008,1032)",3,{},101
85,sect_C.7.5.1,Institutional Department Type Code Sequence,"(0008,1041)",3,{},7030


In [88]:
# create one dataframe and export the file
mapping_table = pd.merge(final_df_expanded, reference_tables, left_on = 'Reference_adjusted', right_on = 'section_id', how = 'left')
mapping_table

Unnamed: 0,xml_id,iod,IE,Module,Reference,Usage,Usage_code,Reference_adjusted,section_id,Attribute Name,Tag,Type,Attribute Description,CID
0,table_A.2-1,Computed Radiography Image IOD Modules,Patient,Patient,sect_C.7.1.1,M,M,sect_C.7.1.1,sect_C.7.1.1,Patient's Name,"(0010,0010)",2,{},
1,table_A.2-1,Computed Radiography Image IOD Modules,Patient,Patient,sect_C.7.1.1,M,M,sect_C.7.1.1,sect_C.7.1.1,Patient ID,"(0010,0020)",2,{},
2,table_A.2-1,Computed Radiography Image IOD Modules,Patient,Patient,sect_C.7.1.1,M,M,sect_C.7.1.1,sect_C.7.1.1,Type of Patient ID,"(0010,0022)",3,"{'Defined Terms': ['TEXT', 'RFID', 'BARCODE']}",
3,table_A.2-1,Computed Radiography Image IOD Modules,Patient,Patient,sect_C.7.1.1,M,M,sect_C.7.1.1,sect_C.7.1.1,Patient's Birth Date,"(0010,0030)",2,{},
4,table_A.2-1,Computed Radiography Image IOD Modules,Patient,Patient,sect_C.7.1.1,M,M,sect_C.7.1.1,sect_C.7.1.1,Patient's Birth Date in Alternative Calendar,"(0010,0033)",3,{},
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23440,table_A.91-1,Height Map Segmentation IOD Modules,Image,SOP Common,sect_C.12.1,M,M,sect_C.12.1,sect_C.12.1,Instance Origin Status,"(0400,0600)",3,{},
23441,table_A.91-1,Height Map Segmentation IOD Modules,Image,SOP Common,sect_C.12.1,M,M,sect_C.12.1,sect_C.12.1,Barcode Value,"(2200,0005)",3,{},
23442,table_A.91-1,Height Map Segmentation IOD Modules,Image,Common Instance Reference,sect_C.12.2,M,M,sect_C.12.2,,,,,,
23443,table_A.91-1,Height Map Segmentation IOD Modules,Image,Frame Extraction,sect_C.12.3,C - Required if the SOP Instance was created i...,C,sect_C.12.3,,,,,,


In [89]:
mapping_table['Tag'] = mapping_table['Tag'].str.replace(r'[(),]', '', regex = True)
mapping_table = mapping_table.drop(columns=['section_id'])
mapping_table.head()

Unnamed: 0,xml_id,iod,IE,Module,Reference,Usage,Usage_code,Reference_adjusted,Attribute Name,Tag,Type,Attribute Description,CID
0,table_A.2-1,Computed Radiography Image IOD Modules,Patient,Patient,sect_C.7.1.1,M,M,sect_C.7.1.1,Patient's Name,100010,2,{},
1,table_A.2-1,Computed Radiography Image IOD Modules,Patient,Patient,sect_C.7.1.1,M,M,sect_C.7.1.1,Patient ID,100020,2,{},
2,table_A.2-1,Computed Radiography Image IOD Modules,Patient,Patient,sect_C.7.1.1,M,M,sect_C.7.1.1,Type of Patient ID,100022,3,"{'Defined Terms': ['TEXT', 'RFID', 'BARCODE']}",
3,table_A.2-1,Computed Radiography Image IOD Modules,Patient,Patient,sect_C.7.1.1,M,M,sect_C.7.1.1,Patient's Birth Date,100030,2,{},
4,table_A.2-1,Computed Radiography Image IOD Modules,Patient,Patient,sect_C.7.1.1,M,M,sect_C.7.1.1,Patient's Birth Date in Alternative Calendar,100033,3,{},


## Add SOP Class UID from Part 4
*Extract Part 4 ("DICOM_P4_harvest_SOP) before running*

In [121]:
mapping_table['short_iod'] = mapping_table['iod'].str.replace(' IOD Modules', '')

In [122]:
sop = pd.read_csv("../files/DICOM Standard/part4_sop_class.csv")
sop['Truncated SOP Class Name'] = sop['Truncated SOP Class Name'].str.replace(' Storage', '')

In [123]:
update_dict = {"Cardiac Electrophysiology Waveform Storage": "Basic Cardiac EP",
 "Ambulatory ECG Waveform Storage": "Ambulatory ECG",
 "General ECG Waveform Storage": "General ECG",
 "12-lead ECG Waveform Storage": "12-Lead ECG"
}

def update_truncated_sop_class_name(row):
    if row['SOP Class Name'] in update_dict:
        return update_dict[row['SOP Class Name']]
    return row['Truncated SOP Class Name']

# Apply the function to update the column
sop['Truncated SOP Class Name'] = sop.apply(update_truncated_sop_class_name, axis=1)

In [124]:
sop[sop['SOP Class Name'].isin(update_dict.keys())]

Unnamed: 0,SOP Class Name,SOP Class UID,Truncated SOP Class Name
24,12-lead ECG Waveform Storage,1.2.840.10008.5.1.4.1.1.9.1.1,12-Lead ECG
25,General ECG Waveform Storage,1.2.840.10008.5.1.4.1.1.9.1.2,General ECG
26,Ambulatory ECG Waveform Storage,1.2.840.10008.5.1.4.1.1.9.1.3,Ambulatory ECG
29,Cardiac Electrophysiology Waveform Storage,1.2.840.10008.5.1.4.1.1.9.3.1,Basic Cardiac EP


In [125]:
# Truncated SOP Class Name, SOP Class UID
data = {
    "SOP Class Name": [
        "Implantation Plan SR Document Storage",
        "RT Treatment Summary Record Storage",
        "RT Ion Plan Storage",
        "RT Ion Beams Treatment Record Storage",
        "RT Beams Delivery Instruction Storage",
        "Generic Implant Template Storage",
        "Implant Assembly Template Storage",
        "Implant Template Group Storage"
        ],
    "SOP Class UID": [
        "1.2.840.10008.5.1.4.1.1.88.70",
        "1.2.840.10008.5.1.4.1.1.481.7",
        "1.2.840.10008.5.1.4.1.1.481.8",
        "1.2.840.10008.5.1.4.1.1.481.9",
        "1.2.840.10008.5.1.4.34.7",
        "1.2.840.10008.5.1.4.43.1",
        "1.2.840.10008.5.1.4.44.1",
        "1.2.840.10008.5.1.4.45.1"
        ],
    "Truncated SOP Class Name": [
        "Implantation Plan SR Document",
        "RT Treatment Summary Record",
        "RT Ion Plan",
        "RT Ion Beams Treatment Record",
        "RT Beams Delivery Instruction",
        "Generic Implant Template",
        "Implant Assembly Template",
        "Implant Template Group"
        ]
}

data_input = pd.DataFrame(data)

# Concatenate the original DataFrame with the new rows DataFrame
sop = pd.concat([sop, data_input], ignore_index=True)

In [126]:
mapping_table = mapping_table.merge(sop, left_on = 'short_iod', right_on = 'Truncated SOP Class Name', how = 'left')
mapping_table.head()

Unnamed: 0,xml_id,iod,IE,Module,Reference,Usage,Usage_code,Reference_adjusted,Attribute Name,Tag,Type,Attribute Description,CID,SOP Class UID_x,short_iod,SOP Class Name,SOP Class UID_y,Truncated SOP Class Name
0,table_A.2-1,Computed Radiography Image IOD Modules,Patient,Patient,sect_C.7.1.1,M,M,sect_C.7.1.1,Patient's Name,100010,2,{},,1.2.840.10008.5.1.4.1.1.1,Computed Radiography Image,Computed Radiography Image Storage,1.2.840.10008.5.1.4.1.1.1,Computed Radiography Image
1,table_A.2-1,Computed Radiography Image IOD Modules,Patient,Patient,sect_C.7.1.1,M,M,sect_C.7.1.1,Patient ID,100020,2,{},,1.2.840.10008.5.1.4.1.1.1,Computed Radiography Image,Computed Radiography Image Storage,1.2.840.10008.5.1.4.1.1.1,Computed Radiography Image
2,table_A.2-1,Computed Radiography Image IOD Modules,Patient,Patient,sect_C.7.1.1,M,M,sect_C.7.1.1,Type of Patient ID,100022,3,"{'Defined Terms': ['TEXT', 'RFID', 'BARCODE']}",,1.2.840.10008.5.1.4.1.1.1,Computed Radiography Image,Computed Radiography Image Storage,1.2.840.10008.5.1.4.1.1.1,Computed Radiography Image
3,table_A.2-1,Computed Radiography Image IOD Modules,Patient,Patient,sect_C.7.1.1,M,M,sect_C.7.1.1,Patient's Birth Date,100030,2,{},,1.2.840.10008.5.1.4.1.1.1,Computed Radiography Image,Computed Radiography Image Storage,1.2.840.10008.5.1.4.1.1.1,Computed Radiography Image
4,table_A.2-1,Computed Radiography Image IOD Modules,Patient,Patient,sect_C.7.1.1,M,M,sect_C.7.1.1,Patient's Birth Date in Alternative Calendar,100033,3,{},,1.2.840.10008.5.1.4.1.1.1,Computed Radiography Image,Computed Radiography Image Storage,1.2.840.10008.5.1.4.1.1.1,Computed Radiography Image


In [127]:
mapping_table = mapping_table.drop(columns = ['short_iod', 'Truncated SOP Class Name', 'SOP Class Name'])
mapping_table.head()

Unnamed: 0,xml_id,iod,IE,Module,Reference,Usage,Usage_code,Reference_adjusted,Attribute Name,Tag,Type,Attribute Description,CID,SOP Class UID_x,SOP Class UID_y
0,table_A.2-1,Computed Radiography Image IOD Modules,Patient,Patient,sect_C.7.1.1,M,M,sect_C.7.1.1,Patient's Name,100010,2,{},,1.2.840.10008.5.1.4.1.1.1,1.2.840.10008.5.1.4.1.1.1
1,table_A.2-1,Computed Radiography Image IOD Modules,Patient,Patient,sect_C.7.1.1,M,M,sect_C.7.1.1,Patient ID,100020,2,{},,1.2.840.10008.5.1.4.1.1.1,1.2.840.10008.5.1.4.1.1.1
2,table_A.2-1,Computed Radiography Image IOD Modules,Patient,Patient,sect_C.7.1.1,M,M,sect_C.7.1.1,Type of Patient ID,100022,3,"{'Defined Terms': ['TEXT', 'RFID', 'BARCODE']}",,1.2.840.10008.5.1.4.1.1.1,1.2.840.10008.5.1.4.1.1.1
3,table_A.2-1,Computed Radiography Image IOD Modules,Patient,Patient,sect_C.7.1.1,M,M,sect_C.7.1.1,Patient's Birth Date,100030,2,{},,1.2.840.10008.5.1.4.1.1.1,1.2.840.10008.5.1.4.1.1.1
4,table_A.2-1,Computed Radiography Image IOD Modules,Patient,Patient,sect_C.7.1.1,M,M,sect_C.7.1.1,Patient's Birth Date in Alternative Calendar,100033,3,{},,1.2.840.10008.5.1.4.1.1.1,1.2.840.10008.5.1.4.1.1.1


In [128]:
mapping_table.to_pickle('../files/DICOM Standard/part3_mapping.pkl')