Let us assume that we are given a CSV or Excel file that contains some information that we should
 query on our FHIR server. In the `example.csv` file we have a list of ICD 10 codes, but it may
 also be a list of patient IDs or anything else.

The process:
 1. Find all patients with these conditions.
 2. Collect all of their CTs.
 3. Download the CTs.

We have created this example according to our FHIR server, where we also can download studies
using the Dicom Web Adapter. For this purpose we use the [DicomWebClient](https://dicomweb-client.readthedocs.io/en/latest/usage.html).
I am not sure on how this works at other university hospitals, so if
 you have any particular requests for this, please do not hesitate and ask.


In [1]:
import pandas as pd

condition_df = pd.read_csv("example.csv")
condition_df

Unnamed: 0,icd_10
0,C78.7
1,C24.9


We now do our call, and we get all the Conditions that have this ICD-10 code.


In [2]:
from fhir_pyrate import Ahoy, Pirate, DicomDownloader
import os
from dotenv import load_dotenv

load_dotenv()
# I have stored these variables in an .env file
BASIC_AUTH = os.environ["BASIC_AUTH"]
REFRESH_AUTH = os.environ["REFRESH_AUTH"]
SEARCH_URL = os.environ["SEARCH_URL"]
DICOM_WEB_URL = os.environ["DICOM_WEB_URL"]

# Init authentication
auth = Ahoy(
    auth_type="token",
    auth_method="env",
    auth_url=BASIC_AUTH,
    refresh_url=REFRESH_AUTH,
)
search = Pirate(
    auth=auth, base_url=SEARCH_URL, print_request_url=False, num_processes=1
)
condition_patient_df = search.trade_rows_for_dataframe(  # Runs the patient queries in parallel
    df=condition_df,
    resource_type="Condition",
    request_params={
        "_sort": "_id",
        "_count": 100,
    },
    with_ref=True,
    df_constraints={"code": ("http://fhir.de/CodeSystem/dimdi/icd-10-gm%7C", "icd_10")},
    fhir_paths=[("patient_id", "subject.reference")],
    num_pages=1,  # This option only collects the results from the first page,
    # which depends on the value set for count
)
redacted_df = condition_patient_df.copy()
redacted_df["patient_id"] = "<redacted-id>"
redacted_df

Query & Build DF (Condition): 100%|██████████| 2/2 [00:00<00:00,  2.02it/s]


Unnamed: 0,patient_id,icd_10
0,<redacted-id>,C78.7
1,<redacted-id>,C78.7
2,<redacted-id>,C78.7
3,<redacted-id>,C78.7
4,<redacted-id>,C78.7
...,...,...
195,<redacted-id>,C24.9
196,<redacted-id>,C24.9
197,<redacted-id>,C24.9
198,<redacted-id>,C24.9


Now, we take the patients that we have found through Condition, and we make sure that there are no
duplicates.

In [3]:
patient_df = pd.DataFrame(
    condition_patient_df["patient_id"].drop_duplicates(keep="first")
)
len(patient_df)

157

We then take the patients and look for their ImagingStudies, where we also decide what kind of
arguments we should to store.


In [4]:
imaging_df = search.trade_rows_for_dataframe(  # Runs the patient queries in parallel with inputs
    # from the given DataFrame
    df=patient_df.head(
        2
    ),  # We only download for 2 patients to keep the DataFrame small for the visualization
    resource_type="ImagingStudy",
    request_params={
        "_sort": "_id",
        "modality": "CT",
        "_count": 100,
    },
    df_constraints={
        "subject": "patient_id",
    },
    with_ref=True,
    fhir_paths=[
        "started",
        ("modality", "modality.code"),
        ("procedureCode", "procedureCode.coding.code"),
        ("study_instance_uid", "identifier.where(system = 'urn:dicom:uid').value.replace('urn:oid:', '')"),
        ("series_instance_uid", "series.uid"),
        ("series_code", "series.modality.code"),
        ("numberOfInstances", "series.numberOfInstances"),
    ],
)
# The series are currently stored in a list
imaging_df = imaging_df.explode(
    [
        "series_instance_uid",
        "series_code",
        "numberOfInstances",
    ]
)
import numpy as np
import pydicom

redacted_df = imaging_df.copy()
redacted_df["patient_id"] = "<redacted-id>"
redacted_df["started"] = np.random.permutation(redacted_df["started"].values)
redacted_df["study_instance_uid"] = pydicom.uid.generate_uid(entropy_srcs=["salt"])
redacted_df["series_instance_uid"] = pydicom.uid.generate_uid(entropy_srcs=["pepper"])
redacted_df

Query & Build DF (ImagingStudy): 100%|██████████| 2/2 [00:00<00:00,  6.08it/s]


Unnamed: 0,started,modality,procedureCode,study_instance_uid,series_instance_uid,series_code,numberOfInstances,patient_id
0,2018-01-31T11:13:26.000+01:00,CT,CTT,1.2.826.0.1.3680043.8.498.24222694654806877939...,1.2.826.0.1.3680043.8.498.33463995182843850024...,PR,2,<redacted-id>
0,2018-04-13T10:25:37.000+02:00,CT,CTT,1.2.826.0.1.3680043.8.498.24222694654806877939...,1.2.826.0.1.3680043.8.498.33463995182843850024...,PR,2,<redacted-id>
0,2019-07-26T10:42:38.000+02:00,CT,CTT,1.2.826.0.1.3680043.8.498.24222694654806877939...,1.2.826.0.1.3680043.8.498.33463995182843850024...,PR,2,<redacted-id>
0,2018-09-19T10:35:20.000+02:00,CT,CTT,1.2.826.0.1.3680043.8.498.24222694654806877939...,1.2.826.0.1.3680043.8.498.33463995182843850024...,PR,2,<redacted-id>
0,2016-02-24T10:08:50.000+01:00,CT,CTT,1.2.826.0.1.3680043.8.498.24222694654806877939...,1.2.826.0.1.3680043.8.498.33463995182843850024...,CT,2,<redacted-id>
...,...,...,...,...,...,...,...,...
55,2018-04-13T10:25:37.000+02:00,PT,PCTGK,1.2.826.0.1.3680043.8.498.24222694654806877939...,1.2.826.0.1.3680043.8.498.33463995182843850024...,CT,199,<redacted-id>
55,2013-12-12T09:23:36.000+01:00,PT,PCTGK,1.2.826.0.1.3680043.8.498.24222694654806877939...,1.2.826.0.1.3680043.8.498.33463995182843850024...,CT,1,<redacted-id>
55,2012-08-01T12:19:14.000+02:00,PT,PCTGK,1.2.826.0.1.3680043.8.498.24222694654806877939...,1.2.826.0.1.3680043.8.498.33463995182843850024...,PT,3,<redacted-id>
55,2020-01-20T10:27:57.000+01:00,PT,PCTGK,1.2.826.0.1.3680043.8.498.24222694654806877939...,1.2.826.0.1.3680043.8.498.33463995182843850024...,PT,244,<redacted-id>


Filter the series by CT.

In [5]:
filtered = imaging_df.loc[imaging_df["series_code"] == "CT"]

Download the studies using the DicomDownloader, which needs a specific URL pointing to the
DicomWebAdapter and that specify the PACS that we want to use.

In [6]:
downloader = DicomDownloader(
    auth=auth, dicom_web_url=DICOM_WEB_URL, output_format="nifti"
)

successful_df, error_df = downloader.download_data_from_dataframe(
    filtered.head(1),  # Download only the last element, just for testing
    output_dir="out",
    study_uid_col="study_instance_uid",
    series_uid_col="series_instance_uid",
    download_full_study=False,
)
error_df

Downloading Rows:   0%|          | 0/1 [00:00<?, ?it/s]
Downloading Instance: 0it [00:00, ?it/s][A
Downloading Instance: 1it [00:00,  1.09it/s][A
Downloading Instance: 2it [00:01,  2.00it/s][A
Downloading Rows: 100%|██████████| 1/1 [00:01<00:00,  1.19s/it]


The error DataFrame is empty, which means that no errors have occured. We can then check the
successful downloads to find out in which folder the series was stored.

In [7]:
# Dummy values
import hashlib

redacted_df = successful_df.copy()
redacted_df["study_instance_uid"] = pydicom.uid.generate_uid(entropy_srcs=["salt"])
redacted_df["series_instance_uid"] = pydicom.uid.generate_uid(entropy_srcs=["pepper"])
redacted_df["download_id"] = [
    hashlib.sha256(
        (
            row.study_instance_uid + "_" + row.series_instance_uid
        ).encode()
    ).hexdigest()
    for row in redacted_df.itertuples(index=False)
]
redacted_df

Unnamed: 0,study_instance_uid,series_instance_uid,deidentified_study_instance_uid,deidentified_series_instance_uid,download_id,download_path
0,1.2.826.0.1.3680043.8.498.24222694654806877939...,1.2.826.0.1.3680043.8.498.33463995182843850024...,2.25.236507833011739959424223550367290629720,2.25.201217751562255229154918195968192653161,65425e1330b0895737d07d0fa29cd2614ff5dd9c7b566c...,out/6c249114e1ebfc8c38b812efdf3ba859a1f57d56bb...


And here we have it, this is our mapping DataFrame which references the folder where the files
were downloaded, their de-identified IDs and their study and series instance UID.

In our system, we have a specific URL that can be used to obtain de-identified studies. The
`deidentified_study_instance_uid` and `deidentified_series_instance_uid` are the IDs that can be
found in the newly downloaded DICOM files. If your system does not de-identify the studies, these
 two IDs will simply be the original IDs of the DICOM file.