Let us assume that we are given a CSV or Excel file that contains some information that we should
 query on our FHIR server. In the `example.csv` file we have a list of ICD 10 codes, but it may
 also be a list of patient IDs or anything else.

The process:
 1. Find all patients with these conditions.
 2. Collect all of their CTs.
 3. Download the CTs.

We have created this example according to our FHIR server, where we also can download studies
using the Dicom Web Adapter. For this purpose we use the [DicomWebClient](https://dicomweb-client.readthedocs.io/en/latest/usage.html).
I am not sure on how this works at other university hospitals, so if
 you have any particular requests for this, please do not hesitate and ask.


In [1]:
import pandas as pd

condition_df = pd.read_csv("example.csv")
condition_df

Unnamed: 0,icd_10
0,C78.7
1,C24.9


We now do our call, and we get all the Conditions that have this ICD-10 code.

In [3]:
from fhir_pyrate import Ahoy, Pirate, DicomDownloader
import os
from dotenv import load_dotenv

load_dotenv()
# I have stored these variables in a .env file
BASIC_AUTH = os.environ["BASIC_AUTH"]
REFRESH_AUTH = os.environ["REFRESH_AUTH"]
SEARCH_URL = os.environ["SEARCH_URL"]
DICOM_WEB_URL = os.environ["DICOM_WEB_URL"]

# Init authentication
auth = Ahoy(
    auth_type="token",
    auth_method="env",
    auth_url=BASIC_AUTH,
    refresh_url=REFRESH_AUTH,
)
search = Pirate(
    auth=auth, base_url=SEARCH_URL, print_request_url=False, num_processes=1
)
# Specify that it is a string, so it will not be converted to double
condition_patient_df = search.trade_rows_for_dataframe(  # Runs the patient queries in parallel
    df=condition_df,
    resource_type="Condition",
    request_params={
        "_sort": "_id",
        "_count": 100,
    },
    df_constraints={"code": ("http://fhir.de/CodeSystem/dimdi/icd-10-gm%7C", "icd_10")},
    fhir_paths=["subject.reference"],
    stop_after_first_page=True,  # This option only collects the results from the first page,
    # which depends on the value set for count
)
redacted_df = condition_patient_df.copy()
redacted_df["subject.reference"] = "<redacted-id>"
redacted_df

Query & Build DF: 100%|██████████| 2/2 [00:01<00:00,  1.81it/s]


Unnamed: 0,subject.reference,icd_10
0,<redacted-id>,C78.7
1,<redacted-id>,C78.7
2,<redacted-id>,C78.7
3,<redacted-id>,C78.7
4,<redacted-id>,C78.7
...,...,...
195,<redacted-id>,C24.9
196,<redacted-id>,C24.9
197,<redacted-id>,C24.9
198,<redacted-id>,C24.9


Now, we take the patients that we have found through condition and we make sure that there are no
duplicates.

In [4]:
patient_df = pd.DataFrame(
    condition_patient_df["subject.reference"].drop_duplicates(keep="first")
)
len(patient_df)

158

We then take the patients and look for their ImagingStudies, where we also decide what kind of
arguments we should like to store.

In [5]:
imaging_df = search.trade_rows_for_dataframe(  # Runs the patient queries in parallel with inputs
    # from the given DataFrame
    df=patient_df.head(
        2
    ),  # We only download for 2 patients to keep the DataFrame small for the visualization
    resource_type="ImagingStudy",
    request_params={
        "_sort": "_id",
        "modality": "CT",
        "_count": 100,
    },
    df_constraints={
        "subject": "subject.reference",
    },
    fhir_paths=[
        "started",
        "modality.code",
        "procedureCode.coding.code",
        "identifier[1].value",
        "series.uid",
        "series.modality.code",
        "series.numberOfInstances",
    ],
)
# Fix some problems:
# Rename columns
imaging_df.rename(
    {"identifier[1].value": "study_instance_uid", "series.uid": "series_instance_uid"},
    inplace=True,
    axis=1,
)
# 1. The StudyInstanceUID have a string before the actual number
imaging_df["study_instance_uid"] = imaging_df["study_instance_uid"].str.replace(
    "urn:oid:", ""
)
# 2. The series are currently stored in a list
imaging_df = imaging_df.explode(
    [
        "series_instance_uid",
        "series.modality.code",
        "series.numberOfInstances",
    ]
)
# This would be also a case where it makes sense to use a processing function, so that you have
# more control over the attributes and the number of elements in a column
# Do some postprocessing to now show real IDs
import numpy as np
import pydicom

redacted_df = imaging_df.copy()
redacted_df["subject.reference"] = "<redacted-id>"
redacted_df["started"] = np.random.permutation(redacted_df["started"].values)
redacted_df["study_instance_uid"] = pydicom.uid.generate_uid(entropy_srcs=["salt"])
redacted_df["series_instance_uid"] = pydicom.uid.generate_uid(entropy_srcs=["pepper"])
redacted_df

Query & Build DF: 100%|██████████| 2/2 [00:00<00:00,  4.49it/s]


Unnamed: 0,started,modality.code,procedureCode.coding.code,study_instance_uid,series_instance_uid,series.modality.code,series.numberOfInstances,subject.reference
0,2016-02-24T10:08:50.000+01:00,CT,CTT,1.2.826.0.1.3680043.8.498.24222694654806877939...,1.2.826.0.1.3680043.8.498.33463995182843850024...,PR,2,<redacted-id>
0,2017-05-12T11:39:07.000+02:00,CT,CTT,1.2.826.0.1.3680043.8.498.24222694654806877939...,1.2.826.0.1.3680043.8.498.33463995182843850024...,PR,2,<redacted-id>
0,2018-06-02T13:05:01.000+02:00,CT,CTT,1.2.826.0.1.3680043.8.498.24222694654806877939...,1.2.826.0.1.3680043.8.498.33463995182843850024...,PR,2,<redacted-id>
0,2018-06-02T19:09:43.000+02:00,CT,CTT,1.2.826.0.1.3680043.8.498.24222694654806877939...,1.2.826.0.1.3680043.8.498.33463995182843850024...,PR,2,<redacted-id>
0,2018-04-13T10:25:37.000+02:00,CT,CTT,1.2.826.0.1.3680043.8.498.24222694654806877939...,1.2.826.0.1.3680043.8.498.33463995182843850024...,CT,2,<redacted-id>
...,...,...,...,...,...,...,...,...
55,2013-10-02T09:27:39.000+02:00,PT,PCTGK,1.2.826.0.1.3680043.8.498.24222694654806877939...,1.2.826.0.1.3680043.8.498.33463995182843850024...,CT,199,<redacted-id>
55,2016-10-17T11:43:44.000+02:00,PT,PCTGK,1.2.826.0.1.3680043.8.498.24222694654806877939...,1.2.826.0.1.3680043.8.498.33463995182843850024...,CT,1,<redacted-id>
55,2018-04-13T10:25:37.000+02:00,PT,PCTGK,1.2.826.0.1.3680043.8.498.24222694654806877939...,1.2.826.0.1.3680043.8.498.33463995182843850024...,PT,3,<redacted-id>
55,2017-02-02T11:40:25.000+01:00,PT,PCTGK,1.2.826.0.1.3680043.8.498.24222694654806877939...,1.2.826.0.1.3680043.8.498.33463995182843850024...,PT,244,<redacted-id>


Filter the series by CT

In [6]:
filtered = imaging_df.loc[imaging_df["series.modality.code"] == "CT"]

Download the studies using the DicomDownloader, which needs a specific URL pointing to the
DicomWebAdapter and that specify the PACS that we want to use.

In [7]:
downloader = DicomDownloader(
    auth=auth, dicom_web_url=DICOM_WEB_URL, output_format="nifti"
)

successful_df, error_df = downloader.download_data_from_dataframe(
    filtered.tail(1),  # Download only the last element, just for testing
    output_dir="out",
    study_uid_col="study_instance_uid",
    series_uid_col="series_instance_uid",
    download_full_study=False,
)
error_df

                                   

The error DataFrame is empty, which means that no errors have occured. We can then check the
successful downloads to find out in which folder the series was stored.

In [8]:
# Dummy values
import hashlib

redacted_df = successful_df.copy()
redacted_df["study_instance_uid"] = pydicom.uid.generate_uid(entropy_srcs=["salt"])
redacted_df["series_instance_uid"] = pydicom.uid.generate_uid(entropy_srcs=["pepper"])
redacted_df["download_id"] = [
    hashlib.sha256(
        (
            row.study_instance_uid + "_" + row.series_instance_uid
        ).encode()
    ).hexdigest()
    for row in redacted_df.itertuples(index=False)
]
redacted_df

Unnamed: 0,study_instance_uid,series_instance_uid,deidentified_study_instance_uid,deidentified_series_instance_uid,download_id
0,1.2.826.0.1.3680043.8.498.24222694654806877939...,1.2.826.0.1.3680043.8.498.33463995182843850024...,2.25.329899711891834784290727202110072100256,2.25.153039349108166964661050207024114219638,65425e1330b0895737d07d0fa29cd2614ff5dd9c7b566c...


And here we have it, this is our mapping DataFrame which references the folder where the files
were downloaded, their identified IDs and their study and series instance UID.