<a href="https://colab.research.google.com/github/denbonte/cloudyday/blob/main/notebooks/download_benchmark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Environment Setup

In [None]:
from google.colab import auth
auth.authenticate_user()

In [None]:
import os

import time
import pandas as pd

Note: to run the benchmarks, change this to your project ID.

In [None]:
my_ProjectID="idc-sandbox-000"

Pull and install `s5cmd` v2.0.0 from their GitHub release page.

In [None]:
%%capture

!wget https://github.com/peak/s5cmd/releases/download/v2.0.0/s5cmd_2.0.0_Linux-64bit.tar.gz
!mkdir -p s5cmd && tar zxf s5cmd_2.0.0_Linux-64bit.tar.gz -C s5cmd
!cp s5cmd/s5cmd /usr/bin && rm s5cmd_2.0.0_Linux-64bit.tar.gz

For benchmarking purposes, we are going to prepare a manifesto for cross-loading DICOM data from the Imaging Data Commons Google Storage buckets. For the sake of example, we are going to pull data from the `nsclc_radiomics`.

In [None]:
%%bigquery cohort_df --project=$my_ProjectID

SELECT
  PatientID,
  StudyInstanceUID,
  SeriesInstanceUID,
  SOPInstanceUID,
  gcs_url
FROM
  `bigquery-public-data.idc_current.dicom_all`
WHERE
  collection_id = "nsclc_radiomics"
  AND Modality = "CT"
ORDER BY
  PatientID

Query is running:   0%|          |

Downloading:   0%|          |

In [None]:
display(cohort_df.info())
display(cohort_df.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51230 entries, 0 to 51229
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   PatientID          51230 non-null  object
 1   StudyInstanceUID   51230 non-null  object
 2   SeriesInstanceUID  51230 non-null  object
 3   SOPInstanceUID     51230 non-null  object
 4   gcs_url            51230 non-null  object
dtypes: object(5)
memory usage: 2.0+ MB


None

Unnamed: 0,PatientID,StudyInstanceUID,SeriesInstanceUID,SOPInstanceUID,gcs_url
0,LUNG1-001,1.3.6.1.4.1.32722.99.99.2393413539117143687725...,1.3.6.1.4.1.32722.99.99.2989917765213423750108...,1.3.6.1.4.1.32722.99.99.3078801584366390810576...,gs://idc-open-cr/ed62c42c-c261-44c4-a4a5-0bc77...
1,LUNG1-001,1.3.6.1.4.1.32722.99.99.2393413539117143687725...,1.3.6.1.4.1.32722.99.99.2989917765213423750108...,1.3.6.1.4.1.32722.99.99.3107100702111033256711...,gs://idc-open-cr/23b87c17-76eb-405d-a033-b8f55...
2,LUNG1-001,1.3.6.1.4.1.32722.99.99.2393413539117143687725...,1.3.6.1.4.1.32722.99.99.2989917765213423750108...,1.3.6.1.4.1.32722.99.99.1064644568755722921755...,gs://idc-open-cr/fdbe15bb-a030-4a8d-b041-b4a73...
3,LUNG1-001,1.3.6.1.4.1.32722.99.99.2393413539117143687725...,1.3.6.1.4.1.32722.99.99.2989917765213423750108...,1.3.6.1.4.1.32722.99.99.1378828584856047266870...,gs://idc-open-cr/7193a2f1-781a-4017-b92f-56f28...
4,LUNG1-001,1.3.6.1.4.1.32722.99.99.2393413539117143687725...,1.3.6.1.4.1.32722.99.99.2989917765213423750108...,1.3.6.1.4.1.32722.99.99.6917641808288785879158...,gs://idc-open-cr/ed7c0188-93ed-480b-a8b0-31887...


As running `gcloud storage cp` or `gsutil cp` can be incredibly slow (at certain times, more than 200s for 1000 files), limit the number of `.dcm` files we are going to pull to `n_files`.

In [None]:
n_files = 1000

download_df = cohort_df.sample(n=n_files)

In [None]:
display(download_df.info())
display(download_df.head())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 43878 to 39236
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   PatientID          1000 non-null   object
 1   StudyInstanceUID   1000 non-null   object
 2   SeriesInstanceUID  1000 non-null   object
 3   SOPInstanceUID     1000 non-null   object
 4   gcs_url            1000 non-null   object
dtypes: object(5)
memory usage: 46.9+ KB


None

Unnamed: 0,PatientID,StudyInstanceUID,SeriesInstanceUID,SOPInstanceUID,gcs_url
43878,LUNG1-365,1.3.6.1.4.1.32722.99.99.1914110964821482780883...,1.3.6.1.4.1.32722.99.99.4596440252475098196896...,1.3.6.1.4.1.32722.99.99.1683121469980282535270...,gs://idc-open-cr/d1a5fdd3-b378-4849-af04-d73ff...
3001,LUNG1-027,1.3.6.1.4.1.32722.99.99.7675440033341345479533...,1.3.6.1.4.1.32722.99.99.2305160426467195519568...,1.3.6.1.4.1.32722.99.99.1032602163965400546395...,gs://idc-open-cr/c4a15b29-13b8-4534-9bb5-32ec5...
2982,LUNG1-027,1.3.6.1.4.1.32722.99.99.7675440033341345479533...,1.3.6.1.4.1.32722.99.99.2305160426467195519568...,1.3.6.1.4.1.32722.99.99.1721689183162355285739...,gs://idc-open-cr/8ec6ebff-c497-44bb-af48-e6776...
39844,LUNG1-332,1.3.6.1.4.1.32722.99.99.1302349440240646975378...,1.3.6.1.4.1.32722.99.99.3190617744073962718010...,1.3.6.1.4.1.32722.99.99.1296061863086161390647...,gs://idc-open-cr/9ab49004-da9f-46b1-9427-ac6fd...
941,LUNG1-009,1.3.6.1.4.1.32722.99.99.1737446948497249041452...,1.3.6.1.4.1.32722.99.99.1491965310436982884554...,1.3.6.1.4.1.32722.99.99.1395953512723386312280...,gs://idc-open-cr/677f1e95-a7f9-4965-8eb7-fc847...


# Generating Manifestos

In [None]:
!mkdir -p data
!mkdir -p data/dicom_s5cmd data/dicom_gsutil data/dicom_gstorage

In [None]:
gs_file_path = "data/gcs_paths.txt"

download_df["gcs_url"].to_csv(gs_file_path, header = False, index = False)

In [None]:
download_path = "data/dicom_s5cmd/"

s5cmd_gs_file_path = "data/gcs_url_s5cmd.txt"

gcsurl_temp = "cp " + download_df["gcs_url"].str.replace("gs://","s3://") + " " + download_path
gcsurl_temp.to_csv(s5cmd_gs_file_path, header=False, index=False)

# Benchmarking

In [None]:
elapsed = dict()

elapsed["gsutil"] = dict()
elapsed["gstorage"] = dict()
elapsed["s5cmd"] = dict()

## gsutil

In [None]:
%%capture

start = time.time()

!cat data/gcs_paths.txt | gsutil -m cp -Ir data/dicom_gsutil

end = time.time()

In [None]:
elapsed["gsutil"]["time"] = end - start
elapsed["gsutil"]["n_subjects"] = len([f for f in os.listdir("data/dicom_gsutil") if ".dcm" in f])

## gcloud storage cp

In [None]:
%%capture

start = time.time()

!cat data/gcs_paths.txt | gcloud storage cp --read-paths-from-stdin data/dicom_gstorage

end = time.time()

In [None]:
elapsed["gstorage"]["time"] = end - start
elapsed["gstorage"]["n_subjects"] = len([f for f in os.listdir("data/dicom_gstorage") if ".dcm" in f])

## s5cmd

In [None]:
%%capture

start = time.time()

!s5cmd --no-sign-request --endpoint-url https://storage.googleapis.com run data/gcs_url_s5cmd.txt

end = time.time()

In [None]:
elapsed["s5cmd"]["time"] = end - start
elapsed["s5cmd"]["n_subjects"] = len([f for f in os.listdir("data/dicom_s5cmd") if ".dcm" in f])

# Results

Note: we found `gsutil cp` and `gcloud storage cp` speed depends on the time of the day (... something related to the network/traffic?)

For instance, it's not rare for the copy operations of `gsutil` and `gcloud storage` to take more than 200 seconds.

In [None]:
elapsed_df = pd.DataFrame.from_dict(elapsed, orient="index")

elapsed_df

Unnamed: 0,time,n_subjects
gsutil,55.508122,1000
gstorage,58.137904,1000
s5cmd,3.70068,1000


In [None]:
!date

Fri 24 Mar 2023 10:03:46 AM UTC


In [None]:
!du -h -d 1 data/

504M	data/dicom_s5cmd
504M	data/dicom_gsutil
504M	data/dicom_gstorage
1.5G	data/


In [22]:
!lscpu

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          2
On-line CPU(s) list:             0,1
Thread(s) per core:              2
Core(s) per socket:              1
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           79
Model name:                      Intel(R) Xeon(R) CPU @ 2.20GHz
Stepping:                        0
CPU MHz:                         2199.998
BogoMIPS:                        4399.99
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       32 KiB
L1i cache:                       32 KiB
L2 cache:                        256 KiB
L3 cache:                        55 MiB
NUMA node0 CPU(s):               0,1
Vulnerability 

In [23]:
!free -m

              total        used        free      shared  buff/cache   available
Mem:          12985         757        8251           1        3976       11929
Swap:             0           0           0


In [24]:
!apt install speedtest-cli
!speedtest-cli

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following NEW packages will be installed:
  speedtest-cli
0 upgraded, 1 newly installed, 0 to remove and 23 not upgraded.
Need to get 24.0 kB of archives.
After this operation, 106 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu focal-updates/universe amd64 speedtest-cli all 2.1.2-2ubuntu0.20.04.1 [24.0 kB]
Fetched 24.0 kB in 0s (176 kB/s)
Selecting previously unselected package speedtest-cli.
(Reading database ... 128285 files and directories currently installed.)
Preparing to unpack .../speedtest-cli_2.1.2-2ubuntu0.20.04.1_all.deb ...
Unpacking speedtest-cli (2.1.2-2ubuntu0.20.04.1) ...
Setting up speedtest-cli (2.1.2-2ubuntu0.20.04.1) ...
Processing triggers for man-db (2.9.1-1) ...
Retrieving speedtest.net configuration...
Testing from Google Cloud (34.73.156.222)...
Retrieving speedtest.net server list...
Selecting best server based on ping...
Hosted