**Title**: Upload the MedNIST dataset.   
**Date**:  05-May-2021  
**Description**:
This notebook downloads the MedNIST dataset and upload it into a Flywheel instance project.
   
The MedNIST dataset was gathered from several sets from TCIA, the RSNA Bone Age Challenge, and the NIH Chest X-ray dataset.

The dataset is kindly made available by [Dr. Bradley J. Erickson M.D., Ph.D.](https://www.mayo.edu/research/labs/radiology-informatics/overview) (Department of Radiology, Mayo Clinic) under the [Creative Commons CC BY-SA 4.0 license](https://creativecommons.org/licenses/by-sa/4.0/).

If you use the MedNIST dataset, please acknowledge the source.

# Data Use Aggreement
Before downloading this data, or any data, make sure you understand the restrictions on the use of data. 

# Requirements:
- **Python** (Preferably >= 3.6):  

- Have administrator permissions to create Flywheel Groups and Projects.

In [3]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

# Install and import dependencies

In [4]:
!pip install pandas pydicom flywheel-sdk monai



In [5]:
import copy
import csv
import datetime
import logging
import os
import time
from getpass import getpass
from pathlib import Path

import flywheel
import pandas as pd
from monai.apps import download_and_extract

In [6]:
# Instantiate a logger
logging.basicConfig(level=logging.INFO, format='%(asctime)s %(levelname)s %(message)s')
log = logging.getLogger('root')

# Download dataset
The following will download the entire github repository to the local drive.

In [8]:
ROOT_DIR = Path('/tmp/data')
if not ROOT_DIR.exists():
    ROOT_DIR.mkdir(parents=True, exist_ok=True)

In [9]:
resource = "https://www.dropbox.com/s/5wwskxctvcxiuea/MedNIST.tar.gz?dl=1"
md5 = "0bc7306e7427e00ad1c5526a6677552d"

compressed_file = os.path.join(ROOT_DIR, "MedNIST.tar.gz")
data_dir = os.path.join(ROOT_DIR, "MedNIST")
if not os.path.exists(data_dir):
    download_and_extract(resource, compressed_file, ROOT_DIR, md5)

MedNIST.tar.gz: 59.0MB [00:23, 2.66MB/s]                              



downloaded file: /tmp/data/MedNIST.tar.gz.
Verified 'MedNIST.tar.gz', md5: 0bc7306e7427e00ad1c5526a6677552d.
Verified 'MedNIST.tar.gz', md5: 0bc7306e7427e00ad1c5526a6677552d.


# Initialize Constants

In [23]:
GROUP_ID = "ml"
GROUP_Label = "ML"

In [24]:
PROJECT_LABEL = "MedNIST"

In [35]:
DESCRIPTION = """
The MedNIST dataset was gathered from several sets from TCIA, the RSNA Bone Age Challenge, and the NIH Chest X-ray dataset.

The dataset is kindly made available by [Dr. Bradley J. Erickson M.D., Ph.D.](https://www.mayo.edu/research/labs/radiology-informatics/overview) (Department of Radiology, Mayo Clinic) under the [Creative Commons CC BY-SA 4.0 license](https://creativecommons.org/licenses/by-sa/4.0/).

If you use the MedNIST dataset, please acknowledge the source.
"""

# Flywheel API Key and Client
Get an API_KEY. More on this in the Flywheel SDK doc [here](https://flywheel-io.gitlab.io/product/backend/sdk/branches/master/python/getting_started.html#api-key).

In [38]:
API_KEY = getpass('Enter API_KEY here: ')

Enter API_KEY here:  ·········································


Instantiate the Flywheel API client

In [39]:
fw_client = flywheel.Client(API_KEY if 'API_KEY' in locals() else os.environ.get('FW_KEY'))

Show Flywheel logging information

In [40]:
log.info('You are now logged in as %s to %s', fw_client.get_current_user()['email'], fw_client.get_config()['site']['api_url'])

2021-05-15 15:27:20,600 INFO You are now logged in as nicolaspannetier@flywheel.io to https://rollout.ce.flywheel.io/api


# Container helpers
Import container helper functions to find existing or create new containers.

In [41]:
from container_helpers import (
    find_or_create_group, 
    find_or_create_project, 
    find_or_create_subject, 
    find_or_create_session, 
    find_or_create_acquisition,
    upload_file_to_acquisition
)


# Find or Create Group and Project:
Create a group with id "public_data" and label "public_data".

Create a project with label 'covid-chestxray-dataset'.

Replace with the id and labels of the group and project you want to create.

If group and project are already created, the group and project with specified labels will be returned.

In [44]:
x = fw_client.lookup("loni/adni-data")

In [42]:
# Initialize the group
public_data_group = find_or_create_group(fw_client, GROUP_ID, GROUP_Label)

# Initialize the project
project = find_or_create_project(PROJECT_LABEL, public_data_group)

if project:
    project.update(description=DESCRIPTION)

2021-05-15 15:27:23,311 INFO Group with label "ML" not found, creating.
2021-05-15 15:27:23,817 INFO Project with label "MedNIST" not found, creating.


# Inspect data structure

In [52]:
!tree -L 2 {ROOT_DIR}

[34m/tmp/data[00m
├── [34mMedNIST[00m
│   ├── [34mAbdomenCT[00m
│   ├── [34mBreastMRI[00m
│   ├── [34mCXR[00m
│   ├── [34mChestCT[00m
│   ├── [34mHand[00m
│   ├── [34mHeadCT[00m
│   └── README.md
└── MedNIST.tar.gz

7 directories, 2 files


Each folder contains JPG images. 

In [68]:
bodyparts = list((ROOT_DIR / "MedNIST").rglob("*/**"))
for x in bodyparts:
    print(f"{x}: {len(list(x.rglob('*')))}")

/tmp/data/MedNIST/Hand: 10000
/tmp/data/MedNIST/BreastMRI: 8954
/tmp/data/MedNIST/ChestCT: 10000
/tmp/data/MedNIST/HeadCT: 10000
/tmp/data/MedNIST/AbdomenCT: 10000
/tmp/data/MedNIST/CXR: 10000


# Upload

Following Flywheel hierarchy will be used:  
* subject.label = bodypart
* session.label = bodypart
* acquisition.label = filename

In [None]:
for part in bodyparts:
    log.info(f"Uploading {part}")
    subject_label = part.parts[-1]
    session_label = part.parts[-1]    
    subject = find_or_create_subject(subject_label, project)
    session = find_or_create_session(session_label, subject)
    files = part.rglob('*')
    for file_ in tqdm(files):
        acq_label = file_.parts[-1].split('.')[0]
        acq = find_or_create_acquisition(acq_label, session)
        acq.update_file(file_)