**Title**: Upload kaggle chest X-Ray.   
**Date**:  12-Oct-2020     
**Description**:  
This notebook downloads the entire RSNA Pneumonia Detection Challenge Dataset (3.6 GB) and incorporates it into a Flywheel instance specified by the supplied API-Key.  A Data Use Agreement (DUA) is required to download this dataset.

Reference:
* https://www.kaggle.com/c/rsna-pneumonia-detection-challenge/data  

# Data Use Aggreement
Before downloading this data, or any data, from kaggle, you must agree to the rules of this competition: 

* https://www.kaggle.com/c/rsna-pneumonia-detection-challenge/rules

In [None]:
%reload_ext autoreload
%autoreload 2 
%matplotlib inline

# Requirements:
- **Python** (Preferably >= 3.6):  

- Have admin permissions to create Flywheel Groups and Projects.

# Install and import dependencies

In [1]:
!pip install pandas pydicom getpass flywheel-sdk tqdm kaggle

Collecting kaggle
[?25l  Downloading https://files.pythonhosted.org/packages/fc/14/9db40d8d6230655e76fa12166006f952da4697c003610022683c514cf15f/kaggle-1.5.8.tar.gz (59kB)
[K     |████████████████████████████████| 61kB 785kB/s 
Collecting python-slugify (from kaggle)
  Downloading https://files.pythonhosted.org/packages/9f/42/e336f96a8b6007428df772d0d159b8eee9b2f1811593a4931150660402c0/python-slugify-4.0.1.tar.gz
Collecting slugify (from kaggle)
  Downloading https://files.pythonhosted.org/packages/7b/89/fbb7391d777b60c82d4e1376bb181b98e75adf506b3f7ffe837eca64570b/slugify-0.0.1.tar.gz
[31m    ERROR: Command errored out with exit status 1:
     command: /Users/joshuajacobs/anaconda2/envs/Py36/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/2b/ltbqpj_123q_sj2dz39874s80000gq/T/pip-install-byfwrtmu/slugify/setup.py'"'"'; __file__='"'"'/private/var/folders/2b/ltbqpj_123q_sj2dz39874s80000gq/T/pip-install-byfwrtmu/slugify/setup.py'"'"';f=getattr(toke

In [6]:
import json
import logging
import os
import re
import time
import zipfile
from getpass import getpass
from pathlib import Path

import flywheel
import pandas as pd
import pydicom
from tqdm.notebook import tqdm

In [3]:
# Instantiate a logger
logging.basicConfig(level=logging.INFO, format='%(asctime)s %(levelname)s %(message)s')
log = logging.getLogger('root')

# Download kaggle dataset

This requires that you have stored your Kaggle credentials in ~/.kaggle/kaggle.json. These can be acquired by creating a kaggle account at kaggle.com and using "Create New API Token" on the user account page. 

This dataset is currently 3.7 GB and may change in the future. Depending on the bandwidth of your internet connection, this may take some time to download.

In [None]:
!kaggle competitions download -c rsna-pneumonia-detection-challenge

# Initialize Constants
Initialize path to dowload directory, default session label, and default acquisition label.

In [4]:
ROOT_KAGGLE_DATA = '/path/to/downloaded/dataset/'
DEFAULT_SESSION_LABEL = 'NA'
DEFAULT_ACQ_LABEL = 'Chest XR'

# Flywheel API Key and Client
Get an API_KEY. More on this in the Flywheel SDK doc [here](https://flywheel-io.gitlab.io/product/backend/sdk/branches/master/python/getting_started.html#api-key).

In [7]:
API_KEY = getpass('Enter API_KEY here: ')

Instantiate the Flywheel API client

In [8]:
fw_client = flywheel.Client(API_KEY if 'API_KEY' in locals() else os.environ.get('FW_KEY'))

Show Flywheel logging information

In [9]:
log.info('You are now logged in as %s to %s', fw_client.get_current_user()['email'], fw_client.get_config()['site']['api_url'])

2020-10-14 14:35:34,642 INFO You are now logged in as joshuajacobs@flywheel.io to https://covid19.flywheel.io/api


# Read the csv

In [None]:
df = pd.read_csv(Path(ROOT_KAGGLE_DATA) / 'stage_2_train_labels.csv')

# Container helpers
Import container helper functions to find existing or create new containers.

In [None]:
from container_helpers import (
    find_or_create_group, 
    find_or_create_project, 
    find_or_create_subject, 
    find_or_create_session, 
    find_or_create_acquisition,
)

# Create the project

In [None]:
# Initialize the group
public_data_group = find_or_create_group(fw_client, 'public_data', 'public_data')
# Initialize the project
project_label = 'kaggle-rsna-pneumonia-detection-challenge'
readme = 'https://www.kaggle.com/c/rsna-pneumonia-detection-challenge/data'
chestxray_project = find_or_create_project(project_label, public_data_group)
if chestxray_project:
    chestxray_project.update(description=readme)

# Iterate through dataframe and upload
Iterate through the training data csv to create the container hierarchy for this project:
* find or create each subject encountered
* find or create each session (with `DEFAULT_SESSION_LABEL`) encountered
* find or create each acquisition (with 'SeriesDescription' or `DEFAULT_ACQ_LABEL`) and add enclosed files.

In [None]:
for i, row in tqdm(df.iterrows(), total=len(df)):
    log.info('Processing Subject %s.', row['patientId'])
    subject = find_or_create_subject(row['patientId'], chestxray_project)
    if row['Target']:
        row_dict = {
            'box': {
                'x': row['x'], 
                'y': row['y'], 
                'width': row['width'], 
                'height': row['height']
            }, 
            'Target': row['Target']
        }
    else:
        row_dict = {'Target': row['Target']}
    if subject:
        log.info('Processing Session %s.', DEFAULT_SESSION_LABEL)
        session = find_or_create_session(DEFAULT_SESSION_LABEL, subject)
        if session:
            filepath = str(Path(ROOT_KAGGLE_DATA) / 'stage_2_train_images' / f"{row['patientId']}.dcm")
            dcm = pydicom.read_file(filepath, stop_before_pixels=True, force=True)
            # Pack dicoms into zip file
            with zipfile.ZipFile(f'/tmp/{row["patientId"]}.zip', 'w') as myzip:
                myzip.write(filepath)

            acq_label = dcm.get('SeriesDescription', DEFAULT_ACQ_LABEL)
            log.info('Processing Acquisition %s.', acq_label)
            
            acq = find_or_create_acquisition(acq_label, session)
            log.info(
                'Uploading file, %s, to acquisition, %s',
                f'/tmp/{row["patientId"]}.zip',
                acq.label
            )
            kwarg_dict = {"type": "dicom", "modality": "X-ray"}
            kwarg_dict["info"] = row_dict
            upload_file_to_acquistion(acq, f'/tmp/{row["patientId"]}.zip', **kwarg_dict)
            # remove temporary zipped dicom file
            os.remove(f'/tmp/{row["patientId"]}.zip')