**Title**: Upload kaggle chest X-Ray.   
**Date**:  12-Oct-2020     
**Description**:  
This notebook downloads the entire RSNA Pneumonia Detection Challenge Dataset (3.6 GB) and incorporates it into a Flywheel instance specified by the supplied API-Key.  A Data Use Agreement (DUA) is required to download this dataset.

Reference:
* https://www.kaggle.com/c/rsna-pneumonia-detection-challenge/data  

# Data Use Aggreement
Before downloading this data, or any data, from kaggle, you must agree to the rules of this competition: 

* https://www.kaggle.com/c/rsna-pneumonia-detection-challenge/rules

In [None]:
%reload_ext autoreload
%autoreload 2 
%matplotlib inline

# Requirements:
- **Python** (Preferably >= 3.6):  

- Have admin permissions to create Flywheel Groups and Projects.

# Install and import dependencies

In [None]:
!pip install pandas pydicom flywheel-sdk kaggle

In [None]:
import os
from pathlib import Path
import pandas as pd
import flywheel
import logging
from tqdm.notebook import tqdm
import pydicom
import re
import json
import time
import zipfile

In [None]:
# Instantiate a logger
logging.basicConfig(level=logging.INFO, format='%(asctime)s %(levelname)s %(message)s')
log = logging.getLogger('root')

# Download kaggle dataset

This requires that you have stored your Kaggle credentials in ~/.kaggle/kaggle.json. These can be acquired by creating a kaggle account at kaggle.com and using "Create New API Token" on the user account page. 

This dataset is currently 3.7 GB and may change in the future. Depending on the bandwidth of your internet connection, this may take some time to download.

In [None]:
!kaggle competitions download -c rsna-pneumonia-detection-challenge

# Initialize Constants
Initialize path to dowload directory, default session label, and default acquisition label.

In [None]:
ROOT_KAGGLE_DATA = '/path/to/downloaded/dataset/'
DEFAULT_SESSION_LABEL = 'NA'
DEFAULT_ACQ_LABEL = 'Chest XR'

# Flywheel API Key and Client
Get an API_KEY. More on this in the Flywheel SDK doc [here](https://flywheel-io.gitlab.io/product/backend/sdk/branches/master/python/getting_started.html#api-key).

In [None]:
API_KEY = getpass('Enter API_KEY here: ')

Instantiate the Flywheel API client

In [None]:
fw_client = flywheel.Client(API_KEY if 'API_KEY' in locals() else os.environ.get('FW_KEY'))

# Read the csv

In [None]:
df = pd.read_csv(Path(ROOT_KAGGLE_DATA) / 'stage_2_train_labels.csv')

# Container helpers
Import container helper functions to find existing or create new containers.

In [None]:
from container_helpers import (
    find_or_create_group, 
    find_or_create_project, 
    find_or_create_subject, 
    find_or_create_session, 
    find_or_create_acquisition,
)

# Create the project

In [None]:
# Initialize the group
public_data_group = find_or_create_group(fw_client, 'public_data', 'public_data')
# Initialize the project
project_label = 'kaggle-rsna-pneumonia-detection-challenge'
readme = 'https://www.kaggle.com/c/rsna-pneumonia-detection-challenge/data'
chestxray_project = find_or_create_project(project_label, public_data_group)
if chestxray_project:
    chestxray_project.update(description=readme)

# Iterate through dataframe and upload
Iterate through the training data csv to create the container hierarchy for this project:
* find or create each subject encountered
* find or create each session (with `DEFAULT_SESSION_LABEL`) encountered
* find or create each acquisition (with 'SeriesDescription' or `DEFAULT_ACQ_LABEL`) and add enclosed files.

In [None]:
for i, row in tqdm(df.iterrows(), total=len(df)):
    subject = find_or_create_subject(row['patientId'], None, chestxray_project)
    if row['Target']:
        row_dict = {
            'box': {
                'x': row['x'], 
                'y': row['y'], 
                'width': row['width'], 
                'height': row['height']
            }, 
            'Target': row['Target']
        }
    else:
        row_dict = {'Target': row['Target']}
    if subject:
        session = find_or_create_session(DEFAULT_SESSION_LABEL, None, subject)
        if session:
            filepath = str(Path(ROOT_KAGGLE_DATA) / 'stage_2_train_images' / f"{row['patientId']}.dcm")
            dcm = pydicom.read_file(filepath, stop_before_pixels=True, force=True)
            with zipfile.ZipFile(f'/tmp/{row["patientId"]}.zip', 'w') as myzip:
                myzip.write(filepath)
            acq_label = dcm.get('SeriesDescription', DEFAULT_ACQ_LABEL)
            acq = find_or_create_acquisition(acq_label, row_dict, f'/tmp/{row["patientId"]}.zip', session)
            os.remove(f'/tmp/{row["patientId"]}.zip')