**Title**: Upload kaggle chest X-Ray.   
**Date**:  12-Oct-2020     
**Description**:  
Pneumonia accounts for over 15% of all deaths of children under 5 years old internationally. Advanced detection of pneumonia could save thousands of lives a year.

In 2018 the RSNA Pneumonia Detection Challenge was posted on Kaggle, an organization for machine learning training and purpose-driven competitions in Data Science.

This notebook downloads the entire RSNA Pneumonia Detection Challenge Dataset (3.6 GB) and incorporates it into a Flywheel instance specified by the supplied API-Key.  A Data Use Agreement (DUA) is required to download this dataset.

Reference:
* https://www.kaggle.com/c/rsna-pneumonia-detection-challenge/data  

# Data Use Aggreement
Before downloading this data, or any data, from kaggle, you must agree to the rules of this competition: 

* https://www.kaggle.com/c/rsna-pneumonia-detection-challenge/rules

In [None]:
%reload_ext autoreload
%autoreload 2 
%matplotlib inline

# Requirements:
- **Python** (Preferably >= 3.6):  

- Have admin permissions to create Flywheel Groups and Projects.

# Install and import dependencies

In [None]:
!pip install pandas pydicom flywheel-sdk tqdm kaggle jupyter ipywidgets

In [None]:
import json
import logging
import os
import re
import time
import zipfile
from getpass import getpass
from pathlib import Path

import flywheel
import pandas as pd
import pydicom
from tqdm.notebook import tqdm

In [None]:
# Instantiate a logger
logging.basicConfig(level=logging.INFO, format='%(asctime)s %(levelname)s %(message)s')
log = logging.getLogger('root')

# Download kaggle dataset

This requires that you have stored your Kaggle credentials in ~/.kaggle/kaggle.json. These can be acquired by creating a kaggle account at kaggle.com and using "Create New API Token" on the user account page. 

This dataset is currently 3.7 GB and may change in the future. Depending on the bandwidth of your internet connection, this may take some time to download.

In [None]:
!kaggle competitions download -c rsna-pneumonia-detection-challenge

# Initialize Constants
Initialize path to dowload directory, default session label, and default acquisition label.

In [None]:
ROOT_KAGGLE_DATA = '/path/to/repository/rsna-pneumonia-detection-challenge'
DEFAULT_SESSION_LABEL = 'NA'
DEFAULT_ACQ_LABEL = 'Chest XR'

# Flywheel API Key and Client
Get an API_KEY. More on this in the Flywheel SDK doc [here](https://flywheel-io.gitlab.io/product/backend/sdk/branches/master/python/getting_started.html#api-key).

In [None]:
API_KEY = getpass('Enter API_KEY here: ')

Instantiate the Flywheel API client

In [None]:
fw_client = flywheel.Client(API_KEY if 'API_KEY' in locals() else os.environ.get('FW_KEY'))

Show Flywheel logging information

In [None]:
log.info('You are now logged in as %s to %s', fw_client.get_current_user()['email'], fw_client.get_config()['site']['api_url'])

# Read the csv
The CSV file consists of the patient id, whether the pnemonia was diagnosed (Target 0/1), and the rectangular region of the image it was found in (x,y,width,height).

```
patientId,x,y,width,height,Target
0004cfab-14fd-4e49-80ba-63a80b6bddd6,,,,,0
00436515-870c-4b36-a041-de91049b9ab4,264.0,152.0,213.0,379.0,1
```

In [None]:
df = pd.read_csv(Path(ROOT_KAGGLE_DATA) / 'stage_2_train_labels.csv')

# Container helpers
Import container helper functions to find existing or create new containers.

In [None]:
from container_helpers import (
    find_or_create_group, 
    find_or_create_project, 
    find_or_create_subject, 
    find_or_create_session, 
    find_or_create_acquisition,
    upload_file_to_acquisition
)

# Create the project

In [None]:
# Initialize the group
public_data_group = find_or_create_group(fw_client, 'public_data', 'public_data')
# Initialize the project
project_label = 'kaggle-rsna-pneumonia-detection-challenge'
readme = 'https://www.kaggle.com/c/rsna-pneumonia-detection-challenge/data'
chestxray_project = find_or_create_project(project_label, public_data_group)
if chestxray_project:
    chestxray_project.update(description=readme)

# Iterate through dataframe and upload
Iterate through the training data csv to create the container hierarchy for this project:

1. find or create each subject encountered

  a. Encode presence/absence of pneumonia (Target=0/1) and the rectangular region it was found in (box) into a dictionary.
2. find or create each session (with `DEFAULT_SESSION_LABEL`) encountered
3. find or create each acquisition (with 'SeriesDescription' or `DEFAULT_ACQ_LABEL`) and add enclosed files.

  a. Incorporate presence/absence of pneumonia (Target) and--if found--the rectangular region it was found in (box) into the metadata of the acquisition file.


In [1]:
for i, row in tqdm(df.iterrows(), total=len(df)):
    log.info('Processing Subject %s.', row['patientId'])
    # (1) Find or create subject
    subject = find_or_create_subject(row['patientId'], chestxray_project)
    # (1a) Encode pneumonia status and rectangular region of positive status in dictionary.
    if row['Target']:
        row_dict = {
            'box': {
                'x': row['x'], 
                'y': row['y'], 
                'width': row['width'], 
                'height': row['height']
            }, 
            'Target': row['Target']
        }
    else:
        row_dict = {'Target': row['Target']}
    if subject:
        log.info('Processing Session %s.', DEFAULT_SESSION_LABEL)
        # (2) Find or create session 
        session = find_or_create_session(DEFAULT_SESSION_LABEL, subject)
        if session:
            filepath = str(Path(ROOT_KAGGLE_DATA) / 'stage_2_train_images' / f"{row['patientId']}.dcm")
            dcm = pydicom.read_file(filepath, stop_before_pixels=True, force=True)
            # Pack dicoms into zip file
            with zipfile.ZipFile(f'/tmp/{row["patientId"]}.zip', 'w') as myzip:
                myzip.write(filepath)

            acq_label = dcm.get('SeriesDescription', DEFAULT_ACQ_LABEL)
            log.info('Processing Acquisition %s.', acq_label)
            # (3) Find or create acquisition
            acq = find_or_create_acquisition(acq_label, session)
            log.info(
                'Uploading file, %s, to acquisition, %s',
                f'/tmp/{row["patientId"]}.zip',
                acq.label
            )
            kwarg_dict = {"type": "dicom", "modality": "X-ray"}
            kwarg_dict["info"] = row_dict
            # Upload file to acquisition and
            # (3a) incorporate Target and box into file metadata
            upload_file_to_acquisition(acq, f'/tmp/{row["patientId"]}.zip', **kwarg_dict)
            # remove temporary zipped dicom file
            os.remove(f'/tmp/{row["patientId"]}.zip')

NameError: name 'tqdm' is not defined