# Preprocessing Template

Copy this notebook and set the paths to start preprocessing your data.

Note: If you are going to update this notebooks, clear the outputs before committing.

## Loading the Data

The code below loads data and labels from GCS.

You should update the paths to save the data to the right place on
your local disk.

### Arrays

In [None]:
# Keeps the data in the local filesystem in-sync with GCS
!gsutil rsync -d -r gs://elvos/numpy PATH/TO/SAVE/DATA/TO

In [None]:
import os
import pathlib
import typing

import numpy as np

In [None]:
def load_data(data_dir: str) -> typing.Dict[str, np.ndarray]:
    """Returns a dictionary which maps patient ids
    to patient pixel data."""
    data_dict = {}
    for filename in os.listdir(data_dir):
        patient_id = filename[:-4] # remove .npy extension
        data_dict[patient_id] = np.load(pathlib.Path(data_dir) / filename)
    return data_dict

In [None]:
data_dict = load_data('<PATH/FROM/ABOVE>')

### Labels

In [None]:
!gsutil cp gs://elvos/labels.csv PATH/TO/SAVE/LABELS/TO

In [None]:
import pandas as pd

In [None]:
labels_df = pd.read_csv('<PATH/FROM/ABOVE>',
                        index_col='patient_id')

## Preprocessing: Part I

If we use gs://elvos/numpy or gs://elvos/labels.csv, we'll have to do some minor
preprocessing first (removing bad data and duplicate labels).

In [None]:
def process_images(data: typing.Dict[str, np.ndarray]):
    return {id_: arr for id_, arr in data.items() if len(arr) != 1} # Remove the bad image

In [None]:
data_dict = process_images(data_dict)

In [None]:
def process_labels(labels: pd.DataFrame, data: typing.Dict[str, np.ndarray]):
    # TODO: Remove duplicate HLXOSVDF27JWNCMJ, IYDXJTFVWJEX36DO from ELVO_key
    labels = labels.loc[~labels.index.duplicated()] # Remove duplicate ids
    labels = labels.loc[list(data.keys())]
    assert len(labels) == len(data)
    return labels

In [None]:
labels_df = process_labels(labels_df, data_dict)

## Data Exploration

Simple plotting of the (mostly) unprocessed data.

For the data in `numpy/`:
- The 6 smallest image heights are: 1, 160, 160, 162, 164, 181.
- The 5 smallest image lengths/widths are: 180, 191, 193, 195, 197.

In [None]:
%matplotlib inline

In [None]:
from matplotlib import pyplot as plt

In [None]:
def plot_images(data: typing.Dict[str, np.ndarray],
                labels: pd.DataFrame,
                num_cols: int,
                limit=20,
                offset=0):
    # Ceiling function of len(data) / num_cols
    num_rows = (min(len(data), limit) + num_cols - 1) // num_cols 
    fig = plt.figure(figsize=(10, 10))
    for i, patient_id in enumerate(data):
        if i < offset:
            continue
        if i >= offset + limit:
            break
        plot_num = i - offset + 1
        ax = fig.add_subplot(num_rows, num_cols, plot_num)
        ax.set_title(f'patient: {patient_id[:4]}...')
        label = 'positive' if labels.loc[patient_id]["label"] else 'negative'
        ax.set_xlabel(f'label: {label}')
        plt.imshow(data[patient_id])
    fig.tight_layout()
    plt.plot()

In [None]:
# Change the input to .transpose to see different views of the data
mipped_all = {k:data_dict[k].transpose(0, 2, 1).max(axis=2) for i, k in enumerate(data_dict)}

In [None]:
plot_images({k: mip(arr[:, :, 20:40]) for k, arr in processed_dict.items()}, labels_df, 5, offset=20)

## Preprocessing: Part II

Cropping the image, applying mip, etc.

You should

In [None]:
import scipy.ndimage

In [None]:
def crop(image3d: np.ndarray, interactive=False) -> np.ndarray:
    """Crops a 3d image in ijk form (height as axis 0).
    """
    assert image3d.shape[1] == image3d.shape[2]
    lw_center = image3d.shape[1] // 2
    lw_min = lw_center - 80
    lw_max = lw_center + 80
    for i in range(len(image3d) - 1, 0, -1):
        if image3d[i, lw_center, lw_center] >= 0:
            height_max = i
            break
    height_min = height_max - 128 # TODO
    return image3d[height_min:height_max, lw_min:lw_max, lw_min:lw_max]


def transpose(image3d: np.ndarray) -> np.ndarray:
    """Move height from the first axis to the last.
    """
    return image3d.transpose(1, 2, 0)


def bound_pixels(image3d: np.ndarray,
                 min_bound: float,
                 max_bound: float) -> np.ndarray:
    image3d[image3d < min_bound] = min_bound
    image3d[image3d > max_bound] = max_bound
    return image3d


def mip(image3d: np.ndarray) -> np.ndarray:
    """Make sure that the array has been transposed first!
    """
    assert image3d.shape[0] == image3d.shape[1]
    return image3d.max(axis=2)


def downsample(image3d: np.ndarray, factor) -> np.ndarray:
    return scipy.ndimage.zoom(image3d, factor)
    

def to_grayscale(image2d: np.ndarray):
    return np.stack([image2d, image2d, image2d], axis=2)


def process_data(data: typing.Dict[str, np.ndarray]) -> typing.Dict[str, np.ndarray]:
    processed = {}
    for id_, arr in data.items():  
        raise NotImplementedError('Choose your own transformations')

In [None]:
processed_dict = process_data(data_dict)

## Data Validation

Check to see that the data looks right.

In [None]:
plot_images({k: mip(arr[:, :, 20:40]) for k, arr in processed_dict.items()}, labels_df, 5, offset=180)

## Saving the Data

Once you've preprocessed data to your liking, you should save the data to 
disk. Load the data from disk in your model building notebook.

In [None]:
# Changed the preprocessed path
processed_dirpath = '<PATH/TO/SAVE/DATA/TO>'
os.mkdir(processed_dirpath)
arr: np.ndarray
for id_, arr in processed_dict.items():
    np.save(pathlib.Path(processed_dirpath) / f'{id_}.npy', arr)

In [None]:
labels_df.to_csv('PATH/TO/SAVE/LABELS/TO')

## Sharing the Data
If the data works well in models, you should share it with others.
Make sure to update the [Code and Data Doc](https://docs.google.com/document/d/1_vaNcfZ_E5KsOZH_rNf4w_wTIr7EvI8GqpZ5o3dAUV4/edit)
if you do upload to GCS.

In [None]:
!gsutil rsync -d -r PATH/TO/PROCESSED/DATA gs://PATH/TO/SAVE/TO

In [None]:
!gsutil rsync -d PATH/TO/PROCESSED/LABELS gs://PATH/TO/SAVE/TO