# Create Mini Xray Dataset With Healthy and Diseased Xrays 

This notebook is modified from **K Scott Mader's** notebook [here](https://www.kaggle.com/code/kmader/create-a-mini-xray-dataset-equalized/notebook) to create a mini chest x-ray dataset that is split 50:50 between normal and diseased images.

In my notebook I will use this dataset to test a pretrained model on a binary classification task (diseased vs. healthy xray), and then visualize which specific labels the model has the most trouble with. 

Also, because disease classification is such an important task to get right, it's likely that any AI/ML medical classification task will include a human-in-the-loop. In this way, this process more closely resembles how this sort of ML would be used in the real world.

Note that the original notebook on which this one was based had two versions: [Standard](https://www.kaggle.com/code/kmader/create-a-mini-xray-dataset-standard) and [Equalized](https://www.kaggle.com/code/kmader/create-a-mini-xray-dataset-equalized/notebook). In this notebook we will be using the equalized version in order to save ourselves the extra step of performing CLAHE during the tensor transformations.

### Goal

The goal of this notebook, as originally stated by Mader, is "to make a much easier to use mini-dataset out of the Chest X-Ray collection. The idea is to have something akin to MNIST or Fashion MNIST for medical images." In order to do this, we will preprocess, normalize, and scale down the images, and then save them into an HDF5 file with the corresponding tabular data.

### Import Libraries

In [None]:
from itertools import chain
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
from glob import glob
from tqdm import tqdm

import h5py
from cv2 import imread, createCLAHE # read and equalize images
from skimage.transform import resize

%matplotlib inline
import matplotlib.pyplot as plt

### Define Helper Functions

In [None]:
def write_df_as_hdf(out_path,
                    out_df,
                    compression='gzip'):
    with h5py.File(out_path, 'w') as h:
        for k, arr_dict in tqdm(out_df.to_dict().items()):
            try:
                s_data = np.stack(arr_dict.values(), 0)
                try:
                    h.create_dataset(k, data=s_data, compression=
                    compression)
                except TypeError as e:
                    try:
                        h.create_dataset(k, data=s_data.astype(np.string_),
                                         compression=compression)
                    except TypeError as e2:
                        print('%s could not be added to hdf5, %s' % (
                            k, repr(e), repr(e2)))
            except ValueError as e:
                print('%s could not be created, %s' % (k, repr(e)))
                all_shape = [np.shape(x) for x in arr_dict.values()]
                warn('Input shapes: {}'.format(all_shape))

In [None]:
def imread_and_normalize(im_path):
    clahe_tool = createCLAHE(clipLimit=2.0, tileGridSize=(8,8))
    img_data = np.mean(imread(im_path), 2).astype(np.uint8)
    img_data = clahe_tool.apply(img_data)
    n_img = (255*resize(img_data, OUT_DIM, mode = 'constant')).clip(0,255).astype(np.uint8)
    return np.expand_dims(n_img, -1)

In [None]:
# Define configs
# SAMPLE_SIZE = sample size per category (diseased vs. healthy)
# Total samples saved to output file will be 2 * SAMPLE_SIZE

SAMPLE_SIZE = 100
FILE_NAME = "TEST"
OUT_DIM = (128, 128)

In [None]:
all_xray_df = pd.read_csv('../input/data/Data_Entry_2017.csv') 
all_image_paths = {os.path.basename(x): x for x in 
                   glob(os.path.join('..', 'input', 'data','images*', '*', '*.png'))}
print('Scans found:', len(all_image_paths), ', Total Headers', all_xray_df.shape[0])

all_xray_df['path'] = all_xray_df['Image Index'].map(all_image_paths.get)
all_xray_df.sample(3)

### Preprocess Labels
Here we one-hot encode the disease labels to make them easier to group and sort on later. We will also create a subset of the diseased Xray data that preserves the original disease distribution.

In [None]:
# Visualize distribution of most popular diseases in full population
# Only show 15 most common diseases
# First (most common bar) represents normal xrays, will drop
label_counts = all_xray_df['Finding Labels'].value_counts()[1:16] # 
fig, ax1 = plt.subplots(1,1,figsize = (12, 8))
ax1.bar(np.arange(len(label_counts))+0.5, label_counts)
ax1.set_xticks(np.arange(len(label_counts))+0.5)
_ = ax1.set_xticklabels(label_counts.index, rotation = 90)

In [None]:
# Split 'Finding Labels'
all_xray_df['Finding Labels'] = all_xray_df['Finding Labels'].map(lambda x: x.replace('No Finding', ''))
all_labels = np.unique(list(chain(*all_xray_df['Finding Labels'].map(lambda x: x.split('|')).tolist())))
print('All Labels', all_labels)

In [None]:
# Dummy encode disease classifications
for c_label in all_labels:
    if len(c_label)>1: # leave out empty labels
        all_xray_df[c_label] = all_xray_df['Finding Labels'].map(lambda finding: 1.0 if c_label in finding else 0)

### Create Normal Subset
Create random sample of normal Xray data (of size `SAMPLE_SIZE`)

In [None]:
# Create subset of SAMPLE_SIZE # of normal samples and set aside
normal_xray_df = all_xray_df[all_xray_df['Finding Labels']=='']
normal_xray_df = normal_xray_df.sample(SAMPLE_SIZE)
normal_xray_df['Finding Labels'] = normal_xray_df['Finding Labels'].map(lambda x: x.replace('', 'Normal'))

### Create Diseased Subset
Create sample of diseased Xray data that preserves original distribution (of size `SAMPLE_SIZE`)

In [None]:
# Calculate disease sample weights
dis_xray_df = all_xray_df[all_xray_df['Finding Labels']!='']
# Make a subset of the the diseased samples that preserves original distribution
# Weight is 0.1 + number of findings
sample_weights = dis_xray_df['Finding Labels'].map(lambda x: len(x.split('|')) if len(x)>0 else 0).values + 1e-1
sample_weights /= sample_weights.sum()
dis_xray_df = dis_xray_df.sample(SAMPLE_SIZE, weights=sample_weights)

In [None]:
# Visualize distribution of most popular diseases in sample population
label_counts = dis_xray_df['Finding Labels'].value_counts()[:15]
fig, ax1 = plt.subplots(1,1,figsize = (12, 8))
ax1.bar(np.arange(len(label_counts))+0.5, label_counts)
ax1.set_xticks(np.arange(len(label_counts))+0.5)
_ = ax1.set_xticklabels(label_counts.index, rotation = 90)

### Combine and Save Tabular Data
Combine `SAMPLE_SIZE` of normal and `SAMPLE_SIZE` of diseased tabular data and save to HDF5.

In [None]:
# Combine samples
# Shuffle samples with .sample(frac=1) and drop indices
final_xray_df = pd.concat([dis_xray_df, normal_xray_df], axis = 0).sample(frac=1).reset_index(drop=True)
final_xray_df.sample(3)

In [None]:
# Write tabular data to HDF5 file
write_df_as_hdf(f'{FILE_NAME}.h5', final_xray_df)

In [None]:
# Show breakdown of tabular data 
with h5py.File(f'{FILE_NAME}.h5', 'r') as h5_data:
    for c_key in h5_data.keys():
        print(c_key, h5_data[c_key].shape, h5_data[c_key].dtype)

### Create Image Subset and Save
Collect relevant images (referred to in tabular dataset) and add to HDF5.

In [None]:
# Show example Xray image
test_img = imread_and_normalize(all_xray_df['path'].values[0])
plt.matshow(test_img[:,:,0])

In [None]:
# preallocate output
out_image_arr = np.zeros((all_xray_df.shape[0],)+OUT_DIM+(1,), dtype=np.uint8)
if False:
    # a difficult to compress array for size approximations
    out_image_arr = np.random.uniform(0, 255,
                                  size = (final_xray_df.shape[0],)+OUT_DIM+(1,)).astype(np.uint8)

In [None]:
final_xray_df.shape

In [None]:
# preallocate output
out_image_arr = np.zeros((final_xray_df.shape[0],)+OUT_DIM+(1,), dtype=np.uint8)
if False:
    # a difficult to compress array for size approximations
    out_image_arr = np.random.uniform(0, 255,
                                  size = (final_xray_df.shape[0],)+OUT_DIM+(1,)).astype(np.uint8)

In [None]:
for i, c_path in enumerate(tqdm(final_xray_df['path'].values)):
    out_image_arr[i] = imread_and_normalize(c_path)

In [None]:
# Append the array
with h5py.File(f'{FILE_NAME}.h5', 'a') as h5_data:
    h5_data.create_dataset('images', data = out_image_arr, compression = None) # compression takes too long
    for c_key in h5_data.keys():
        print(c_key, h5_data[c_key].shape, h5_data[c_key].dtype)

In [None]:
print('Output File-size %2.2fMB' % (os.path.getsize(f'{FILE_NAME}.h5')/1e6))

### Next Steps
To read **this** HDF5 file into your own environment, set up your Kaggle credentials in your working directory and run:

`!kaggle kernels output abbymorgan/create-mini-xray-dataset-binary-classification -p path/to/dest`

If you copy this notebook and make **your own subset** of the original dataset, to load your HDF5 file:

- Make sure to **commit** the final version of your notebook ([see here for more info on commiting to Kaggle](https://www.kaggle.com/general/224266)).
- Navigate to the 'Data' tab of your saved notebook and copy the bash command at the bottom of the page.
- Paste the bash command in your working directory containing a `kaggle.json` with your Kaggle credentials. 

For more information on how to read HDF5 files, see the following resources:

- `[colab link here]`
- [h5py Documentation](https://docs.h5py.org/en/stable/quick.html)
- ['How to Read HDF5 Files in Python](https://www.pythonforthelab.com/blog/how-to-use-hdf5-files-in-python/), Python For the Lab