# Data Exploration

The notebook explores the NIH Chest X-ray dataset as described below:

>Potential Dataset 2 : NIH Chest X-ray dataset

>Link: https://academictorrents.com/details/e615d3aebce373f1dc8bd9d11064da55bdadede0

>Size: 112,120 frontal-view X-ray images of 30,805 unique patients

>Color Space: Grayscale

>Resolution: 1024 x 1024 pixels

>Downstream task: Classification of common thoracic pathologies include Atelectasis, Consolidation, Infiltration, Pneumothorax, Edema, Emphysema, Fibrosis, Effusion, Pneumonia, Pleural_thickening, Cardiomegaly, Nodule, Mass and Hernia.

>Benchmark: https://arxiv.org/abs/1705.02315

Please note that each image can have a multi-labels. I'm wonder if the multi-task classification is too complicated.





In [6]:
from __future__ import absolute_import, division, print_function, unicode_literals
import warnings
warnings.filterwarnings("ignore")
from scipy.io import loadmat
from pprint import pprint
import pandas as pd
import numpy as np
import torch
import torchvision
import torchvision.transforms as transforms
import torchvision.utils as vutils
import matplotlib.pyplot as plt
#from dataloaders.chest_xray import XRayDataSet
import tensorflow as tf
from PIL import Image
from pandas_profiling import ProfileReport

## 1. Import Dataset










In [None]:
    use_cache = False
    data_frame_path = "../data/01_raw/Data_Entry_2017.csv"
    img_data_path = "../data/01_raw/images-224"
    config = dict()
    scratch_dir = None
    batch_size = 128
    buffer_size = 128 * 2

    if use_cache:
        train_ds = XRayDataSet(img_data_path, data_frame_path, config=config, scratch_dir=scratch_dir) \
            .prefetch(tf.data.experimental.AUTOTUNE) \
            .batch(batch_size) \
            .cache(cache_dir + "/tf_learn_cache") \
            .shuffle(buffer_size)

    else:
        train_ds = XRayDataSet(img_data_path, data_frame_path, config=config, scratch_dir=scratch_dir) \
            .prefetch(tf.data.experimental.AUTOTUNE) \
            .shuffle(buffer_size)\
            .batch(batch_size) 
            
    for (images, labels) in train_ds:
        tf.print(images[0], output_stream=sys.stdout)
        print()
        print("Just printed image. Done!!")
        break

## 2. Visualize X-Ray Samples

In [None]:
my_img_grid = df['img'].iloc[430].view(1,1,434,636)
for i in range(431,440):
    # my_img_grid[i,None] = df['img'].iloc[i]
    my_img_grid = torch.cat([my_img_grid,df['img'].iloc[i].view(1,1,434,636)])
    print(df['fat'].iloc[i])
    
    print(df['id'].iloc[i])

def imshow(img):
    plt.figure(figsize=(20,20))
    plt.axis("off")
    plt.title("10 Ultrasound images of the same person")
    #img = img / 2 + 0.5     # unnormalize
    
    npimg = img.numpy()
    plt.imshow(np.transpose(npimg, (1, 2, 0)))
    plt.show()

# show images
imshow(torchvision.utils.make_grid(my_img_grid.type(torch.int), nrow=5))
# print( "The shape of the images is", images.shape)

## 3. Analyze Data Labels

There are 12 features:
1.'Image Index'
2. 'Finding Labels'
3.'Follow-up #'
4. 'Patient ID'
5. 'Patient Age'
6. 'Patient Gender'
7. 'View Position'
8. 'OriginalImage[Width','Height]'
9. 'OriginalImagePixelSpacing[x', 'y]'
10.'Unnamed: 11'


In [2]:
data_frame_path = "../data/01_raw/Data_Entry_2017.csv"
df = pd.read_csv(data_frame_path) 
print('There are ', len(df), ' labeled images')
df.head()

There are  112120  labeled images


Unnamed: 0,Image Index,Finding Labels,Follow-up #,Patient ID,Patient Age,Patient Gender,View Position,OriginalImage[Width,Height],OriginalImagePixelSpacing[x,y],Unnamed: 11
0,00000001_000.png,Cardiomegaly,0,1,058Y,M,PA,2682,2749,0.143,0.143,
1,00000001_001.png,Cardiomegaly|Emphysema,1,1,058Y,M,PA,2894,2729,0.143,0.143,
2,00000001_002.png,Cardiomegaly|Effusion,2,1,058Y,M,PA,2500,2048,0.168,0.168,
3,00000002_000.png,No Finding,0,2,081Y,M,PA,2500,2048,0.171,0.171,
4,00000003_000.png,Hernia,0,3,081Y,F,PA,2582,2991,0.143,0.143,


In [None]:
# for the patient age, there is an Y in the value
# remove Y from all patient Age



### 3.1 Labels Exploration

In [3]:
# split labels 
df_labels = pd.concat([df,df['Finding Labels'].str.get_dummies(sep='|')], axis=1)
df_labels.head()

Unnamed: 0,Image Index,Finding Labels,Follow-up #,Patient ID,Patient Age,Patient Gender,View Position,OriginalImage[Width,Height],OriginalImagePixelSpacing[x,...,Emphysema,Fibrosis,Hernia,Infiltration,Mass,No Finding,Nodule,Pleural_Thickening,Pneumonia,Pneumothorax
0,00000001_000.png,Cardiomegaly,0,1,058Y,M,PA,2682,2749,0.143,...,0,0,0,0,0,0,0,0,0,0
1,00000001_001.png,Cardiomegaly|Emphysema,1,1,058Y,M,PA,2894,2729,0.143,...,1,0,0,0,0,0,0,0,0,0
2,00000001_002.png,Cardiomegaly|Effusion,2,1,058Y,M,PA,2500,2048,0.168,...,0,0,0,0,0,0,0,0,0,0
3,00000002_000.png,No Finding,0,2,081Y,M,PA,2500,2048,0.171,...,0,0,0,0,0,1,0,0,0,0
4,00000003_000.png,Hernia,0,3,081Y,F,PA,2582,2991,0.143,...,0,0,1,0,0,0,0,0,0,0


In [4]:
profile = ProfileReport(df_labels, title='Pandas Profiling Report', explorative=True)

In [5]:
profile.to_file("chest_xray_exploration.html")

HBox(children=(HTML(value='Summarize dataset'), FloatProgress(value=0.0, max=41.0), HTML(value='')))




HBox(children=(HTML(value='Generate report structure'), FloatProgress(value=0.0, max=1.0), HTML(value='')))




HBox(children=(HTML(value='Render HTML'), FloatProgress(value=0.0, max=1.0), HTML(value='')))




HBox(children=(HTML(value='Export report to file'), FloatProgress(value=0.0, max=1.0), HTML(value='')))


