# Exploratory Data Analysis (EDA)

The dataset is organized into 3 folders (train, test, val) and contains subfolders for each image category (Pneumonia/Normal). There are 5,863 X-Ray images (JPEG) and 2 categories (Pneumonia/Normal).

Chest X-ray images (anterior-posterior) were selected from retrospective cohorts of pediatric patients of one to five years old from Guangzhou Women and Children’s Medical Center, Guangzhou. All chest X-ray imaging was performed as part of patients’ routine clinical care.

For the analysis of chest x-ray images, all chest radiographs were initially screened for quality control by removing all low quality or unreadable scans. The diagnoses for the images were then graded by two expert physicians before being cleared for training the AI system. In order to account for any grading errors, the evaluation set was also checked by a third expert.

Dataset: https://www.kaggle.com/datasets/paultimothymooney/chest-xray-pneumonia

## Imports 

In [1]:
import numpy as np
import pandas as pd
import cv2
import glob

## Loading datasets on default split (train, val, test)

### Loader 

In [2]:
def flatten(lst):
    return [x for xs in lst for x in xs]

def load_split(split_path: str) -> pd.DataFrame: 
    image_paths = [] 
    classes = [] 

    for split_path in glob.glob(split_path + '/*'):
        classes.append(split_path.split('/')[-1]) 
        image_paths.append(glob.glob(split_path + '/*'))

    image_paths = list(flatten(image_paths))
    labels = list()

    for image_path in image_paths: 
        label = image_path.split('/')[-2]
        labels.append(label)

    return pd.DataFrame(list(zip(image_paths, labels)), columns =['image_path', 'target'])

### Default dataset split - **train**

In [3]:
train_deafult_df = load_split('../data/chest_xray/train')
train_deafult_df

Unnamed: 0,image_path,target
0,../data/chest_xray/train/NORMAL/IM-0408-0001.jpeg,NORMAL
1,../data/chest_xray/train/NORMAL/IM-0541-0001.jpeg,NORMAL
2,../data/chest_xray/train/NORMAL/NORMAL2-IM-053...,NORMAL
3,../data/chest_xray/train/NORMAL/NORMAL2-IM-133...,NORMAL
4,../data/chest_xray/train/NORMAL/IM-0523-0001-0...,NORMAL
...,...,...
5211,../data/chest_xray/train/PNEUMONIA/person1349_...,PNEUMONIA
5212,../data/chest_xray/train/PNEUMONIA/person1312_...,PNEUMONIA
5213,../data/chest_xray/train/PNEUMONIA/person1411_...,PNEUMONIA
5214,../data/chest_xray/train/PNEUMONIA/person703_b...,PNEUMONIA


### Default dataset split - **validation**

In [4]:
val_deafult_df = load_split('../data/chest_xray/val')
val_deafult_df

Unnamed: 0,image_path,target
0,../data/chest_xray/val/NORMAL/NORMAL2-IM-1431-...,NORMAL
1,../data/chest_xray/val/NORMAL/NORMAL2-IM-1440-...,NORMAL
2,../data/chest_xray/val/NORMAL/NORMAL2-IM-1427-...,NORMAL
3,../data/chest_xray/val/NORMAL/NORMAL2-IM-1438-...,NORMAL
4,../data/chest_xray/val/NORMAL/NORMAL2-IM-1430-...,NORMAL
5,../data/chest_xray/val/NORMAL/NORMAL2-IM-1437-...,NORMAL
6,../data/chest_xray/val/NORMAL/NORMAL2-IM-1436-...,NORMAL
7,../data/chest_xray/val/NORMAL/NORMAL2-IM-1442-...,NORMAL
8,../data/chest_xray/val/PNEUMONIA/person1946_ba...,PNEUMONIA
9,../data/chest_xray/val/PNEUMONIA/person1950_ba...,PNEUMONIA


### Default dataset split - **test**

In [5]:
test_deafult_df = load_split('../data/chest_xray/test')
test_deafult_df

Unnamed: 0,image_path,target
0,../data/chest_xray/test/NORMAL/NORMAL2-IM-0058...,NORMAL
1,../data/chest_xray/test/NORMAL/NORMAL2-IM-0278...,NORMAL
2,../data/chest_xray/test/NORMAL/IM-0033-0001.jpeg,NORMAL
3,../data/chest_xray/test/NORMAL/NORMAL2-IM-0373...,NORMAL
4,../data/chest_xray/test/NORMAL/IM-0027-0001.jpeg,NORMAL
...,...,...
619,../data/chest_xray/test/PNEUMONIA/person1676_v...,PNEUMONIA
620,../data/chest_xray/test/PNEUMONIA/person1649_v...,PNEUMONIA
621,../data/chest_xray/test/PNEUMONIA/person150_ba...,PNEUMONIA
622,../data/chest_xray/test/PNEUMONIA/person124_ba...,PNEUMONIA


### Default dataset split - size comparison 

In [6]:
print(f'Train split size:', len(train_deafult_df))
print(f'Validation split size:', len(val_deafult_df))
print(f'Test split size:', len(test_deafult_df))

Train split size: 5216
Validation split size: 16
Test split size: 624


### Default dataset split - class occurency comparsion

In [7]:
train_deafult_df.target.value_counts()

PNEUMONIA    3875
NORMAL       1341
Name: target, dtype: int64

In [8]:
val_deafult_df.target.value_counts()

NORMAL       8
PNEUMONIA    8
Name: target, dtype: int64

In [9]:
test_deafult_df.target.value_counts()

PNEUMONIA    390
NORMAL       234
Name: target, dtype: int64

### Deafult dataset split - conclusions 
* The default data split is clearly incorrect, the validation set is 0.3% of the training set, which is unacceptable. We decided to use our own data manager class to split the data appropriately - our chosen ratio of training set to validation set is 4:1. 
* We observe that the number of cases with pneumonia is much higher (about 3 times), therefore it is necessary to augment the data. 
* The test set remains unchanged because it was provided in the dataset and there is no need to change it.

## Loading datasets on custom split (train, val, test) by PneumoniaDataManager

In [10]:
import sys
sys.path.insert(0, '../tools')

from data_tools import PneumoniaDataManager

In [11]:
pdm = PneumoniaDataManager('../data/chest_xray', val_size=0.2)   

### PDM dataset split - class occurency comparsion

In [12]:
print(f'Train split size:', len(pdm.splited_df_dict['train']))
print(f'Validation split size:', len(pdm.splited_df_dict['val']))
print(f'Test split size:', len(pdm.splited_df_dict['test']))

Train split size: 4185
Validation split size: 1047
Test split size: 624


### PDM dataset split - class occurency comparsion

In [15]:
pdm.splited_df_dict['train'].target.value_counts()

PNEUMONIA    3106
NORMAL       1079
Name: target, dtype: int64

In [16]:
pdm.splited_df_dict['val'].target.value_counts()

PNEUMONIA    777
NORMAL       270
Name: target, dtype: int64

In [17]:
pdm.splited_df_dict['test'].target.value_counts()

PNEUMONIA    390
NORMAL       234
Name: target, dtype: int64

### PDM dataset split - conclusions 
* jest git