# Data Exploration and Pre-Processing

This dataset contains chest x-ray images some of which are labeled as 'NORMAL' and otheras as 'PNEUMONIA' which are examples of patients that indeed have a case the potentially lethal respiratory infection. There are 5656 total images in a range of dimensions and all are greyscale. The dataset is sourced directly from [this Kaggle page](https://www.kaggle.com/datasets/paultimothymooney/chest-xray-pneumonia) using the Kaggle API command (kaggle datasets download -d paultimothymooney/chest-xray-pneumonia). This kaggle page cites [Mendeley Data](https://data.mendeley.com/datasets/rscbjbr9sj/2) as its original data source.  

## Data loading and sifting

In [1]:
# dependences for data loading and sifting
import os
from keras.preprocessing.image import ImageDataGenerator
from PIL import Image

In [2]:
# file paths to data directories
train_dir = "data/train/"
test_dir = "data/test/"
val_dir = "data/val/"

In [3]:
# calculate class balance for each sample
train_class_balance = (len(os.listdir(train_dir+"NORMAL")),len(os.listdir(train_dir+"PNEUMONIA")))
test_class_balance = (len(os.listdir(test_dir+"NORMAL")),len(os.listdir(test_dir+"PNEUMONIA")))
val_class_balance = (len(os.listdir(val_dir+"NORMAL")),len(os.listdir(val_dir+"PNEUMONIA")))

# calculate volume for each sample
train_total = len(os.listdir(train_dir+"NORMAL"))+len(os.listdir(train_dir+"PNEUMONIA"))
test_total = len(os.listdir(test_dir+"NORMAL"))+len(os.listdir(test_dir+"PNEUMONIA"))
val_total = len(os.listdir(val_dir+"NORMAL"))+len(os.listdir(val_dir+"PNEUMONIA"))

# calculate distribution of data volume between samples
sample_balance = (train_total,test_total,val_total)


print("Classs balance for Train sample (Normal, Pneumonia): ",train_class_balance[0]/train_class_balance[1])
print("Classs balance for Test sample (Normal, Pneumonia): ",test_class_balance[0]/test_class_balance[1])
print("Classs balance for Validation sample (Normal, Pneumonia): ",val_class_balance[0]/val_class_balance[1])
print("Sampling distribution (train,test,val): ",(train_total/sum(sample_balance),test_total/sum(sample_balance),val_total/sum(sample_balance)))


Classs balance for Train sample (Normal, Pneumonia):  0.3460645161290323
Classs balance for Test sample (Normal, Pneumonia):  0.6
Classs balance for Validation sample (Normal, Pneumonia):  1.0
Sampling distribution (train,test,val):  (0.8907103825136612, 0.10655737704918032, 0.00273224043715847)


The data comes already labeled and split for 3 way validation. The Train sample is, naturally, the largest of the 3 with over five thousand images, 66% of which are cases of pneumonia which will be the target prediction. The next largest is the test sample with over six hundred images of which 40% are trues cases of pneumonia. Finally, the validation set contains 16 images with each class being represented equally at 50%.  

## Sanity Check: Can the images be displayed easily?

In [6]:
# an example of (first) and Normal x-ray scan and (second) and x-ray scan showing pneumonia
Image.open("data/train/NORMAL/IM-0115-0001.jpeg").show()
Image.open("data/train/PNEUMONIA/person1001_bacteria_2932.jpeg").show()

## Sanity Check: is the data ready to fit a model?

In [None]:
# dependencies for modeling
from keras import layers
from keras import models
from keras import optimizers

In [5]:
# instatiating a data degenerater for each split sample 
train_datagen = ImageDataGenerator(rescale=1./255)

test_datagen = ImageDataGenerator(rescale=1./255)
                                   
val_datagen = ImageDataGenerator(rescale=1./255)

train_data_generator = train_datagen.flow_from_directory(
                       train_dir,
                       target_size=(150,150),
                       batch_size=16,
                       class_mode='binary',
                       color_mode='grayscale')

test_data_generator = test_datagen.flow_from_directory(
                      test_dir,
                      target_size=(150,150),
                      batch_size=16,
                      class_mode='binary',
                      color_mode='grayscale')

val_data_generator = val_datagen.flow_from_directory(
                     val_dir,
                     target_size=(150,150),
                     batch_size=16,
                     class_mode='binary',
                     color_mode='grayscale')

Found 5216 images belonging to 2 classes.
Found 624 images belonging to 2 classes.
Found 16 images belonging to 2 classes.


In [6]:
# build a simple convultional neural network
base_model = models.Sequential()
base_model.add(layers.Conv2D(32,(3,3),activation='relu',input_shape=(150,150,1)))
base_model.add(layers.MaxPooling2D((2, 2)))
base_model.add(layers.Flatten())
base_model.add(layers.Dense(1, activation='sigmoid'))

In [7]:
# compile model from above
base_model.compile(loss='binary_crossentropy',metrics=['acc'])

In [8]:
# fit the model to the training data and validate with the test sample
base_model.fit(train_data_generator, 
               batch_size=16,
               epochs=10,
               steps_per_epoch=25,
               validation_data=test_data_generator,
               validation_steps=15
               )

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x26caf355d60>

## Conlcusion 
The base model succesfully compiles and fits without any errors or warnings. The model also performs decently for a baseline model. I will continue tuning and optimizing thhe model in the [modeling notebook](/PROJECT-2/modeling.ipynb).  