# OES Data Preperation

**Please read this notebook on [Kaggle](https://www.kaggle.com/matthewturnerirl/oes-data-prep/notebook)**


The purpose of this notebook will demostrate and justify our data preperations steps used on our data. In our preperation we set up generators for our data, and in the generators we also re-scale and re-size the images before they are fed into our model.

In [23]:
# Import statements
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import keras

## Directory Setup
Due to the large size of our dataset, we chose to use Kaggle as a place to store our data on the cloud. Our directory setup was structured in the following format: 

```
├── chest-xray                    <- Top level directory
│   ├── test                      <- Test set images
│   │   ├── Normal                <- Normal lung photos      
│   │   │   └── ...
│   │   └──  Pneumonia            <- Pneumonia lung photos
│   │   │   └── ...
│   ├── train                     <- Training set images
│   │   ├── Normal                <- Normal lung photos 
│   │   │   └── ...
│   │   └──  Pneumonia            <- Pneumonia lung photos
│   │   │   └── ...
│   ├── val                       <- Training set images
│   │   ├── Normal                <- Normal lung photos 
│   │   │   └── ...
│   │   └──  Pneumonia            <- Pneumonia lung photos
│   │   │   └── ...             

```

This setup allows us to use Kera's Image Data Generator to load our data. We chose to use a generator for 3 reasons:
- Saving memory and disk space by not downloading the dataset
- Integrating the preproccesing into our modeling process
- Easy re-sizing and rescaling of images

Using generators also allows more easily reproducable results. Since images fed into our model this way do not have to be preproccessed beforehand.

In [24]:
# Instantiating a generator object and normalizing the RGB values
traingen = keras.preprocessing.image.ImageDataGenerator(rescale=1/255)
testgen = keras.preprocessing.image.ImageDataGenerator(rescale=1/255)
valgen = keras.preprocessing.image.ImageDataGenerator(rescale=1/255)

# Creating the generator for the training data
train_data = traingen.flow_from_directory(
    # Specifying location of training data
    directory='../input/chest-xray-pneumonia/chest_xray/train',
    # Re-sizing images to 150x150
    target_size=(150, 150),
    # Class mode to binary to recoginize the two directories "NORMAL" and "PNEUMONIA" as the labels
    class_mode='binary',
    batch_size=20,
    seed=42
)
# Creating the generator for the testing data
test_data = testgen.flow_from_directory(
    # Specifying location of testing data
    directory='../input/chest-xray-pneumonia/chest_xray/test',
    # Re-sizing images to 150x150
    target_size=(150, 150),
    # Class mode to binary to recoginize the two directories "NORMAL" and "PNEUMONIA" as the labels
    class_mode='binary',
    batch_size=20,
    seed=42
)

# Setting aside a validation set
val_data = valgen.flow_from_directory(
    # Specifying location of testing data
    directory='../input/chest-xray-pneumonia/chest_xray/val',
    # Re-sizing images to 150x150
    target_size=(150, 150),
    # Class mode to binary to recoginize the two directories "NORMAL" and "PNEUMONIA" as the labels
    class_mode='binary',
    batch_size=20,
    seed=42
)

#### Visualize Transformation
We will visualize the first 10 items in the training data set to check that all transformations to the images were done correctly.

In [25]:
# Visualize
train_batch = train_data.next()
fig, axes = plt.subplots(2, 5, figsize=(16, 8))
    
for i in range(10):
    # Load image into numpy array and re-scale
    img = np.array(train_batch[0][i] * 255, dtype='uint8')
    ax = axes[i // 5, i % 5]
    ax.imshow(img)
fig.suptitle('Training Images')
plt.tight_layout()
plt.show()

Visual confirmation that the images are 150x150. We can now move on to the modeling phase.