## Setting Dataset
In this notebook we will setup our dataset, we will read the dataset and apply all the preprocessing and then save that for later use.

### Importing Libraries
For reading images and resizing them we will need `open cv` library, and we will also need `numpy` library. After reading and resizing the images we will convert the image array into numpy array so then it can be used in Neural Networks. We will use `pickle` library to store our numpy array of images so then we can use it later.

In [13]:
import numpy as np
import matplotlib.pyplot as plt
import cv2
import os
import pickle
import random
from sklearn.model_selection import train_test_split

### Initializing variables
We will need some global variables which will be used in the notebook.

- Training dataset path.
- Testing dataset path.
- Total Classes in the dataset.
- Size of the image, height and width.

In [22]:
train_path = os.path.join(os.getcwd(), 'DataSet')
test_path = os.path.join(os.getcwd(), 'test')

# List of classes
categories = os.listdir(train_path)

# Image size width and height
IMG_SIZE = 125

### Reading Dataset
This function will be used to read the images, we will need to pass the dataset path, list of classes and the number to read maximum number of images for each class. This function will first read the image by using **cv2.imread()** function and then after reading the image this function will resize the image by using **cv2.resize()** function, this function will need the image height and width, we will pass **IMG_SIZE** for both height and width. For our project we will convert the images into **Gray Scale**, this will help to make the model computation faster as compared to **RGB image**. We will convert the image into gray scale while reading the image. In the image reading function **imread()** we will pass the second argument as **cv2.IMREAD_GRAYSCALE**, this will convert the image into gray scale.

In [6]:
def read_data(categories, data_path, max_images):
    data = []

    for category in categories:
        path = os.path.join(data_path, category)
        class_num = categories.index(category)
        for img in os.listdir(path)[:max_images]:
            try:
                img_array = cv2.imread(os.path.join(path, img), cv2.IMREAD_GRAYSCALE)
                resized_array = cv2.resize(img_array, (IMG_SIZE, IMG_SIZE))
                data.append([resized_array, class_num])
            except Exception:
                print("Exception")
                pass
    
    return data

### Setting Training Dataset

In [5]:
training_data = read_data(categories, train_path, 210)

Shuffle the training data, this will help the model to learn better.

In [8]:
random.shuffle(training_data)

Creating X_train and y_train variables to store training features and labels

In [18]:
X_train = []
y_train = []

for feature, label in training_data:
    X_train.append(feature)
    y_train.append(label)

Now in this section below we will reshape the training dataset, again we are doing this so this will make the computation faster

In [19]:
X_train = np.array(X_train).reshape(-1, IMG_SIZE, IMG_SIZE, 1)

#### Spliting Training and Validation
Now we will split training and validation sets, the validation set is important for training the neural network as it will help us to prevent overfiting and underfiting problems and by the help of validation set it will give us better results.

In [20]:
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=0.1, random_state=42)

## Normalizing dataset
Now we will normalize the dataset, we have read the images in gray scale mode and now we know that each image is consist of array with numbers 0 to 255, so in this case we will normalize the dataset by dividing each image matrix with 255. This normalization will help us achieve better result and the computational loss of the model will also going to be low with the help of normalization.

In [21]:
X_train = X_train.astype('float32')/255
X_valid = X_valid.astype('float32')/255

## Setting Testing Dataset
Now after setting up training dataset we will going to setup testing dataset. The steps for reading and setting up testing dataset is same as training dataset.

In testing we will use 10 images for each class of the disease. The model will gonna test on total 50 images.

In [23]:
testing_data = read_data(categories, test_path, 10)

In [26]:
X_test = []
y_test = []

for feature, label in testing_data:
    X_test.append(feature)
    y_test.append(label)

In [27]:
X_test = np.array(X_test).reshape(-1, IMG_SIZE, IMG_SIZE, 1)

In [28]:
X_test = X_test.astype('float32')/255

## Saving Testing, Training and Validation datasets 

In [31]:
with open('Data/X_train.pickle', 'wb') as f:
    pickle.dump(X_train, f)
    
with open('Data/X_valid.pickle', 'wb') as f:
    pickle.dump(X_valid, f)
    
with open('Data/X_test.pickle', 'wb') as f:
    pickle.dump(X_test, f)

In [32]:
with open('Data/y_train.pickle', 'wb') as f:
    pickle.dump(y_train, f)
    
with open('Data/y_valid.pickle', 'wb') as f:
    pickle.dump(y_valid, f)
    
with open('Data/y_test.pickle', 'wb') as f:
    pickle.dump(y_test, f)