# Summer School Workshop - Image Data Preprocessing

Most Deep Learning Frameworks already include high-level interfaces for image transformation, so that the data can be processed by the neural net. The images are converted to the same level of resolution and size. 
Furthermore a conversion to byte arrays decreases the processing time.

In this workshop we work on the subject to preprocess image data. Our goal will be an optimal training result.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import os
import shutil
from IPython.display import Image, display

## 1. Explore you data

We already obtained your data! 
In the folder 'data/image_cats_dogs' are 100 cat and 100 dog images. The images come from the Kaggle dataset of the challenge 'Dogs vs. Cats':  https://www.kaggle.com/c/dogs-vs-cats\

These images should be prepared now. As a result a dog vs. cat classifier can be trained on the basis of the prepared data. 
- Take a look at the folder and become familiar with the images!

In [None]:
# This is our current path
print("current path: {}".format(os.getcwd()))

# let's save the home path
home = "/home/jovyan/"

folders = {home + "data/image_cats_dogs/"}
for folder in folders:
    for i, file in zip(range(4),os.listdir(folder)):
        display(Image(filename=(folder + file)))

## 2. Preparing Data

For further steps it's easier to separate the image classes into different directories. The most toolkits are able to map the name of a directory to a specfic class. 

The file *~/data/catdoglabels.csv* contains for each image name the correct class.

- Therefore structure the data using the csv file into two subfolders based on the schema below:
    - Images
        
        -1 (Label Dog):
            -866.jpg
            -783.jpg
            -...
        -0 (Label Cat):
            -u27.jpg
            -099.jpg
            -...
This could be done by completing the Python code with some lines...

In [None]:
# Read the csv file with the pandas library
csvlabels = pd.read_csv(home + "data/catdoglabels.csv")
print ("data size:", csvlabels.shape)
labels = { 0 : "cat", 1 : "dog"}
csvlabels.sample(5)

In [None]:
# create the folders for each class
source_folder = home + "data/image_cats_dogs/"
target_folder = home + "temp/image_cats_dogs/"

if not os.path.exists(target_folder):
    os.makedirs(target_folder)
for i in range(2):
    label = labels[i]
    path =  target_folder + label + "/"
    if not os.path.exists(path):
        print ('Generated subfolder:', label)
        os.makedirs(path)

# loop through the images
copy_count = 0;
for file in os.listdir(source_folder):
    # search for the label
    label = csvlabels[csvlabels.id == file.replace(".jpg","")].label.values[0]
    # copy it
    shutil.copy((source_folder+file), (target_folder+labels[label]+"/"))
    copy_count += 1
print("{} files copied".format(copy_count))

## 3. Standardise size

Most machine learning methods need a fix input size. This is why we normalize the image size to an uniform, square format. 

Additionally we work with an image size of 250 x 250 pixels. Far more than the Fashion MNIST dataset, but still small. 

Surplus image parts will be cropped (Cropping).

***Hints:***
- `PIL.Image` has a `crop` as well as a `thumbnail` method

  
Crop all images to a resolution 250 x 250 pixels and use the Python Imaging Library (PIL). Therefore complete the following code with a crop method!


In [None]:
from PIL import Image as PILImage # name conflict with IPython.display.Image
DESIRED_SIZE = 250, 250

count = 0
for root, dirs, files in os.walk(target_folder):
    for pic in files:
        img_path = os.path.join(root, pic)
        img = PILImage.open(img_path)
        width, height = img.size

        #insert the crop method here!
        if width > height:
            left = int((width-height)/2)
            img = img.crop((left, 0, left+height, height))
        else:
            upper = int((height-width)/2)
            img = img.crop((0, upper, width, upper+width))

        img.thumbnail(DESIRED_SIZE, PILImage.ANTIALIAS)
        img.save(os.path.join(root, pic))
        count += 1
print('{} images done'.format(count))

- Now open a few images with the display method to check if it works!

In [None]:
listOfImageNames = [target_folder + labels[1] + '/1v4.jpg',
                    target_folder + labels[1] + '/4vv.jpg',
                    target_folder + labels[0] + '/x12.jpg',
                    target_folder + labels[0] + '/x13.jpg']

for imageName in listOfImageNames:
    display(Image(filename=imageName))

## 4. Create a test dataset

To recognize if the learning method does not simply memorize the examples, we won't show him a part of the examples during the training. If the model can predict well for those data examples, we know that it have learned something useful.
   
Take 80% of all data for training data and 20% for test data, and move them into separate directories.

In [None]:
import random

# In order that both classes are also evenly distributed in the test set, 
# we process them separate 
for i in range(2):
    filenames = os.listdir(target_folder + labels[i])
    # shuffle to generate a random split 
    random.shuffle(filenames) 
    # split
    split_index = int(0.8 * len(filenames))
    split_files = {}
    split_files['train'] = filenames[:split_index]
    split_files['test'] = filenames[split_index:]
    for set in ["train","test"]:
        # make dir and move
        os.makedirs(target_folder + set + '/' + labels[i], exist_ok=True)
        for file in split_files[set]:
            shutil.move(
                target_folder + labels[i] + "/" + file,
                target_folder + set + '/' + labels[i] + "/" + file)

# delete old dirs
for i in range(2):
    os.removedirs(target_folder + labels[i])
    
# what's the result?
for folder in os.walk(target_folder):
    print("folder: {}, file count: {}".format(folder[0], len(folder[2])))

## 5. Expand your train dataset

Due to the fact that our previous dataset is too small for a successful training, we have to expand it. One way of achieving this is called 'Data Augmentation'. Within different transformation steps like rotation or perspectives we create further versions of an image. It helps the model to generalize better and it prevents overfitting.
- First search for information about Data Augmentation in our NovaTec-Blog (https://blog.novatec-gmbh.de/keras-data-augmentation-for-cnn/)
- After that apply Data Augmentation to one example image of your choice. Take note how different parameters affect the images!
- Complete the code by an ImageDataGenerator(https://keras.io/preprocessing/image/)

In [None]:
from keras.preprocessing.image import ImageDataGenerator, array_to_img, img_to_array, load_img
from IPython.display import Image, display

#ImageDataGenerator datagen:
datagen = ImageDataGenerator(
            rotation_range=20,
            width_shift_range=0.1,
            height_shift_range=0.1,
            shear_range=0.1,
            zoom_range=0.1,
            horizontal_flip=True,
            fill_mode='nearest')
    
img = load_img('../temp/image_cats_dogs/train/dog/0h7.jpg')  
x = img_to_array(img)  # Numpy array with shape (250, 250, 3)
x = x.reshape((1,) + x.shape)  # Numpy array with shape (1, 250, 250, 3)
    
    
# generating batches of randomly transformed images
# save to the 'augmentation' directory
if not os.path.exists('../temp/image_cats_dogs/augmentation'):
    os.makedirs('../temp/image_cats_dogs/augmentation')
        
i = 0
for batch in datagen.flow(x, batch_size=1, save_to_dir='../temp/image_cats_dogs/augmentation', save_prefix='dog', save_format='jpeg'):
    i += 1
    # 10 images
    if i > 9:
        break 
    
listOfAugmentedImages = []
for root, dirs, files in os.walk('../temp/image_cats_dogs/augmentation/'):
    for pic in files:
        listOfAugmentedImages.append(os.path.join(root,pic))
        
for imageName in listOfAugmentedImages:
        display(Image(filename=imageName))

## 6. Convert to Byte-Array

The data input into a neural net (convolutional neural net) must be available in byte format. For CNNs this would be a 4-D input tensor with the values [batch_size, width, height, channels].
- Convert one image of your choice and output a byte array with shape [1,250, 250, 3].

Hint: Use the Keras method 'img_to_array'!

In [None]:
from matplotlib.pyplot import imshow
from PIL import Image
import numpy as np

%matplotlib inline
image = Image.open(target_folder +'train/'+ labels[1] + '/0jj.jpg', 'r')

#insert the method here:
image = img_to_array(image)
print ('Shape:', image.shape)
imshow(image)

image = image / 255
image = np.expand_dims(image, axis=0)
print ('Shape:', image.shape)