# Summer School Workshop - Image Data Preprocessing

As you may have heard in the lecture, most machine learning algorithms need the data in a fixed and uniform format. For image data, this means a fixed size, resolution, color depth and so on.
Luckily, most deep learning frameworks already include high-level interfaces for image transformation, so that the data can be transformed into a useful format quite easily. The result of the preprocessing should be simply a byte array for each image that we can feed into a neural network.

In this exercise we learn how to preprocess image data. Our goal will be an optimal training result.

Steps:
- Explore the raw data
- Structure the dataset
- Standardize the size
- Split into training and testing data
- Augment the data to have a greater variety 

In [None]:
# some standard imports 
import pandas as pd
import matplotlib.pyplot as plt
import os
import shutil
from IPython.display import Image, display

## 1. Explore the data

We already obtained some raw data for you. 
In the folder ['data/image_cats_dogs'](/tree/data/image_cats_dogs) are 100 cat and 100 dog images. The images come from the dataset of the Kaggle challenge ['Dogs vs. Cats'](https://www.kaggle.com/c/dogs-vs-cats)

We want to prepared these images now in order to train a dog vs. cat classifier

- Take a look at the folder and become familiar with the images!

In [None]:
# This is our current path
print("current path: {}".format(os.getcwd()))

# let's save the home path: this value is for the container environment. 
# You can update the path according to the outputof the command above
home = "/home/jovyan/"

folders = {home + "data/image_cats_dogs/"}
for folder in folders:
    for i, file in zip(range(4),os.listdir(folder)):
        display(Image(filename=(folder + file)))

## 2. Structure your dataset

We should first copy the files to a new location in order to avoid having to redownload them in the case of erros in the preprocessing. 

Because its also ofthen easier to have the classes in seperate locations, we'll seperate cats images from dog images in this step, too. Most toolkits are then able to map the name of a directory to a specfic class. 

The information if an image shows a cat or a dog, the labels, are contained in the file ['data/catdoglabels.csv'](/edit/data/catdoglabels.csv).

- Therefore structure the data using the csv file into two subfolders based on the schema below:

```
- Images folder
    -cat:
        -u27.jpg
        -099.jpg
        -...
    -dog:
        -866.jpg
        -783.jpg
        -...
```        
This can be done by completing the following Python code:

**Hint:** read the csv into a pandas dataframe, which is basically a table structure that you can access like a hash map.

In [None]:
# read the csv file with pandas
pd.read_csv?
csvlabels = ...

# define a mapping from the integer label (0/1) to a human readable class label (cats/dogs)
labels = ...

In [None]:
# This code is already complete because it is mostly boring file operation...

# define source folder and our working folder in which we store the copy of the data
source_folder = home + "data/image_cats_dogs/"
target_folder = home + "temp/image_cats_dogs/"

# create folder structure if necessary
os.makedirs(target_folder, exist_ok=True)

# create the folders for each class
for i in range(2):
    label = labels[i]
    path =  target_folder + label + "/"
    os.makedirs(path, exist_ok=True)

# loop through the images
copy_count = 0;
for file in os.listdir(source_folder):
    # search for the label
    label = csvlabels[csvlabels.id == file.replace(".jpg","")].label.values[0]
    # copy it
    shutil.copy((source_folder+file), (target_folder+labels[label]+"/"))
    copy_count += 1
print("{} files copied".format(copy_count))

## 3. Standardise sizes

Most machine learning methods need a fix input size. This is why we normalize the image size to an uniform, square format. 

We work with an (arbitrary) image size of 250 x 250 pixels. Far more than the Fashion MNIST dataset, but still small. 

Step1: Make the image square by cutting surplus image parts from either top/bottom or left/right (cropping).

Step2: Resize the square image to a fixed resolution of 250 x 250 pixels (thumbnailing). 

Therefore complete the following code with a crop method!

***Hints:***
- `PIL.Image` has a `crop` as well as a `thumbnail` method


In [None]:
from PIL import Image as PILImage # name conflict with IPython.display.Image
DESIRED_SIZE = 250, 250

count = 0
# goes through all folders
for root, dirs, files in os.walk(target_folder):
    # for each folder: goes through each file
    for pic in files:
        #loads image
        img_path = os.path.join(root, pic)
        img = PILImage.open(img_path)
        width, height = img.size
       
        if width > height:
            left = int((width-height)/2)
             #insert the crop method here!
             
        else:
            upper = int((height-width)/2)
            #insert the crop method here!

        # this code resizes the cropped image and overwrites the file with the transformed image.
        img.thumbnail(DESIRED_SIZE, PILImage.ANTIALIAS)
        img.save(os.path.join(root, pic))
        count += 1
print('{} images done'.format(count))

Now open a few images with the display method to check if it worked! The images should all be squared and constant size with a nicely cropped animal.
(code is already provided, just run it)

In [None]:
listOfImageNames = [target_folder + labels[1] + '/1v4.jpg',
                    target_folder + labels[1] + '/4vv.jpg',
                    target_folder + labels[0] + '/x12.jpg',
                    target_folder + labels[0] + '/x13.jpg']

for imageName in listOfImageNames:
    display(Image(filename=imageName))

## 4. Create a test set

To recognize if the learning method does not simply memorize the examples, we won't show him a part of the examples during the training. If the model can predict well for those data examples, we know that it have learned something useful.
   
Take 80% of all data for training data and 20% for test data, and move them into separate directories:

```
- Images folder
    - train
        -cat:
            -u27.jpg
            -099.jpg
            -...
        -dog:
            -866.jpg
            -783.jpg
            -...
    - test
        -cat:
            -f11.jpg
            -d84.jpg
            -...
        -dog:
            -0h7.jpg
            -110.jpg
            -...

```


In [None]:
import random

# In order to guarantee that both classes are also evenly distributed in the test set, 
# we process them separatly 
for i in range(2):
    filenames = os.listdir(target_folder + labels[i])
    # shuffle to generate a random split 
    random.shuffle(filenames) 
    # insert your code to split the data into two parts!
    split_index = ...
    # insert your code to make dirs and move the files
    shutil.move?


# delete old dirs if your are certain that everything works :)
#for i in range(2):
#    os.removedirs(target_folder + labels[i])
    
# what's the result?
for folder in os.walk(target_folder):
    print("folder: {}, file count: {}".format(folder[0], len(folder[2])))

## 5. Expand your train dataset

Due to the fact that our previous dataset is very small, we want to expand it. One way of achieving this is called 'Data Augmentation'. Within different transformation steps like rotation or perspectives we create further versions of an image. It helps the model to generalize better and it prevents overfitting.
- First search for information about Data Augmentation in our NovaTec-Blog (https://blog.novatec-gmbh.de/keras-data-augmentation-for-cnn/)
- After that apply Data Augmentation to one example image of your choice. Take note how different parameters affect the images!
- Complete the code by an ImageDataGenerator(https://keras.io/preprocessing/image/)

In [None]:
from keras.preprocessing.image import ImageDataGenerator, array_to_img, img_to_array, load_img
from IPython.display import Image, display

#ImageDataGenerator datagen:
datagen = # your code here!

# 
img = load_img('../temp/image_cats_dogs/train/dog/0h7.jpg')  # you may have to change this file name! (random split 4TW!)
x = img_to_array(img)  # Numpy array with shape (250, 250, 3)
x = x.reshape((1,) + x.shape)  # Numpy array with shape (1, 250, 250, 3)
    
    
# generating batches of randomly transformed images
# save to the 'augmentation' directory
if not os.path.exists('../temp/image_cats_dogs/augmentation'):
    os.makedirs('../temp/image_cats_dogs/augmentation')
        
i = 0
for batch in datagen.flow(x, batch_size=1, save_to_dir='../temp/image_cats_dogs/augmentation', save_prefix='dog', save_format='jpeg'):
    i += 1
    # 10 images
    if i > 9:
        break 
    
listOfAugmentedImages = []
for root, dirs, files in os.walk('../temp/image_cats_dogs/augmentation/'):
    for pic in files:
        listOfAugmentedImages.append(os.path.join(root,pic))
        
for imageName in listOfAugmentedImages:
        display(Image(filename=imageName))

## 6. Convert to byte array

The data input into a neural net (convolutional neural net) must be available in byte format. For CNNs this would be a 4-D input tensor with the values `[batch_size, width, height, channels]`.
- Convert one image of your choice and output a byte array with shape `[1,250, 250, 3]`.

***Hint***: Use the Keras method 'img_to_array'!

In [None]:
from matplotlib.pyplot import imshow
from PIL import Image
import numpy as np

%matplotlib inline
image = Image.open(target_folder +'train/'+ labels[1] + '/0jj.jpg', 'r')

#insert the method here:
image = ...
print ('Shape:', image.shape)
imshow(image)

image = image / 255
image = np.expand_dims(image, axis=0)
print ('Shape:', image.shape)