# Summer School Workshop - Image Data Preprocessing

Most Deep Learning Frameworks already include high-level interfaces for image transformation, so that the data can be processed by the neural net. The images are converted to the same level of resolution and size. 
Furthermore a conversion to byte arrays decreases the processing time.

In this workshop we work on the subject to preprocess image data. Our goal will be an optimal training result.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import shutil

## 1. Explore you data

We already obtained your data! 
In the folder 'data' are 100 cat and 100 dog images. The images come from the Kaggle dataset of the challenge 'Dogs vs. Cats':  https://www.kaggle.com/c/dogs-vs-cats\

These images should be prepared now. As a result a dog vs. cat classifier can be trained on the basis of the prepared data. 
- Take a look at the folder and become familiar with the images!

In [None]:
csvlabels = pd.read_csv("../data/catdoglabels.csv")
csvlabels.head()

In [None]:
print ("data size:", csvlabels.shape)

In [None]:
# Occurance of labels 
temp = pd.DataFrame(csvlabels.label.value_counts())
temp.reset_index(inplace=True)
temp.columns = ['label','count']
temp

In [None]:
# Plot the labels
plt.figure(figsize = (9, 8))
plt.title('frequency of labels')
sns.set_color_codes("pastel")
sns.barplot(x="label", y="count", data=temp,
            label="Count")
plt.show()

In [None]:
from IPython.display import Image, display
   
listOfImageNames = ['../data/image_cats_dogs/1v4.jpg',
                    '../data/image_cats_dogs/4vv.jpg',
                    '../data/image_cats_dogs/x12.jpg',
                    '../data/image_cats_dogs/x13.jpg']

for imageName in listOfImageNames:
    display(Image(filename=imageName))

## 2. Structure your data

You might have noticed that the images are not assigned to the required class yet.
A neural network will process data better, if the images are splitted into subfolders.

- Therefore structure the data using the csv file into two subfolders based on the schema below:
    - Images
        
        -1 (Label Dog):
            -866.jpg
            -783.jpg
            -...
        -0 (Label Cat):
            -u27.jpg
            -099.jpg
            -...
    This could be done by completing the Python code with some lines...

We already checked the labels and created the target folder 'images' for you:

In [None]:
print ('Dog-Label:', temp.label.iloc[0])
print ('Cat-Label:', temp.label.iloc[1])

if not os.path.exists('../temp/image_cats_dogs'):
    os.makedirs('../temp/image_cats_dogs')

It's your turn. Create for each label a sub-folder in 'images'!

In [None]:
#Generate Subfolders for each class

Now we have to assign the images to the right target folder. 

For that, complete the code with an if-clause and copy the images by their ids, if they are in 'image_ids'.

In [None]:
# loop over csv columns
for n in range(2):

    image_ids =[]
    t = csvlabels[(csvlabels.label == n)]
    num_images = len(t.id)
    print ('Number of images:',num_images)

    #get list of image ids
    for i in range(len(t.id)):
        it = i - 1
        image_ids.append(t.id.iloc[it])
  
    #check if list contains id and move to subfolder
    for root, dirs, files in os.walk('../data/image_cats_dogs'):  
        for pic in files:\n",
  
            #get imagename
            p = os.path.splitext(pic)[0]   
            inpath = '../data/image_cats_dogs' + pic
            outpath = '../temp/image_cats_dogs/' + str(n)
            
            #please insert the if-metrics here:
            

## 3. Resize the images

The images are now available central and classified. So, the preprocessing can be continued.
One of the first steps is the guarantee, that all images exist in the same size and aspect ratios. It's common pratice to choose quadratic aspect ratios.
With the so-called 'Cropping' a square can be cut out of an image.
  
- Crop all images to a resolution 250 x 250 pixels and use the Python Imaging Library (PIL).
- Therefore complete the following code with a crop method!


In [None]:
from PIL import Image, ImageOps


DESIRED_SIZE = 250, 250

for n in range(2):
    
    for root, dirs, files in os.walk('../temp/image_cats_dogs/' + str(n)):   
        for pic in files:
            #print os.path.join(root, pic)\n",
            img_path = os.path.join(root, pic)

            
            img = Image.open(img_path)
            width, height = img.size

            if width > height:
               delta = width - height
               left = int(delta/2)
               upper = 0
               right = height + left
               lower = height
            else:
               delta = height - width
               left = 0
               upper = int(delta/2)
               right = width
               lower = width + upper
            
            #Insert the crop method here! You will need four input values!
            
            
            
            img.thumbnail(DESIRED_SIZE, Image.ANTIALIAS)

          
            img.save(os.path.join(root, pic))


- Now open a few images with the display method to check if it works!

In [None]:
from IPython.display import Image, display


## 4. Split your dataset

To test your trained model after each epoch and show the development, the dataset will be splitted into train, test and validation data.
   
- Which proportions would you choose?
   
- Split your dataset into these proportions.

Hint: You can use 'train_test_split' method or simple coding using lists!

## 5. Expand your train dataset

Due to the fact that our previous dataset is too small for a successful training, we have to expand it. One way of achieving this is called 'Data Augmentation'. Within different transformation steps like rotation or perspectives we create further version of an image. It helps the model to generalize better and it prevents overfitting.
- First search for information about Data Augmentation in our NovaTec-Blog (https://blog.novatec-gmbh.de/keras-data-augmentation-for-cnn/)
- After that apply Data Augmentation to one example image of your choice. Take note how different parameters affect the images!
- Complete the code by an ImageDataGenerator(https://keras.io/preprocessing/image/)

In [None]:
from keras.preprocessing.image import ImageDataGenerator, array_to_img, img_to_array, load_img
from IPython.display import Image, display

#ImageDataGenerator datagen:

    
img = load_img('../temp/image_cats_dogs/train/1/555.jpg')  
x = img_to_array(img)  # Numpy array with shape (250, 250, 3)
x = x.reshape((1,) + x.shape)  # Numpy array with shape (1, 250, 250, 3)
    
    
# generating batches of randomly transformed images\n",
# save to the 'augmentation' directory\n",
if not os.path.exists('../temp/image_cats_dogs/augmentation'):
    os.makedirs('../temp/image_cats_dogs/augmentation')
        
i = 0
for batch in datagen.flow(x, batch_size=1, save_to_dir='../temp/image_cats_dogs/augmentation', save_prefix='dog', save_format='jpeg'):
    i += 1
    # 10 images
    if i > 9:
        break 
    
listOfAugmentedImages = []
for root, dirs, files in os.walk('../temp/image_cats_dogs/augmentation/'):   
    for pic in files:
        listOfAugmentedImages.append(os.path.join(root,pic))
        
for imageName in listOfAugmentedImages:
        display(Image(filename=imageName))

## 6. Convert to Byte-Array

The data input into a neural net (convolutional neural net) must be available in byte format. For CNNs this would be a 4-D input tensor with the values [batch_size, width, height, channles]. 
- Convert one image of your choice and output a byte array with shape [1,250, 250, 3].

Hint: Use the Keras method 'img_to_array'!

In [None]:
from matplotlib.pyplot import imshow
from PIL import Image
import numpy as np

%matplotlib inline
image = Image.open('../temp/image_cats_dogs/train/1/555.jpg', 'r')

#insert the method here:
