This notebook explain how to do the data preparation for the dataset:

- To do data augmentation (with bright/blur)
- Move each image to each folder of classification
- Rename all title of image
- Split folder between test/train

## Import packages

The `shutil` module offers a number of high-level operations on files and collections of files. In particular, functions are provided which support file copying and removal. For operations on individual files, see also the `os` module.
`OpenCV-Python` is a library of Python bindings designed to solve computer vision problems.
And `PIL` is the Python Imaging Library which provides the python interpreter with image editing capabilities.

In [None]:
import os
import shutil
import cv2
from PIL import Image, ImageEnhance

In [None]:
#path = "/multilabel_data"

`What we have` = The dataset has the following directory structure:

<pre>
|__ <b>multilabel_data</b>
    |______ <b>bateau</b>: [bateau.0.jpg, bateau.1.jpg, bateau.2.jpg ....]
    |______ <b>bol</b>: [bol.0.jpg, bol.1.jpg, bol.2.jpg ...]
    |______ <b>...</b>
</pre>

`What we want` = The dataset has the following directory structure:

<pre>
<b>dataset</b>
|__ <b>train</b>
    |______ <b>bateau</b>: [bateau.0.jpg, bateau.1.jpg, bateau.2.jpg ....]
    |______ <b>bol</b>: [bol.0.jpg, bol.1.jpg, bol.2.jpg ...]
    |______ <b>...</b>
|__ <b>test</b>
    |______ <b>bateau</b>: [bateau.2000.jpg, bateau.2001.jpg, bateau.2002.jpg ....]
    |______ <b>bol</b>: [bol.2000.jpg, bol.2001.jpg, bol.2002.jpg ...]
    |______ <b>...</b>
</pre>

## Data Augmentation (with bright/blur)

Having a large dataset is crucial for the performance of the deep learning model. However, we can improve the performance of the model by augmenting the data we already have.
- The `ImageEnhance` module contains a number of classes that can be used for image enhancement.
- `cv2.blur()` method is used to blur an image using the normalized box filter. The function smooths an image using the kernel which is represented as:

In [None]:
# Data Augmentation with bightness

for directory in os.listdir(path):
    new_path = os.path.join(path, directory)
    count = 0
    for filename in os.listdir(new_path):
        if str(filename).startswith(directory):
            count = count + 1
            img_path = os.path.join(new_path, filename)
            img = Image.open(img_path)
            
            #image brightness enhancer
            enhancer = ImageEnhance.Brightness(img)

            factor = 1.5 #brightens the image
            im_output = enhancer.enhance(factor)
            
            #save result image
            im_output.save(path+"/"+str(directory)+"/"+str(directory)+"_clear_%d.jpg" % count)
        else:
              pass

In [None]:
# Data Augmentation with blur

for directory in os.listdir(path):
    new_path = os.path.join(path, directory)
    count = 0
    for filename in os.listdir(new_path):
        if str(filename).startswith(directory):
            count = count + 1
            img_path = os.path.join(new_path, filename)
            
            #read image
            src = cv2.imread(img_path)
            
            # apply guassian blur on src image
            rst = cv2.GaussianBlur(src,(3,3),cv2.BORDER_DEFAULT)
            
            #save result image
            cv2.imwrite((path+"/"+str(directory)+"/"+str(directory)+"_blur_%d.jpg" % count),rst)
        else:
              pass

## Move image of folder to another folder

In [None]:
#name_class = "tortue"
#path_data = '/multilabel_data/'+name_class
#tangram_class = os.listdir(path_data)

In [None]:
# Create list with name of all image in folder (of name_class)

tangram_class = [(os.listdir(path_data)[i]).lower() for i in range(len(tangram_class))]

In [None]:
# For select all element "start with" to folder tangram_class

def list_tangram_class():
    list_tangram_class = [tangram_class[elem] for elem in range(len(tangram_class)) if (tangram_class[elem].startswith("tortue_blur") and tangram_class[elem].endswith(".jpg"))]
    return list_tangram_class

files_tangram_class = list_tangram_class()

In [None]:
# For copy file to folder to another folder

source="/multilabel_data/"+name_class+"/"
dest="/dataset/train/"+name_class

def copy_file():
    for i in range(len(files_tangram_class)):
        print(files_tangram_class[i])
        # Copy file to another directory
        newPath = shutil.copy(source + files_tangram_class[i], dest)
        print("Path of copied file : ", newPath)   

tangram_shape = copy_file()

## Rename image

In [None]:
# For rename all element to folder

name_class = "tortue"
source="/multilabel_data/"+name_class                                               
dest="/dataset/train/"+name_class

def rename_all_img():
    count= 0
    tangram_class = [(os.listdir(source)[i]).lower() for i in range(len(os.listdir(source)))]
    for i in range(len(tangram_class)):
        #Copy a file with new name
        count = count +1
        newPath = shutil.copy(source +"/"+ tangram_class[i], dest +"/"+ name_class +".%d.jpg" % count)
        print("Path of copied file : ", newPath)
        
tangram_shape = rename_all_img()

## Split folder between train / test dataset

We can verify with the preceding output that we have the same number of images for each category. Let’s now build our smaller dataset, with 20% of train dataset for our test dataset of each categories.


In [None]:
# Number of img for each categories, and create list of all img for each categories

train_dir = "/dataset/train/"+name_class
string_train = []

for i in os.listdir(train_dir):
    print(train_dir+"/"+i)
    print(len(os.listdir(train_dir+"/"+i)))

In [None]:
# Split random train dataset

import random

def split_train_balanced(string, nb):
    class_train_balanced = []
    for i in range(len(string)):
        class_train = random.sample(string[i], k=nb)
        class_train_balanced.append(class_train)
    return class_train_balanced

train_balanced = split_train_balanced(string_train, nb=400)
#nb=400, for 400 for each categorie

In [None]:
# Use copy_file to move file to train folder
train_balanced_img = copy_file(source_train,dest_train, train_balanced)

In [None]:
# Split random test dataset 

test_list_choice = []
for i in range(len(os.listdir(train_dir))):
    test_list = list(set(string_train[i]) - set(train_balanced[i]))
    test_list_choice.append(test_list)

test_balanced = split_train_balanced(test_list_choice, nb=80)
#nb=80 for 20% of 400 img

# Use copy_file to move file to train folder
test_balanced_img = copy_file(source_test,dest_test,test_balanced)