# **Data Preparation**

## Objectives

* Clean the data from non jpg files
* Split the dataset into train, test and validation

## Inputs

* inputs/dataset/row/cherry-leaves 

## Outputs

* Generate a clean images files on each folder under Dataset: inputs/dataset/row/cherry-leaves
* Generate three subset of the given input on the same destination folder:
  * Train
  * Test
  * Validation


---

## Change working directory

* Change the working directory from its current folder to its parent folder

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/mildew-detection-in-cherry-leaves/jupyter_notebooks'

* Make the parent of the current directory the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


* Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/mildew-detection-in-cherry-leaves'

---

### Clean Data
- provide the dataset dir to clean
- Check if the selected folder contains a non .jpg file extension
- Delete any non .jpg file extension in the selected folder
- Calculate the number of .jpg files in the selected folder

In [4]:
def remove_non_image_file(input_dir):
    image_extension = ('.png', '.jpg', '.jpeg')
    folders = os.listdir(input_dir)
    for folder in folders:
        files = os.listdir(input_dir + '/' + folder)
        i = []
        j = []
        for given_file in files:
            if not given_file.lower().endswith(image_extension):
                file_location = input_dir + '/' + folder + '/' + given_file
                os.remove(file_location)  # remove non image file
                i.append(1)
            else:
                j.append(1)
                pass
        print(f"Folder: {folder} - has image file", len(j))
        print(f"Folder: {folder} - has non-image file", len(i))

In [5]:
input_dir = "inputs/dataset/row/cherry-leaves"
remove_non_image_file(input_dir)

Folder: healthy - has image file 2104
Folder: healthy - has non-image file 0
Folder: powdery_mildew - has image file 2104
Folder: powdery_mildew - has non-image file 0


---

### Split the dataset into Train, Test and Validation

In [10]:
import os
import shutil
import random
import joblib


def split_train_validation_test_images(input_dir, train_set_ratio, validation_set_ratio, test_set_ratio):

    if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
        print("train_set_ratio + validation_set_ratio + test_set_ratio should sum to 1.0")
        return

    # gets classes labels
    labels = os.listdir(input_dir)  # it should get only the folder name
    if 'test' in labels:
        pass
    else:
        # create train, test folders with classes labels sub-folder
        for folder in ['train', 'validation', 'test']:
            for label in labels:
                os.makedirs(name=input_dir + '/' + folder + '/' + label)

        for label in labels:

            files = os.listdir(input_dir + '/' + label)
            random.shuffle(files)

            train_set_files_qty = int(len(files) * train_set_ratio)
            validation_set_files_qty = int(len(files) * validation_set_ratio)

            count = 1
            for file_name in files:
                if count <= train_set_files_qty:
                    # move a given file to the train set
                    shutil.move(input_dir + '/' + label + '/' + file_name,
                                input_dir + '/train/' + label + '/' + file_name)

                elif count <= (train_set_files_qty + validation_set_files_qty):
                    # move a given file to the validation set
                    shutil.move(input_dir + '/' + label + '/' + file_name,
                                input_dir + '/validation/' + label + '/' + file_name)

                else:
                    # move given file to test set
                    shutil.move(input_dir + '/' + label + '/' + file_name,
                                input_dir + '/test/' + label + '/' + file_name)

                count += 1

            os.rmdir(input_dir + '/' + label)

In [11]:
split_train_validation_test_images(input_dir=f"inputs/dataset/row/cherry-leaves",
                                   train_set_ratio=0.7,
                                   validation_set_ratio=0.1,
                                   test_set_ratio=0.2
                                   )

---

## Summary

* In this notebook the dataset is cleaned from non any non image files.
* In this notebook the dataset is divided into three subsets, these are:
  * Train
  * Test
  * Validation