# **Data Collection**

## Objectives

* Fetch data from Kaggle
* Prepare data for further processes.

## Inputs

* Kaggle JSON file - the authentication token. 

## Outputs

* Generate Dataset: inputs/dataset/cherry-leaves

## Additional Comments

* The code in this notebook was taken from Code Institue Malaria Detector Walkthrough Sample Project and adapted to suit this project.

---

# Import packages

In [1]:
%pip install -r /workspaces/mildew-detection-in-cherry-leaves/requirements.txt


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.2[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [2]:
import numpy
import os

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [3]:
current_dir = os.getcwd()
current_dir

'/workspaces/mildew-detection-in-cherry-leaves/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [4]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [5]:
current_dir = os.getcwd()
current_dir

'/workspaces/mildew-detection-in-cherry-leaves'

# Install Kaggle

In [6]:
%pip install kaggle==1.5.12


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.2[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


Change the Kaggle configuration directory to the current working directory and set permissions for the Kaggle authentication JSON

In [7]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Set the kaggle dataset and download it into an appropriate destination folder.

In [8]:
KaggleDatasetPath = "codeinstitute/cherry-leaves"
DestinationFolder = "inputs/dataset"
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading cherry-leaves.zip to inputs/dataset
 96%|████████████████████████████████████▌ | 53.0M/55.0M [00:01<00:00, 44.7MB/s]
100%|██████████████████████████████████████| 55.0M/55.0M [00:01<00:00, 44.0MB/s]


Unzip the downloaded file and delete the zip file.

In [9]:
import zipfile
with zipfile.ZipFile(DestinationFolder + '/cherry-leaves.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/cherry-leaves.zip')

---

# Data Preparation


## Data Cleaning
Check if all images in both datasets are image files, remove any data that is not an image file.

For each instance of image or non-image file, the number 1 is added to a corresponding list to display the total number of each when the function ends.

In [10]:
def remove_non_image_file(my_data_dir):
    image_extension = ('.png', '.jpg', '.jpeg')
    folders = os.listdir(my_data_dir)
    for folder in folders:
        files = os.listdir(my_data_dir + '/' + folder)
        num_non_images = []
        num_images = []
        for given_file in files:
            if not given_file.lower().endswith(image_extension):
                file_location = my_data_dir + '/' + folder + '/' + given_file
                os.remove(file_location)  # remove non image file
                num_non_images.append(1)
            else:
                num_images.append(1)
                pass
        print(f"Folder: {folder} - has image file", len(num_images))
        print(f"Folder: {folder} - has non-image file", len(num_non_images))

In [11]:
remove_non_image_file(my_data_dir='inputs/dataset/cherry-leaves')

Folder: healthy - has image file 2104
Folder: healthy - has non-image file 0
Folder: powdery_mildew - has image file 2104
Folder: powdery_mildew - has non-image file 0


## Split train validation test set

The data set will be split into 3 subsets; training set, test set, and validation set.

The split_train_validation_test_images function takes 4 parameters; my_data_dir, train_set_ratio, validation_set_ratio and test_set_ratio.
The values of these parameters are set when calling the function.

* The function first checks if the sum of the values of train_set_ratio, validation_set_ratio and test_set_ratio is equal to 1.
* It then gets the data labels from the folder names in the dataset directory.
* If the label 'test' is present in the function ends as a test set already exists.
* Directory folder are created for train, validation and test subsets, each with a sub-folder for each dataset label.
* The dataset files are shuffled and each is relocated to the appropriate label folder in either the train, validation or test folders until each folder reaches the required number of data files set by the ratio parameters passed to the function.
* The original label folders are removed when all the datafiles have been moved.

In [12]:
import os
import shutil
import random
import joblib


def split_train_validation_test_images(my_data_dir, train_set_ratio, validation_set_ratio, test_set_ratio):

    if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
        print("train_set_ratio + validation_set_ratio + test_set_ratio should sum to 1.0")
        return

    # gets classes labels
    labels = os.listdir(my_data_dir)  # it should get only the folder name
    if 'test' in labels:
        pass
    else:
        # create train, validation and test folders with classes labels sub-folder
        for folder in ['train', 'validation', 'test']:
            for label in labels:
                os.makedirs(name=my_data_dir + '/' + folder + '/' + label)

        for label in labels:

            files = os.listdir(my_data_dir + '/' + label)
            random.shuffle(files)

            train_set_files_qty = int(len(files) * train_set_ratio)
            validation_set_files_qty = int(len(files) * validation_set_ratio)

            count = 1
            for file_name in files:
                if count <= train_set_files_qty:
                    # move a given file to the train set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/train/' + label + '/' + file_name)

                elif count <= (train_set_files_qty + validation_set_files_qty):
                    # move a given file to the validation set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/validation/' + label + '/' + file_name)

                else:
                    # move given file to test set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/test/' + label + '/' + file_name)

                count += 1

            os.rmdir(my_data_dir + '/' + label)

* The training set is divided into a 0.70 ratio of data.
* The validation set is divided into a 0.10 ratio of data
* The test set is divided into a 0.20 ratio of data.

In [13]:
split_train_validation_test_images(my_data_dir=f"inputs/dataset/cherry-leaves",
                                   train_set_ratio=0.7,
                                   validation_set_ratio=0.1,
                                   test_set_ratio=0.2
                                   )

---

# Conclusions

* The cherry-leaves dataset has been successfully downloaded from Kaggle.
* The data has been cleaned to ensure data quality as only image files are required for further image anaysis processes.
* The data has been successfully split into train, validation and test subsets with a 70-10-20 ratio, this ratio split should help avoid over or underfitting during ML model training.

# Next steps

* A visual study will be conducted to visually differentiate a cherry leaf that is healthy from one that contains powdery mildew, satifying business requirement 1.