# Data Collection

## Objective

- Fetch and upload cherry leaves image dataset from Kaggle.
- Prepare data for analysis and modelling.

## Inputs

- Kaggle authentification token [JSON file].
- !(Kaggle dataset)[https://www.kaggle.com/codeinstitute/cherry-leaves].

## Outputs

- Dataset split: train, validation, test sections.
- Datasets saved to train, validation, test folders (within inputs/datasets/cherry_leaves_raw_data).


## Change working directory

Change working directory from current to parent folder.

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/mildew-detection/jupyter_notebooks'

In [2]:
os.chdir("/workspace/mildew-detection")
print("You set a new current directory.")

You set a new current directory.


Confirm new current directory.

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/mildew-detection'

## Kaggle Installation

In [4]:
pip install kaggle

You should consider upgrading via the '/home/gitpod/.pyenv/versions/3.8.12/bin/python -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


In [5]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Get the dataset path from the !(Kaggle url)[https://www.kaggle.com/datasets/codeinstitute/cherry-leaves] and set the destination folder.

In [6]:
KaggleDatasetPathway = "codeinstitute/cherry-leaves"
DestinationFolder = "inputs/cherry_leaves_raw_dataset"
! kaggle datasets download -d {KaggleDatasetPathway} -p {DestinationFolder}

Downloading cherry-leaves.zip to inputs/cherry_leaves_raw_dataset
 84%|███████████████████████████████▊      | 46.0M/55.0M [00:00<00:00, 61.4MB/s]
100%|██████████████████████████████████████| 55.0M/55.0M [00:00<00:00, 68.6MB/s]


Unzip the data from the zipfile, and then delete the zipfile.

In [7]:
import zipfile
with zipfile.ZipFile(DestinationFolder + "/" + "cherry-leaves.zip", "r") as zip:
    zip.extractall(DestinationFolder)
os.remove(DestinationFolder + "/" + "cherry-leaves.zip")

In [8]:
!ls

inputs		   kaggle.json	README.md	  runtime.txt
jupyter_notebooks  Procfile	requirements.txt  setup.sh


## Data Preparation

Remove non-image files from the dataset.

In [9]:
def remove_non_image_file(my_data_dir):
    image_extension = ('.png', '.jpg', '.jpeg')
    folders = os.listdir(my_data_dir) 
    for folder in folders:
        files = os.listdir(my_data_dir + '/' + folder)
        
            #print(files)
        i = []
        j = []
        for given_file in files:
            if not given_file.lower().endswith(image_extension):
                file_location = my_data_dir + '/' + folder + '/' + given_file
                os.remove(file_location) # remove non image file
                i.append(1)
            else:
                j.append(1)
                pass
        print(f"Folder: {folder} - has image file",len(j))
        print(f"Folder: {folder} - has non-image file",len(i))

In [10]:
remove_non_image_file(my_data_dir='inputs/cherry_leaves_raw_dataset/cherry-leaves/')

Folder: healthy - has image file 2104
Folder: healthy - has non-image file 0
Folder: powdery_mildew - has image file 2104
Folder: powdery_mildew - has non-image file 0


Split data into train, validation and test sets.

In [1]:
import os
import shutil
import random
import joblib

def split_train_validation_test_images(my_data_dir, train_set_ratio, validation_set_ratio, test_set_ratio):

    if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
        print("train_set_ratio + validation_set_ratio + test_set_ratio should sum to 1.0")
        return

    # Gets class labels
    labels = os.listdir(my_data_dir)  # retrieve folder names

    if 'test' in labels:
        pass
    else:
        # Create train, validation, and test folders with class label sub-folders
        for folder in ['train', 'validation', 'test']:
            for label in labels:
                os.makedirs(name=os.path.join(my_data_dir, folder, label))

        for label in labels:
            files = os.listdir(os.path.join(my_data_dir, label))
            random.shuffle(files)

            train_set_files_qty = int(len(files) * train_set_ratio)
            validation_set_files_qty = int(len(files) * validation_set_ratio)

            count = 1
            # Move files to train set until full, then validation set until full, then test set
            for file_name in files:
                if count <= train_set_files_qty:
                    # Move a file to the train set
                    shutil.move(os.path.join(my_data_dir, label, file_name),
                                os.path.join(my_data_dir, 'train', label, file_name))

                elif count <= (train_set_files_qty + validation_set_files_qty):
                    # Move a file to the validation set
                    shutil.move(os.path.join(my_data_dir, label, file_name),
                                os.path.join(my_data_dir, 'validation', label, file_name))

                else:
                    # Move a file to the test set
                    shutil.move(os.path.join(my_data_dir, label, file_name),
                                os.path.join(my_data_dir, 'test', label, file_name))

                count += 1

            os.rmdir(os.path.join(my_data_dir, label))
        
        # Print the number of files in each set after splitting
        print("Number of files in Train set:")
        for label in labels:
            train_files = os.listdir(my_data_dir + '/train/' + label)
            print(f"Class {label}: {len(train_files)}")

        print("\nNumber of files in Validation set:")
        for label in labels:
            validation_files = os.listdir(my_data_dir + '/validation/' + label)
            print(f"Class {label}: {len(validation_files)}")

        print("\nNumber of files in Test set:")
        for label in labels:
            test_files = os.listdir(my_data_dir + '/test/' + label)
            print(f"Class {label}: {len(test_files)}")

ModuleNotFoundError: No module named 'joblib'

A standard 70/10/20% split is used for train/validation/test sets.

In [None]:
split_train_validation_test_images(my_data_dir=f"inputs/cherry_leaves_raw_dataset/cherry-leaves",
                                   train_set_ratio=0.7,
                                   validation_set_ratio=0.1,
                                   test_set_ratio=0.2
                                   )

## Comments
The data has been uploaded, extracted and split into train, validation and test sets to be used in the modelling process.