# **Preparing data to train the models**

Objectives: Preprocess the cherry leaves dataset for model training.

Inputs: Cherry leaves dataset downloaded from Kaggle.

Outputs: Preprocessed dataset ready for model training.

## Important Disclaimer

Please follow the order of the cells in this Jupyter Notebook strictly. Each cell is designed to be executed in sequence to ensure the correct processing of data and functionality of the code. Running the cells out of order or skipping steps may result in errors or unintended consequences. It is highly recommended to execute each cell consecutively for optimal results and to avoid any potential issues.

# Steps for preparing the data:


## 1. Install prerequisites

In [1]:
%pip install zipfile36
%pip install os-sys
%pip install Pillow
%pip install random2
%pip install glob2
%pip install keras
print("prerequisites installed successfully")

Note: you may need to restart the kernel to use updated packages.
Collecting os-sys
  Using cached os_sys-2.1.4-py3-none-any.whl (15.6 MB)
Collecting pygubu (from os-sys)
  Using cached pygubu-0.31-py3-none-any.whl.metadata (7.1 kB)
Collecting sqlparse (from os-sys)
  Using cached sqlparse-0.4.4-py3-none-any.whl (41 kB)
Collecting progress (from os-sys)
  Using cached progress-1.6.tar.gz (7.8 kB)
  Preparing metadata (setup.py) ... [?25ldone
Collecting progressbar (from os-sys)
  Using cached progressbar-2.5.tar.gz (10 kB)
  Preparing metadata (setup.py) ... [?25ldone
Collecting Eel (from os-sys)
  Using cached Eel-0.16.0.tar.gz (24 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting extract-zip (from os-sys)
  Using cached extract_zip-1.0.0-py3-none-any.whl (2.6 kB)
INFO: pip is looking at multiple versions of os-sys to determine which version is compatible with other requirements. This could take a while.
Collecting os-sys
  Using cached os_sys-2.1.3-py3-none-any.wh

**Important Note**: After installing the required packages, you may need to restart the Jupyter Notebook kernel for the changes to take effect. This ensures that the newly installed libraries are properly loaded. You can restart the kernel by selecting "Kernel" > "Restart" in the Jupyter Notebook menu.

## 2. Import libraries

In [2]:
import zipfile
import os
import shutil
from PIL import Image, ImageOps, ImageEnhance
import random
import glob
from keras.preprocessing.image import ImageDataGenerator
print("Libraries imported successfully")

Libraries imported successfully


## 3. Unzip files

In this step, we are extracting the cherry leaves dataset from a compressed .zip file. This action makes the dataset accessible for further processing. It involves opening the .zip file and extracting its contents into a designated directory, followed by removing the .zip file to clean up our workspace.


In [3]:
os.chdir('/workspaces/Mildew-Detection-in-Cherry-Leaves')
print("Working directory changed to '/workspaces/Mildew-Detection-in-Cherry-Leaves'")

Working directory changed to '/workspaces/Mildew-Detection-in-Cherry-Leaves'


In [4]:
DestinationFolder = "inputs/mildew_dataset"
with zipfile.ZipFile(DestinationFolder + '/cherry-leaves.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/cherry-leaves.zip')

## 4. Merge and Group Images by Category

This part of the process involves reorganizing the dataset structure. Initially, the dataset was divided into 'train' and 'test' folders. We're consolidating these images into a single location, separating them into 'healthy' and 'powdery_mildew' categories. This reorganization sets the stage for our next steps, where we will process these images and then redistribute them into new 'training', 'validation', and 'test' sets, aligning with a different standard for model training and evaluation.


### Define the functions

In [5]:
def merge_and_group_images(source_folders, destination_folders):
    for _, dest in destination_folders.items():
        if not os.path.exists(dest):
            os.makedirs(dest)

    for folder in source_folders:
        for category in ['healthy', 'powdery_mildew']:
            source_path = folder + '/' + category
            files = os.listdir(source_path)
            for file in files:
                shutil.move(source_path + '/' + file, destination_folders[category])
            os.rmdir(source_path)

def remove_initial_directories(directories):
    for directory in directories:
        shutil.rmtree(directory)


def rename_images_in_folder(folder_path, new_name_prefix):
    files = os.listdir(folder_path)
    for i, file in enumerate(files):
        os.rename(os.path.join(folder_path, file), os.path.join(folder_path, f"{new_name_prefix}_{i}.jpg"))

def standardize_image_names(folder_path, prefix):
    for count, filename in enumerate(os.listdir(folder_path), start=1):
        dst = f"{prefix}_{count}.jpg"
        src = f"{folder_path}/{filename}"
        dst = f"{folder_path}/{dst}"

        # Rename the file
        os.rename(src, dst)

### Define source folders

In [6]:
source_folders = [
    "inputs/mildew_dataset/cherry-leaves/test",
    "inputs/mildew_dataset/cherry-leaves/train"
]

### Define destination folders

In [7]:
destination_folders = {
    "healthy": "inputs/mildew_dataset/cherry-leaves/healthy",
    "powdery_mildew": "inputs/mildew_dataset/cherry-leaves/powdery_mildew"
}

### Define the directories to be removed

In [8]:
directories_to_remove = [
    "inputs/mildew_dataset/cherry-leaves/test",
    "inputs/mildew_dataset/cherry-leaves/train"
]

### Define the directories to standardize

In [9]:
healthy_folder_path = '/workspaces/Mildew-Detection-in-Cherry-Leaves/inputs/mildew_dataset/cherry-leaves/healthy'
mildew_folder_path = '/workspaces/Mildew-Detection-in-Cherry-Leaves/inputs/mildew_dataset/cherry-leaves/powdery_mildew'

### Execute the function

In [10]:
rename_images_in_folder("inputs/mildew_dataset/cherry-leaves/train/healthy", "train_healthy")
rename_images_in_folder("inputs/mildew_dataset/cherry-leaves/train/powdery_mildew", "train_mildew")
merge_and_group_images(source_folders, destination_folders)
remove_initial_directories(directories_to_remove)
standardize_image_names(healthy_folder_path, 'healthy')
standardize_image_names(mildew_folder_path, 'mildew')
print("All images have been successfully merged and grouped by category into their respective destination folders.")

All images have been successfully merged and grouped by category into their respective destination folders.


## 5. Equalizing Image Numbers to Avoid Bias

In this step, we've ensured an equal number of images for both 'healthy' and 'powdery mildew' categories. This is crucial for preventing bias in our model's training, as an imbalanced dataset could lead to skewed predictions. Our goal is to provide the CNN with a fair representation of both categories, enhancing its ability to accurately classify cherry leaves.

### Count Images in Each Folder:

In [11]:
healthy_images_count = len(os.listdir('/workspaces/Mildew-Detection-in-Cherry-Leaves/inputs/mildew_dataset/cherry-leaves/healthy'))
mildew_images_count = len(os.listdir('/workspaces/Mildew-Detection-in-Cherry-Leaves/inputs/mildew_dataset/cherry-leaves/powdery_mildew'))
print(healthy_images_count)
print(mildew_images_count)

2986
2793


### Determine the Difference and Set Image Generation Count:

In [12]:
images_to_generate = abs(healthy_images_count - mildew_images_count)
folder_to_use = '/workspaces/Mildew-Detection-in-Cherry-Leaves/inputs/mildew_dataset/cherry-leaves/powdery_mildew' if healthy_images_count > mildew_images_count else '/workspaces/Mildew-Detection-in-Cherry-Leaves/inputs/mildew_dataset/cherry-leaves/healthy'
print(images_to_generate)

193


### Define Image Transformation Functions:

In [13]:
from PIL import Image, ImageOps

def random_transformation(image_path):
    image = Image.open(image_path)
    choice = random.randint(1, 3)
    if choice == 1:
        # Mirror
        return ImageOps.mirror(image)
    elif choice == 2:
        # Adjust brightness
        return ImageEnhance.Brightness(image).enhance(random.uniform(0.5, 1.5))
    else:
        # Rotate
        return image.rotate(random.choice([90, 180, 270]))

### Generate and Save Transformed Images:

In [14]:
from PIL import ImageEnhance

for i in range(images_to_generate):
    random_image = random.choice(os.listdir(folder_to_use))
    transformed_image = random_transformation(os.path.join(folder_to_use, random_image))
    save_path = f'{folder_to_use}/synthetic_{i}.jpg'
    transformed_image.save(save_path)

    print(f"Processing image number: {i}")
    print(f"Selected random image for transformation: {random_image}")
    print(f"Saving transformed image to: {save_path}")

Processing image number: 0
Selected random image for transformation: mildew_950.jpg
Saving transformed image to: /workspaces/Mildew-Detection-in-Cherry-Leaves/inputs/mildew_dataset/cherry-leaves/powdery_mildew/synthetic_0.jpg
Processing image number: 1
Selected random image for transformation: mildew_2012.jpg
Saving transformed image to: /workspaces/Mildew-Detection-in-Cherry-Leaves/inputs/mildew_dataset/cherry-leaves/powdery_mildew/synthetic_1.jpg
Processing image number: 2
Selected random image for transformation: mildew_54.jpg
Saving transformed image to: /workspaces/Mildew-Detection-in-Cherry-Leaves/inputs/mildew_dataset/cherry-leaves/powdery_mildew/synthetic_2.jpg
Processing image number: 3
Selected random image for transformation: mildew_2213.jpg
Saving transformed image to: /workspaces/Mildew-Detection-in-Cherry-Leaves/inputs/mildew_dataset/cherry-leaves/powdery_mildew/synthetic_3.jpg
Processing image number: 4
Selected random image for transformation: mildew_498.jpg
Saving tran

### Counting the updated number of files

In [15]:
healthy_images_count = len(os.listdir('/workspaces/Mildew-Detection-in-Cherry-Leaves/inputs/mildew_dataset/cherry-leaves/healthy'))
mildew_images_count = len(os.listdir('/workspaces/Mildew-Detection-in-Cherry-Leaves/inputs/mildew_dataset/cherry-leaves/powdery_mildew'))
print(healthy_images_count)
print(mildew_images_count)

2986
2986


## 6. Scale down the sample images, to reduce cpu needed.

In this step, we will resize all images in both 'healthy' and 'mildew' folders to a smaller size. This reduces the computational load for future processing.

### Code Cell - Define Resizing Function:

In [16]:
def resize_images(folder_path, output_size=(50, 50)):
    for img_path in glob.glob(folder_path + '/*.jpg'):
        img = Image.open(img_path)
        img = img.resize(output_size, Image.Resampling.LANCZOS)
        img.save(img_path)

### Resize images in both folders

In [17]:
healthy_folder_path = '/workspaces/Mildew-Detection-in-Cherry-Leaves/inputs/mildew_dataset/cherry-leaves/healthy'
mildew_folder_path = '/workspaces/Mildew-Detection-in-Cherry-Leaves/inputs/mildew_dataset/cherry-leaves/powdery_mildew'
resize_images(healthy_folder_path)
resize_images(mildew_folder_path)

### Checking the new size of the images.

In [18]:
def check_random_image_dimensions(folder_path):
    images = os.listdir(folder_path)
    random_image = random.choice(images)
    img_path = os.path.join(folder_path, random_image)
    with Image.open(img_path) as img:
        width, height = img.size
        print(f"Selected Image: {random_image}")
        print(f"Dimensions: {width}x{height}")

check_random_image_dimensions(healthy_folder_path)
check_random_image_dimensions(mildew_folder_path)

Selected Image: healthy_2848.jpg
Dimensions: 50x50
Selected Image: mildew_450.jpg
Dimensions: 50x50


## 7. Splitting Data into Training, Validation, and Test Sets

Now, we will divide our images into training, validation, and test sets. This helps in model training, tuning, and evaluation.

### Code Cell - Define Splitting Function:

In [19]:
def split_images_into_sets(my_data_dir, train_set_ratio, validation_set_ratio, test_set_ratio):
    if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
        print("Sum of ratios must equal 1.0")
        return

    labels = os.listdir(my_data_dir)
    for label in labels:

        for set_type in ['train', 'validation', 'test']:
            os.makedirs(os.path.join(my_data_dir, set_type, label), exist_ok=True)
        
        files = os.listdir(os.path.join(my_data_dir, label))
        random.shuffle(files)

        train_set_files_qty = int(len(files) * train_set_ratio)
        validation_set_files_qty = int(len(files) * validation_set_ratio)

        for i, file in enumerate(files):
            src_path = os.path.join(my_data_dir, label, file)
            if i < train_set_files_qty:
                dest_path = os.path.join(my_data_dir, 'train', label, file)
            elif i < train_set_files_qty + validation_set_files_qty:
                dest_path = os.path.join(my_data_dir, 'validation', label, file)
            else:
                dest_path = os.path.join(my_data_dir, 'test', label, file)
            
            shutil.move(src_path, dest_path)

### Define Function to erase the original folders

In [20]:
def remove_folder(path):
    if os.path.exists(path):
        shutil.rmtree(path)
        print(f"erased folder: {path}")
    else:
        print(f"this directory does not exist: {path}")

### Define the path to the folders to erase

In [21]:
healthy_folder = "inputs/mildew_dataset/cherry-leaves/healthy"
mildew_folder = "inputs/mildew_dataset/cherry-leaves/powdery_mildew"

### Code Cell - Execute Functions:

In [22]:
split_images_into_sets(
    my_data_dir='/workspaces/Mildew-Detection-in-Cherry-Leaves/inputs/mildew_dataset/cherry-leaves',
    train_set_ratio=0.7,
    validation_set_ratio=0.1,
    test_set_ratio=0.2
)

In [23]:
remove_folder(healthy_folder)
remove_folder(mildew_folder)

erased folder: inputs/mildew_dataset/cherry-leaves/healthy
erased folder: inputs/mildew_dataset/cherry-leaves/powdery_mildew


## 8. Image Labeling for Training

Labeling Process Overview
In this step, we'll label the cherry leaf images as 'healthy' or 'powdery mildew' based on their directory names. This labeling is crucial for the model to learn and differentiate between these two categories during the training process. The images are already segregated into distinct folders, simplifying this process. We'll ensure each image is associated with the correct label, forming the foundation for our CNN model training.

### Define Parameters:

In [24]:
train_dir = 'inputs/mildew_dataset/cherry-leaves/train'
validation_dir = 'inputs/mildew_dataset/cherry-leaves/validation'
test_dir = 'inputs/mildew_dataset/cherry-leaves/test'

img_width, img_height = 50, 50  # Adjusted image size
batch_size = 20  # Adjusted batch size

### Create ImageDataGenerators:

In [25]:
train_datagen = ImageDataGenerator(rescale=1./255)
validation_datagen = ImageDataGenerator(rescale=1./255)
test_datagen = ImageDataGenerator(rescale=1./255)

### Flow Images from Directories:

In [26]:
train_generator = train_datagen.flow_from_directory( train_dir, target_size=(img_width, img_height), batch_size=batch_size, class_mode='binary')

validation_generator = validation_datagen.flow_from_directory( validation_dir, target_size=(img_width, img_height), batch_size=batch_size, class_mode='binary')

test_generator = test_datagen.flow_from_directory( test_dir, target_size=(img_width, img_height), batch_size=batch_size, class_mode='binary')

Found 4180 images belonging to 2 classes.
Found 596 images belonging to 2 classes.
Found 1196 images belonging to 2 classes.


In [27]:
# Display class indices
print("Train Set Class Indices:", train_generator.class_indices)
print("Validation Set Class Indices:", validation_generator.class_indices)
print("Test Set Class Indices:", test_generator.class_indices)

# Optional: Display some image batches with their labels
for image_batch, label_batch in train_generator:
    print("Image batch shape:", image_batch.shape)
    print("Label batch shape:", label_batch.shape)
    break  # Display only the first batch

Train Set Class Indices: {'healthy': 0, 'powdery_mildew': 1}
Validation Set Class Indices: {'healthy': 0, 'powdery_mildew': 1}
Test Set Class Indices: {'healthy': 0, 'powdery_mildew': 1}
Image batch shape: (20, 50, 50, 3)
Label batch shape: (20,)
