# Preprocessing Plant Disease Images

In this section, we will preprocess the plant disease image dataset to ensure it is clean and ready for machine learning. This involves organizing directories, normalizing images, and handling imbalanced classes.

## Libraries Overview

- **os, cv2, shutil, numpy, pandas**: These libraries help with file handling, image processing, data manipulation, and moving directories.
- **sklearn.utils**: Provides utilities for resampling, which is used to handle class imbalance.


In [1]:
# Import necessary libraries for preprocessing
import os
import cv2
import shutil
import numpy as np
import pandas as pd
from sklearn.utils import resample

## Organize Directories

The dataset is initially stored in a complex directory structure. We simplify this by moving the train and validation directories to a more accessible location and removing any unnecessary folders.


In [12]:
# Define the source and destination paths
src_train = '../src/data/new-plant-diseases/New Plant Diseases Dataset(Augmented)/New Plant Diseases Dataset(Augmented)/train'
src_valid = '../src/data/new-plant-diseases/New Plant Diseases Dataset(Augmented)/New Plant Diseases Dataset(Augmented)/valid'
dest_train = '../src/data/new-plant-diseases/train'
dest_valid = '../src/data/new-plant-diseases/valid'


# Move the train directory if it doesn't already exist in the destination
if not os.path.exists(dest_train):
    shutil.move(src_train, dest_train)
    print(f"Moved 'train' directory to '{dest_train}'")
else:
    print(f"'train' directory already exists at '{dest_train}'")

# Move the valid directory if it doesn't already exist in the destination
if not os.path.exists(dest_valid):
    shutil.move(src_valid, dest_valid)
    print(f"Moved 'valid' directory to '{dest_valid}'")
else:
    print(f"'valid' directory already exists at '{dest_valid}'")

# Define the now-empty folder path
unused_folder = '../src/data/new-plant-diseases/New Plant Diseases Dataset(Augmented)'

# Remove the now-empty folder if it exists
if os.path.exists(unused_folder):
    shutil.rmtree(unused_folder)
    print(f"Removed the unused folder '{unused_folder}'")
else:
    print(f"Folder '{unused_folder}' does not exist or is not empty")


'train' directory already exists at '../src/data/new-plant-diseases/train'
'valid' directory already exists at '../src/data/new-plant-diseases/valid'
Removed the unused folder '../src/data/new-plant-diseases/New Plant Diseases Dataset(Augmented)'


## Load Images and Basic Analysis

The `load_images` function reads image file paths and labels into a DataFrame for both the training and validation datasets. This allows us to perform data manipulation and analysis using Pandas.


In [2]:
# Function to load images and labels from a folder
def load_images(folder):
    images = []
    labels = []
    for subdir, _, files in os.walk(folder):
        for file in files:
            if file.endswith('.jpg') or file.endswith('.JPG') or file.endswith('.png'):
                label = os.path.basename(subdir)
                img_path = os.path.join(subdir, file)
                images.append(img_path)
                labels.append(label)
    return pd.DataFrame({'image_path': images, 'label': labels})

# Load Datasets
#plant_disease_df_train = load_images('../src/data/new-plant-diseases/random_train')
#plant_disease_df_valid = load_images('../src/data/new-plant-diseases/random_valid')

plant_disease_df_train = load_images(r'..\src\data\new-plant-diseases\random_train')
plant_disease_df_valid = load_images(r'..\src\data\new-plant-diseases\random_valid')


# Basic Analysis
print("Train DataFrame Shape:", plant_disease_df_train.shape)
print("Valid DataFrame Shape:", plant_disease_df_valid.shape)

Train DataFrame Shape: (144, 2)
Valid DataFrame Shape: (126, 2)


## Data Cleaning and Validation

We ensure data quality by checking for missing, invalid, or duplicate entries in our datasets. This step is crucial to prevent errors during model training and evaluation.


In [4]:
# Check for missing or invalid data
def check_missing_invalid(df):
    # Check for missing image paths or labels
    missing_images = df['image_path'].isnull().sum()
    missing_labels = df['label'].isnull().sum()
    print(f"Missing image paths: {missing_images}")
    print(f"Missing labels: {missing_labels}")

    # Check for invalid image paths (e.g., file does not exist)
    invalid_images = df[~df['image_path'].apply(os.path.exists)]
    print(f"Invalid image paths: {len(invalid_images)}")

    return invalid_images

# Check train dataset
print("Checking train dataset for missing or invalid data...")
invalid_train_images = check_missing_invalid(plant_disease_df_train)

# Check valid dataset
print("Checking valid dataset for missing or invalid data...")
invalid_valid_images = check_missing_invalid(plant_disease_df_valid)

#### Check for duplicate entries
def check_duplicates(df):
    duplicates = df[df.duplicated(subset=['image_path'])]
    print(f"Duplicate entries: {len(duplicates)}")
    return duplicates

# Check train dataset
print("Checking train dataset for duplicates...")
duplicate_train_images = check_duplicates(plant_disease_df_train)

# Check valid dataset
print("Checking valid dataset for duplicates...")
duplicate_valid_images = check_duplicates(plant_disease_df_valid)


Checking train dataset for missing or invalid data...
Missing image paths: 0
Missing labels: 0
Invalid image paths: 0
Checking valid dataset for missing or invalid data...
Missing image paths: 0
Missing labels: 0
Invalid image paths: 0
Checking train dataset for duplicates...
Duplicate entries: 0
Checking valid dataset for duplicates...
Duplicate entries: 0


## Normalizing Images

Images are read, normalized (pixel values scaled between 0 and 1), and saved in batches to optimize memory usage. This step ensures consistent image data input to the model.


In [7]:
# Function to read and normalize images
def read_and_normalize_image(image_path):
    img = cv2.imread(image_path)
    if img is not None:
        img = img / 255.0  # Normalize the image
    return img

# Function to normalize and process images in batches
def normalize_dataset_in_batches(df, batch_size=32):
    total_images = len(df)
    for start in range(0, total_images, batch_size):
        end = min(start + batch_size, total_images)
        batch_df = df.iloc[start:end]
        
        normalized_images = []
        for img_path in batch_df['image_path']:
            img = read_and_normalize_image(img_path)
            if img is not None:
                normalized_images.append(img)
        
        yield np.array(normalized_images), batch_df['label'].values

# Function to save normalized images
def save_normalized_images(batch_images, batch_labels, output_folder, start_index):
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)
    
    for i, (img, label) in enumerate(zip(batch_images, batch_labels)):
        label_folder = os.path.join(output_folder, label)
        if not os.path.exists(label_folder):
            os.makedirs(label_folder)
        
        save_path = os.path.join(label_folder, f"image_{start_index + i}.jpg")
        img_to_save = (img * 255).astype(np.uint8)  # Convert back to uint8
        cv2.imwrite(save_path, img_to_save)
        print(f"Saved: {save_path}")


data_folders = ['random_train', 'random_valid']    ### delete random_
base_input_path = r'..\src\data\new-plant-diseases'
base_output_path = r'..\src\data\new-plant-diseases\random_normalized'

# Normalize and save images
for folder in data_folders:
    input_path = os.path.join(base_input_path, folder)
    output_folder = os.path.join(base_output_path, folder)
    
    # Load the image paths and labels
    plant_disease_df = load_images(input_path)
    
    # Process and normalize images in batches
    batch_size = 32
    print(f"Processing {folder} dataset in batches of size: {batch_size}")
    start_index = 0
    for batch_images, batch_labels in normalize_dataset_in_batches(plant_disease_df, batch_size):
        print(f"Processed a batch of size: {len(batch_images)}")
        save_normalized_images(batch_images, batch_labels, output_folder, start_index)
        start_index += len(batch_images)


Processing random_train dataset in batches of size: 32
Processed a batch of size: 32
Saved: ..\src\data\new-plant-diseases\random_normalized\random_train\Apple___Apple_scab\image_0.jpg
Saved: ..\src\data\new-plant-diseases\random_normalized\random_train\Apple___Apple_scab\image_1.jpg
Saved: ..\src\data\new-plant-diseases\random_normalized\random_train\Apple___Apple_scab\image_2.jpg
Saved: ..\src\data\new-plant-diseases\random_normalized\random_train\Apple___Apple_scab\image_3.jpg
Saved: ..\src\data\new-plant-diseases\random_normalized\random_train\Apple___Apple_scab\image_4.jpg
Saved: ..\src\data\new-plant-diseases\random_normalized\random_train\Apple___Apple_scab\image_5.jpg
Saved: ..\src\data\new-plant-diseases\random_normalized\random_train\Apple___Apple_scab\image_6.jpg
Saved: ..\src\data\new-plant-diseases\random_normalized\random_train\Apple___Apple_scab\image_7.jpg
Saved: ..\src\data\new-plant-diseases\random_normalized\random_train\Apple___Apple_scab\image_8.jpg
Saved: ..\src\d

In [8]:
# Definieren Sie den Pfad
path_to_check = r'..\src\data\new-plant-diseases\random_valid'

# Überprüfen, ob der Pfad existiert
if os.path.exists(input_path):
    print(f"Der Pfad '{input_path}' existiert.")
else:
    print(f"Der Pfad '{input_path}' existiert nicht.")
plant_disease_df_train

Der Pfad '..\src\data\new-plant-diseases\random_valid' existiert.


Unnamed: 0,image_path,label
0,..\src\data\new-plant-diseases\random_train\Ap...,Apple___Apple_scab
1,..\src\data\new-plant-diseases\random_train\Ap...,Apple___Apple_scab
2,..\src\data\new-plant-diseases\random_train\Ap...,Apple___Apple_scab
3,..\src\data\new-plant-diseases\random_train\Ap...,Apple___Apple_scab
4,..\src\data\new-plant-diseases\random_train\Ap...,Apple___Apple_scab
...,...,...
139,..\src\data\new-plant-diseases\random_train\Ap...,Apple___healthy
140,..\src\data\new-plant-diseases\random_train\Ap...,Apple___healthy
141,..\src\data\new-plant-diseases\random_train\Ap...,Apple___healthy
142,..\src\data\new-plant-diseases\random_train\Ap...,Apple___healthy


## Handle Imbalanced Classes

The dataset is imbalanced, with more "unhealthy" images than "healthy." We balance the classes by downsampling the majority class to match the number of samples in the minority class, ensuring fair model training.


In [9]:
# Creating target variable from image name ending
plant_disease_df_train['target'] = plant_disease_df_train['label'].apply(lambda x: 'healthy' if 'healthy' in x.lower() else 'unhealthy')
plant_disease_df_valid['target'] = plant_disease_df_valid['label'].apply(lambda x: 'healthy' if 'healthy' in x.lower() else 'unhealthy')

print(plant_disease_df_train.target.value_counts())
print("\n")
print(plant_disease_df_valid.target.value_counts())

# Separate majority and minority classes since dataset is not balanced
df_majority = plant_disease_df_train[plant_disease_df_train.target == 'unhealthy']
df_minority = plant_disease_df_train[plant_disease_df_train.target == 'healthy']

# Downsample majority class
df_majority_downsampled = resample(df_majority, 
                                   replace=False,    # sample without replacement
                                   n_samples=len(df_minority),  # to match minority class
                                   random_state=13)  # reproducible results

# Combine minority class with downsampled majority class
df_downsampled = pd.concat([df_majority_downsampled, df_minority])

# Display new class counts
print(df_downsampled.target.value_counts())

# Reassign the downsampled DataFrame to the original variable for further processing
plant_disease_df_train = df_downsampled


target
unhealthy    108
healthy       36
Name: count, dtype: int64


target
unhealthy    90
healthy      36
Name: count, dtype: int64
target
unhealthy    36
healthy      36
Name: count, dtype: int64


In [10]:
# Import necessary libraries
import os
import cv2
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

In [19]:
# Function to load images and labels from a folder
def load_images(folder):
    images = []
    labels = []
    for subdir, _, files in os.walk(folder):
        for file in files:
            if file.endswith('.jpg') or file.endswith('.JPG') or file.endswith('.png'):
                label = os.path.basename(subdir)
                img_path = os.path.join(subdir, file)
                images.append(img_path)
                labels.append(label)
    return pd.DataFrame({'image_path': images, 'label': labels})

# Load the preprocessed datasets
plant_disease_df_train = load_images(r'..\src\data\new-plant-diseases\random_normalized\random_train')
plant_disease_df_valid = load_images(r'..\src\data\new-plant-diseases\random_normalized\random_valid')

# Create target variable
plant_disease_df_train['target'] = plant_disease_df_train['label'].apply(lambda x: 'healthy' if 'healthy' in x.lower() else 'unhealthy')
plant_disease_df_valid['target'] = plant_disease_df_valid['label'].apply(lambda x: 'healthy' if 'healthy' in x.lower() else 'unhealthy')

In [20]:
#################
plant_disease_df_valid

Unnamed: 0,image_path,label,target
0,..\src\data\new-plant-diseases\random_normaliz...,Apple___Apple_scab,unhealthy
1,..\src\data\new-plant-diseases\random_normaliz...,Apple___Apple_scab,unhealthy
2,..\src\data\new-plant-diseases\random_normaliz...,Apple___Apple_scab,unhealthy
3,..\src\data\new-plant-diseases\random_normaliz...,Apple___Apple_scab,unhealthy
4,..\src\data\new-plant-diseases\random_normaliz...,Apple___Apple_scab,unhealthy
...,...,...,...
121,..\src\data\new-plant-diseases\random_normaliz...,Apple___healthy,healthy
122,..\src\data\new-plant-diseases\random_normaliz...,Apple___healthy,healthy
123,..\src\data\new-plant-diseases\random_normaliz...,Apple___healthy,healthy
124,..\src\data\new-plant-diseases\random_normaliz...,Apple___healthy,healthy


In [21]:
# Function to flatten images
def flatten_images(df):
    images = []
    labels = []
    for _, row in df.iterrows():
        img = cv2.imread(row['image_path'])
        if img is not None:
            images.append(img.flatten())  # Flatten the image
            labels.append(row['target'])
    return np.array(images), np.array(labels)

# Flatten the images for training and validation
X_train, y_train = flatten_images(plant_disease_df_train)
X_valid, y_valid = flatten_images(plant_disease_df_valid)

In [22]:
##########################################
print(f"X_train shape: {X_train.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"X_valid shape: {X_valid.shape}")
print(f"y_valid shape: {y_valid.shape}")
import os

# Aktuellen Arbeitsordner ausgeben
print("Aktueller Arbeitsordner:", os.getcwd())
import os

# Basis-Pfad relativ zum aktuellen Arbeitsordner
base_input_path = r'..\src\data\new-plant-diseases\random_normalized'

# Vollständigen Pfad für Eingabeverzeichnisse erstellen
train_folder = os.path.join(base_input_path, 'train')
valid_folder = os.path.join(base_input_path, 'valid')

print("Vollständiger Pfad zum Train-Ordner:", os.path.abspath(train_folder))
print("Vollständiger Pfad zum Valid-Ordner:", os.path.abspath(valid_folder))

# Überprüfen, ob die Verzeichnisse existieren
print("Existiert der Train-Ordner:", os.path.exists(train_folder))
print("Existiert der Valid-Ordner:", os.path.exists(valid_folder))
base_input_path = r'C:\Users\palla\Desktop\Data Scientist\data_plant_recognization\june24_bds_int_plant_reco\src\data\new-plant-diseases\random_normalized'

# Überprüfe, ob der Basis-Pfad existiert
if not os.path.exists(base_input_path):
    print(f"Basis-Pfad existiert nicht: {base_input_path}")
else:
    print(f"Basis-Pfad existiert: {base_input_path}")


X_train shape: (144, 196608)
y_train shape: (144,)
X_valid shape: (126, 196608)
y_valid shape: (126,)
Aktueller Arbeitsordner: c:\Users\palla\Desktop\Data Scientist\data_plant_recognization\june24_bds_int_plant_reco\notebooks
Vollständiger Pfad zum Train-Ordner: c:\Users\palla\Desktop\Data Scientist\data_plant_recognization\june24_bds_int_plant_reco\src\data\new-plant-diseases\random_normalized\train
Vollständiger Pfad zum Valid-Ordner: c:\Users\palla\Desktop\Data Scientist\data_plant_recognization\june24_bds_int_plant_reco\src\data\new-plant-diseases\random_normalized\valid
Existiert der Train-Ordner: False
Existiert der Valid-Ordner: False
Basis-Pfad existiert: C:\Users\palla\Desktop\Data Scientist\data_plant_recognization\june24_bds_int_plant_reco\src\data\new-plant-diseases\random_normalized


In [23]:
########################################
import os
import pandas as pd

# Beispiel-DataFrames laden
print(f"Anzahl der Bilder im Train DataFrame: {len(plant_disease_df_train)}")
print(f"Anzahl der Bilder im Valid DataFrame: {len(plant_disease_df_valid)}")

# Überprüfe die ersten paar Einträge in den DataFrames
print("Erste Einträge im Train DataFrame:")
print(plant_disease_df_train.head())

print("Erste Einträge im Valid DataFrame:")
print(plant_disease_df_valid.head())


Anzahl der Bilder im Train DataFrame: 144
Anzahl der Bilder im Valid DataFrame: 126
Erste Einträge im Train DataFrame:
                                          image_path               label  \
0  ..\src\data\new-plant-diseases\random_normaliz...  Apple___Apple_scab   
1  ..\src\data\new-plant-diseases\random_normaliz...  Apple___Apple_scab   
2  ..\src\data\new-plant-diseases\random_normaliz...  Apple___Apple_scab   
3  ..\src\data\new-plant-diseases\random_normaliz...  Apple___Apple_scab   
4  ..\src\data\new-plant-diseases\random_normaliz...  Apple___Apple_scab   

      target  
0  unhealthy  
1  unhealthy  
2  unhealthy  
3  unhealthy  
4  unhealthy  
Erste Einträge im Valid DataFrame:
                                          image_path               label  \
0  ..\src\data\new-plant-diseases\random_normaliz...  Apple___Apple_scab   
1  ..\src\data\new-plant-diseases\random_normaliz...  Apple___Apple_scab   
2  ..\src\data\new-plant-diseases\random_normaliz...  Apple___Apple_sca

In [24]:
# Initialize RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, random_state=13)

# Train the model
rf.fit(X_train, y_train)

# Make predictions
y_pred = rf.predict(X_valid)

# Evaluate the model
accuracy = accuracy_score(y_valid, y_pred)
print(f"Test Accuracy: {accuracy * 100:.2f}%")
print(classification_report(y_valid, y_pred))

Test Accuracy: 86.51%
              precision    recall  f1-score   support

     healthy       0.95      0.56      0.70        36
   unhealthy       0.85      0.99      0.91        90

    accuracy                           0.87       126
   macro avg       0.90      0.77      0.81       126
weighted avg       0.88      0.87      0.85       126



In [25]:
from sklearn.metrics import accuracy_score, classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report


# Training des Decision Tree Modells
dt = DecisionTreeClassifier(max_depth=12, random_state=13)
dt.fit(X_train, y_train)

# Vorhersagen mit dem Modell
y_pred = dt.predict(X_valid)

# Evaluierung des Modells
accuracy = accuracy_score(y_valid, y_pred)
print(f"Test Accuracy: {accuracy * 100:.2f}%")
print(classification_report(y_valid, y_pred))

Test Accuracy: 78.57%
              precision    recall  f1-score   support

     healthy       0.63      0.61      0.62        36
   unhealthy       0.85      0.86      0.85        90

    accuracy                           0.79       126
   macro avg       0.74      0.73      0.74       126
weighted avg       0.78      0.79      0.78       126



In [28]:
# Generate and print the classification report
from sklearn.metrics import classification_report


class_report = classification_report(y_valid, y_pred)
print('Classification Report:')
print(class_report)

Classification Report:
              precision    recall  f1-score   support

     healthy       0.63      0.61      0.62        36
   unhealthy       0.85      0.86      0.85        90

    accuracy                           0.79       126
   macro avg       0.74      0.73      0.74       126
weighted avg       0.78      0.79      0.78       126

