### Decision Tree model for Plant Disease Data

In this notebook, we will apply the Decision Tree model to a dataset of plant diseases. The dataset is retrieved from Kaggle (see for: [Plant Disease Dataset](https://www.kaggle.com/datasets/saroz014/plant-disease).
The notebook is organized in the following structure: 
+  1. Importing necessary libraries for preprocessing and training the model
+  2. Setting paths to the plant-disease dataset
+  3. Defining a function to load images into a DataFrame for further processing
+  4. Checking for duplicate entries as part of preprocessing
+  5. Balancing 'healthy' and 'unhealthy' images, as both classes are highly imbalanced
+  6. Defining a function to resize, normalize, and flatten images in batches for more efficient training and better performance
+  7. Creating DataFrame of batches to handle large datasets efficiently and separating variables into feature and label variables
+  8. Applying PCA transformation to the data to enhance model performance by focusing on the most significant features
+  9. Training the Decision Tree Model
+ 10. Generating a classification report for all classes (subfolders)

In [1]:
# 1. Import necessary libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import cv2
import shutil
from sklearn.utils import resample
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score, classification_report
from sklearn.tree import DecisionTreeClassifier

#### 2. Setting paths to plant-disease dataset
Sets the paths to the dataset directories and converts them to absolute paths. Prints the current working directory and the absolute paths for verification.

In [2]:
# dataset directories
script_dir = os.getcwd()
train_path = os.path.join(script_dir, "src", "data", "plant-disease", "dataset", "train")
test_path = os.path.join(script_dir, "src", "data", "plant-disease", "dataset", "test")

# converting to absolute paths
absolut_train = os.path.abspath(train_path)
absolut_test = os.path.abspath(test_path)

# Ensuring that paths exist
print("Current path:", script_dir)
print("Absolute training path:", absolut_train)
print("Absolute test path:", absolut_test)

Current path: c:\Program Files (x86)\Projects\Plants\june24_bds_int_plant_reco\notebooks
Absolute training path: c:\Program Files (x86)\Projects\Plants\june24_bds_int_plant_reco\notebooks\src\data\plant-disease\dataset\train
Absolute test path: c:\Program Files (x86)\Projects\Plants\june24_bds_int_plant_reco\notebooks\src\data\plant-disease\dataset\test


#### 3. Function to load images into a Dataframe
Walks through the directory structure, loads image paths and their labels into lists, then creates and returns a DataFrame containing these lists.

In [3]:
# Function that loads images and creates a DataFrame with the two columns 'image_path' and 'label'
def load_images(folder):
    image_paths = []
    labels = []
    for subdir, _, files in os.walk(folder):
        for file in files:
            if file.endswith('.jpg') or file.endswith('.JPG') or file.endswith('.png'):
                label = os.path.basename(subdir)
                img_path = os.path.join(subdir, file)
                image_paths.append(img_path)
                labels.append(label)
    return pd.DataFrame({'image_path': image_paths, 'label': labels})

# Resulting DataFrames
images_train = load_images(absolut_train)
images_test = load_images(absolut_test)

# Basic Analysis
print("Train DataFrame Shape:", images_train.shape)
print("Valid DataFrame Shape:", images_test.shape)

Train DataFrame Shape: (43455, 2)
Valid DataFrame Shape: (10849, 2)


#### 4. Check for duplicate entries
Identifies and prints duplicate image paths in the DataFrame. Ensures there are no duplicate entries before further processing.

In [4]:
# Function for checking duplicate entries in 'image_path' column
def check_duplicates(df):
    duplicates = df[df.duplicated(subset=['image_path'])]
    print(f"Duplicate entries: {len(duplicates)}")
    return duplicates

# Check train dataset
print("Checking train dataset for duplicates...")
duplicate_train_images = check_duplicates(images_train)

# Check valid dataset
print("Checking valid dataset for duplicates...")
duplicate_valid_images = check_duplicates(images_test)

Checking train dataset for duplicates...
Duplicate entries: 0
Checking valid dataset for duplicates...
Duplicate entries: 0


#### 5. Balancing 'healthy' and 'unhealthy' Images
The dataset is imbalanced, with more "unhealthy" images than "healthy." To address this, we add a target column labeling images as 'healthy' or 'unhealthy' and balance the training data by downsampling the majority class ('unhealthy') to match the number of 'healthy' samples. This ensures fair model training and prevents bias towards the majority class.

In [5]:
# Creating new target column labeling images as healthy or unhealthy
images_train['target'] = images_train['label'].apply(lambda x: 'healthy' if 'healthy' in x.lower() else 'unhealthy')
images_test['target'] = images_test['label'].apply(lambda x: 'healthy' if 'healthy' in x.lower() else 'unhealthy')

print(images_train.target.value_counts())
print("\n")
print(images_test.target.value_counts())

# Separate majority and minority classes which are highly unbalanced
df_majority_train = images_train[images_train.target == 'unhealthy']
df_minority_train = images_train[images_train.target == 'healthy']

# Downsample majority class of training data
df_majority_train_downsampled = resample(df_majority_train,
                                   replace=False,                       # sample without replacement
                                   n_samples=len(df_minority_train),    # to match minority class
                                   random_state=13)                     # reproducible results


# Combine minority class with downsampled majority class
df_downsampled_train = pd.concat([df_majority_train_downsampled, df_minority_train])

# Display new class counts
print(df_downsampled_train.target.value_counts())

# Reassign the downsampled DataFrames to the original variables
images_train = df_downsampled_train

target
unhealthy    31384
healthy      12071
Name: count, dtype: int64


target
unhealthy    7836
healthy      3013
Name: count, dtype: int64
target
unhealthy    12071
healthy      12071
Name: count, dtype: int64


#### 6. Function to resize, normalize and flatten images in batches
Reads an image from the given path, resizes it to 64x64 pixels, normalizes pixel values to [0, 1] and flattens the image array in batches for further processing.

In [6]:
# Function to resize and normalize pixel values
def read_and_normalize_image(image_path):
    img = cv2.imread(image_path)
    if img is not None:
        img = cv2.resize(img, (64, 64))                # Resize to 64x64 pixels
        img = img / 255.0                              # Normalize the image
    return img

# Function to flatten array and process images in batches
def flatten_dataset_in_batches(df, batch_size=32):
    total_images = len(df)
    for start in range(0, total_images, batch_size):
        end = min(start + batch_size, total_images)
        batch_df = df.iloc[start:end]

        flatten_images = []
        for img_path in batch_df['image_path']:
            img = read_and_normalize_image(img_path)
            if img is not None:
                flatten_images.append(img.flatten())    # Flatten the image

        flattened_images_df = pd.DataFrame(np.array(flatten_images))
        flattened_images_df['label'] = batch_df['label'].values

        yield flattened_images_df

#### 7. Creating DataFrame of batches and separating variables to feature and label variables
Processes images in batches and combines them into a single DataFrame. Encodes the labels and separates features (flattened images) from target variables (encoded labels).

In [7]:
# Concatenating all batches into a single DataFrame
def create_dataframe_from_batches(df, batch_size=32):
    all_batches = []
    for batch_df in flatten_dataset_in_batches(df, batch_size=batch_size):
        all_batches.append(batch_df)

    full_df = pd.concat(all_batches, ignore_index=True)

    return full_df

train_df = create_dataframe_from_batches(images_train)
test_df = create_dataframe_from_batches(images_test)


# Initialize LabelEncoder
le = LabelEncoder()

# Fit and transform target variable of training
train_df['label_encoded'] = le.fit_transform(train_df['label'])

# transform target variable of test
test_df['label_encoded'] = le.transform(test_df['label'])

# Separation of features und target variables
X_train = train_df.drop(columns=['label', 'label_encoded']).values
y_train = train_df['label_encoded'].values
X_test = test_df.drop(columns=['label', 'label_encoded']).values
y_test = test_df['label_encoded'].values

# Checking dimension of data
print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)
print("X_test shape:", X_test.shape)
print("y_test shape:", y_test.shape)

X_train shape: (24142, 12288)
y_train shape: (24142,)
X_test shape: (10849, 12288)
y_test shape: (10849,)


#### 8. Applying PCA transformation onto data
Fits PCA to the training data to determine the number of principal components required to retain 95% of the variance, then applies this PCA transformation to both the training and test data.

In [8]:
# 8.1. Fit PCA on the flattened training data
pca = PCA().fit(X_train)

cumulative_variance = np.cumsum(pca.explained_variance_ratio_)
n_components_95 = np.argmax(cumulative_variance >= 0.95) + 1
print(f"Number of components for 95% variance: {n_components_95}")

Number of components for 95% variance: 1340


In [9]:
# 8.2. Apply PCA transformation with the number of components that explain 95% variance
pca = PCA(n_components=n_components_95)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)

#### 9. Training Decision Tree Model
Trains a Decision Tree Classifier on the PCA-transformed training data. Makes predictions on the PCA-transformed test data. Evaluates the model’s performance and prints accuracy and classification metrics.

In [10]:
# Setting and training Decision Tree Model
dt = DecisionTreeClassifier(max_depth=6, random_state=13)
dt.fit(X_train_pca, y_train)

# Prediction of model
y_pred = dt.predict(X_test_pca)

# Evaluating model
accuracy = accuracy_score(y_test, y_pred)
print(f"Test Accuracy: {accuracy * 100:.2f}%")
print(classification_report(y_test, y_pred))

Test Accuracy: 36.81%
              precision    recall  f1-score   support

           0       0.00      0.00      0.00       126
           1       0.00      0.00      0.00       124
           2       0.00      0.00      0.00        55
           3       0.33      0.42      0.37       329
           4       0.48      0.54      0.51       300
           5       0.00      0.00      0.00       210
           6       0.37      0.28      0.32       170
           7       0.00      0.00      0.00       102
           8       0.86      0.72      0.78       238
           9       0.00      0.00      0.00       197
          10       0.83      0.81      0.82       232
          11       0.00      0.00      0.00       236
          12       0.00      0.00      0.00       276
          13       0.37      0.32      0.34       215
          14       0.00      0.00      0.00        84
          15       0.52      0.81      0.63      1101
          16       0.54      0.23      0.32       459
     

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


#### 10. Generate the classification report for all classes (subfolders)
Generates a detailed classification report showing precision, recall, and F1-score for each class. Prints the full report.

In [11]:
# Full classification report
full_classification_report = classification_report(
    y_test, 
    y_pred, 
    target_names=le.classes_,  # Consider all classes
    zero_division=0            # Avoid errors for divisions by zero
)

# Print the full classification report for all classes
print("Full Classification Report for all classes:\n")
print(full_classification_report)


Full Classification Report for all classes:

                                                    precision    recall  f1-score   support

                                Apple___Apple_scab       0.00      0.00      0.00       126
                                 Apple___Black_rot       0.00      0.00      0.00       124
                          Apple___Cedar_apple_rust       0.00      0.00      0.00        55
                                   Apple___healthy       0.33      0.42      0.37       329
                               Blueberry___healthy       0.48      0.54      0.51       300
          Cherry_(including_sour)___Powdery_mildew       0.00      0.00      0.00       210
                 Cherry_(including_sour)___healthy       0.37      0.28      0.32       170
Corn_(maize)___Cercospora_leaf_spot Gray_leaf_spot       0.00      0.00      0.00       102
                       Corn_(maize)___Common_rust_       0.86      0.72      0.78       238
               Corn_(maize)___Nort