# **Data Modelling and Evaluation**

---

## Objectives

* Answer business requirement 2: 
    * The client seeks to predict whether a cherry leaf is healthy or infected with powdery mildew.

## Inputs

* inputs/cherry_leaves_dataset/cherry-leaves/train
* inputs/cherry_leaves_dataset/cherry-leaves/test
* inputs/cherry_leaves_dataset/cherry-leaves/validation
* image shape embeddings

## Outputs

* Images distribution plot in train, validation, and test set
* Image augmentation
* Class indices to change prediction inference in labels
* Machine learning model creation and training
* Save model
* Learning curve plot for model performance
* Model evaluation on pickle file
* Prediction on the random image file





## Additional Comments:

N/A


---

## 1. Set Data Directory and Import Libraries

---

Import libraries

In [None]:
import os
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import pandas as pd
import seaborn as sns
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.models import load_model


---

## 2. Set working directory

---

In [None]:
cwd= os.getcwd()

In [None]:
os.chdir('/workspace/Portfolio_5_Cherry_Leaves_Mildew')
print("You set a new current directory")

In [None]:
work_dir = os.getcwd()
work_dir

---

## 3. Set input directories (Train, Validation, Test)

---

Set train, validation and test paths.

In [None]:
base_dir = '/workspace/Portfolio_5_Cherry_Leaves_Mildew/inputs/cherry_leaves_dataset/cherry-leaves'
train_dir = os.path.join(base_dir, 'train')
validation_dir = os.path.join(base_dir, 'validation')
test_dir = os.path.join(base_dir, 'test')


---

## 4. Set output directory

---

In [None]:

version = 'v1'
base_dir = 'inputs/cherry_leaves_dataset/cherry-leaves'  # Make sure to set your base directory
file_path = os.path.join(base_dir, 'outputs', version)

# Function to automatically increment version
def increment_version(ver):
    base, num = ver[:-1], int(ver[-1])  # Assumes version format is vX where X is a digit
    return f"{base}{num + 1}"

# Check if 'outputs' directory exists, if not, create it
if 'outputs' not in os.listdir(base_dir):
    os.makedirs(os.path.join(base_dir, 'outputs'))

# Check if the specific version exists
if version in os.listdir(os.path.join(base_dir, 'outputs')):
    print('Old version is already available. Creating a new version...')
    # Increment the version until a new, unused version is found
    while version in os.listdir(os.path.join(base_dir, 'outputs')):
        version = increment_version(version)
    # Create a new version directory
    new_file_path = os.path.join(base_dir, 'outputs', version)
    os.makedirs(new_file_path)
    print(f"New version created: {version}")
else:
    # If the version doesn't exist, create the specified version directory
    os.makedirs(file_path)
    print(f"Version {version} created.")



---

## 5. Set label names

---

Set labels

In [None]:
labels = os.listdir(train_dir)
print('Label for the images are', labels)

---

## 6. Set image shape

---

Import saved image shape embedding

In [None]:
import joblib

version = 'v1'
image_shape = joblib.load(filename=f"outputs/{version}/image_shape.pkl")
image_shape

---

## Images distribution

---

Count number of images per set and label

In [None]:
import plotly.express as px
import pandas as pd


df_freq = pd.DataFrame([])
for folder in ['train', 'test', 'validation']:
    for label in labels:
        df_freq = pd.concat(
            [df_freq,
             pd.Series(data={'Set': folder,
                             'Label': label,
                             'Count': int(len(os.listdir(base_dir + '/' + folder + '/' + label)))}
                       )],
            ignore_index=True
        )

        print(
            f"* {folder} - {label}: {len(os.listdir(base_dir+'/'+ folder + '/' + label))} images")

print("\n")

Label Distribution - Bar Chart

In [None]:
import plotly.express as px
import pandas as pd

# Assuming you have separate lists for "Set," "Label," and "Count"
sets = ["train", "train", "test", "test", "validation", "validation"]
labels = ["healthy", "powdery_mildew", "healthy", "powdery_mildew", "healthy", "powdery_mildew"]
counts = [1472, 1472, 422, 422, 210, 210]

# Create the DataFrame
df_freq = pd.DataFrame({'Set': sets, 'Label': labels, 'Count': counts})

fig = px.bar(df_freq, 
             x='Set', 
             y='Count', 
             color='Label',
             title='Cherry Leaves Dataset',
             text_auto=True
            )

fig.update_layout(
    autosize=False,
    width=800, 
    height=500, 
    )
fig.show()
fig.write_image(f'outputs/v1/bar_chart.png')


In [None]:
import plotly.express as px
import pandas as pd

# Assuming you have separate lists for "Set" and "Count"
sets = ["train", "test", "validation"]
counts = [1472, 422, 210]

# Create the DataFrame
df_sets = pd.DataFrame({'Set': sets, 'Count': counts})

# Create the pie chart
fig = px.pie(df_sets, 
             values='Count', 
             names='Set', 
             title='Dataset Split',
             color='Set',
             color_discrete_map={'train': 'blue', 'test': 'green', 'validation': 'orange'}  # Optionally specify color for each set
            )

# Show the chart
fig.show()
fig.write_image(f'outputs/v1/pie_chart.png')


---

## 7. Load Images from Train, Test, and Validation Sets

---

In [None]:
train_datagen = ImageDataGenerator(rescale=1./255,
                                   rotation_range=40,
                                   width_shift_range=0.2,
                                   height_shift_range=0.2,
                                   shear_range=0.2,
                                   zoom_range=0.2,
                                   horizontal_flip=True,
                                   fill_mode='nearest')

test_datagen = ImageDataGenerator(rescale=1./255)  # Note that validation data should not be augmented!

In [None]:
test_generator = test_datagen.flow_from_directory(
        test_dir,
        target_size=(50, 50),
        batch_size=20,
        class_mode='binary')

test_generator.class_indices

In [None]:
train_generator = train_datagen.flow_from_directory(
        train_dir,
        target_size=(50, 50),
        batch_size=20,
        class_mode='binary')

train_generator.class_indices

In [None]:
validation_generator = test_datagen.flow_from_directory(
        validation_dir,
        target_size=(50, 50),
        batch_size=20,
        class_mode='binary')

validation_generator.class_indices

---

## 8. Image Data Augmentation

---

Import ImageDataGenerator

* Intiatize ImageDataGenerator

In [None]:
def plot_augmented_images(image_gen, num_images=20):
    """
    Plots a grid of `num_images` from an image generator.
    
    :param image_gen: The image generator object.
    :param num_images: The number of images to plot.
    """
    images, _ = next(image_gen)  # Get a batch of images and labels
    
    # Calculate grid size and create a new figure
    n_cols = 5  # Number of columns in the grid
    n_rows = num_images // n_cols + (num_images % n_cols > 0)  # Calculate required rows
    plt.figure(figsize=(n_cols * 3, n_rows * 3))
    
    for i in range(num_images):
        plt.subplot(n_rows, n_cols, i + 1)
        plt.imshow(images[i])  # Images are already rescaled by the generator
        plt.axis('off')
    
    plt.tight_layout()
    plt.show()

# Plot augmented train images
plot_augmented_images(train_generator, num_images=20)


* Augment test image dataset

In [None]:
def plot_test_images(directory_iterator, num_images=10):
    """
    Plots a specified number of test images from a DirectoryIterator.
    
    :param directory_iterator: A DirectoryIterator yielding batches of images and labels.
    :param num_images: The number of test images to plot.
    """
    # Get a batch of images and labels using next()
    images, _ = next(directory_iterator)
    
    # Determine the grid size for plotting
    n_cols = min(num_images, 5)  # Adjust the number of columns as needed, maxing out at 5
    n_rows = num_images // n_cols + (num_images % n_cols > 0)
    
    plt.figure(figsize=(n_cols * 3, n_rows * 3))
    
    for i in range(min(num_images, len(images))):  # Ensure we do not exceed the batch size or num_images
        plt.subplot(n_rows, n_cols, i + 1)
        plt.imshow(images[i])
        plt.axis('off')
    
    plt.tight_layout()
    plt.show()


In [None]:
# Assuming test_generator is your DirectoryIterator instance
plot_test_images(test_generator, num_images=10)


* Augment train image dataset

In [None]:

def plot_train_images(generator, num_images=8):
    """
    Plots a batch of images from the given generator.
    
    :param generator: The generator yielding batches of images and labels.
    :param num_images: The number of images to plot from a single batch.
    """
    images, labels = next(generator)  # Get a batch of images and labels

    n_cols = min(num_images, 4)  # Number of columns in the plot
    n_rows = num_images // n_cols + int(num_images % n_cols != 0)  # Calculate required number of rows

    plt.figure(figsize=(n_cols * 3, n_rows * 3))
    for i in range(num_images):
        plt.subplot(n_rows, n_cols, i + 1)
        plt.imshow(images[i])
        plt.title(f'Label: {labels[i]}')
        plt.axis('off')
    plt.tight_layout()
    plt.show()


In [None]:
plot_train_images(validation_generator, num_images=8)


* Augment validation image dataset

In [None]:
def plot_validation_images(generator, num_images=8):
    """
    Plots a batch of images from the given generator.
    
    :param generator: The generator yielding batches of images and labels.
    :param num_images: The number of images to plot from a single batch.
    """
    images, labels = next(generator)  # Get a batch of images and labels

    n_cols = min(num_images, 4)  # Number of columns in the plot
    n_rows = num_images // n_cols + int(num_images % n_cols != 0)  # Calculate required number of rows

    plt.figure(figsize=(n_cols * 3, n_rows * 3))
    for i in range(num_images):
        plt.subplot(n_rows, n_cols, i + 1)
        plt.imshow(images[i])
        plt.title(f'Label: {labels[i]}')
        plt.axis('off')
    plt.tight_layout()
    plt.show()


In [None]:
plot_validation_images(validation_generator, num_images=8)


---

## 9. Save Class Indices

---

In [None]:
class_indices = {'healthy': 0, 'powdery_mildew': 1} 

# Save the class indices to a file using joblib or any other suitable method
joblib.dump(class_indices, f'{file_path}/class_indices.pkl')

---

## 10. Model Creation and Training

---

Advanced model and hyperparameter tuning using GridSearchCV


* Prepare Data

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense

# Initialize the model with the correct input shape
model = Sequential([
    Conv2D(32, (3, 3), activation='relu', input_shape=(50, 50, 3)),  # Updated input shape
    MaxPooling2D((2, 2)),
    Conv2D(64, (3, 3), activation='relu'),
    MaxPooling2D((2, 2)),
    Conv2D(128, (3, 3), activation='relu'),
    MaxPooling2D((2, 2)),
    Flatten(),
    Dense(128, activation='relu'),
    Dense(1, activation='sigmoid')  # Use 'softmax' for multi-class classification
])

# Compile the model
model.compile(optimizer='adam',
              loss='binary_crossentropy',  # Use 'categorical_crossentropy' for multi-class classification
              metrics=['accuracy'])

# Summary of the model
model.summary()


* Define a callback function in Keras that monitors the model's accuracy and stops training once the accuracy reaches 97% (or any other threshold)

In [None]:
from tensorflow.keras.callbacks import Callback

class AccuracyThresholdCallback(Callback):
    """
    Custom callback to stop training when a specified accuracy threshold is reached.
    """
    def __init__(self, threshold, min_epochs):
        super(AccuracyThresholdCallback, self).__init__()
        self.threshold = threshold
        self.min_epochs = min_epochs

    def on_epoch_end(self, epoch, logs=None):
        # Check if the accuracy threshold has been reached
        if epoch >= self.min_epochs - 1:  # epochs are zero-indexed
            if logs.get('accuracy') is not None and logs.get('accuracy') >= self.threshold:
                print(f"\nEpoch {epoch+1}: Reached {self.threshold * 100}% accuracy. Stopping training.")
                self.model.stop_training = True

# Instantiate the callback with your desired threshold
accuracy_threshold_callback = AccuracyThresholdCallback(threshold=0.99, min_epochs=5)

* Fit model for model training

In [None]:
from tensorflow.keras.callbacks import Callback

class DetailedHistory(Callback):
    def on_train_batch_end(self, batch, logs=None):
        # Log training metrics more frequently (e.g., every 10 batches)
        if batch % 10 == 0:
            print(f"Batch {batch}: Loss = {logs['loss']}, Accuracy = {logs['accuracy']}")

    def on_epoch_end(self, epoch, logs=None):
        # Log validation metrics at the end of each epoch
        print(f"Epoch {epoch}: Val Loss = {logs['val_loss']}, Val Accuracy = {logs['val_accuracy']}")

detailed_history = DetailedHistory()

In [None]:
history = model.fit(
    train_generator,
    epochs=25,
    steps_per_epoch=len(train_generator.classes) // 20,
    validation_data=validation_generator,  # Make sure this is provided
    validation_steps=len(validation_generator.classes) // 20,
    callbacks=[detailed_history, accuracy_threshold_callback],
    verbose=1
)

* Save model


In [None]:
model.save('outputs/v1/mildew_detector_model.h5')

---

## 11. Model Performace

---

### Model learning curve

In [None]:
loss = pd.DataFrame(model.history.history)
sns.set_style("darkgrid")

# Plot only if 'val_loss' is present
if 'val_loss' in loss.columns:
    loss[['loss', 'val_loss']].plot(style='.-')
    plt.title("Loss")
    plt.savefig(f'outputs/v1/model_training_losses.png', bbox_inches='tight', dpi=150)
    plt.show()

if 'val_accuracy' in loss.columns:
    loss[['accuracy', 'val_accuracy']].plot(style='.-')
    plt.title("Accuracy")
    plt.savefig(f'outputs/v1/model_training_acc.png', bbox_inches='tight', dpi=150)
    plt.show()


### Model Evaluation on Test Data

* Load saved model

In [None]:
from keras.models import load_model
model = load_model('outputs/v1/mildew_detector_model.h5')

* Evaluate model on test set

In [None]:
evaluation = model.evaluate(test_generator)

* Save evaluation pickle

In [None]:
joblib.dump(value=evaluation ,
            filename=f"outputs/v1/evaluation.pkl")

* ROC Curve

The ROC curve shows the relationship between the TPR and FPR for various classification thresholds. The AUC is a metric that measures the overall performance of the classifier.

In [None]:
from sklearn.metrics import roc_curve, auc

# Make predictions on the test set
pred = model.predict(test_generator)

# Calculate FPR, TPR, and classification thresholds
fpr, tpr, thresholds = roc_curve(test_generator.classes, pred)

# Calculate area under the curve (AUC)
roc_auc = auc(fpr, tpr)

# Plot the ROC curve
plt.figure(figsize=(8, 5))
plt.plot(fpr, tpr, color='blue', lw=2,
         label=f'ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='grey', lw=2,
         linestyle='--', label='Random Guess')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate (FPR)')
plt.ylabel('True Positive Rate (TPR)')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc='lower right')
plt.savefig(f'/workspace/Portfolio_5_Cherry_Leaves_Mildew/outputs/v1/roc_curve.png', bbox_inches='tight', dpi=150)

In [None]:
import cv2

def grayscale_with_color_histograms(my_image):
    # Load the image
    image = cv2.imread(my_image)
    
    # Check if the image is loaded successfully
    if image is None:
        raise ValueError("Error loading image. Please check the file path.")
    
    # Convert the image to grayscale
    grayscale_image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    
    # Compute color histograms for each RGB channel
    b_hist = cv2.calcHist([image], [0], None, [256], [0, 256])
    g_hist = cv2.calcHist([image], [1], None, [256], [0, 256])
    r_hist = cv2.calcHist([image], [2], None, [256], [0, 256])
    
    # Plot the color histograms
    plt.figure(figsize=(10, 6))
    plt.plot(b_hist, color='blue', label='Blue')
    plt.plot(g_hist, color='green', label='Green')
    plt.plot(r_hist, color='red', label='Red')
    plt.title('Color Histograms')
    plt.xlabel('Pixel Intensity')
    plt.ylabel('Frequency')
    plt.legend()
    plt.grid(True)
    plt.show()
    
    # Return the grayscale image
    return grayscale_image

# Example usage
my_image = 'static/images/powdery_mildew_3.jpg'
result_image = grayscale_with_color_histograms(my_image)
plt.imshow(result_image, cmap='gray')
plt.show()


* HSV (Hue, Saturation, Value) color space in OpenCV

In [None]:
import cv2
import matplotlib.pyplot as plt

def convert_to_hsv(image_path):
    # Load the image
    image = cv2.imread(image_path)
    
    # Check if the image is loaded successfully
    if image is None:
        raise ValueError("Error loading image. Please check the file path.")
    
    # Convert BGR to HSV
    hsv_image = cv2.cvtColor(image, cv2.COLOR_BGR2HSV)
    
    return hsv_image

# Example usage
image_path = '/workspace/Portfolio_5_Cherry_Leaves_Mildew/static/images/powdery_mildew_2.jpg'
hsv_image = convert_to_hsv(image_path)

# Display the original and HSV images
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.imshow(cv2.cvtColor(cv2.imread(image_path), cv2.COLOR_BGR2RGB))
plt.title('Original Image')

plt.subplot(1, 2, 2)
plt.imshow(cv2.cvtColor(hsv_image, cv2.COLOR_BGR2RGB))
plt.title('HSV Image')

plt.show()


### Prediction on a Random Image File

* Load a random image as PIL

In [None]:
import random
from tensorflow.keras.preprocessing import image

# Select a random label from the 'labels' list
random_label = random.choice(labels)

# Construct the path for the chosen label
label_dir = os.path.join(test_dir, random_label)

# Get a list of all files in the chosen label directory
files_in_label_dir = os.listdir(label_dir)

# Select a random file from the list of files
random_file = random.choice(files_in_label_dir)

# Construct the full path for the randomly selected image
image_path = os.path.join(label_dir, random_file)

# Load the image
pil_image = image.load_img(image_path, target_size=image_shape, color_mode='rgb')

print(f'Randomly selected label: {random_label}')
print(f'Image path: {image_path}')
print(f'Image shape: {pil_image.size}, Image mode: {pil_image.mode}')



* Display the image

In [None]:
import matplotlib.pyplot as plt

plt.imshow(pil_image)
plt.axis('off')
plt.show()

* Convert image to array and prepare for prediction

In [None]:
my_image = image.img_to_array(pil_image)
my_image = np.expand_dims(my_image, axis=0)/255
print(my_image.shape)

* Predict class probabilities

In [None]:
pred_proba = model.predict(my_image)[0,0]

target_map = {v: k for k, v in train_generator.class_indices.items()}
pred_class =  target_map[pred_proba > 0.5]  

if pred_class == target_map[0]: pred_proba = 1 - pred_proba

print(pred_proba)
print(pred_class)

----

## 12. Conlusions

---

- **Effective Performance**: The model showcases impressive performance, even with a relatively small dataset, highlighting its efficiency in learning from limited data.

- **Consistent Learning**: Analysis of loss and accuracy curves reveals a stable and consistent training behavior, with no signs of overfitting or underfitting, indicating a well-tuned model.

- **Accurate Predictions**: Demonstrates a strong capability to accurately predict the class of new, unseen images, confirming the model's generalization ability.

- **Data Augmentation Impact**: The application of data augmentation techniques significantly contributed to the model's robustness, allowing it to handle a variety of image orientations and scales.

- **Real-World Applicability**: The model's reliability in classifying cherry leaf diseases underlines its potential for real-world agricultural applications, offering valuable support for early disease detection and management.

- **Future Improvement Avenues**: While current results are promising, exploring more complex architectures, deeper networks, and larger datasets could further enhance model performance and reliability.