# **Data Modelling and Evaluation**

---

## Objectives

* Answer business requirement 2: 
    * The client seeks to predict whether a cherry leaf is healthy or infected with powdery mildew.

## Inputs

* inputs/cherry_leaves_dataset/cherry-leaves/train
* inputs/cherry_leaves_dataset/cherry-leaves/test
* inputs/cherry_leaves_dataset/cherry-leaves/validation
* image shape embeddings

## Outputs

* Images distribution plot in train, validation, and test set
* Image augmentation
* Class indices to change prediction inference in labels
* Machine learning model creation and training
* Save model
* Learning curve plot for model performance
* Model evaluation on pickle file
* Prediction on the random image file





## Additional Comments:

N/A


---

## 1. Set Data Directory and Import Libraries

---

Import libraries

In [97]:
import os
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import pandas as pd
import seaborn as sns
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.models import Sequential, load_model
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc, precision_score, recall_score, f1_score
import joblib


---

## 2. Set working directory

---

In [98]:
cwd= os.getcwd()

In [99]:
os.chdir('/workspace/Portfolio_5_Cherry_Leaves_Mildew')
print("Changed current directory to workspace.")

Changed current directory to workspace.


In [100]:
work_dir = os.getcwd()
work_dir

'/workspace/Portfolio_5_Cherry_Leaves_Mildew'

---

## 3. Set input directories (Train, Validation, Test)

---

Set train, validation and test paths.

In [101]:
base_dir = '/workspace/Portfolio_5_Cherry_Leaves_Mildew/inputs/cherry_leaves_dataset/cherry-leaves'
train_dir = os.path.join(base_dir, 'train')
validation_dir = os.path.join(base_dir, 'validation')
test_dir = os.path.join(base_dir, 'test')


---

## 4. Set output directory

---

* Set output directory

In [102]:
outputs_dir = '/workspace/Portfolio_5_Cherry_Leaves_Mildew'
version = 'v1'
file_path = os.path.join(outputs_dir, 'outputs', version)

* Function to automatically increment version

In [103]:
def increment_version(ver):
    base, num = ver[:-1], int(ver[-1])
    return f"{base}{num + 1}"

* Check if 'outputs' directory exists, if not, create it

In [104]:
if 'outputs' not in os.listdir(outputs_dir):
    os.makedirs(os.path.join(outputs_dir, 'outputs'))

---

## 5. Set label names

---

Set labels

In [105]:
labels = os.listdir(train_dir)
print('Label for the images are', labels)

Label for the images are ['healthy', 'powdery_mildew']


---

## 6. Set image shape

---

Import saved image shape embedding

In [106]:
version = 'v1'
image_shape = joblib.load(filename=f"outputs/{version}/image_shape.pkl")
image_shape

(50, 50)

---

## Images distribution

---

Function to count number of images per set and label

In [107]:
def count_images_per_label(base_dir, folders, labels):
    df_freq = pd.DataFrame([])
    for folder in folders:
        for label in labels:
            num_images = len(os.listdir(os.path.join(base_dir, folder, label)))
            df_freq = pd.concat([df_freq, pd.Series(data={'Set': folder, 'Label': label, 'Count': num_images})], ignore_index=True)
            print(f"* {folder} - {label}: {num_images} images")
    return df_freq

Function to plot bar chart

In [109]:
import plotly.express as px

sets = ["train", "train", "test", "test", "validation", "validation"]
labels = ["healthy", "powdery_mildew", "healthy", "powdery_mildew", "healthy", "powdery_mildew"]
counts = [1472, 1472, 422, 422, 210, 210]

# Create the DataFrame
df_freq = pd.DataFrame({'Set': sets, 'Label': labels, 'Count': counts})

# Define colors
color_palette = {"healthy": "darkgreen", "powdery_mildew": "yellow"}

fig = px.bar(df_freq, 
             x='Set', 
             y='Count', 
             color='Label',
             color_discrete_map=color_palette,
             title='Cherry Leaves Dataset',
             text_auto=True
            )

fig.update_layout(
    autosize=False,
    width=800, 
    height=500, 
    )
fig.show()
fig.write_image(f'outputs/v1/bar_chart.png')


Function to plot pie chart

In [None]:
def plot_pie_chart(df):
    fig = plt.figure(figsize=(8, 8))
    plt.pie(df['Count'], labels=df['Set'], autopct='%1.1f%%', colors=['blue', 'green', 'orange'])
    plt.title('Dataset Split')
    plt.savefig(f'outputs/v1/pie_chart.png')
    plt.show()


---

## 7. Image augmentation

---

In [None]:
# Import imagedatagenerator
from tensorflow.keras.preprocessing.image import ImageDataGenerator

Initialising the function

In [None]:
augmented_dataset = ImageDataGenerator(rotation_range=10,
                                          width_shift_range=0.10,
                                          height_shift_range=0.10,
                                          shear_range=0.1,
                                          zoom_range=0.1,
                                          horizontal_flip=True,
                                          vertical_flip=True,
                                          fill_mode='nearest',
                                          rescale=1./255
                                          )

Augmenting training dataset

In [None]:
batch_size = 15 # running batch of 15 at a time
train_data_set = augmented_dataset.flow_from_directory(train_dir,
                                                     target_size=image_shape[:2],
                                                     color_mode='rgb',
                                                     batch_size=batch_size,
                                                     class_mode='binary',
                                                     shuffle=True
                                                     )

train_data_set.class_indices

Augmentaing test dataset

In [None]:
test_data_set = ImageDataGenerator(rescale=1./255).flow_from_directory(test_dir,
                                                                  target_size=image_shape[:2],
                                                                  color_mode='rgb',
                                                                  batch_size=batch_size,
                                                                  class_mode='binary',
                                                                  shuffle=False
                                                                  )

test_data_set.class_indices

Augmenting validation dataset

In [None]:
validation_data_set = ImageDataGenerator(rescale=1./255).flow_from_directory(validation_dir,
                                                                        target_size=image_shape[:2],
                                                                        color_mode='rgb',
                                                                        batch_size=batch_size,
                                                                        class_mode='binary',
                                                                        shuffle=False
                                                                        )

validation_data_set.class_indices

Plotting augmented train images

In [None]:
for _ in range(5):
    img, label = train_data_set.next()
    print(img.shape) #  (batch_size,h,w,rgb)
    plt.imshow(img[0])
    plt.show()

Plotting augmented test images

In [None]:
for _ in range(5):
    img, label = test_data_set.next()
    print(img.shape)  #  (batch_size,h,w,rgb)
    plt.imshow(img[0])
    plt.show()

Plotting validation test images

In [None]:
for _ in range(5):
    img, label = validation_data_set.next()
    print(img.shape)  #  (batch_size,h,w,rgb)
    plt.imshow(img[0])
    plt.show()

---

## 8. Save Class Indices

---

In [None]:
class_indices = train_data_set.class_indices 
joblib.dump(class_indices, 'outputs/v1/class_indices.pkl')

---

## 9. Machine Learning Model Creation

---

### ML Model

In [None]:
def create_ml_model(image_shape):
    model = Sequential()

    model.add(Conv2D(filters=32, kernel_size=(3, 3),
              input_shape=image_shape, activation='relu'))
    model.add(MaxPooling2D(pool_size=(2, 2)))

    model.add(Conv2D(filters=64, kernel_size=(3, 3),
              activation='relu'))
    model.add(MaxPooling2D(pool_size=(2, 2)))

    model.add(Conv2D(filters=64, kernel_size=(3, 3),
              activation='relu'))
    model.add(MaxPooling2D(pool_size=(2, 2)))

    model.add(Flatten())
    model.add(Dense(128, activation='relu'))

    model.add(Dropout(0.5))
    model.add(Dense(1, activation='sigmoid'))

    model.compile(loss='binary_crossentropy',
                  optimizer='adam',
                  metrics=['accuracy'])

    return model

Model Summary

In [None]:
# Assuming you have defined image_shape previously
image_shape = (50, 50, 3)  # Example image shape

# Create the model
model = create_ml_model(image_shape)

# Print model summary
model.summary()

Callback - Early Stoppage Define a call back functioin to monitor accuracy of 99% 

In [None]:
class myCallback(tf.keras.callbacks.Callback):
  def on_epoch_end(self, epoch, logs={}):
    if(logs.get('accuracy')>0.99):
      print(" \n Reached 99% accuracy so cancelling training!")
      self.model.stop_training = True

callbacks = myCallback()

In [None]:
model = create_ml_model(image_shape)
model.fit(train_data_set,
          epochs=20,
          steps_per_epoch=len(train_data_set.classes) // batch_size,
          validation_data=validation_data_set,
          callbacks=[callbacks],
          verbose=1
          )

Saving the model

In [None]:
model.save('outputs/v1/mildew_detection_model.h5')

---

## 11. Model Performace Metrics and Evaluation

---

Learning Curve

In [None]:
loss = pd.DataFrame(model.history.history)

sns.set_style("darkgrid")
loss[['loss', 'val_loss']].plot(style='.-')
plt.title("Loss")
plt.savefig(f'{file_path}/model_training_losses.png',
            bbox_inches='tight', dpi=150)
plt.show()

print("\n")
loss[['accuracy', 'val_accuracy']].plot(style='.-')
plt.title("Accuracy")
plt.savefig(f'{file_path}/model_training_acc.png',
            bbox_inches='tight', dpi=150)
plt.show()

Model Evaluation on Test Data

In [None]:
model = load_model(f"outputs/v1/best_model.h5")
evaluation = model.evaluate(test_data_set)
print("Test Loss:", evaluation[0])
print("Test Accuracy:", evaluation[1])

Save evaluation pickle

In [None]:
joblib.dump(value=evaluation ,
            filename=f"outputs/v1/evaluation.pkl")

In [None]:
test_predictions = model.predict(test_data_set)
test_predictions_binary = np.where(test_predictions > 0.5, 1, 0)

Classification Report

In [None]:
print("Classification Report:")
print(classification_report(test_data_set.classes, test_predictions_binary))


In [None]:
import matplotlib as mpl
clf_report = classification_report(test_data_set.classes, test_predictions_binary, output_dict=True)
fig, ax = plt.subplots(figsize=(8,5))
sns.heatmap(pd.DataFrame(clf_report).iloc[:-1, :].T, annot=True, cmap="Greens", cbar=False, linewidths=1)
plt.title('Classification Report')
plt.savefig('outputs/v1/classification_report.png')

# Confusion Matrix

In [None]:
conf_matrix = confusion_matrix(test_data_set.classes, test_predictions_binary)
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Greens', cbar=False)
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.title('Confusion Matrix')
plt.savefig("outputs/v1/confusion_matrix.png")
plt.show()

# ROC Curve and AUC

In [None]:
from sklearn.metrics import roc_curve, auc

# Make predictions on the test set
pred = model.predict(test_data_set)

# Calculate FPR, TPR, and classification thresholds
fpr, tpr, thresholds = roc_curve(test_data_set.classes, pred)

# Calculate area under the curve (AUC)
roc_auc = auc(fpr, tpr)

# Plot the ROC curve
plt.figure(figsize=(8, 5))
plt.plot(fpr, tpr, color='green', lw=2,
         label=f'ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='grey', lw=2,
         linestyle='--', label='Random Guess')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate (FPR)')
plt.ylabel('True Positive Rate (TPR)')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc='lower right')
plt.savefig(f'{file_path}/roc_curve.png', bbox_inches='tight', dpi=150)

---

## 12. Prediction on a Random Image File

---

* Load a random image as PIL

In [None]:
import random
from tensorflow.keras.preprocessing import image

# Select a random label from the 'labels' list
random_label = random.choice(labels)

# Construct the path for the chosen label
label_dir = os.path.join(test_dir, random_label)

# Get a list of all files in the chosen label directory
files_in_label_dir = os.listdir(label_dir)

# Select a random file from the list of files
random_file = random.choice(files_in_label_dir)

# Construct the full path for the randomly selected image
image_path = os.path.join(label_dir, random_file)

# Load the image
pil_image = image.load_img(image_path, target_size=image_shape, color_mode='rgb')

print(f'Randomly selected label: {random_label}')
print(f'Image path: {image_path}')
print(f'Image shape: {pil_image.size}, Image mode: {pil_image.mode}')



* Display the image

In [None]:
import matplotlib.pyplot as plt

plt.imshow(pil_image)
plt.axis('off')
plt.show()

* Convert image to array and prepare for prediction

In [None]:
my_image = image.img_to_array(pil_image)
my_image = np.expand_dims(my_image, axis=0)/255
print(my_image.shape)

* Predict class probabilities

In [None]:
pred_proba = model.predict(my_image)[0,0]

target_map = {v: k for k, v in train_data_set.class_indices.items()}
pred_class =  target_map[pred_proba > 0.5]  

if pred_class == target_map[0]: pred_proba = 1 - pred_proba

print(pred_proba)
print(pred_class)

----

## 13. Conlusions

---

- **Effective Performance**: The model showcases impressive performance, even with a relatively small dataset, highlighting its efficiency in learning from limited data.

- **Consistent Learning**: Analysis of loss and accuracy curves reveals a stable and consistent training behavior, with no signs of overfitting or underfitting, indicating a well-tuned model.

- **Accurate Predictions**: Demonstrates a strong capability to accurately predict the class of new, unseen images, confirming the model's generalization ability.

- **Data Augmentation Impact**: The application of data augmentation techniques significantly contributed to the model's robustness, allowing it to handle a variety of image orientations and scales.

- **Real-World Applicability**: The model's reliability in classifying cherry leaf diseases underlines its potential for real-world agricultural applications, offering valuable support for early disease detection and management.

- **Future Improvement Avenues**: While current results are promising, exploring more complex architectures, deeper networks, and larger datasets could further enhance model performance and reliability.