# **Data Modelling and Evaluation**

---

## Objectives

* Answer business requirement 2: 
    * The client seeks to predict whether a cherry leaf is healthy or infected with powdery mildew.

## Inputs

* inputs/cherry_leaves_dataset/cherry-leaves/train
* inputs/cherry_leaves_dataset/cherry-leaves/test
* inputs/cherry_leaves_dataset/cherry-leaves/validation
* image shape embeddings

## Outputs

* Images distribution plot in train, validation, and test set
* Image augmentation
* Class indices to change prediction inference in labels
* Machine learning model creation and training
* Save model
* Learning curve plot for model performance
* Model evaluation on pickle file
* Prediction on the random image file





## Additional Comments:

N/A


---

# Set Data Directory

---

## Import libraries

In [46]:
import pandas as pd
import os
import seaborn as sns
import matplotlib.pyplot as plt
import joblib
import cv2

## Set working directory

In [47]:
cwd= os.getcwd()

In [48]:
os.chdir('/workspace/Portfolio_5_Cherry_Leaves_Mildew')
print("You set a new current directory")

You set a new current directory


In [49]:
current_dir = os.getcwd()
current_dir

'/workspace/Portfolio_5_Cherry_Leaves_Mildew'

## Set input directories

Set train, validation and test paths.

In [50]:
my_data_dir = 'inputs/cherry_leaves_dataset/cherry-leaves'
train_path = my_data_dir + '/train'
val_path = my_data_dir + '/validation'
test_path = my_data_dir + '/test'

## Set output directory

In [51]:
version = 'v1'
file_path = f'outputs/{version}'

if 'outputs' in os.listdir(current_dir) and version in os.listdir(current_dir + '/outputs'):
    print('Old version is already available create a new version.')
    pass
else:
    os.makedirs(name=file_path)

Old version is already available create a new version.


### Set label names

In [52]:
# Set the labels
labels = os.listdir(train_path)
print('Label for the images are', labels)

Label for the images are ['healthy', 'powdery_mildew']


### Set image shape

In [53]:
## Import saved image shape embedding
version = 'v1'
image_shape = joblib.load(filename=f"outputs/{version}/image_shape.pkl")
image_shape

(50, 50)

---

## Number of images in the train, test, and validation data

---

In [54]:
df_freq = pd.DataFrame([]) 
for folder in ['train', 'validation', 'test']:
    for label in labels:
        df_freq = pd.concat([
            df_freq,
            pd.Series(data={'Set': folder,
                            'Label': label,
                            'Frequency': int(len(os.listdir(my_data_dir+'/'+ folder + '/' + label)))}
            )
        ], ignore_index=True)

        print(f"* {folder} - {label}: {len(os.listdir(my_data_dir+'/'+ folder + '/' + label))} images")

print("\n")
sns.set_style("whitegrid")
plt.figure(figsize=(8, 5))
sns.barplot(data=df_freq, X='Set', y='Frequency', hue='Label')
plt.savefig(f'{file_path}/labels_distribution.png', bbox_inches='tight', dpi=150)
plt.show()


* train - healthy: 1472 images
* train - powdery_mildew: 1472 images
* validation - healthy: 315 images
* validation - powdery_mildew: 315 images
* test - healthy: 317 images
* test - powdery_mildew: 317 images




ValueError: Could not interpret value `Frequency` for `y`. An entry with this name does not appear in `data`.

<Figure size 800x500 with 0 Axes>

---

## Image data augmentation

---

### ImageDataGenerator

In [63]:
import cv2
import os
import numpy as np

# Initialize ImageDataGenerator for augmentation
def augment_image(image):
    # Apply your desired augmentations using OpenCV
    # Example augmentations:
    # Random rotation
    angle = np.random.randint(0, 360)
    image = cv2.rotate(image, cv2.ROTATE_90_CLOCKWISE)
    
    # Random horizontal flip
    if np.random.rand() > 0.5:
        image = cv2.flip(image, 1)
    
    # Other augmentations can be added similarly

    return image

# Function to load and augment images from a directory
def load_and_augment_images(directory_path, num_samples, target_size=(150, 150)):
    images = []
    labels = []
    class_indices = {}
    classes = os.listdir(directory_path)
    print(images)
    
    for i, class_name in enumerate(classes):
        class_indices[class_name] = i
    
    for class_name in classes:
        class_path = os.path.join(directory_path, class_name)
        image_files = [f for f in os.listdir(class_path) if f.endswith('.jpg') or f.endswith('.png')]
        num_samples_per_class = num_samples // len(classes)
        
        for image_file in image_files[:num_samples_per_class]:
            image_path = os.path.join(class_path, image_file)
            image = cv2.imread(image_path)
            image = cv2.resize(image, target_size)
            image = augment_image(image)
            images.append(image)
            labels.append(class_indices[class_name])
    
    images = np.array(images)
    labels = np.array(labels)
    
    return images, labels, class_indices

# Load and augment training data
train_images, train_labels, class_indices = load_and_augment_images(train_path, num_samples=1000)

# Load validation data without augmentation
validation_images, validation_labels, _ = load_and_augment_images(val_path, num_samples=200)

# Print class indices
print("Class Indices:")
print(class_indices)


[]
[]
Class Indices:
{'healthy': 0, 'powdery_mildew': 1}


### Plot augmented training image

In [64]:
for i in range(3):
    img = train_images[i]
    plt.imshow(img)
    plt.show()

IndexError: index 0 is out of bounds for axis 0 with size 0