# **Data Modelling and Evaluation**

---

## Objectives

* Answer business requirement 2: 
    * The client seeks to predict whether a cherry leaf is healthy or infected with powdery mildew.

## Inputs

* inputs/cherry_leaves_dataset/cherry-leaves/train
* inputs/cherry_leaves_dataset/cherry-leaves/test
* inputs/cherry_leaves_dataset/cherry-leaves/validation

## Outputs

1. **Images Distribution Plot**
   - Generate distribution plots for images in the training, validation, and test datasets to understand data balance.

2. **Image Augmentation**
   - Apply data augmentation techniques to enhance the training dataset's diversity and generalization.

3. **Class Indices Mapping**
   - Create a mapping of class indices to label names to interpret model predictions.

4. **Machine Learning Model Development**
   - Choose an appropriate machine learning algorithm or neural network architecture.
   - Train the selected model using the preprocessed training data.

5. **Model Saving**
   - Save the trained machine learning model for future use.

6. **Learning Curve Plot**
   - Create learning curve plots to visualize the model's performance over training epochs.

7. **Model Evaluation and Pickle File**
   - Evaluate the trained model using the validation dataset.
   - Save the evaluation results in a pickle file for future reference.

8. **Random Image Prediction**
   - Select a random image from the dataset.
   - Utilize the trained model to predict whether the leaf is infected with powdery mildew or not.

## Steps and Tasks

1. **Data Preprocessing**
   - Load and preprocess the training, validation, and test data.
   - Generate and save images distribution plots for each dataset.

2. **Data Augmentation**
   - Apply data augmentation techniques such as rotation, scaling, and flipping to augment the training dataset.

3. **Label Mapping**
   - Create a dictionary or mapping to convert model predictions (indices) into human-readable labels.

4. **Model Development**
   - Select an appropriate machine learning model or neural network architecture.
   - Train the chosen model using the preprocessed training data.

5. **Model Saving**
   - Save the trained model to a file for later use.

6. **Learning Curve Plotting**
   - Plot learning curves to visualize how the model's performance changes during training.

7. **Model Evaluation**
   - Evaluate the trained model's performance using the validation dataset.
   - Save the evaluation results, including accuracy, precision, recall, and F1-score, in a pickle file.

8. **Random Image Prediction**
   - Select a random image from the dataset.
   - Use the trained model to predict whether the selected leaf image is infected with powdery mildew.





## Additional Comments:

N/A


---

# Set Data Directory

---

## Import libraries

In [None]:
import pandas as pd
import os
import seaborn as sns
import matplotlib.pyplot as plt

## Set working directory

In [None]:
cwd= os.getcwd()

In [None]:
os.chdir('/workspace/Portfolio_5_Cherry_Leaves_Mildew')
print("You set a new current directory")

In [None]:
current_dir = os.getcwd()
current_dir

## Set input directories

Set train, validation and test paths.

In [None]:
my_data_dir = 'inputs/cherry_leaves_dataset/cherry-leaves'
train_path = my_data_dir + '/train'
val_path = my_data_dir + '/validation'
test_path = my_data_dir + '/test'

## Set output directory

In [None]:
version = 'v1'
file_path = f'outputs/{version}'

if 'outputs' in os.listdir(current_dir) and version in os.listdir(current_dir + '/outputs'):
    print('Old version is already available create a new version.')
    pass
else:
    os.makedirs(name=file_path)

### Set label names

In [None]:
# Set the labels
labels = os.listdir(train_path)
print('Label for the images are', labels)

---

## Checking the number of images in the train, test, and validation datasets, and preparing for image augmentation:

---

In [None]:
# Create an empty DataFrame to store image statistics
df = pd.DataFrame([])

# Iterate through each dataset folder (test, train, validation)
for folder in ['test', 'train', 'validation']:
    for label in labels:
        # Get the number of images in the current folder and label combination
        num_images = len(os.listdir(current_dir + '/' + folder + '/' + label))

        # Append the statistics to the DataFrame
        df = df.append(
            pd.Series(data={'Set': folder,
                            'Label': label,
                            'Frequency': num_images}
                      ),
            ignore_index=True
        )

        # Print the number of images for the current folder and label
        print(f"* {folder} - {label}: {num_images} images")

# Visualize the distribution of images using a bar plot
sns.set_style("whitegrid")
plt.figure(figsize=(8, 5))
sns.barplot(data=df, x='Set', y='Frequency', hue='Label')

# Save the distribution plot as an image
plt.savefig(f'{file_path}/labels_distribution.png',
            bbox_inches='tight', dpi=150)

# Display the plot
plt.show()