# **Data Preprocessing**

- This notebook contains the first approach for the data preprocessing resizing the image to 224x224. The image was resized as an attempt to reduce that gray background  and focus more on the leaf.

- I also implemented  a base model. 

In [1]:
# Define the base path
base_path = '/Users/catarinavuzi/Downloads/image data'

# Define the subdirectories
subdirs = ['test', 'train', 'validation']

In [None]:
# Iterate through each subdirectory
for subdir in subdirs:
    tomato_path = os.path.join(base_path, subdir, 'tomato')
    if os.path.exists(tomato_path):
        print(f"Contents of {tomato_path}:")
        for category in os.listdir(tomato_path):
            category_path = os.path.join(tomato_path, category)
            if os.path.isdir(category_path):
                for item in os.listdir(category_path):
                    item_path = os.path.join(category_path, item)
                    if os.path.isfile(item_path):
                        try:
                            with Image.open(item_path) as img:
                                img = img.resize((224, 224))
                                img.save(item_path)
                                print(f"  - {item} in {category} resized to 224x224")
                        except Exception as e:
                            print(f"  - {item} in {category} could not be resized: {e}")
                    else:
                        print(f"  - {item} in {category} is not a file")
    else:
        print(f"{tomato_path} does not exist.")


Contents of /Users/catarinavuzi/Downloads/image data/test/tomato:
  - b8682577-be6c-42a9-9d02-48300363e796___GCREC_Bact.Sp 3498.JPG in bacterial spot resized to 224x224
  - 166a69a7-cae0-4e6e-ac75-bc207a5335f4___GCREC_Bact.Sp 5705.JPG in bacterial spot resized to 224x224
  - a129e8eb-e2b4-4a8a-a509-f6625da6b11c___GCREC_Bact.Sp 3002.JPG in bacterial spot resized to 224x224
  - 5cf74b8f-2a3a-4141-8309-1c02f01ab444___GCREC_Bact.Sp 5967.JPG in bacterial spot resized to 224x224
  - be4dd24b-c2d2-4e6b-bb77-f5d6866ede96___GCREC_Bact.Sp 3123.JPG in bacterial spot resized to 224x224
  - 67c8bbdb-ba79-4c64-897f-e9052728df45___GCREC_Bact.Sp 3008.JPG in bacterial spot resized to 224x224
  - 98df469d-20d0-4b6e-aae3-617473f3081f___GCREC_Bact.Sp 5792.JPG in bacterial spot resized to 224x224
  - eabc7def-52d6-4383-8d81-ffbe7beedce8___GCREC_Bact.Sp 5572.JPG in bacterial spot resized to 224x224
  - 393946a5-3f5d-4ed3-af1f-fcb7647b369d___GCREC_Bact.Sp 6190.JPG in bacterial spot resized to 224x224
  - 15d

## **Implementing the base model**

In [None]:
import tensorflow as tf
from tensorflow.keras.applications import VGG16
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.models import Model
from tensorflow.keras.preprocessing.image import ImageDataGenerator

# Define the path to your dataset
train_data_dir = '/Users/catarinavuzi/Downloads/image data/train/tomato'
validation_data_dir = '/Users/catarinavuzi/Downloads/image data/validation/tomato'

# Define image dimensions and batch size
img_width, img_height = 224, 224
batch_size = 32

# Rescaling for both training and validation data
datagen = ImageDataGenerator(rescale=1. / 255)

# Create training and validation data generators
train_generator = datagen.flow_from_directory(
    train_data_dir,
    target_size=(img_width, img_height),
    batch_size=batch_size,
    class_mode='categorical'
)

validation_generator = datagen.flow_from_directory(
    validation_data_dir,
    target_size=(img_width, img_height),
    batch_size=batch_size,
    class_mode='categorical'
)

# Load pre-trained VGG16 model (excluding top layers)
base_model = VGG16(weights='imagenet', include_top=False, input_shape=(img_width, img_height, 3))

# Add custom top layers for disease prediction
x = base_model.output
x = Flatten()(x)
x = Dense(256, activation='relu')(x)
predictions = Dense(train_generator.num_classes, activation='softmax')(x)

# Create the final model
model = Model(inputs=base_model.input, outputs=predictions)

# Freeze layers in the base model (optional)
for layer in base_model.layers:
    layer.trainable = False

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Train the model
epochs = 5  # Adjust as needed
model.fit( 
    train_generator,
    steps_per_epoch=train_generator.samples // batch_size,
    epochs=epochs,
    validation_data=validation_generator,
    validation_steps=validation_generator.samples // batch_size
)

# Save the trained model
model.save('leaf_disease_vgg16_benchmark.h5')

2024-11-01 13:18:33.688280: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Found 13083 images belonging to 10 classes.
Found 3265 images belonging to 10 classes.
Epoch 1/5


  self._warn_if_super_not_called()


[1m408/408[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9s/step - accuracy: 0.6943 - loss: 1.3826


- This means that the model correctly predicts the leaf disease about 69.43% of the time.


- The loss value represents how well or poorly your model is performing. A loss of 1.3826 suggests that there is a significant difference between the predicted values and the actual values. Lower loss values would indicate that the model perfoms better.



Oversampling
Used when there is not enough data, oversampling duplicates data from the minority class to create a more balanced dataset. 
Undersampling
Used when there is too much data, undersampling removes data from the majority class to create a more balanced dataset.