# **3.2.1 Data Collection**

## **Dataset**


### **Fruits and Vegetables Image Recognition Dataset**

The researchers utilized a dataset from Kaggle that contains images of fruits and vegetables to be used for training, testing, and validating machine learning models. The images in this dataset were scraped from Bing Image Search (Seth, 2020). The dataset contains 36 image classifications:
  - Fruits: banana, apple, pear, grapes, orange, kiwi, watermelon, pomegranate, pineapple, and mango.
  - Vegetables: cucumber, carrot, capsicum, onion, potato, lemon, tomato, raddish, beetroot, cabbage, lettuce, spinach, soy bean, cauliflower, bell pepper, chili pepper, turnip, corn, sweetcorn, sweet potato, paprika, jalepeño, ginger, garlic, peas, and eggplant.

The dataset contains three folders:
  - train (100 images per classification)
  - test (10 images per classification)
  - validation (10 images per classification)

## **Pre-processing**


To ensure consistency in the input data, the sampled images are first converted to the RGB color space. This is done if they are not already in the required format, as the model expects three color channels (red, green, and blue). Aside from converting the images, they are also resized to a standard size of 224x224 pixels. This ensures that all images have the same size and dimensions before being used as the basis of the adversarial images. The images are also then converted to tensors (multi-dimensional arrays) as the model processes them in that form. Lastly, the images are normalized using predefined mean and standard deviation values for each color channel. This ensures that the input values (pixel intensities) have a similar data distribution, making the learning process more efficient.


## **Dataset Sampling**


### **Sampling Method**

This study only used the test folder as it only required a relatively smaller number of images due to previously discussed limitations. To avoid biases, the researchers used random sampling on the fruit and vegetable image dataset. The study used 50 out of 359 test images, which consisted of 25 fruit images and 25 vegetable images. These are stored in two folders (fruits and vegetables). Inside these folders are also folders that contain the test images and are named after the actual classification of the images inside them.


In [1]:
# Clone the GitHub repository
!git clone https://github.com/ansem7/cs199specialproject.git

Cloning into 'cs199specialproject'...
remote: Enumerating objects: 495, done.[K
remote: Counting objects: 100% (495/495), done.[K
remote: Compressing objects: 100% (478/478), done.[K
remote: Total 495 (delta 28), reused 0 (delta 0), pack-reused 0[K
Receiving objects: 100% (495/495), 242.66 MiB | 23.58 MiB/s, done.
Resolving deltas: 100% (28/28), done.
Updating files: 100% (257/257), done.


In [None]:
import os
import shutil
import random
from google.colab import files

# Define the directories
fruits_dir = '/content/cs199specialproject/images/full_dataset/fruits'
vegetables_dir = '/content/cs199specialproject/images/full_dataset/vegetables'
sampled_dir = '/content/cs199specialproject/images/sampled_dataset'

# Create the sampled directory if it doesn't exist
os.makedirs(sampled_dir, exist_ok=True)

# Get a list of all the image filenames in the fruits and vegetables directories
fruits_images = os.listdir(fruits_dir)
vegetables_images = os.listdir(vegetables_dir)

# Randomly select 25 images from each list
selected_fruits = random.sample(fruits_images, 25)
selected_vegetables = random.sample(vegetables_images, 25)

# Copy the selected images to the new directory
for image in selected_fruits:
    shutil.copy(os.path.join(fruits_dir, image), sampled_dir)

for image in selected_vegetables:
    shutil.copy(os.path.join(vegetables_dir, image), sampled_dir)

#### **Save the sampled dataset in a folder**
Put each image inside a folder that is dependent on their filename (this contains their true classification name).

In [None]:
# Define the directory containing the copied images
sampled_dir = '/content/cs199specialproject/images/sampled_dataset'

# Get a list of all the image filenames in the directory
image_files = os.listdir(sampled_dir)

# Iterate over the image filenames
for image_file in image_files:
    # Split the filename at the last underscore to get the class name
    class_name = '_'.join(image_file.split('_')[:-1])

    # Create a new directory for this class, if it doesn't exist already
    new_dir = os.path.join(sampled_dir, class_name)
    os.makedirs(new_dir, exist_ok=True)

    # Move the image file into the new directory
    shutil.move(os.path.join(sampled_dir, image_file), os.path.join(new_dir, image_file))

#### **Archive and download the dataset folder**
This contains the sampled dataset and will be uploaded to the GitHub repository to be used in the procedure.

In [None]:
!zip -r /content/sampled_dataset.zip /content/cs199specialproject/images/sampled_dataset

files.download("/content/sampled_dataset.zip")

  adding: content/cs199specialproject/images/sampled_dataset/ (stored 0%)
  adding: content/cs199specialproject/images/sampled_dataset/kiwi/ (stored 0%)
  adding: content/cs199specialproject/images/sampled_dataset/kiwi/kiwi_10.jpg (deflated 3%)
  adding: content/cs199specialproject/images/sampled_dataset/kiwi/kiwi_7.jpg (deflated 0%)
  adding: content/cs199specialproject/images/sampled_dataset/kiwi/kiwi_9.jpg (deflated 2%)
  adding: content/cs199specialproject/images/sampled_dataset/kiwi/kiwi_2.jpg (deflated 0%)
  adding: content/cs199specialproject/images/sampled_dataset/kiwi/kiwi_5.jpg (deflated 1%)
  adding: content/cs199specialproject/images/sampled_dataset/kiwi/kiwi_3.jpg (deflated 0%)
  adding: content/cs199specialproject/images/sampled_dataset/Granny_Smith/ (stored 0%)
  adding: content/cs199specialproject/images/sampled_dataset/Granny_Smith/Granny_Smith_2.jpg (deflated 7%)
  adding: content/cs199specialproject/images/sampled_dataset/pomegranate/ (stored 0%)
  adding: content/cs

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>