# Notebook 01 - Data Collection

## Objectives

* Fetch data from Kaggle and save as raw data in organised folders
* Inspect the data and check for non-image files
* Split the data into Train, Test and Validation sets
* Save the split data

## Inputs

* Kaggle JSON file - authentication token
* Kaggle dataset

## Outputs

* Image files saved in separate folders for healthy leaves and those with powdery mildew, within each of train, test and validation sets

## Additional Comments

* By the end of this workbook, the dataset has no non-image files, and the files are now split in the desired ratio between test, train and validation sets.
* Next, we will proceed to data visualisation.


---

# Import Packages

In [15]:
import os
import stat
import shutil
import random
import joblib

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [2]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\franc\\cherry-leaves-mildew-detector\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [3]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [4]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\franc\\cherry-leaves-mildew-detector'

---

# Fetch data from Kaggle

The dataset can now be fetched from Kaggle, where it is stored.

Firstly, the Kaggle package is installed to allow fetching of the data:

In [6]:
# install kaggle package
%pip install kaggle==1.5.12

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.1.2 -> 23.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


The `kaggle.json` file is then imported to the workspace to authenticate the request to access the data from Kaggle

This file will not be seen in the public repository since it is linked to my personal Kaggle account and as such is listed in the `.gitignore` file
The following cell sets the Kaggle API config directory, gets the path to the `kaggle.json` file and then sets the file permissions for the `kaggle.json` file

In [9]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
kaggle_json_path = os.path.join(os.getcwd(), 'kaggle.json')
os.chmod(kaggle_json_path, stat.S_IREAD | stat.S_IWRITE)

The dataset can now be imported:

In [10]:
KaggleDatasetPath = "codeinstitute/cherry-leaves"
DestinationFolder = "inputs/cherry_leaves_dataset"
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading cherry-leaves.zip to inputs/cherry_leaves_dataset




  0%|          | 0.00/55.0M [00:00<?, ?B/s]
  2%|▏         | 1.00M/55.0M [00:00<00:52, 1.07MB/s]
  4%|▎         | 2.00M/55.0M [00:01<00:27, 2.06MB/s]
  7%|▋         | 4.00M/55.0M [00:01<00:12, 4.40MB/s]
 11%|█         | 6.00M/55.0M [00:01<00:07, 6.91MB/s]
 16%|█▋        | 9.00M/55.0M [00:01<00:04, 10.5MB/s]
 20%|█▉        | 11.0M/55.0M [00:01<00:04, 10.8MB/s]
 25%|██▌       | 14.0M/55.0M [00:01<00:02, 14.5MB/s]
 31%|███       | 17.0M/55.0M [00:02<00:02, 15.7MB/s]
 38%|███▊      | 21.0M/55.0M [00:02<00:01, 18.3MB/s]
 44%|████▎     | 24.0M/55.0M [00:02<00:01, 20.4MB/s]
 49%|████▉     | 27.0M/55.0M [00:02<00:01, 19.9MB/s]
 55%|█████▍    | 30.0M/55.0M [00:02<00:01, 21.0MB/s]
 60%|█████▉    | 33.0M/55.0M [00:02<00:01, 20.9MB/s]
 65%|██████▌   | 36.0M/55.0M [00:02<00:00, 21.0MB/s]
 71%|███████   | 39.0M/55.0M [00:03<00:00, 21.2MB/s]
 78%|███████▊  | 43.0M/55.0M [00:03<00:00, 22.2MB/s]
 84%|████████▎ | 46.0M/55.0M [00:03<00:00, 20.9MB/s]
 89%|████████▉ | 49.0M/55.0M [00:03<00:00, 20.5MB/s]
 

Finally, the files are unzipped and the kaggle.json file is removed

In [11]:
for file in os.listdir(DestinationFolder):
    if file.endswith(".zip"):
        file_path = os.path.join(DestinationFolder, file)
        shutil.unpack_archive(file_path, DestinationFolder)
        os.remove(file_path)

os.remove("kaggle.json")

---

# Data Preparation

Check for and remove any non-image files:

In [12]:
def remove_non_image_file(my_data_dir):
    image_extension = ('.png', '.jpg', '.jpeg')
    folders = os.listdir(my_data_dir)
    for folder in folders:
        files = os.listdir(my_data_dir + '/' + folder)
        # print(files)
        i = []
        j = []
        for given_file in files:
            if not given_file.lower().endswith(image_extension):
                file_location = my_data_dir + '/' + folder + '/' + given_file
                os.remove(file_location)  # remove non image file
                i.append(1)
            else:
                j.append(1)
                pass
        print(f"Folder: {folder} - has image file", len(j))
        print(f"Folder: {folder} - has non-image file", len(i))

In [14]:
remove_non_image_file(my_data_dir='inputs/cherry_leaves_dataset/cherry-leaves')

Folder: healthy - has image file 2104
Folder: healthy - has non-image file 0
Folder: powdery_mildew - has image file 2104
Folder: powdery_mildew - has non-image file 0


After this step, the dataset should not contain any images.

---

# Split Train, Test and Validation sets

The images are now split into train, test and validation sets and divided into respective folders, with separate folders within each of these for images of healthy leaves and images of leaves infected with powdery mildew.

In [16]:
def split_train_validation_test_images(my_data_dir, train_set_ratio, validation_set_ratio, test_set_ratio):
  
  if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
    print("train_set_ratio + validation_set_ratio + test_set_ratio should sum to 1.0")
    return

  # gets classes' labels
  labels = os.listdir(my_data_dir) # gets the folder names
  if 'test' in labels:
    pass
  else: 
    # create train, test folders with classes labels sub-folder
    for folder in ['train','validation','test']:
      for label in labels:
        os.makedirs(name=my_data_dir+ '/' + folder + '/' + label)

    for label in labels:

      files = os.listdir(my_data_dir + '/' + label)
      random.shuffle(files)

      train_set_files_qty = int(len(files) * train_set_ratio)
      validation_set_files_qty = int(len(files) * validation_set_ratio)

      count = 1
      for file_name in files:
        if count <= train_set_files_qty:
          # move given file to train set
          shutil.move(my_data_dir + '/' + label + '/' + file_name,
                      my_data_dir + '/train/' + label + '/' + file_name)
          

        elif count <= (train_set_files_qty + validation_set_files_qty ):
          # move given file to validation set
          shutil.move(my_data_dir + '/' + label + '/' + file_name,
                      my_data_dir + '/validation/' + label + '/' + file_name)

        else:
          # move given file to test set
          shutil.move(my_data_dir + '/' + label + '/' + file_name,
                  my_data_dir + '/test/' +label + '/'+ file_name)
          
        count += 1

      os.rmdir(my_data_dir + '/' + label)

Conventionally, the dataset is divided such that:

* The training set is 70% of the data.
* The validation set is 10% of the data.
* The test set is 20% of the data.

In [18]:
split_train_validation_test_images(my_data_dir = f"inputs/cherry_leaves_dataset/cherry-leaves",
                        train_set_ratio = 0.7,
                        validation_set_ratio=0.1,
                        test_set_ratio=0.2
                        )

We check the number of image files in each folder:

In [19]:
sets = ['train', 'test', 'validation']
labels = ['healthy', 'powdery_mildew']
for set in sets:
    for label in labels:
        number_of_files = len(os.listdir(f'inputs/cherry_leaves_dataset/cherry-leaves/{set}/{label}'))
        print(f'There are {number_of_files} images in {set}/{label}')

There are 1472 images in train/healthy
There are 1472 images in train/powdery_mildew
There are 422 images in test/healthy
There are 422 images in test/powdery_mildew
There are 210 images in validation/healthy
There are 210 images in validation/powdery_mildew


We can see that each set has an even distribution of images across both labels, healthy and powdery_mildew. Additionally, we see that the images are split in the correct ratio between train, test and validation sets: the train set has by far the highest number of images, and the test set has approximately twice as many images as the validation set.

---

# Conclusions and Next Steps

* The dataset has no non-image files, and the files are now split in the desired ratio between test, train and validation sets.
* Next, we will proceed to data visualisation.