# **Data Collection**

## Objectives

* Fetch data from Kaggle and save as raw data
* Check for non-image files
* Split into Test, Train and Validation sets

## Inputs

* Kaggle JSON file - authentication token

## Outputs

Generate Dataset: inputs/cherry-dataset/mildew-dataset 

## Additional Comments

* This notebook imports, cleans and prepares the data for the machine learning model.



---

# Change working directory

* The notebooks are stored in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/cherry-leaf-mildew-detection-project/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/cherry-leaf-mildew-detection-project'

---

# Install Kaggle

Install Kaggle so the dataset can be imported.

In [4]:
# Install Kaggle Package
%pip install kaggle==1.5.12

Collecting kaggle==1.5.12
  Using cached kaggle-1.5.12.tar.gz (58 kB)
  Preparing metadata (setup.py) ... [?25ldone
Collecting certifi
  Using cached certifi-2024.2.2-py3-none-any.whl (163 kB)
Collecting requests
  Using cached requests-2.31.0-py3-none-any.whl (62 kB)
Collecting tqdm
  Using cached tqdm-4.66.2-py3-none-any.whl (78 kB)
Collecting python-slugify
  Using cached python_slugify-8.0.4-py2.py3-none-any.whl (10 kB)
Collecting urllib3
  Using cached urllib3-2.2.1-py3-none-any.whl (121 kB)
Collecting text-unidecode>=1.3
  Using cached text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
Collecting idna<4,>=2.5
  Using cached idna-3.7-py3-none-any.whl (66 kB)
Collecting charset-normalizer<4,>=2
  Using cached charset_normalizer-3.3.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (141 kB)
Installing collected packages: text-unidecode, urllib3, tqdm, python-slugify, idna, charset-normalizer, certifi, requests, kaggle
[33m  DEPRECATION: kaggle is being installed using the le

Note: you may need to restart the kernel to use updated packages.


Run the cell below to change the Kaggle configuration directory to the current working directory and set permissions for the Kaggle authentication JSON.

In [5]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Import the data from Kaggle

In [6]:
KaggleDatasetPath = "codeinstitute/cherry-leaves"
DestinationFolder = "inputs/cherry-leaf-dataset"
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading cherry-leaves.zip to inputs/cherry-leaf-dataset
 15%|█████▌                                | 8.00M/55.0M [00:00<00:02, 19.5MB/s]

100%|█████████████████████████████████████▉| 55.0M/55.0M [00:01<00:00, 39.0MB/s]
100%|██████████████████████████████████████| 55.0M/55.0M [00:01<00:00, 31.9MB/s]


In [7]:
import zipfile
with zipfile.ZipFile(DestinationFolder + '/cherry-leaves.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/cherry-leaves.zip')

---

# Data Preparation

Check the data for any files which are not images.

In [8]:
def remove_non_image_file(my_data_dir):
    image_extension = ('.png', '.jpg', '.jpeg')
    folders = os.listdir(my_data_dir)
    for folder in folders:
        files = os.listdir(my_data_dir + '/' + folder)
        # print(files)
        i = []
        j = []
        for given_file in files:
            if not given_file.lower().endswith(image_extension):
                file_location = my_data_dir + '/' + folder + '/' + given_file
                os.remove(file_location)  # remove non image file
                i.append(1)
            else:
                j.append(1)
                pass
        print(f"Folder: {folder} - has image file", len(j))
        print(f"Folder: {folder} - has non-image file", len(i))

Remove any non image files from the dataset

In [9]:
remove_non_image_file(my_data_dir='inputs/cherry-leaf-dataset/cherry-leaves')

Folder: healthy - has image file 2104
Folder: healthy - has non-image file 0
Folder: powdery_mildew - has image file 2104
Folder: powdery_mildew - has non-image file 0


---

# Split Data into Train, Test, Validation Sets

Split the data into train (70%, 0.7), test (20%, 0.2) and vaildation (10%, 0.1) sets.

In [10]:
import os
import shutil
import random
import joblib


def split_train_validation_test_images(my_data_dir, train_set_ratio, validation_set_ratio, test_set_ratio):

    if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
        print("train_set_ratio + validation_set_ratio + test_set_ratio should sum to 1.0")
        return

    # gets classes labels
    labels = os.listdir(my_data_dir)  # it should get only the folder name
    if 'test' in labels:
        pass
    else:
        # create train, test folders with classes labels sub-folder
        for folder in ['train', 'validation', 'test']:
            for label in labels:
                os.makedirs(name=my_data_dir + '/' + folder + '/' + label)

        for label in labels:

            files = os.listdir(my_data_dir + '/' + label)
            random.shuffle(files)

            train_set_files_qty = int(len(files) * train_set_ratio)
            validation_set_files_qty = int(len(files) * validation_set_ratio)

            count = 1
            for file_name in files:
                if count <= train_set_files_qty:
                    # move a given file to the train set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/train/' + label + '/' + file_name)

                elif count <= (train_set_files_qty + validation_set_files_qty):
                    # move a given file to the validation set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/validation/' + label + '/' + file_name)

                else:
                    # move given file to test set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/test/' + label + '/' + file_name)

                count += 1

            os.rmdir(my_data_dir + '/' + label)

ModuleNotFoundError: No module named 'joblib'

In [None]:
split_train_validation_test_images(my_data_dir=f"inputs/cherry-leaf-dataset/cherry-leaves",
                                   train_set_ratio=0.7,
                                   validation_set_ratio=0.1,
                                   test_set_ratio=0.2
                                   )

---

# Push files to Repo

* If you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
#import os
#try:
#    # create here your folder
#    # os.makedirs(name='')
#except Exception as e:
#    print(e)
