# **Data Collection**

## Introduction

This notebook is designed to collect and prepare a dataset from Kaggle, ensuring that it is ready for further analysis or model training. The dataset we are working with is sourced from Kaggle and is titled **"Cherry Leaves - Healthy or Powdery Mildew"**.

### Dataset Description:
- **Dataset Name**: Cherry Leaves - Healthy or Powdery Mildew
- **Source**: [Kaggle Dataset](https://www.kaggle.com/datasets/codeinstitute/cherry-leaves)
- **Description**: This dataset contains images of cherry leaves, categorized into two classes: 
  1. **Healthy**: Images of cherry leaves without any disease.
  2. **Powdery Mildew**: Images of cherry leaves affected by powdery mildew, a fungal disease.
- **Purpose**: The dataset can be used for training models to detect and classify the health status of cherry leaves, which is particularly useful in agricultural and plant disease research.

### Objectives

This notebook will perform the following tasks:
1. **Install Necessary Packages**:
   - Install the Kaggle package to enable downloading datasets directly from Kaggle.
   - Set up the environment to use the Kaggle API securely.

2. **Download the Dataset**:
   - Authenticate with Kaggle using a JSON file containing your API credentials.
   - Download the specified dataset (`codeinstitute/cherry-leaves`) and save it to a designated directory.

3. **Prepare the Dataset**:
   - **Unzip the Dataset**: Extract the contents of the downloaded zip file and remove the zip file to save space.
   - **Data Cleaning**: Ensure that the dataset only contains valid image files by removing any non-image files.
   - **Split the Dataset**: Divide the dataset into training, validation, and test sets with specified ratios (e.g., 70% train, 10% validation, 20% test).

4. **Verification**:
   - After the dataset is prepared and split, verify that the files are correctly categorized into the appropriate directories.

## Inputs

- **Kaggle JSON file**: A file containing your authentication token, necessary for accessing Kaggle's API.

## Outputs

- **Dataset Directory**: A structured directory containing the dataset split into train, validation, and test subsets (`inputs/cherry-leaves_dataset`).

## Additional Comments

* The client has supplied the data under a non-disclosure agreement (NDA), meaning it can only be shared with professionals directly involved in the project. The dataset pertains to binary image classification, specifically differentiating between healthy cherry leaves and those affected by powdery mildew.


---

# Import packages


In [1]:
%pip install -r /workspace/rare-and-sweet//requirements.txt

Note: you may need to restart the kernel to use updated packages.


In [2]:
# Import necessary libraries for data manipulation and operating system interaction

import numpy
import os

# Change working directory

In [3]:
# Store the current working directory

import os
current_dir = os.getcwd()
current_dir

'/workspace/rare-and-sweet/jupyter_notebooks'

In [4]:
# Change the working directory to the parent directory and confirm the change

os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


In [5]:
# Confirm the new working directory after the change

current_dir = os.getcwd()
current_dir

'/workspace/rare-and-sweet'

---

# Section 1

## Install Kaggle

### Annotation:

Before running this cell, it is crucial to ensure that the Kaggle JSON file, which contains your API credentials, is correctly set up in the working directory. This JSON file should be named `kaggle.json` and must contain your Kaggle username and API key. 

#### Important Steps:
1. **Obtain the JSON File**: 
   - You can download your `kaggle.json` file from the "Account" section of your Kaggle profile under the "API" tab.

2. **Place the JSON File**:
   - Make sure the `kaggle.json` file is placed in the current working directory of your notebook.

3. **Verify Permissions**:
   - The cell will change the permissions of the JSON file to ensure it is secure (readable only by the user). This is an essential step to protect your credentials.

If the JSON file is not correctly set up, the Kaggle API will not be able to authenticate, and any attempts to download datasets will fail.


In [6]:
# install kaggle package
%pip install kaggle

Note: you may need to restart the kernel to use updated packages.


Run the cell below **to change kaggle configuration directory to current working directory and permission of kaggle authentication json**

In [7]:
# Set Kaggle configuration directory to current working directory and secure the Kaggle JSON file by setting appropriate permissions

os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

---

# Section 2

## Set Kaggle Dataset and Download it

In [8]:
# Download the specified dataset from Kaggle and store it in the designated folder

KaggleDatasetPath = "codeinstitute/cherry-leaves"
DestinationFolder = "inputs/cherry-leaves_dataset"

!kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}


Dataset URL: https://www.kaggle.com/datasets/codeinstitute/cherry-leaves
License(s): unknown
Downloading cherry-leaves.zip to inputs/cherry-leaves_dataset
100%|█████████████████████████████████████▉| 55.0M/55.0M [00:02<00:00, 23.2MB/s]
100%|██████████████████████████████████████| 55.0M/55.0M [00:02<00:00, 21.3MB/s]


## Unzip the downloaded file, delete the zip file

In [9]:
# Extract the contents of the downloaded zip file and remove the zip file afterward to save space

import zipfile
with zipfile.ZipFile(DestinationFolder + '/cherry-leaves.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/cherry-leaves.zip')

---

# Section 3

## Data Preparation

### Data Cleaning

### Check and remove non images files

In [10]:
# Function to remove non-image files from the dataset directory
# It counts and deletes files that do not have a valid image extension

def remove_non_image_file(my_data_dir):
    image_extensions = ('.png', '.jpg', '.jpeg')
    
    # List all items in the root directory
    items = os.listdir(my_data_dir)
    
    for item in items:
        folder_path = os.path.join(my_data_dir, item)
        
        # Ensure the item is a directory
        if os.path.isdir(folder_path):
            files = os.listdir(folder_path)
            
            image_count = 0
            non_image_count = 0
            
            for given_file in files:
                file_path = os.path.join(folder_path, given_file)
                
                # Check if the file has a valid image extension
                if not given_file.lower().endswith(image_extensions):
                    try:
                        os.remove(file_path)  # Remove non-image file
                        non_image_count += 1
                    except Exception as e:
                        print(f"Error removing file {file_path}: {e}")
                else:
                    image_count += 1
            
            print(f"Folder: {item} - contains {image_count} image file(s)")
            print(f"Folder: {item} - contains {non_image_count} non-image file(s)")

In [11]:
my_data_dir = '/workspace/rare-and-sweet/inputs/cherry-leaves_dataset/cherry-leaves'

# Run the function to remove non-image files
remove_non_image_file(my_data_dir)

Error removing file /workspace/rare-and-sweet/inputs/cherry-leaves_dataset/cherry-leaves/test/healthy: [Errno 21] Is a directory: '/workspace/rare-and-sweet/inputs/cherry-leaves_dataset/cherry-leaves/test/healthy'
Error removing file /workspace/rare-and-sweet/inputs/cherry-leaves_dataset/cherry-leaves/test/powdery_mildew: [Errno 21] Is a directory: '/workspace/rare-and-sweet/inputs/cherry-leaves_dataset/cherry-leaves/test/powdery_mildew'
Folder: test - contains 0 image file(s)
Folder: test - contains 0 non-image file(s)
Error removing file /workspace/rare-and-sweet/inputs/cherry-leaves_dataset/cherry-leaves/train/healthy: [Errno 21] Is a directory: '/workspace/rare-and-sweet/inputs/cherry-leaves_dataset/cherry-leaves/train/healthy'
Error removing file /workspace/rare-and-sweet/inputs/cherry-leaves_dataset/cherry-leaves/train/powdery_mildew: [Errno 21] Is a directory: '/workspace/rare-and-sweet/inputs/cherry-leaves_dataset/cherry-leaves/train/powdery_mildew'
Folder: train - contains 0 i

### Split train validation test set

In [12]:
import os
import shutil
import random
import joblib

def split_train_validation_test_images(my_data_dir, train_set_ratio, validation_set_ratio, test_set_ratio):
  
    if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
        print("train_set_ratio + validation_set_ratio + test_set_ratio should sum to 1.0")
        return

    # gets classes labels
    labels = os.listdir(my_data_dir)
    
    if 'test' in labels:
        pass
    else: 
        # create train, test folders with class labels sub-folder
        for folder in ['train','validation','test']:
            for label in labels:
                os.makedirs(os.path.join(my_data_dir, folder, label), exist_ok=True)

        for label in labels:
            label_dir = os.path.join(my_data_dir, label)
            files = os.listdir(label_dir)
            random.shuffle(files)

            train_set_files_qty = int(len(files) * train_set_ratio)
            validation_set_files_qty = int(len(files) * validation_set_ratio)

            count = 1
            for file_name in files:
                src_path = os.path.join(label_dir, file_name)

                if count <= train_set_files_qty:
                    # move given file to train set
                    dst_path = os.path.join(my_data_dir, 'train', label, file_name)
                    shutil.move(src_path, dst_path)
                elif count <= (train_set_files_qty + validation_set_files_qty):
                    # move given file to validation set
                    dst_path = os.path.join(my_data_dir, 'validation', label, file_name)
                    shutil.move(src_path, dst_path)
                else:
                    # move given file to test set
                    dst_path = os.path.join(my_data_dir, 'test', label, file_name)
                    shutil.move(src_path, dst_path)
                    
                count += 1

            os.rmdir(label_dir)

    print("Dataset successfully split into train, validation, and test sets.")

In [13]:
# Split the dataset into train, validation, and test sets with the specified ratios

split_train_validation_test_images(my_data_dir = "/workspace/rare-and-sweet/inputs/cherry-leaves_dataset/cherry-leaves",
                                   train_set_ratio = 0.7,
                                   validation_set_ratio = 0.1,
                                   test_set_ratio = 0.2)

Dataset successfully split into train, validation, and test sets.


---