# **01_Data Collection**

## Business Requirements

> Add Business Case information here
* The client is interested in conducting a study to visually differentiate a cherry leaf that is healthy from one that contains powdery mildew.
* The client is interested in predicting if a cherry tree is healthy or contains powdery mildew.

## Objectives

* Install requirements and successfully fetch data from Kaggle via its API by using kaggle.json file and save raw data from [kaggle](https://www.kaggle.com/codeinstitute/cherry-leaves).

## Inputs

* kaggle.json file as the authentication token
* Existing requiremehnts.txt file

## Outputs

* Installed libraries to use for this project
* Generate raw dataset: inputs/dataset/cherry_leaves_dataset
* Prepare dataset: Cleanin & Splitting

## Additional Comments

> In case you have any additional comments that don't fit in the previous bullets, please state them here. 



---

# Install packages

In [1]:
%pip install -r /workspaces/PP5-mildew-detection/requirements.txt

Collecting numpy==1.26.1 (from -r /workspaces/PP5-mildew-detection/requirements.txt (line 1))
  Downloading numpy-1.26.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
Collecting pandas==2.1.1 (from -r /workspaces/PP5-mildew-detection/requirements.txt (line 2))
  Downloading pandas-2.1.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Collecting matplotlib==3.8.0 (from -r /workspaces/PP5-mildew-detection/requirements.txt (line 3))
  Downloading matplotlib-3.8.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.8 kB)
Collecting seaborn==0.13.2 (from -r /workspaces/PP5-mildew-detection/requirements.txt (line 4))
  Downloading seaborn-0.13.2-py3-none-any.whl.metadata (5.4 kB)
Collecting plotly==5.17.0 (from -r /workspaces/PP5-mildew-detection/requirements.txt (line 5))
  Downloading plotly-5.17.0-py2.py3-none-any.whl.metadata (7.0 kB)
Collecting Pillow==10.0.1 (from -r /workspaces/PP5-mildew-detection/requiremen

# Change working directory

Change the working directory from its current folder to its parent folder
> access the current directory with ```os.getcwd()```

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspaces/PP5-mildew-detection/jupyter_notebooks'

Make the parent of the current directory the new current directory
> ```os.path.dirname()``` gets the parent directory; ```os.chir()``` defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspaces/PP5-mildew-detection'

# Install Kaggle

In [4]:
# install kaggle package
%pip install kaggle==1.5.12

Collecting kaggle==1.5.12
  Downloading kaggle-1.5.12.tar.gz (58 kB)
  Preparing metadata (setup.py) ... [?25ldone
Collecting tqdm (from kaggle==1.5.12)
  Downloading tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Collecting python-slugify (from kaggle==1.5.12)
  Downloading python_slugify-8.0.4-py2.py3-none-any.whl.metadata (8.5 kB)
Collecting text-unidecode>=1.3 (from python-slugify->kaggle==1.5.12)
  Downloading text_unidecode-1.3-py2.py3-none-any.whl.metadata (2.4 kB)
Downloading python_slugify-8.0.4-py2.py3-none-any.whl (10 kB)
Downloading tqdm-4.67.1-py3-none-any.whl (78 kB)
Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
Building wheels for collected packages: kaggle
  Building wheel for kaggle (setup.py) ... [?25ldone
[?25h  Created wheel for kaggle: filename=kaggle-1.5.12-py3-none-any.whl size=73026 sha256=f792e86911fe4a5fc464cdd0ce5b0bf6aa9c4fe2bf0da2fed2b6b11a45a1e030
  Stored in directory: /home/cistudent/.cache/pip/wheels/f5/69/4d/d701fc604b9fb09be59718b4056f

Change the Kaggle configuration directory to the current working directory and set permissions for the Kaggle authentication JSON.

In [5]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Set the [kaggle](https://www.kaggle.com/codeinstitute/cherry-leaves) dataset and download it.

In [6]:
KaggleDatasetPath = "codeinstitute/cherry-leaves"
DestinationFolder = "inputs/dataset/cherry_leaves_dataset"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading cherry-leaves.zip to inputs/dataset/cherry_leaves_dataset
 96%|████████████████████████████████████▌ | 53.0M/55.0M [00:02<00:00, 30.1MB/s]
100%|██████████████████████████████████████| 55.0M/55.0M [00:02<00:00, 21.9MB/s]


Unzip the downloaded file, and delete the zip file.

In [7]:
import zipfile
with zipfile.ZipFile(DestinationFolder + '/cherry-leaves.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/cherry-leaves.zip')

---

## Data Preparation

### Data Cleaning

Check and remove non-image files

In [8]:
# code for Data Preparation and its functions taken from Walkthrough Project 01 Malaria Detector

def remove_non_image_file(my_data_dir):
    image_extension = ('.png', '.jpg', '.jpeg')
    folders = os.listdir(my_data_dir)
    for folder in folders:
        files = os.listdir(my_data_dir + '/' + folder)
        # print(files)
        i = []
        j = []
        for given_file in files:
            if not given_file.lower().endswith(image_extension):
                file_location = my_data_dir + '/' + folder + '/' + given_file
                os.remove(file_location)  # remove non image file
                i.append(1)
            else:
                j.append(1)
                pass
        print(f"Folder: {folder} - has image file", len(j))
        print(f"Folder: {folder} - has non-image file", len(i))

In [9]:
remove_non_image_file(my_data_dir='inputs/dataset/cherry_leaves_dataset/cherry-leaves')

Folder: healthy - has image file 2104
Folder: healthy - has non-image file 0
Folder: powdery_mildew - has image file 2104
Folder: powdery_mildew - has non-image file 0


All files already were images, however if they would not have been they would have been dropped. The images are evenly distributed to **2104** healthy & **2104** powdery_images images.

---

> here eventually add step to resize images in case I have problems with deployingg them later on

---

### Split train-, validation- and test set

In [10]:
import shutil
import random
import joblib

# code for Data Preparation and its functions taken from Walkthrough Project 01 Malaria Detector 
def split_train_validation_test_images(my_data_dir, train_set_ratio, validation_set_ratio, test_set_ratio):

    if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
        print("train_set_ratio + validation_set_ratio + test_set_ratio should sum to 1.0")
        return

    # gets classes labels
    labels = os.listdir(my_data_dir)  # it should get only the folder name
    if 'test' in labels:
        pass
    else:
        # create train, test folders with classes labels sub-folder
        for folder in ['train', 'validation', 'test']:
            for label in labels:
                os.makedirs(name=my_data_dir + '/' + folder + '/' + label)

        for label in labels:

            files = os.listdir(my_data_dir + '/' + label)
            random.shuffle(files)

            train_set_files_qty = int(len(files) * train_set_ratio)
            validation_set_files_qty = int(len(files) * validation_set_ratio)

            count = 1
            for file_name in files:
                if count <= train_set_files_qty:
                    # move a given file to the train set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/train/' + label + '/' + file_name)

                elif count <= (train_set_files_qty + validation_set_files_qty):
                    # move a given file to the validation set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/validation/' + label + '/' + file_name)

                else:
                    # move given file to test set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/test/' + label + '/' + file_name)

                count += 1

            os.rmdir(my_data_dir + '/' + label)

Split dataset into the following ratio: 70% train set, 10% validation set and 20% test set

In [11]:
split_train_validation_test_images(my_data_dir=f"inputs/dataset/cherry_leaves_dataset/cherry-leaves",
                                   train_set_ratio=0.7,
                                   validation_set_ratio=0.1,
                                   test_set_ratio=0.2
                                   )

---

# Outputs fulfilled

* All packages and libraries have been installed
* Input datasets have been generated
* Dataset has been cleaned
* Dataset has been split