# **Data Collection**

---

## Objectives

* Fetch data from Kaggle and prepare it for further processing.
  - Clean data
  - Split data

## Inputs

*   Kaggle JSON file - the authentication token. 

## Outputs

* Generate Dataset: inputs/cherry_leaves_dataset/cherry-leaves


## Comments | Insights | Conclusions

* These steps are necessary to fetch the data, clean it, and divide it into subsets for the purposes of machine learning. 

* The next step will be Data Visualization to understand the data and discover patterns.


---

In [1]:
! pip install -r /workspace/Portfolio_5_Cherry_Leaves_Mildew/requirements.txt



In [2]:
import numpy as np
import os

### Change working directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/Portfolio_5_Cherry_Leaves_Mildew/jupyter_notebooks'

In [4]:
os.chdir('/workspace/Portfolio_5_Cherry_Leaves_Mildew')
print("You set a new current directory")

You set a new current directory


In [5]:
current_dir = os.getcwd()
current_dir

'/workspace/Portfolio_5_Cherry_Leaves_Mildew'

In [6]:
# install kaggle package
%pip install kaggle==1.5.12

Note: you may need to restart the kernel to use updated packages.


Run the cell below **to change kaggle configuration directory to current working directory and permission of kaggle authentication json**

In [7]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Check if the DestinationFolder exists, removes it if it does, creates a new one, downloads the dataset, and then unzips it. It ensures that any existing data is removed before downloading the new dataset.

In [8]:
import os
import shutil
import zipfile

# Define your Kaggle dataset path and destination folder
KaggleDatasetPath = "codeinstitute/cherry-leaves"
DestinationFolder = "inputs/cherry_leaves_dataset"

# Check if the data folder already exists
if os.path.exists(DestinationFolder):
    print("Data folder already exists. Removing existing data...")
    # Remove existing data folder and its contents
    shutil.rmtree(DestinationFolder)

# Create the destination folder
os.makedirs(DestinationFolder)

# Download the dataset
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

# Unzip the downloaded file and delete the zip file
with zipfile.ZipFile(DestinationFolder + '/cherry-leaves.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/cherry-leaves.zip')


Downloading cherry-leaves.zip to inputs/cherry_leaves_dataset
 93%|███████████████████████████████████▏  | 51.0M/55.0M [00:01<00:00, 28.8MB/s]
100%|██████████████████████████████████████| 55.0M/55.0M [00:02<00:00, 27.9MB/s]


---

## Data Preparation

---

### Data cleaning

Check and remove non images files

In [9]:
def remove_non_image_file(my_data_dir):
    image_extension = ('.png', '.jpg', '.jpeg')
    folders = os.listdir(my_data_dir)
    for folder in folders:
        files = os.listdir(my_data_dir + '/' + folder)
        # print(files)
        i = []
        j = []
        for given_file in files:
            if not given_file.lower().endswith(image_extension):
                file_location = my_data_dir + '/' + folder + '/' + given_file
                os.remove(file_location)  # remove non image file
                i.append(1)
            else:
                j.append(1)
                pass
        print(f"Folder: {folder} - has image file", len(j))
        print(f"Folder: {folder} - has non-image file", len(i))


In [10]:
remove_non_image_file(my_data_dir='inputs/cherry_leaves_dataset/cherry-leaves')


Folder: healthy - has image file 2104
Folder: healthy - has non-image file 0
Folder: powdery_mildew - has image file 2104
Folder: powdery_mildew - has non-image file 0


Split train validation test set

In [11]:
import os
import shutil
import random

def split_train_validation_test_images(my_data_dir, train_set_ratio, validation_set_ratio, test_set_ratio):
  
  if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
    print("train_set_ratio + validation_set_ratio + test_set_ratio should sum 1.0")
    return

  # gets classes labels
  labels = os.listdir(my_data_dir) # it should get only the folder name
  if 'test' in labels:
    pass
  else: 
    # create train, test folders with classess labels sub-folder
    for folder in ['train','validation','test']:
      for label in labels:
        os.makedirs(name=my_data_dir+ '/' + folder + '/' + label)

    for label in labels:

      files = os.listdir(my_data_dir + '/' + label)
      random.shuffle(files)

      train_set_files_qty = int(len(files) * train_set_ratio)
      validation_set_files_qty = int(len(files) * validation_set_ratio)

      count = 1
      for file_name in files:
        if count <= train_set_files_qty:
          # move given file to train set
          shutil.move(my_data_dir + '/' + label + '/' + file_name,
                      my_data_dir + '/train/' + label + '/' + file_name)
          

        elif count <= (train_set_files_qty + validation_set_files_qty ):
          # move given file to validation set
          shutil.move(my_data_dir + '/' + label + '/' + file_name,
                      my_data_dir + '/validation/' + label + '/' + file_name)

        else:
          # move given file to test set
          shutil.move(my_data_dir + '/' + label + '/' + file_name,
                  my_data_dir + '/test/' +label + '/'+ file_name)
          
        count += 1

      os.rmdir(my_data_dir + '/' + label)
    

Conventionally,

* The training set is divided into 70% ratio of data.
* The validation set is divided into 10% ratio of data.
* The test set is divided into 20% ratio of data.

In [12]:
split_train_validation_test_images(my_data_dir = f"inputs/cherry_leaves_dataset/cherry-leaves",
                        train_set_ratio = 0.70,
                        validation_set_ratio=0.10,
                        test_set_ratio=0.20
                        )

---

## Conclusions

---

The data has been downloaded from Kaggle, underwent cleaning, and has now been organized into separate train, test, and validation folders, ready for further development and processing.