# **DATA COLLECTION**

## Objectives

* Fetch data from Kaggle and prepare it for further processes

## Inputs

* Kaggle JSON file - authentication token 

## Outputs

* Generate Dataset: datasets/codeinstitute/cherry-leaves

## Additional Comments

* No comments 



---

# Import packages


In [5]:
import numpy
import os

# Change working directory

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/rare-and-sweet/jupyter_notebooks'

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/rare-and-sweet'

---

# Section 1

## Install Kaggle

In [4]:
# install kaggle package
%pip install kaggle

154.08s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


Collecting kaggle
  Downloading kaggle-1.6.17.tar.gz (82 kB)
  Preparing metadata (setup.py) ... [?25ldone
Collecting tqdm (from kaggle)
  Downloading tqdm-4.66.5-py3-none-any.whl.metadata (57 kB)
Collecting python-slugify (from kaggle)
  Downloading python_slugify-8.0.4-py2.py3-none-any.whl.metadata (8.5 kB)
Collecting text-unidecode>=1.3 (from python-slugify->kaggle)
  Downloading text_unidecode-1.3-py2.py3-none-any.whl.metadata (2.4 kB)
Downloading python_slugify-8.0.4-py2.py3-none-any.whl (10 kB)
Downloading tqdm-4.66.5-py3-none-any.whl (78 kB)
Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
Building wheels for collected packages: kaggle
  Building wheel for kaggle (setup.py) ... [?25ldone
[?25h  Created wheel for kaggle: filename=kaggle-1.6.17-py3-none-any.whl size=105786 sha256=100aeebb4e94fbfd0caf9d9af3da69223fdc0afcd025fb3521746e1b32944e6c
  Stored in directory: /home/gitpod/.cache/pip/wheels/a5/6f/7b/837915771e94e181fa3052822926444e34f725ca38e70be77e
Successfull

Run the cell below **to change kaggle configuration directory to current working directory and permission of kaggle authentication json**

In [8]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

734.75s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


---

# Section 2

## Set Kaggle Dataset and Download it

In [10]:
KaggleDatasetPath = "codeinstitute/cherry-leaves"
DestinationFolder = "inputs/cherry-leaves_dataset"

!kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}


1141.38s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


Dataset URL: https://www.kaggle.com/datasets/codeinstitute/cherry-leaves
License(s): unknown
Downloading cherry-leaves.zip to inputs/cherry-leaves_dataset
 95%|███████████████████████████████████▉  | 52.0M/55.0M [00:02<00:00, 25.0MB/s]
100%|██████████████████████████████████████| 55.0M/55.0M [00:02<00:00, 20.8MB/s]


## Unzip the downloaded file, delete the zip file

In [11]:
import zipfile
with zipfile.ZipFile(DestinationFolder + '/cherry-leaves.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/cherry-leaves.zip')

---

# Section 3

## Data Preparation

### Data Cleaning

### Check and remove non images files

In [11]:
def remove_non_image_file(my_data_dir):
    image_extensions = ('.png', '.jpg', '.jpeg')
    
    # List all items in the root directory
    items = os.listdir(my_data_dir)
    
    for item in items:
        folder_path = os.path.join(my_data_dir, item)
        
        # Ensure the item is a directory
        if os.path.isdir(folder_path):
            files = os.listdir(folder_path)
            
            image_count = 0
            non_image_count = 0
            
            for given_file in files:
                file_path = os.path.join(folder_path, given_file)
                
                # Check if the file has a valid image extension
                if not given_file.lower().endswith(image_extensions):
                    try:
                        os.remove(file_path)  # Remove non-image file
                        non_image_count += 1
                    except Exception as e:
                        print(f"Error removing file {file_path}: {e}")
                else:
                    image_count += 1
            
            print(f"Folder: {item} - contains {image_count} image file(s)")
            print(f"Folder: {item} - contains {non_image_count} non-image file(s)")

In [12]:
my_data_dir = '/workspace/rare-and-sweet/inputs/cherry-leaves_dataset/cherry-leaves'

# Run the function to remove non-image files
remove_non_image_file(my_data_dir)

Folder: healthy - contains 2104 image file(s)
Folder: healthy - contains 0 non-image file(s)
Folder: powdery_mildew - contains 2104 image file(s)
Folder: powdery_mildew - contains 0 non-image file(s)


### Split train validation test set

In [13]:
import os
import shutil
import random
import joblib

def split_train_validation_test_images(my_data_dir, train_set_ratio, validation_set_ratio, test_set_ratio):
  
    if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
        print("train_set_ratio + validation_set_ratio + test_set_ratio should sum to 1.0")
        return

    # gets classes labels
    labels = os.listdir(my_data_dir)
    
    if 'test' in labels:
        pass
    else: 
        # create train, test folders with class labels sub-folder
        for folder in ['train','validation','test']:
            for label in labels:
                os.makedirs(os.path.join(my_data_dir, folder, label), exist_ok=True)

        for label in labels:
            label_dir = os.path.join(my_data_dir, label)
            files = os.listdir(label_dir)
            random.shuffle(files)

            train_set_files_qty = int(len(files) * train_set_ratio)
            validation_set_files_qty = int(len(files) * validation_set_ratio)

            count = 1
            for file_name in files:
                src_path = os.path.join(label_dir, file_name)

                if count <= train_set_files_qty:
                    # move given file to train set
                    dst_path = os.path.join(my_data_dir, 'train', label, file_name)
                    shutil.move(src_path, dst_path)
                elif count <= (train_set_files_qty + validation_set_files_qty):
                    # move given file to validation set
                    dst_path = os.path.join(my_data_dir, 'validation', label, file_name)
                    shutil.move(src_path, dst_path)
                else:
                    # move given file to test set
                    dst_path = os.path.join(my_data_dir, 'test', label, file_name)
                    shutil.move(src_path, dst_path)
                    
                count += 1

            os.rmdir(label_dir)

    print("Dataset successfully split into train, validation, and test sets.")

In [14]:
split_train_validation_test_images(my_data_dir = "/workspace/rare-and-sweet/inputs/cherry-leaves_dataset/cherry-leaves",
                                   train_set_ratio = 0.7,
                                   validation_set_ratio = 0.1,
                                   test_set_ratio = 0.2)

Dataset successfully split into train, validation, and test sets.


---