# **Data Collection**

## Objectives

* Fetch data from Kaggle and save as raw data
* Remove non-image files from full dataset
* Split data into train, validation and test sets

## Inputs

* kaggle.json - the authentication token for data

## Outputs

* Full dataset
* Data set split into three sub folders (Train, Test and Validation)

## Additional Comments

* No additional comments



---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [39]:
import numpy
import os
path = '/workspace/mildew-detector/jupyter_notebooks'
os.chdir(path)
current_dir = os.getcwd()
current_dir

'/workspace/mildew-detector/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [40]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [41]:
current_dir = os.getcwd()
current_dir

'/workspace/mildew-detector'

# Import the Kaggle Data

Kaggle data

In [42]:
%pip install kaggle


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


* Change config directory to currect directory
* Permission of kaggle authentication using kaggle.json

In [43]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

* Download Kaggle dataset

In [44]:
KaggleDatasetPath = 'codeinstitute/cherry-leaves'
DestinationFolder = 'inputs/full_dataset'
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading cherry-leaves.zip to inputs/full_dataset
100%|█████████████████████████████████████▉| 55.0M/55.0M [00:01<00:00, 45.8MB/s]
100%|██████████████████████████████████████| 55.0M/55.0M [00:01<00:00, 32.3MB/s]


* Unzip files
* Delete zipped files

In [45]:
import zipfile
with zipfile.ZipFile(DestinationFolder + '/cherry-leaves.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/cherry-leaves.zip')

---

# Preparation of Data

Cleaning the data to remove all non-image files.

In [46]:
def non_image_file(my_data_dir):
    """
    Remove files without .png, .jpg and .jpeg
    """

    img_ext = ('.png','.jpg','.jpeg')
    folders = os.listdir(my_data_dir)
    for folder in folders:
        files = os.listdir(my_data_dir + '/' + folder)

        i = []
        j = []

        for curr_file in files:
            if not curr_file.lower().endswith(img_ext):
                file_location = my_data_dir + '/' + folder + '/' + curr_file
                os.remove(file_location)
                i.append(1)
            else:
                j.append(1)
                pass
        print(f"{folder} - image files - ",len(j))
        print(f"{folder} - non-image files - ",len(i))

In [47]:
non_image_file('inputs/full_dataset/cherry-leaves')

healthy - image files -  2104
healthy - non-image files -  0
powdery_mildew - image files -  2104
powdery_mildew - non-image files -  0


## Split data

Split data into three sets-
* Train
* Test
* Validation

Imports

In [None]:
import os
import shutil
import random
import joblib

Function to split data and run the function

In [None]:
def split_images(my_data_dir, train_ratio, validation_ratio, test_ratio):
    """ 
    This function will split the full dataset into train, validation and test sets. 
    It will split in the ratio that will be entered when running function
    """ 

    # check if sum of ratio is 1

    if train_ratio + validation_ratio + test_ratio != 1.0:
        print("The sum of the ratios should equal 1")
        return
    
    labels = os.listdir(my_data_dir)
    if 'test' in labels:
        pass
    else:
        folders = ['train', 'validation', 'test']
        for folder in folders:
            for label in labels:
                os.makedirs(name=my_data_dir+ '/' + folder + '/' + label)
        
        for label in labels:
            files = os.listdir(my_data_dir + '/' + label)
            random.shuffle(files)

            train_set_qty = int(len(files) * train_ratio)
            validation_set_qty = int(len(files) * validation_ratio)

            count=1

            for file_name in files:
                if count <= train_set_qty:
                    shutil.move(my_data_dir +'/' + label +'/' + file_name,
                   my_data_dir + '/train/' + label +'/' + file_name )

                elif count <= (train_set_qty+validation_set_qty):
                    shutil.move(my_data_dir +'/' + label +'/' + file_name,
                   my_data_dir + '/validation/' + label +'/' + file_name )
                
                else:
                    shutil.move(my_data_dir +'/' + label +'/' + file_name,
                   my_data_dir + '/test/' + label +'/' + file_name )
                
                count+=1
            os.rmdir(my_data_dir +'/'+label)


Use the function to split into following ratios for each set:
* Train = 70%
* Validation = 10%
* Test = 20%

In [None]:
split_images( f"inputs/full_dataset/cherry-leaves", 0.7, 0.1, 0.2)

---

## Conclusions

* Data is downloaded from Kaggle
* Data is cleaned to remove any non-image files
* Data is split into three folders 
    - inputs/full_dataset/cherry-leaves/train
    - inputs/full_dataset/cherry-leaves/validation
    - inputs/full_dataset/cherry-leaves/test

---

# Push files to Repo

* If you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.