# **Data Collection**

## Objectives

* Fetch the data from Kaggle, save as raw data in workspace, and prepare it for further processing. 

## Inputs

* Kaggle JSON file - authentication token

## Outputs

* Generate Dataset - inputs/cherryleaves_dataset/leaf_images

## Additional Comments

* As the notebooks are stored in a subfolder, when running the notebook in an editor the working directory will have to be changed from the current folder to the parent folder. 
* The current directory will be accessed with os.getcwd()



---

# Import Packages

In [1]:
import numpy 
import os 

## Change working Directory 

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [2]:
current_dir = os.getcwd()
current_dir

'/workspaces/mildew-detection-cherry-leaves-p5/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [3]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [4]:
current_dir = os.getcwd()
current_dir

'/workspaces/mildew-detection-cherry-leaves-p5'

# Install Kaggle

In [5]:
# install the kaggle package with pip
%pip install kaggle==1.5.12

Collecting kaggle==1.5.12
  Downloading kaggle-1.5.12.tar.gz (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.0/59.0 kB[0m [31m1.0 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting tqdm (from kaggle==1.5.12)
  Obtaining dependency information for tqdm from https://files.pythonhosted.org/packages/00/e5/f12a80907d0884e6dff9c16d0c0114d81b8cd07dc3ae54c5e962cc83037e/tqdm-4.66.1-py3-none-any.whl.metadata
  Downloading tqdm-4.66.1-py3-none-any.whl.metadata (57 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.6/57.6 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting python-slugify (from kaggle==1.5.12)
  Downloading python_slugify-8.0.1-py2.py3-none-any.whl (9.7 kB)
Collecting text-unidecode>=1.3 (from python-slugify->kaggle==1.5.12)
  Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.2/78.2 kB[

Kaggle Configuration


In [6]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Set Kaggle Dataset and download it

In [7]:
KaggleDatasetPath = "codeinstitute/cherry-leaves"
DestinationFolder = "inputs/cherryleaves_dataset"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading cherry-leaves.zip to inputs/cherryleaves_dataset
 89%|█████████████████████████████████▊    | 49.0M/55.0M [00:01<00:00, 40.3MB/s]
100%|██████████████████████████████████████| 55.0M/55.0M [00:01<00:00, 35.8MB/s]


* Unzip the downloaded data file, delete the zip file

In [8]:
import zipfile
with zipfile.ZipFile(DestinationFolder + '/cherry-leaves.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/cherry-leaves.zip')

---
# Data Preparation
---

## Data Cleaning 

### Check and Remove non-image files

* Check for, and remove non image files that do not have an extension finishing with png, jpg, or jpeg

In [9]:
def remove_non_image_file(my_data_dir):
    """
    Function removes non-image files
    """
    image_extension = ('.png', '.jpg', '.jpeg')
    folders = os.listdir(my_data_dir) 
    for folder in folders:
        files = os.listdir(my_data_dir + '/' + folder)
            #print(files)
        i = []
        j = []
        #iterates over every file in each folder
        for given_file in files:
            if not given_file.lower().endswith(image_extension):
                file_location = my_data_dir + '/' + folder + '/' + given_file
                os.remove(file_location)
                i.append(1)
            else:
                j.append(1)
                pass
        print(f"Folder: {folder} - has image file",len(j))
        print(f"Folder: {folder} - has non-image file",len(i))



In [11]:
remove_non_image_file(my_data_dir='inputs/cherryleaves_dataset/cherry-leaves') 

Folder: powdery_mildew - has image file 2104
Folder: powdery_mildew - has non-image file 0
Folder: healthy - has image file 2104
Folder: healthy - has non-image file 0


## Split the Data: Train, Validation, Test Sets  

In [12]:
import os
import shutil
import random 
import joblib

def split_train_validation_test_images(my_data_dir, train_set_ratio, validation_set_ratio, test_set_ratio):
    """
    Function divides dataset into train, validation, and test sets.
    Splits the data into percentages stated below within these sets. 
    """
    if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
        print("train_set_ratio + validation_set_ratio + test_set_ratio should sum to 1.0")
        return

    # get class labels 
    labels = os.listdir(my_data_dir)  # it should get only the folder name
    if 'test' in labels:
        pass
    else:
        # create train, test folders with classes labels sub-folder
        for folder in ['train', 'validation', 'test']:
            for label in labels:
                os.makedirs(name=my_data_dir + '/' + folder + '/' + label)

        for label in labels:

            files = os.listdir(my_data_dir + '/' + label)
            random.shuffle(files)

            train_set_files_qty = int(len(files) * train_set_ratio)
            validation_set_files_qty = int(len(files) * validation_set_ratio)

            count = 1
            for file_name in files:
                if count <= train_set_files_qty:
                    # move a given file to the train set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/train/' + label + '/' + file_name)

                elif count <= (train_set_files_qty + validation_set_files_qty):
                    # move a given file to the validation set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/validation/' + label + '/' + file_name)

                else:
                    # move given file to test set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/test/' + label + '/' + file_name)

                count += 1

            os.rmdir(my_data_dir + '/' + label)

The data is split into:

* A training set (70%)
* A validation set (10%)
* A test set (20%)


In [13]:
split_train_validation_test_images(my_data_dir = f"inputs/cherryleaves_dataset/cherry-leaves",
                        train_set_ratio = 0.7,
                        validation_set_ratio = 0.1,
                        test_set_ratio = 0.2
                        ) 

### Conclusions and Next Steps 

* The data has been downloaded for analysis
* There are now 3 separate folders in the inputs/cherryleaves_dataset/leaf_images where the train, validation, and test sets are stored
* The next steps will be to visualise and augment the data to further prepare it for modelling