# **Data Collection**

## Objectives

* Fetch Data from Kaggle
* Remove non-image

## Inputs

* Kaggle JSON Authentication

## Outputs

* Generate cherry-leaves dataset



---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import numpy
import os
current_dir = os.getcwd()
current_dir

'/workspaces/cherry-leaves-project/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspaces/cherry-leaves-project'

# Kaggle Installation

Install Kaggle here

In [4]:
!pip install kaggle




---

In [5]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

In [6]:
KaggleDatasetPath = "codeinstitute/cherry-leaves"
DestinationFolder = "inputs/cherry_leaves_dataset"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading cherry-leaves.zip to inputs/cherry_leaves_dataset
 85%|████████████████████████████████▍     | 47.0M/55.0M [00:00<00:00, 75.9MB/s]
100%|██████████████████████████████████████| 55.0M/55.0M [00:00<00:00, 75.1MB/s]


In [7]:
import zipfile
with zipfile.ZipFile(DestinationFolder + '/cherry-leaves.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/cherry-leaves.zip')

# Data Cleaning

Remove non images

In [8]:
def remove_non_image(data_dir):
    image_ext = ('.png', '.jpg', '.jpeg')
    folders = os.listdir(data_dir) 
    for folder in folders:
        files = os.listdir(data_dir + '/' + folder)
        
            #print(files)
        i = []
        j = []
        for given_file in files:
            if not given_file.lower().endswith(image_ext):
                file_location = data_dir + '/' + folder + '/' + given_file
                os.remove(file_location) # remove non image file
                i.append(1)
            else:
                j.append(1)
                pass
        print(f"Folder: {folder} - has image file",len(j))
        print(f"Folder: {folder} - has non-image file",len(i))

In [9]:
remove_non_image(data_dir='inputs/cherry_leaves_dataset/cherry-leaves')

Folder: healthy - has image file 2104
Folder: healthy - has non-image file 0
Folder: powdery_mildew - has image file 2104
Folder: powdery_mildew - has non-image file 0


## Split the Images

The images are split into a test, train and validation set. 
Train set = 70%
Test set = 10%
Validation set = 20%

In [10]:
import os
import shutil
import random
import joblib
import splitfolders

input_folder = "inputs/cherry_leaves_dataset/cherry-leaves"
output = "inputs/cherry_leaves_dataset/cherry-leaves" 
#where you want the split datasets saved. one will be created if it does not exist or none is set

splitfolders.ratio(input_folder, output=output, seed=42, ratio=(.7, .1, .2)) 
# ratio of split are in order of train/val/test. You can change to whatever you want. 
# For train/val sets only, you could do .75, .25 for example.

Copying files: 4208 files [00:02, 1598.90 files/s]


---

## **Conclusions and Next Steps**

**Conclusion**
* Removed non-image files
* Split data into Train, Test and Validation set
  
**Next Steps**
* Data Visualization
* Mean of images
* Variability of images
* Image Montage