# Part 1: Dogs-vs-Cats dataset preparation


## 1. Downloading the dataset from Kaggle

If you are not already registered, go to the [Kaggle website](https://www.kaggle.com) and create an account.

Once you are logged in, download the [dogs-vs-cats dataset](https://www.kaggle.com/c/dogs-vs-cats/data) and place the downloaded zip file (dogs-vs-cats.zip) in the same folder as this Jupyter NoteBook.

The dogs-vs-cats.zip contains another pair of zip archives - test1.zip contains unlabelled images that were required as part of the orginal Kaggle Challenge, we won't be using them.

train.zip contains 25000 labelled images where the filename of the JPEG image indicates its class (or label). For example, `cat.12.jpg` is clearly part of the cat class. We will divide these 25k images into a training set of approximately 19k images, a validation set of approximately 5k images and a test set of approximately 1k images.

The training set is, obviously, used during training. The validation set is used to validate accuracy at the end of each training epoch - the validation images are not used to train the model.

Finally, the test set is a small set of 'unseen' data that we will use to make predictions with the trained model.


## 2. Moving the images into folders

We will be using the Keras `.flow_from_directory()` method during training, so the images need to be divided into folders that reflect the classes:
<br>

![title](img/folders.png)

<br>
We start by importing the necessary libraries:

In [None]:
import os
import sys
import shutil
import zipfile

from random import seed, random
from random import random

Then we create some variables that point to the current working directory and to the folders that we want to create..

In [None]:
SCRIPT_DIR = os.getcwd()
print('This script is located in: ', SCRIPT_DIR)

# dataset top level
DATASET_DIR = os.path.join(SCRIPT_DIR, 'dataset')

# train, validation and test folders
TRAIN_DIR = os.path.join(DATASET_DIR, 'train')
VALID_DIR = os.path.join(DATASET_DIR, 'valid')
TEST_DIR = os.path.join(DATASET_DIR, 'test')

# class folders
TRAIN_CAT_DIR = os.path.join(TRAIN_DIR, 'cat')
TRAIN_DOG_DIR = os.path.join(TRAIN_DIR, 'dog')
VALID_CAT_DIR = os.path.join(VALID_DIR, 'cat')
VALID_DOG_DIR = os.path.join(VALID_DIR, 'dog')
TEST_CAT_DIR = os.path.join(TEST_DIR, 'cat')
TEST_DOG_DIR = os.path.join(TEST_DIR, 'dog')

Now we delete any previous folders and then make new class folders just like in the image above..

In [None]:
# remove any previous data
dir_list = [DATASET_DIR]
for dir in dir_list: 
    if (os.path.exists(dir)):
        shutil.rmtree(dir)
    os.makedirs(dir)
    print("Directory" , dir ,  "created ")
    
# make all necessary folders
dir_list = [VALID_DIR, TEST_DIR,TRAIN_CAT_DIR,TRAIN_DOG_DIR, \
            VALID_CAT_DIR, VALID_DOG_DIR,TEST_CAT_DIR,TEST_DOG_DIR]
 
for dir in dir_list: 
    os.makedirs(dir)
    print("Directory " , dir ,  "created ")

Unzip the dogs-vs-cats.zip archive that we downloaded from Kaggle, then unzip the train.zip archive that was inside it..

In [None]:
# unzip the dogs-vs-cats archive that was downloaded from Kaggle
zip_ref = zipfile.ZipFile('./dogs-vs-cats.zip', 'r')
zip_ref.extractall('./dataset')
zip_ref.close()

# unzip train archive (inside the dogs-vs-cats archive)
zip_ref = zipfile.ZipFile('./dataset/train.zip', 'r')
zip_ref.extractall('./dataset')
zip_ref.close()

print('Unzipped dataset..')


# remove un-needed files
os.remove(os.path.join(DATASET_DIR, 'sampleSubmission.csv'))
os.remove(os.path.join(DATASET_DIR, 'test1.zip'))
os.remove(os.path.join(DATASET_DIR, 'train.zip'))

Make a list of all 25k images that are now in the `dataset/train` folder..

In [None]:
# make a list of all files currently in the train folder
imageList = list()
for (root, name, files) in os.walk(TRAIN_DIR):
    imageList += [os.path.join(root, file) for file in files]

Set up a random number generator which will generate a random floating-point number between 0 and 1 each time we call `random()`..

In [None]:
# seed random number generator
seed(1)

test_ratio = 0.04
valid_ratio = 0.2

Now we move the files to their class folders inside the train, validation or test folders based on the random number that we generate. If the random number is less `test_ratio`, the image file will be used for test. If the random number is greater than `test_ratio` but less than `valid_ratio`, the image will be used for validation. Any random number greater than `valid_ratio` means the image will be used for training.

In [None]:
# move the images to their class folders inside train, valid, test
for img in imageList:
    filename = os.path.basename(img)
    class_folder,_ = filename.split('.',1)

    # choose between train, test, validation based on random number
    if random() <= test_ratio:
        dst_dir = TEST_DIR
    elif (random() > test_ratio and random() <= (test_ratio + valid_ratio)):
        dst_dir = VALID_DIR
    else:
        dst_dir = TRAIN_DIR
       
    os.rename(img, os.path.join(dst_dir, class_folder, filename))

print ('FINISHED CREATING DATASET')

We now have all of the data ready for training and can move to the Part 2 Notebook.