# Dogs vs Cats Transfer Learning Data Preparation

Transfer learning can be used to use a pre-trained to learn new classes. The lesson uses networks trained on ImageNet to learn a new classifier to predict if an image is a dog or a cat

This notebook converts the folders and files of data into structured TFRecord files for Tensorflow consumption

In [None]:
import sys
import os
import numpy as np
import tensorflow as tf
import urllib
import glob
import pickle
from zipfile import ZipFile
from PIL import Image
from concurrent.futures import ProcessPoolExecutor

In [None]:
tf.__version__

Data can be obtained from [Kaggle web](https://www.kaggle.com/c/dogs-vs-cats/data) and I have put into the `data_dir` location. This is super easy on a remote machine also using the [kaggle-cli](https://github.com/floydwch/kaggle-cli)

In [None]:
data_dir = os.path.expanduser(os.getcwd(), 'data')
if not os.path.isdir(data_dir):
    os.makedirs(data_dir)

dev_data_dir = os.path.join(data_dir, 'dev')
train_data_dir = os.path.join(data_dir, 'train')
test_data_dir = os.path.join(data_dir, 'test1')
validation_data_dir = os.path.join(data_dir, 'validation')

In [None]:
def ensure_extracted(path, data_dir):
    if not os.path.isdir(path):
        with ZipFile(path + '.zip') as z:
            z.extractall(path=data_dir)
ensure_extracted(test_data_dir, data_dir)
ensure_extracted(train_data_dir, data_dir)

Get a class name to indice mapping as our network will predict integers

## Class to index mapping

Downloading files to a local disk for all intermediate processing, but output records will go onto a mapped Azure file share to be used by the training task. Use the `afs_data_dir` for these outputs from here out.

In [None]:
afs_data_dir = '/afs/DogsVsCats'
if not os.path.isdir(afs_data_dir):
    os.makedirs(afs_data_dir)

In [None]:
class2idx = { 'dog': 0, 'cat': 1 }

In [None]:
with open(os.path.join(afs_data_dir, 'class2idx.p'), 'wb') as p:
    pickle.dump(class2idx, p)

Construct a numpy array of tuples for the dataset: `(filename, label)` for each sample

In [None]:
full_train_dataset = np.array([(os.path.join(train_data_dir, filename), class2idx[filename[:3]]) for filename in os.listdir(train_data_dir)])
np.random.shuffle(full_train_dataset)

I'm using a 'dev' set as only 10 samples. This is particularly useful when developing to ensure that the graph executes as expected. It's much quicker to go through 10 samples than a single epoch of the full training set :) 

In [None]:
# Small sample to get operational
dev_dataset = full_train_dataset[:10]

In [None]:
#200 Validation samples
validation_dataset = full_train_dataset[-200:]
train_dataset = full_train_dataset[:-200]

Test dataset has `None` as labels just so the TFRecords are consistent for all data

In [None]:
test_dataset = np.array([(os.path.join(test_data_dir, filename), None) for filename in os.listdir(test_data_dir)])

## TFRecord Conversion

We will read in each image from their directory and serialize the sample into the TFRecord format

In [None]:
def _int64_feature(value):
    return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))

def _bytes_feature(value):
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

def convert_to_tfrecord(dataset_name, dataset, segments=1):
    
    def _to_record(dataset, name):
        with tf.python_io.TFRecordWriter(name) as writer:
            dataset_length = len(dataset)
            for index, (filename, label) in enumerate(dataset):
                sys.stdout.write("\rProcessing sample " + str(index+1) + " of " + str(dataset_length))
                sys.stdout.flush()
                
                # Open and make images constant size
                image = Image.open(filename).resize((224, 224))
                
                image_raw = ((np.array(image) / 255.) - 0.5).astype(np.float32).tobytes() # Normalize image to -0.5 -> 0.5
                
                features = {
                    'image': _bytes_feature(image_raw),
                    'label': _int64_feature(int(label) if label else -1)
                }
                
                example = tf.train.Example(features=tf.train.Features(feature=features))
                writer.write(example.SerializeToString())
            print()
            
    sharded_dataset = np.array_split(dataset, segments)
    for index, sub_dataset in enumerate(sharded_dataset):
        filename = os.path.join(afs_data_dir, dataset_name + "-" + str(index) + ".tfrecord")
        print('Generating records into ' + filename)
        _to_record(sub_dataset, filename)

In [None]:
convert_to_tfrecord('dev', dev_dataset)

In [None]:
convert_to_tfrecord('validation', validation_dataset, segments=3)

In [None]:
convert_to_tfrecord('test', test_dataset, segments=3)

In [None]:
convert_to_tfrecord('train', train_dataset, segments=10)