# Dogs vs Cats Transfer Learning Data Preparation

Transfer learning can be used to use a pre-trained to learn new classes. The lesson uses networks trained on ImageNet to learn a new classifier to predict if an image is a dog or a cat

This notebook converts the folders and files of data into structured TFRecord files for Tensorflow consumption

In [1]:
import sys
import os
import numpy as np
import tensorflow as tf
import urllib
import glob
import pickle
from PIL import Image
from concurrent.futures import ProcessPoolExecutor

Data can be obtained from Kaggle web and I have put into the `data_dir` location

In [2]:
data_dir = os.path.expanduser('~/data/DogsVsCats')

dev_data_dir = os.path.join(data_dir, 'dev')
train_data_dir = os.path.join(data_dir, 'train')
test_data_dir = os.path.join(data_dir, 'test1')
validation_data_dir = os.path.join(data_dir, 'validation')

Get a class name to indice mapping as our network will predict integers

## Class to index mapping

In [3]:
class_names = os.listdir(dev_data_dir)
class_name2id = { label: index for index, label in enumerate(class_names) }

In [4]:
with open(os.path.join(data_dir, 'class_name2id.p'), 'wb') as p:
    pickle.dump(class_name2id, p)

## TFRecord Conversion

We will read in each image from their directory and serialize the sample into the TFRecord format

In [15]:
def _int64_feature(value):
    return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))

def _bytes_feature(value):
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

def convert_to_tfrecord(dataset_name, data_directory, segments=1, directories_as_labels=True, files='**/*.jpg'):
    
    # Get all files in the directory
    filenames = glob.glob(os.path.join(data_directory, files)) 
    # Create the dataset of filename class tuple
    classes = (os.path.basename(os.path.dirname(name)) for name in filenames) if directories_as_labels else [None] * len(filenames)
    dataset = list(zip(filenames, classes))
    
    def _to_record(dataset, name):
        with tf.python_io.TFRecordWriter(name) as writer:
            dataset_length = len(dataset)
            for index, (filename, label) in enumerate(dataset):
                sys.stdout.write(f"\rProcessing sample {index+1} of {dataset_length}")
                sys.stdout.flush()
                
                # Open and make images constant size
                image = Image.open(filename).resize((224, 224))
                
                image_raw = ((np.array(image) / 255.) - 0.5).astype(np.float32).tobytes() # Normalize image to -0.5 -> 0.5
                
                features = {
                    'image': _bytes_feature(image_raw),
                    'label': _int64_feature(class_name2id[label] if label else -1),
                    'text_label': _bytes_feature((label  if label else 'NA').encode('utf8'))
                }
                
                example = tf.train.Example(features=tf.train.Features(feature=features))
                writer.write(example.SerializeToString())
            print()
    
    if segments > 1:
        for index, sub_dataset in enumerate(np.array_split(dataset, segments)):
            filename = f'{dataset_name}-{index}.tfrecord'
            print(f'Generating records into {filename}')
            _to_record(sub_dataset, os.path.join(data_dir, filename))
    else:
        filename = f'{dataset_name}.tfrecord'
        print(f'Generating records into {filename}')
        _to_record(dataset, os.path.join(data_dir, filename))

In [11]:
convert_to_tfrecord('dev', dev_data_dir)

Generating records into dev.tfrecord
Processing sample 10 of 10


In [12]:
convert_to_tfrecord('validation', validation_data_dir, segments=3)

Generating records into validation-0.tfrecord
Processing sample 2750 of 2750
Generating records into validation-1.tfrecord
Processing sample 2750 of 2750
Generating records into validation-2.tfrecord
Processing sample 2750 of 2750


In [13]:
convert_to_tfrecord('test', test_data_dir, directories_as_labels=False, files='*.jpg')

Generating records into test.tfrecord
Processing sample 12500 of 12500


In [14]:
convert_to_tfrecord('train', train_data_dir, segments=10)

Generating records into train-0.tfrecord
Processing sample 1675 of 1675
Generating records into train-1.tfrecord
Processing sample 1675 of 1675
Generating records into train-2.tfrecord
Processing sample 1675 of 1675
Generating records into train-3.tfrecord
Processing sample 1675 of 1675
Generating records into train-4.tfrecord
Processing sample 1675 of 1675
Generating records into train-5.tfrecord
Processing sample 1675 of 1675
Generating records into train-6.tfrecord
Processing sample 1675 of 1675
Generating records into train-7.tfrecord
Processing sample 1675 of 1675
Generating records into train-8.tfrecord
Processing sample 1675 of 1675
Generating records into train-9.tfrecord
Processing sample 1675 of 1675
