# Dogs vs Cats Transfer Learning Data Preparation

Transfer learning can be used to use a pre-trained to learn new classes. The lesson uses networks trained on ImageNet to learn a new classifier to predict if an image is a dog or a cat

This notebook converts the folders and files of data into structured TFRecord files for Tensorflow consumption

In [20]:
import pickle
import sys
from pathlib import Path
import numpy as np
from PIL import Image
import tensorflow as tf
from typing import List, Union, Tuple

Data can be obtained from Kaggle web and I have put into the `data_dir` location

In [2]:
data_dir = Path.home() / 'data/DogsVsCats'

dev_data_dir = data_dir / 'dev'
train_data_dir = data_dir / 'train'
test_data_dir = data_dir / 'test1'
validation_data_dir = data_dir / 'validation'

Get a class name to indice mapping as our network will predict integers

## Class to index mapping

In [3]:
class2idx = { label.name: index for index, label in enumerate(dev_data_dir.iterdir()) }

In [4]:
with (data_dir / 'class2idx.p').open('wb') as p:
    pickle.dump(class2idx, p)

## TFRecord Conversion

We will read in each image from their directory and serialize the sample into the TFRecord format

In [17]:
def _int64_feature(value: int) -> tf.train.Feature:
    return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))

def _bytes_feature(value: bytes) -> tf.train.Feature:
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

def convert_to_tfrecord(dataset: Union[List, np.ndarray], dataset_name: str, data_dir: Path, segments: int=1):
    
    if not isinstance(dataset, np.ndarray):
        dataset = np.array(dataset)
    
    def _to_record(dataset, name):
        with tf.python_io.TFRecordWriter(name) as writer:
            dataset_length = len(dataset)
            for index, (filename, label) in enumerate(dataset):
                sys.stdout.write(f"\rProcessing sample {index+1} of {dataset_length}")
                sys.stdout.flush()
                
                # Open and make images constant size
                image = Image.open(filename).resize((224, 224))
                
                # Normalize image to -0.5 -> 0.5 and get bytes
                image_raw = ((np.array(image) / 255.) - 0.5).astype(np.float32).tobytes() 
                
                features = {
                    'image': _bytes_feature(image_raw),
                    'label': _int64_feature(label)
                }
                
                example = tf.train.Example(features=tf.train.Features(feature=features))
                writer.write(example.SerializeToString())
            print()
    
    for index, dataset_segment in enumerate(np.array_split(dataset, segments)):
        filename = f'{dataset_name}-{index}.tfrecord'
        print(f'Generating records into {filename}')
        _to_record(dataset_segment, str(data_dir / filename))

In [22]:
def labeled_dataset(dataset_dir: Path, has_labels: bool=True) -> List[Tuple[Path, int]]:
    files = list(dataset_dir.rglob('*.jpg'))
    
    if has_labels:
        labels = [class2idx[p.parts[-2]] for p in files]
    else:
        labels = [-1] * len(files)
    
    return list(zip(files, labels))

train_dataset = labeled_dataset(train_data_dir)
dev_dataset = labeled_dataset(dev_data_dir)
validation_dataset = labeled_dataset(validation_data_dir)
test_dataset = labeled_dataset(test_data_dir, has_labels=False)

In [23]:
convert_to_tfrecord(dev_dataset, 'dev', data_dir)

Generating records into dev-0.tfrecord
Processing sample 10 of 10


In [8]:
convert_to_tfrecord(validation_dataset, 'validation', data_dir, segments=3)

Generating records into validation-0.tfrecord
Processing sample 2750 of 2750
Generating records into validation-1.tfrecord
Processing sample 2750 of 2750
Generating records into validation-2.tfrecord
Processing sample 2750 of 2750


In [9]:
convert_to_tfrecord(train_dataset, 'train', data_dir, segments=10)

Generating records into train-0.tfrecord
Processing sample 1675 of 1675
Generating records into train-1.tfrecord
Processing sample 1675 of 1675
Generating records into train-2.tfrecord
Processing sample 1675 of 1675
Generating records into train-3.tfrecord
Processing sample 1675 of 1675
Generating records into train-4.tfrecord
Processing sample 1675 of 1675
Generating records into train-5.tfrecord
Processing sample 1675 of 1675
Generating records into train-6.tfrecord
Processing sample 1675 of 1675
Generating records into train-7.tfrecord
Processing sample 1675 of 1675
Generating records into train-8.tfrecord
Processing sample 1675 of 1675
Generating records into train-9.tfrecord
Processing sample 1675 of 1675


In [10]:
convert_to_tfrecord(test_dataset, 'test', data_dir, segments=3)

Generating records into test-0.tfrecord
Processing sample 4167 of 4167
Generating records into test-1.tfrecord
Processing sample 4167 of 4167
Generating records into test-2.tfrecord
Processing sample 4166 of 4166
