# A TensorFlow Input Pipeline Guide

Goal is to write Celeba images and male-female labels (available [here](http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html)) into a TFRecords file. Then, use the Dataset API to load the information into the graph. We will use a CNN to determine the gender of the image. We want TensorBoard visualization.

In [1]:
import tensorflow as tf
import numpy as np
from scipy.misc import imread  # There are others that would work here.
from cv2 import resize  # Again, others to choose from.
import os  # Tools to read data off disk.

  return f(*args, **kwds)


## Image Paths and Labels from Disk

In [2]:
def read_labeled_image_list(
        img_dir="/media/dylan/DATA/img_align_celeba",
        label_txt_path="/media/dylan/DATA/list_attr_celeba.txt"):
    """Read labeled images off disk.
    
    Args:
        img_dir: Path to folder containing all the images.
        label_txt_path: Full path to annotation txt file.
        
    Returns:
        A list of image paths for all images.
        A list of one-hot lists [male?, female?] for all images.

    """
    # Get the image paths.
    img_filenames = os.listdir(img_dir)
    # This is a list.
    abs_img_paths = [os.path.join(img_dir, img_f) 
                     for img_f in img_filenames]
    
    # Get the labels by "r"eading the txt.
    with open(label_txt_path, "r") as f:
        img_fnames_and_lbls = f.read().splitlines()
    # We will make a one-hot list-of-lists for all imgs.
    labels = []
    for fname_and_lbl in img_fnames_and_lbls[2:]:  # Skip 2 header rows
        # Take everything after the .jpg part
        lbl_str = fname_and_lbl.split(".jpg")[1]
        lbl_str_split = lbl_str.split(" ")
        all_labels = [int(label) 
                      for label in lbl_str_split if label is not ""]
        # Now we have a list of all 40 attributes,
        # we want male info
        is_male = all_labels[20] == 1
        labels.append([int(is_male), int(not is_male)])
    return abs_img_paths, labels

## Make TFRecords
We invoke the function we just defined to get all the image paths and labels. For each path and label, write the TFR format data into a file located at `tfr_file`.

NOTE that in this example `TFRecordWriter` expects the `tfr` folder to **already exist** on **some systems** (for instance, I think `TFRecordWriter` can create the directory on Windows, but I know it cannot on Linux.) HOWEVER, it **creates** a file called `celeba_tutorial`.

In [3]:
def make_tfr(
        all_image_paths, all_labels, tfr_file="tfr/celeba_tutorial"):
    """Write a TFR file inside `tfr_dir`.
    
    Get all the image paths and labels. Recall the labels come in
    a list of lists of ints. FOR EACH IMAGE/LABEL, read the image, 
    do any preprocessing while image is still np.ndarray, 
    and convert into `bytes`. Then construct an `Example` object. 
    The `Example` object contains a `tf.train.Features` object
    with a dictionary of `features`. We use helper functions
    to convert our data into `Feature` types. Then, we write the image
    and label to the file.
    
    COMMON GOTCHA: `bytes`, the raw encoding for the image, 
    is not a list type. However, `tf.train.Feature` needs a list!
    That's why `_bytes_feature` turns `value` into a list. On the
    other hand, each label **is** an integer list! So, we do **not** 
    need to make the argument of `_int64_list_feature` a list! 
    
    """
    def _int64_list_feature(a_list):
        return tf.train.Feature(
            int64_list=tf.train.Int64List(value=a_list))

    def _bytes_feature(value):
        return tf.train.Feature(
            bytes_list=tf.train.BytesList(value=[value]))

    print("Writing TFR files.")
    writer = tf.python_io.TFRecordWriter(tfr_file)
    for n, (path, label) in enumerate(zip(all_image_paths, all_labels)):
        # n is only for a nice little status bar.
        print("Writing file {}\r".format(n), end="")
        disk_im = imread(path)  # numpy ndarray, 218 x 178 x 3
        resized_im = resize(disk_im, (128, 128))  # np, 128 x 128 x 3
        raw_im = resized_im.tostring()  # <class 'bytes'>
        # Construct an example proto-obj,
        example = tf.train.Example(
            # which wants a Features proto-obj,
            features=tf.train.Features(
                # which wants a dict.
                feature={
                    'image_raw': _bytes_feature(raw_im),
                    'label': _int64_list_feature(label)
        })) # close your example object
        serialized = example.SerializeToString()
        writer.write(serialized)
    print("\nDone.")

In [4]:
# We're going to implement some simple overwrite protection
# because it takes a while to write all of celeba, even to SSD.

write_tfr = False  # force overwrite. Reset it while you're waiting :)
TFR_FILE = "/media/dylan/DATA/tfr/celeba_tutorial"

if write_tfr or not os.path.isfile(TFR_FILE):
    all_image_paths, all_labels = read_labeled_image_list()
    make_tfr(all_image_paths, all_labels, tfr_file=TFR_FILE)

Want a train/test split? Do something like

    import sklearn.model_selection as sk
    TFR_TRAIN = "tfr/celeba_train"
    TFR_TEST = "tfr/celeba_test"
    
    if write_tfr or not os.path.isfile(TFR_FILE):
        all_image_paths, all_labels = read_labeled_image_list()
        tr_paths, te_paths, tr_lbls, te_lbls = sk.train_test_split(
            all_image_paths, all_labels, test_size=0.2)
        make_tfr(tr_paths, tr_lbls, tfr_file=TFR_TRAIN)
        make_tfr(te_paths, te_lbls, tfr_file=TFR_TEST)

## Dataset
Now we have a data file to read. Let's get to the graph building. The first part of any graph is the data input pipeline. We're going to make
an `input_pipeline` function. Why a function? If we want to use a train-test split, we want a dataset for both TFR files. We would call `input_pipeline` twice - once with `TFR_TRAIN` and again with `TFR_TEST`.

In [5]:
def input_pipeline(batch_size, epochs, tfr_file="tfr/celeba_tutorial"):
    dataset = tf.data.TFRecordDataset(tfr_file)

    def parse_protocol_buffer(example_proto):
        """Read the TFR file into TENSORS! Read the bytes image
        into a tf.string tensor and read the label into a
        tf.int64 tensor!"""
        features = {'image_raw': tf.FixedLenFeature((), tf.string),
                    # NOTICE THE 2 HERE!
                    'label': tf.FixedLenFeature((2), tf.int64)}
        parsed_features = tf.parse_single_example(
            example_proto, features)
        return parsed_features['image_raw'], parsed_features['label']

    # Apply the reading function to the dataset.
    dataset = dataset.map(parse_protocol_buffer)

    def convert_parsed_proto_to_input(image_string, label):
        """Convert that tf.string tensor with the raw image
        into a tf.uint8 tensor (natural data-type for disk imgs).
        Then restore the image shape. Finally, turn it into
        a float so we can do real-valued math to it."""
        image_decoded = tf.decode_raw(image_string, tf.uint8)
        image_resized = tf.reshape(image_decoded, (128, 128, 3))
        image = tf.cast(image_resized, tf.float32)
        # Reshape label as a sanity check, convert to a real number.
        label = tf.reshape(label, [2])
        label = tf.cast(label, tf.float32)
        # I usually put my image elements in [-1, 1]
        return image * (2. /255) -1, label

    # Apply the decoding function to the dataset.
    dataset = dataset.map(convert_parsed_proto_to_input)
    # Shuffle the dataset.
    dataset = dataset.shuffle(buffer_size=1000)
    # Fill the dataset with this many examples & labels
    dataset = dataset.repeat(batch_size*epochs)
    # Collect `batch_size` num. of elements. Image and label
    # tensors in the dataset change from [128 x 128 x 3] 
    # and [2] into [bs x 128 x 128 x 3] and [bs x 2].
    dataset = dataset.batch(batch_size)
    # Fill the dataset with this many batches!
    dataset = dataset.repeat(epochs)
    return dataset

## Model

In [6]:
DESIRED_OUTPUT_SIZE = 2  # This is num labels per img. MALE and FEMALE

def conv(imgs, filters_out, stride_size, kernel_size):
    """Simple 2D convolution helper function."""
    filters_in = imgs.get_shape().as_list()[3]
    Kernel = tf.get_variable(
        "kernel",
        [kernel_size[0], kernel_size[1], filters_in, filters_out],
        initializer=tf.truncated_normal_initializer(stddev=0.1))
    Bias = tf.get_variable(
        "bias",
        [filters_out],
        initializer=tf.zeros_initializer())
    evidence = tf.nn.conv2d(
        imgs,
        Kernel,
        strides=[1, stride_size[0], stride_size[1], 1],
        padding="SAME")
    return evidence + Bias

def model(image_tensor, keep_prob=0.5):
    # Since conv uses tf.get_variable, we NEED tf.variable_scope.
    # That tells tf.get_variable what variable to get, i.e. a new
    # one or one that already exists. tf.variable_scope otherwise
    # behaves similarly to tf.name_scope
    with tf.variable_scope("layer1"):
        # input: bs x 128 x 128 x 3
        z1 = conv(image_tensor, 16, (2, 2), (5, 5))
        a1 = tf.nn.relu(z1)
    with tf.variable_scope("layer2"):
        # input: bs x 64 x 64 x 16
        z2 = conv(a1, 32, (2, 2), (5, 5))
        a2 = tf.nn.relu(z2)
    # let's max-pool to hurry things up
    m2 = tf.nn.max_pool(a2, [1, 2, 2, 1], [1, 2, 2, 1], "SAME")
    with tf.variable_scope("layer3"):
        # input: bs x 16 x 16 x 32
        z3 = conv(m2, 64, (2, 2), (5, 5))
        a3 = tf.nn.relu(z3)
    with tf.variable_scope("layer4"):
        # input: bs x 8 x 8 x 64
        z4 = conv(a3, 64, (2, 2), (5, 5))
        a4 = tf.nn.relu(z4)
        # output: bs x 4 x 4 x 64
        
    # Let's quit convolving.
    final_shape = a4.get_shape().as_list()
    n_elems = final_shape[1] * final_shape[2] * final_shape[3]
    # if you're lazy,  n_elems = np.prod(final_shape[1:])
    flat_a4 = tf.reshape(a4, [-1, n_elems])
    # flat_a4: bs x 4*4*64
    d4 = tf.nn.dropout(flat_a4, keep_prob)
    with tf.variable_scope("fully_connected"):
        W = tf.get_variable(
            "weights",
            [n_elems, DESIRED_OUTPUT_SIZE],
            initializer=tf.truncated_normal_initializer(stddev=0.1))
        b = tf.get_variable(
            "bias",
            [DESIRED_OUTPUT_SIZE],
            initializer=tf.zeros_initializer())
        logits = tf.matmul(d4, W) + b
        # logits: Unscaled predictions. Notice we don't use an
        # activation function here.
        
        # bs x DESIRED_OUTPUT_SIZE == [bs x 2] == shape of labels!
    return logits

## Evaluate and Train

In [7]:
def loss(logits=None, labels=None):
    with tf.name_scope("Eval"):
        elemwise_xent = tf.nn.softmax_cross_entropy_with_logits(
            logits=logits, labels=labels, dim=-1)
        avg_xent = tf.reduce_mean(elemwise_xent, axis=-1)
    return avg_xent

In [8]:
def train(loss, learning_rate=1e-4):
    train_op = tf.train.AdamOptimizer(learning_rate).minimize(loss)
    return train_op

## Build the Graph

This constructs the graph. Running this cell twice without restarting the kernel will cause ``tf.get_variable`` to complain.

In [9]:
# Put all the pieces together.
batch_size = 50
iterations = 10000

# Get a dataset, turn it into an iterator, return batch_image and
# batch_label (no data in here yet, just graph building).
train_dataset = input_pipeline(
    batch_size, iterations, tfr_file=TFR_FILE)
with tf.name_scope("Input"):
    train_iterator = train_dataset.make_initializable_iterator()
    batch_image, batch_label = train_iterator.get_next()

# Sanity check on the pipeline
tf.summary.image("inputs", batch_image, 1)

# Sanity check on the keep prob
keep_prob = tf.placeholder(tf.float32)
tf.summary.scalar("keep_prob", keep_prob)

# Run the batch of imgs through the convolutional stack
logits = model(batch_image, keep_prob=keep_prob)

# Get the loss
loss_op = loss(logits=logits, labels=batch_label)

# Monitor loss
tf.summary.scalar("xent", loss_op)

# Train
train_op = train(loss_op)

# Magically merge all the summaries into one op.
summary_op = tf.summary.merge_all()

How do we handle a train/test split now?

    train_dset = input_pipeline(
        batch_size, iterations, tfr_file=TFR_TRAIN)
    test_dset = input_pipeline(
        batch_size, iterations, tfr_file=TFR_TEST)
    is_training = tf.placeholder(tf.bool)
    # It really doesn't matter if you overshoot iterations.
    with tf.name_scope("Input"):
        train_iterator = train_dset.make_initializable_iterator()
        test_iterator = test_dset.make_initializable_iterator()
        batch_image, batch_label = tf.cond(
            is_training,
            lambda: train_iterator.get_next(),
            lambda: test_iterator.get_next(),
            name="choose_data")
            
Admittedly, it's a little contrived, but I haven't found a better way yet.

## Training

We have some stuff on TensorBoard.

    $ tensorboard --logdir=tb/tutorial
    
We aren't automatically clearing the TensorBoard file. If you want to run this twice, use the same tb file, and don't want to have jumbled graphs, DELETE tb/tutorial!

After you're running ``tensorboard``, go to ``localhost:6006`` and check out the graph.

In [10]:
# Run the training.
with tf.Session() as sess:
    # Initialize all the tf.get_variable()s
    sess.run(tf.global_variables_initializer())
    # Create a TensorBoard writer (this line adds the graph dashboard)
    train_writer = tf.summary.FileWriter("tb/tutorial", sess.graph)
    # INITIALIZE THE ITERATOR! START READING DATA OFF DISK!!
    
    sess.run(train_iterator.initializer)
    # sess.run(test_iterator.initializer)

    for epoch in range(iterations + 1):
        check_in = epoch % 100 == 0
        # test = epoch % 500 == 0
        feed_dict = {keep_prob: 0.5}

        # SPLIT:
        # feed_dict = {keep_prob: 1 if test else 0.5,
        #              is_training: not test}

        # if not test:
        _, summary = sess.run(
            [train_op, summary_op], feed_dict=feed_dict)
        train_writer.add_summary(summary, epoch)
        # elif test:
        #     summary = sess.run([summary_op], feed_dict=feed_dict)

        if check_in:
            curr_loss = sess.run([loss_op], feed_dict=feed_dict)
            print(epoch, curr_loss)

0 [3.2054853]
100 [0.86350244]
200 [0.6849426]
300 [0.68953967]
400 [0.67173815]
500 [0.6842258]
600 [0.72459626]
700 [0.67978317]
800 [0.62469435]
900 [0.68821692]
1000 [0.67310441]
1100 [0.68012029]
1200 [0.73374277]
1300 [0.65069425]
1400 [0.67672628]
1500 [0.68643713]
1600 [0.68121016]
1700 [0.64919662]
1800 [0.68303972]
1900 [0.68060791]
2000 [0.70264405]
2100 [0.67227471]
2200 [0.65915924]
2300 [0.69329268]
2400 [0.68970263]
2500 [0.68072993]
2600 [0.64484537]
2700 [0.67863578]
2800 [0.71092546]
2900 [0.69675964]
3000 [0.68089616]
3100 [0.67159635]
3200 [0.69296592]
3300 [0.71214175]
3400 [0.69607812]
3500 [0.70406282]
3600 [0.69037366]
3700 [0.69690186]
3800 [0.68985528]
3900 [0.63170135]
4000 [0.67624885]
4100 [0.76049638]
4200 [0.7092613]
4300 [0.68205917]
4400 [0.68462998]
4500 [0.6523453]
4600 [0.65850502]
4700 [0.61176008]
4800 [0.67044312]
4900 [0.68276715]
5000 [0.66934198]
5100 [0.67133242]
5200 [0.68198776]
5300 [0.71684754]
5400 [0.67312533]
5500 [0.6705699]
5600 [0.68