## Loading and processing data efficiently in TensorFlow

Notes for TF tools for streamlining and for input data processing
(following "Learning TensorFlow" by Hope, Resheff, Lieder (O'Reilly))

### TFRecord

A binary file containing serialized input data. Serialization is based on protocol buffers.
All data is held in one block of memory, cutting on the read-from-memory time.
Many TF tools are optimized for TFRecords.

### TFRecordWriter

Note: it's actually rather slow and storage consuming.

In [3]:
import os
import numpy as np
import tensorflow as tf
from tensorflow.contrib.learn.python.learn.datasets import mnist
from __future__ import print_function

In [4]:
DATA_DIR = '/tmp/loading/mnist'

In [10]:
# Download data
data_sets = mnist.read_data_sets(DATA_DIR,
                                 dtype=tf.uint8,
                                 reshape=False,
                                 validation_size=1000)

# data_sets is already divided
data_splits = ["train", "test", "validation"]
for d in range(len(data_splits)):
    print("saving " + data_splits[d])
    data_set = data_sets[d]

    filename = os.path.join(DATA_DIR, data_splits[d] + '.tfrecords')
    
    # Create a TFRecord writer
    writer = tf.python_io.TFRecordWriter(filename)
    
    for index in range(data_set.images.shape[0]):
        image = data_set.images[index].tostring()
        example = tf.train.Example(features=tf.train.Features(
            feature={
                'height': tf.train.Feature(
                    int64_list=tf.train.Int64List(
                        value=[data_set.images.shape[1]])),
                'width': tf.train.Feature(
                    int64_list=tf.train.Int64List(
                        value=[data_set.images.shape[2]])),
                'depth': tf.train.Feature(
                    int64_list=tf.train.Int64List(
                        value=[data_set.images.shape[3]])),
                'label': tf.train.Feature(
                    int64_list=tf.train.Int64List(
                        value=[int(data_set.labels[index])])),
                'image_raw': tf.train.Feature(
                    bytes_list=tf.train.BytesList(
                        value=[image]))}
        ))
        writer.write(example.SerializeToString())
    writer.close()


filename = os.path.join(DATA_DIR, 'train.tfrecords')
record_iterator = tf.python_io.tf_record_iterator(filename)
seralized_img_example = next(record_iterator)

example = tf.train.Example()
example.ParseFromString(seralized_img_example)
image = example.features.feature['image_raw'].bytes_list.value
label = example.features.feature['label'].int64_list.value[0]
width = example.features.feature['width'].int64_list.value[0]
height = example.features.feature['height'].int64_list.value[0]

img_flat = np.fromstring(image[0], dtype=np.uint8)
img_reshaped = img_flat.reshape((height, width, -1))

print
print(img_flat.shape)
print(type(img_reshaped))
print(img_reshaped.shape)


Extracting /tmp/loading/mnist/train-images-idx3-ubyte.gz
Extracting /tmp/loading/mnist/train-labels-idx1-ubyte.gz
Extracting /tmp/loading/mnist/t10k-images-idx3-ubyte.gz
Extracting /tmp/loading/mnist/t10k-labels-idx1-ubyte.gz
saving train
saving test
saving validation
(784,)
<type 'numpy.ndarray'>
(28, 28, 1)


Other notes:
- feed_dict - does a single-threaded copy of data from the Python runtimeto the TensorFlow runtime; causing latency and slowdowns