In [1]:
import numpy as np
import tensorflow as tf

### Reading input data into your `Dataset` object
#### Using `tf.data.Dataset.from_tensor_slices()`
* If all of your input data fits in memory, the simplest way to create a `Dataset` is to use `Dataset.from_tensor_slices()` to convert your data to tf.Tensor objects.

In [2]:
dataset1 = tf.data.Dataset.from_tensor_slices(tf.random_uniform([4,10]))

#### Consuming NumPy arrays
* In this example, we embed the feature and label arrays in the TensorFlow graph as `tf.constant()` objects. 
* This works well for a small dataset, but wastes memory because the contents of the array will be copied multiple times.
* You can also run into the 2GB limit for the tf.GraphDef protocol buffer.

In [3]:
x = np.arange(100).reshape(10,10)
x

array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
       [30, 31, 32, 33, 34, 35, 36, 37, 38, 39],
       [40, 41, 42, 43, 44, 45, 46, 47, 48, 49],
       [50, 51, 52, 53, 54, 55, 56, 57, 58, 59],
       [60, 61, 62, 63, 64, 65, 66, 67, 68, 69],
       [70, 71, 72, 73, 74, 75, 76, 77, 78, 79],
       [80, 81, 82, 83, 84, 85, 86, 87, 88, 89],
       [90, 91, 92, 93, 94, 95, 96, 97, 98, 99]])

In [4]:
outfile = './tmp/training_data.npy'
np.save(outfile,x)

In [5]:
data =  np.load(outfile)
labels, features = data[:,0], data[:,1:]

In [6]:
labels

array([ 0, 10, 20, 30, 40, 50, 60, 70, 80, 90])

In [7]:
features

array([[ 1,  2,  3,  4,  5,  6,  7,  8,  9],
       [11, 12, 13, 14, 15, 16, 17, 18, 19],
       [21, 22, 23, 24, 25, 26, 27, 28, 29],
       [31, 32, 33, 34, 35, 36, 37, 38, 39],
       [41, 42, 43, 44, 45, 46, 47, 48, 49],
       [51, 52, 53, 54, 55, 56, 57, 58, 59],
       [61, 62, 63, 64, 65, 66, 67, 68, 69],
       [71, 72, 73, 74, 75, 76, 77, 78, 79],
       [81, 82, 83, 84, 85, 86, 87, 88, 89],
       [91, 92, 93, 94, 95, 96, 97, 98, 99]])

In [8]:
# Assume that each row of `features` corresponds to the same row as `labels`
assert features.shape[0] == labels.shape[0]

#### For larger datasets in numpy format:
* Define the `Dataset` in terms of `tf.placeholder()` tensors, and feed the NumPy arrays when you initialize an `Iterator` over the dataset.

In [9]:
# Load the training data into two NumPy arrays, for example using `np.load()`.
data =  np.load(outfile)
labels, features = data[:,0], data[:,1:]

# Assume that each row of `features` corresponds to the same row as `labels`.
assert features.shape[0] == labels.shape[0]

features_placeholder = tf.placeholder(features.dtype, features.shape)
labels_placeholder = tf.placeholder(labels.dtype, labels.shape)

dataset = tf.data.Dataset.from_tensor_slices((features_placeholder, labels_placeholder))
# [Other transformations on `dataset`...]
# dataset = ...
iterator = dataset.make_initializable_iterator()

with tf.Session() as sess:
    sess.run(iterator.initializer, feed_dict={features_placeholder: features,
                                              labels_placeholder: labels})

### Consuming text data
#### Using `tf.data.TextLineDataset` 
* Many datasets are distributed as one or more text files. `tf.data.TextLineDataset` provides an easy way to extract lines from one or more text files. 
* Given one or more filenames, a `TextLineDataset` will produce one string-valued element per line of those files. 
* Like a `TFRecordDataset`, `TextLineDataset` accepts filenames as a tf.Tensor, so you can parameterize it by passing a `tf.placeholder(tf.string)`.

In [10]:
filenames = ["/var/data/file1.txt", "/var/data/file2.txt"]
dataset = tf.data.TextLineDataset(filenames)

* By default, a `TextLineDataset` yields every line of each file, which may not be desirable, for example if the file starts with a header line, or contains comments. 
* These lines can be removed using the `Dataset.skip()` and `Dataset.filter()` transformations. 
* To apply these transformations to each file separately, we use `Dataset.flat_map()` to create a nested Dataset for each file.

## Supplementary Material Not Covered In Class
* **Consuming TFRecord data**
* **Parsing tf.Example protocol buffer messages**

### Consuming TFRecord data
* The TFRecord file format is a simple record-oriented binary format that many TensorFlow applications use for training data. \
* The `tf.data.TFRecordDataset` class enables you to stream over the contents of one or more TFRecord files as part of an input pipeline.

### Parsing `tf.Example` protocol buffer messages
Many input pipelines extract `tf.train.Example` protocol buffer messages from a TFRecord-format file (written, for example, using `tf.python_io.TFRecordWriter`). Each `tf.train.Example` record contains one or more "features", and the input pipeline typically converts these features into tensors.

### Decoding image data and resizing it
#### Using `tf.image.decode_image` and `tf.image.resize_images`
When training a neural network on real-world image data, it is often necessary to convert images of different sizes to a common size, so that they can fed to the net in batches.
*  `tf.image.decode_image` - Detects whether an image is a BMP, GIF, JPEG, or PNG, and performs the appropriate operation to convert the input bytes string into a Tensor of type uint8.
* `tf.image.resize_images(images, size, method=ResizeMethod.BILINEAR, align_corners=False)`
    - Resize images to size using the specified method.
    - Resized images will be distorted if their original aspect ratio is not the same as _size_. To avoid distortions see `tf.image.resize_image_with_crop_or_pad`.