# Quick start to Dataset

The [tf.data](https://www.tensorflow.org/api_docs/python/tf/data) module contains a collection of classes that allows you to easily load data, manipulate it, and pipe it into your model. This document introduces the API by walking through two simple examples:

  * Reading in-memory data from numpy arrays.
  * Reading lines from a csv file.

## Preparation

In [4]:
import tensorflow as tf
import iris_data

Fetch iris and mnist as example dataset.

In [5]:
# Fetch the iris_data
iris_train, iris_test = iris_data.load_data()
iris_features, iris_labels = iris_train
print(type(iris_features))
print(iris_features.head())
print(type(iris_labels))
print(iris_labels.head())

<class 'pandas.core.frame.DataFrame'>
   SepalLength  SepalWidth  PetalLength  PetalWidth
0          6.4         2.8          5.6         2.2
1          5.0         2.3          3.3         1.0
2          4.9         2.5          4.5         1.7
3          4.9         3.1          1.5         0.1
4          5.7         3.8          1.7         0.3
<class 'pandas.core.series.Series'>
0    2
1    1
2    2
3    0
4    0
Name: Species, dtype: int64


In [7]:
# Fetch the mnist_data
mnist_train, mnist_test = tf.keras.datasets.mnist.load_data()
mnist_X, mnist_y = mnist_train
print(type(mnist_X))
print(mnist_X.shape)

<class 'numpy.ndarray'>
(60000, 28, 28)


## tf.data.Dataset.from_tensor_slices()

In [8]:
iris_ds = tf.data.Dataset.from_tensor_slices((dict(iris_features), iris_labels))
print(iris_ds)

<TensorSliceDataset shapes: ({SepalLength: (), SepalWidth: (), PetalLength: (), PetalWidth: ()}, ()), types: ({SepalLength: tf.float64, SepalWidth: tf.float64, PetalLength: tf.float64, PetalWidth: tf.float64}, tf.int64)>


In [9]:
iris_ds_fs = tf.data.Dataset.from_tensor_slices(dict(iris_features))
print(iris_ds_fs)

<TensorSliceDataset shapes: {SepalLength: (), SepalWidth: (), PetalLength: (), PetalWidth: ()}, types: {SepalLength: tf.float64, SepalWidth: tf.float64, PetalLength: tf.float64, PetalWidth: tf.float64}>


In [10]:
mnist_ds_X = tf.data.Dataset.from_tensor_slices(mnist_X)
print(mnist_ds_X)

<TensorSliceDataset shapes: (28, 28), types: tf.uint8>


## Reading Dataset from csv file

In [11]:
iris_train_path, iris_test_path = iris_data.maybe_download()
print(iris_train_path)
print(iris_test_path)

/Users/liuweijie/.keras/datasets/iris_training.csv
/Users/liuweijie/.keras/datasets/iris_test.csv


We start by building a `TextLineDataset` object to read the file one line at a time. Then, we call the skip method to skip over the first line of the file, which contains a header, not an example:

In [13]:
iris_train_ds = tf.data.TextLineDataset(iris_train_path).skip(1)
print(iris_train_ds)

<SkipDataset shapes: (), types: tf.string>
