## Import Packages, Environment Setting

In [1]:
import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds

tfds.disable_progress_bar()

physical_devices = tf.config.list_physical_devices('GPU') 
tf.config.experimental.set_memory_growth(physical_devices[0], True)

## Available Datasets
### List all datasets
`tfds.list_builders` [API](https://www.tensorflow.org/datasets/api_docs/python/tfds/list_builders)

List all the aviable datasets provided by `tensorflow_datasets` packages.

In [2]:
tfds.list_builders()

['abstract_reasoning',
 'aflw2k3d',
 'amazon_us_reviews',
 'bair_robot_pushing_small',
 'bigearthnet',
 'binarized_mnist',
 'binary_alpha_digits',
 'caltech101',
 'caltech_birds2010',
 'caltech_birds2011',
 'cats_vs_dogs',
 'celeb_a',
 'celeb_a_hq',
 'chexpert',
 'cifar10',
 'cifar100',
 'cifar10_corrupted',
 'clevr',
 'cnn_dailymail',
 'coco',
 'coco2014',
 'coil100',
 'colorectal_histology',
 'colorectal_histology_large',
 'curated_breast_imaging_ddsm',
 'cycle_gan',
 'deep_weeds',
 'definite_pronoun_resolution',
 'diabetic_retinopathy_detection',
 'downsampled_imagenet',
 'dsprites',
 'dtd',
 'dummy_dataset_shared_generator',
 'dummy_mnist',
 'emnist',
 'eurosat',
 'fashion_mnist',
 'flores',
 'food101',
 'gap',
 'glue',
 'groove',
 'higgs',
 'horses_or_humans',
 'image_label_folder',
 'imagenet2012',
 'imagenet2012_corrupted',
 'imdb_reviews',
 'iris',
 'kitti',
 'kmnist',
 'lfw',
 'lm1b',
 'lsun',
 'mnist',
 'mnist_corrupted',
 'moving_mnist',
 'multi_nli',
 'nsynth',
 'omniglot',

### Load provided datasets
`tfds.load(name, split, shuffle_files=True, with_info=True)` [API](https://www.tensorflow.org/datasets/api_docs/python/tfds/load)

Load provided datasets, the way to identify the split can be found [here](https://www.tensorflow.org/datasets/splits). The first return variable will be a Tensorflow [Dataset](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) object while the second variable (only show when setting with_info=True) will be a Tensorflow [DatasetInfo](https://www.tensorflow.org/datasets/api_docs/python/tfds/core/DatasetInfo) object.

In [3]:
dataset, info = tfds.load('iris', split='train', shuffle_files=True, with_info=True)
print(info)

tfds.core.DatasetInfo(
    name='iris',
    version=1.0.0,
    description='This is perhaps the best known database to be found in the pattern recognition
literature. Fisher's paper is a classic in the field and is referenced
frequently to this day. (See Duda & Hart, for example.) The data set contains
3 classes of 50 instances each, where each class refers to a type of iris
plant. One class is linearly separable from the other 2; the latter are NOT
linearly separable from each other.
',
    urls=['https://archive.ics.uci.edu/ml/datasets/iris'],
    features=FeaturesDict({
        'features': Tensor(shape=(4,), dtype=tf.float32),
        'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=3),
    }),
    total_num_examples=150,
    splits={
        'train': 150,
    },
    supervised_keys=('features', 'label'),
    citation="""@misc{Dua:2019 ,
    author = "Dua, Dheeru and Graff, Casey",
    year = "2017",
    title = "{UCI} Machine Learning Repository",
    url = "http://archive

## Tensorflow Dataset
Tensorflow provided useful interface to process datasets. By using the Dataset object, we can start using some useful functions to facilitate the manipulation of input data.

### Dataset creation

`tf.data.Dataset.from_tensor_slices` [API](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#from_tensor_slices), `tf.data.Dataset.from_generator` [API](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#from_generator)

Create Dataset object from Tensor or Generator (it can also be created from TFRecord, we will cover TFRecord later). We can pass the dictionary of Tensors as the parameter, this is helpful for keep track of the names of Tensors in each example.

In [4]:
X = tf.convert_to_tensor(np.random.normal(0, 1, 500).reshape(100, 5))
y = tf.convert_to_tensor(np.random.normal(0, 1, 100))
dataset = tf.data.Dataset.from_tensor_slices({'X': X, 'y': y})
for instance in dataset:
    pass # do something with example using example['X'] and example['y']

### Shuffle

`dataset.shuffle(buffer)` [API](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#shuffle)

Shuffle the order of instances (`tf.train.Example`) with given buffer size (the instances is randomly shuffled in the buffer size). Return a new Dataset.

In [5]:
dataset.shuffle(10)

<ShuffleDataset shapes: {X: (5,), y: ()}, types: {X: tf.float64, y: tf.float64}>

### Get first n instances

`dataset.take(n)` [API](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#take)

Get first `n` instances. Return a new Dataset.

In [6]:
dataset.take(50)

<TakeDataset shapes: {X: (5,), y: ()}, types: {X: tf.float64, y: tf.float64}>

### Create simple batch

`dataset.batch` [API](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#batch)

Split the dataset into batches. Return a new Dataset (of Dataset).

In [7]:
batched_dataset = dataset.batch(20)
for batch in batched_dataset:
    pass # do something with batch dataset

### Filter instances in the Dataset

`dataset.filter(conditional_fn)` [API](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#filter)

Filter instances in the Dataset of given conditional function. The `filter` function pass each instance into the conditional function, and the function should return Boolean value. Return a new Dataset.

In [8]:
dataset.filter(lambda x: tf.reduce_sum(x['X']) > 0)

<FilterDataset shapes: {X: (5,), y: ()}, types: {X: tf.float64, y: tf.float64}>

### Apply function to every instance in the Dataset

`dataset.map(transformation_func)` [API](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#map)

Apply function to every instance in the Dataset. The `map` function pass each instance into the transformation function, and the function should return the transformed instance. Return a MapDataset.

In [9]:
def negative_transform(x):
    x['negative_X'] = -1 * x['X']
    return x

dataset.map(negative_transform)

<MapDataset shapes: {X: (5,), y: (), negative_X: (5,)}, types: {X: tf.float64, y: tf.float64, negative_X: tf.float64}>

### Get specs of Dataset object

`tf.data.DatasetSpec(dataset)`

Get the specs of the Dataset object.

In [10]:
tf.data.DatasetSpec(dataset)

DatasetSpec(<TensorSliceDataset shapes: {X: (5,), y: ()}, types: {X: tf.float64, y: tf.float64}>, TensorShape([]))