# dataset tutor 001

Taken from this webpage

https://medium.com/ymedialabs-innovation/how-to-use-dataset-and-iterators-in-tensorflow-with-code-samples-3bb98b6b74ab

In [11]:
import datetime
from pathlib import Path

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import os
import os.path


import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

import numpy as np

In [12]:
print(tf.__version__)

2.0.0-alpha0


In [13]:
# works in 1.13.1
#print (tf.VERSION)

# Dataset Creation

## From tensorslices

This method accepts individual (or multiple) Numpy (or Tensors) objects. In case you are feeding multiple objects, pass them as tuple and make sure that all the objects have same size in zeroth dimension.



example 1

In [17]:
tf.range(10,15)

<tf.Tensor: id=51, shape=(5,), dtype=int32, numpy=array([10, 11, 12, 13, 14], dtype=int32)>

In [14]:
# Assume batch size is 1
dataset1 = tf.data.Dataset.from_tensor_slices(tf.range(10, 15))
# Emits data of 10, 11, 12, 13, 14, (One element at a time)
dataset1

<TensorSliceDataset shapes: (), types: tf.int32>

In [19]:
for element in dataset1:
    print(element)

tf.Tensor(10, shape=(), dtype=int32)
tf.Tensor(11, shape=(), dtype=int32)
tf.Tensor(12, shape=(), dtype=int32)
tf.Tensor(13, shape=(), dtype=int32)
tf.Tensor(14, shape=(), dtype=int32)


example 2

In [21]:
tf.range(30, 45, 3)

<tf.Tensor: id=95, shape=(5,), dtype=int32, numpy=array([30, 33, 36, 39, 42], dtype=int32)>

In [22]:
np.arange(60, 70, 2)

array([60, 62, 64, 66, 68])

In [15]:
dataset2 = tf.data.Dataset.from_tensor_slices((tf.range(30, 45, 3), np.arange(60, 70, 2)))
# Emits data of (30, 60), (33, 62), (36, 64), (39, 66), (42, 68)
# Emits one tuple at a time
dataset2

<TensorSliceDataset shapes: ((), ()), types: (tf.int32, tf.int64)>

In [20]:
for element in dataset2:
    print(element)

(<tf.Tensor: id=72, shape=(), dtype=int32, numpy=30>, <tf.Tensor: id=73, shape=(), dtype=int64, numpy=60>)
(<tf.Tensor: id=76, shape=(), dtype=int32, numpy=33>, <tf.Tensor: id=77, shape=(), dtype=int64, numpy=62>)
(<tf.Tensor: id=80, shape=(), dtype=int32, numpy=36>, <tf.Tensor: id=81, shape=(), dtype=int64, numpy=64>)
(<tf.Tensor: id=84, shape=(), dtype=int32, numpy=39>, <tf.Tensor: id=85, shape=(), dtype=int64, numpy=66>)
(<tf.Tensor: id=88, shape=(), dtype=int32, numpy=42>, <tf.Tensor: id=89, shape=(), dtype=int64, numpy=68>)


example 3

In [23]:
#dataset3 = tf.data.Dataset.from_tensor_slices((tf.range(10), np.arange(5)))
# Dataset not possible as zeroth dimenion is different at 10 and 5

## From tensors

Just like from_tensor_slices, this method also accepts individual (or multiple) Numpy (or Tensors) objects. But this method doesn’t support batching of data, i.e all the data will be given out instantly. As a result, you can pass differently sized inputs at zeroth dimension if you are passing multiple objects. This method is useful in cases where dataset is very small or your learning model needs all the data at once.



In [24]:
dataset1 = tf.data.Dataset.from_tensors(tf.range(10, 15))
# Emits data of [10, 11, 12, 13, 14]
# Holds entire list as one element

dataset2 = tf.data.Dataset.from_tensors((tf.range(30, 45, 3), np.arange(60, 70, 2)))
# Emits data of ([30, 33, 36, 39, 42], [60, 62, 64, 66, 68])
# Holds entire tuple as one element

dataset3 = tf.data.Dataset.from_tensors((tf.range(10), np.arange(5)))
# Possible with from_tensors, regardless of zeroth dimension mismatch of constituent elements.
# Emits data of ([1, 2, 3, 4, 5, 6, 7, 8, 9], [0, 1, 2, 3, 4])
# Holds entire tuple as one element

## From generators
In this method, a generator function is passed as input. This method is useful in cases where you wish to generate the data at runtime and as such no raw data exists with you or in scenarios where your training data is extremely huge and it is not possible to store them in your disk. I would strongly encourage people to not use this method for the purpose of generating data augmentations.



In [25]:
# Assume batch size is 1
def generator(sequence_type):
    if sequence_type == 1:
        for i in range(5):
            yield 10 + i
    elif sequence_type == 2:
        for i in range(5):
            yield (30 + 3 * i, 60 + 2 * i)
    elif sequence_type == 3:
        for i in range(1, 4):
            yield (i, ['Hi'] * i)

dataset1 = tf.data.Dataset.from_generator(generator, (tf.int32), args = ([1]))
# Emits data of 10, 11, 12, 13, 14, (One element at a time)

dataset2 = tf.data.Dataset.from_generator(generator, (tf.int32, tf.int32), args = ([2]))
# Emits data of (30, 60), (33, 62), (36, 64), (39, 66), (42, 68)
# Emits one tuple at a time

dataset3 = tf.data.Dataset.from_generator(generator, (tf.int32, tf.string), args = ([3]))
# Emits data of (1, ['Hi']), (2, ['Hi', 'Hi']), (3, ['Hi', 'Hi', 'Hi'])
# Emits one tuple at a time

W0514 17:47:45.013040 4417709504 deprecation.py:323] From /Users/davis/miniconda3/envs/tf2/lib/python3.7/site-packages/tensorflow/python/data/ops/dataset_ops.py:410: py_func (from tensorflow.python.ops.script_ops) is deprecated and will be removed in a future version.
Instructions for updating:
tf.py_func is deprecated in TF V2. Instead, there are two
    options available in V2.
    - tf.py_function takes a python function which manipulates tf eager
    tensors instead of numpy arrays. It's easy to convert a tf eager tensor to
    an ndarray (just call tensor.numpy()) but having access to eager tensors
    means `tf.py_function`s can use accelerators such as GPUs as well as
    being differentiable using a gradient tape.
    - tf.numpy_function maintains the semantics of the deprecated tf.py_func
    (it is not differentiable, and manipulates numpy arrays). It drops the
    stateful argument making all functions stateful.
    


# Dataset Transformations
Once you have created the Dataset covering all the data (or scenarios, in some cases like, runtime data generation), it is time to apply various types of transformation. Let us go through some of commonly used transformations.

## Batch Transformation
Batch corresponds to sequentially dividing your dataset by the specified batch size.

## Repeat Transformation
Whatever Dataset you have generated, use this transformation to create duplicates of the existing data in your Dataset.