**tf.data.Dataset** API support writing descriptive and effficient input pipelines.**Dataset** usage follows a common pattern:
1. Create a source dataset from your input data.
2. Apply dataset transformation to preprocess the data.
3. iterate over the dataset and process the element.

Some of the popular and useful method od tf.data objects are:
- as_numpy_iterator()
- cache()
- shuffle()
- batch()
- map()
- filter()
- prefetch()
- zip()
- take()
- skip()




In [8]:
import tensorflow as tf
import numpy as np

In [9]:
ds=tf.data.Dataset.from_tensor_slices(tf.range(1,21))

# Dataset is created. Ds is a iterator.

In [10]:
for element in ds: 
    print(element.numpy()) 

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20


In [11]:
#.as_numpy_iterator() : Returns an iterator which convert all elements of the dataset to numpy.

npy_iter=ds.as_numpy_iterator()
list(ds.as_numpy_iterator()) ## We can also list all the element 

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]

In [12]:
for arr in npy_iter:
    print(arr)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20


In [13]:
## We can also list all the element in a list

print(list(npy_iter)) ## It returns an empty array because all the element has already been iterated

[]


In [14]:
# .batch() 

# Using this method, we will iterate batches of element in each  iteration

ds=tf.data.Dataset.from_tensor_slices(tf.range(1,21))

print('Number of samples in unbatched ds: ',len(ds))
ds_batch=ds.batch(6) ## Here if the last batch does not have 6 samples, then the batch will be created only with left samples
print('Number of samples/batches in batched ds: ',len(ds_batch))

for element in ds_batch:
    print(element)

print()
print(list(ds.batch(5).as_numpy_iterator()))

Number of samples in unbatched ds:  20
Number of samples/batches in batched ds:  4
tf.Tensor([1 2 3 4 5 6], shape=(6,), dtype=int32)
tf.Tensor([ 7  8  9 10 11 12], shape=(6,), dtype=int32)
tf.Tensor([13 14 15 16 17 18], shape=(6,), dtype=int32)
tf.Tensor([19 20], shape=(2,), dtype=int32)

[array([1, 2, 3, 4, 5]), array([ 6,  7,  8,  9, 10]), array([11, 12, 13, 14, 15]), array([16, 17, 18, 19, 20])]


In [9]:
# .map(); similar to map of python

ds=tf.data.Dataset.from_tensor_slices(tf.range(1,21))
ds=ds.map(lambda x: x**2)

print(list(ds.as_numpy_iterator()))

[1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225, 256, 289, 324, 361, 400]


In [16]:
"""
.filter()
""";

ds=tf.data.Dataset.range(10)
ds=ds.filter(lambda x: x>4) ## Unlike the map() the defined function will return the boolean value; ds will only contains value(element) which satisfy or return True

for element in ds:
    print(element)

tf.Tensor(5, shape=(), dtype=int64)
tf.Tensor(6, shape=(), dtype=int64)
tf.Tensor(7, shape=(), dtype=int64)
tf.Tensor(8, shape=(), dtype=int64)
tf.Tensor(9, shape=(), dtype=int64)


In [10]:

#.cache() 
# This method will cache the iteration of dataset.  The first time the dataset is iterated over, its elements will be cached either in the specified file or in memory.
# Subsequent iterations will use the cached data. 
# This method is specially useful when preprocessing can take a lot of time. 

In [11]:
ds=tf.data.Dataset.from_tensor_slices(tf.range(1,21))
ds=ds.map(lambda x: x+3) # We do mapping or other preprocessing technique before caching
ds=ds.cache() # Here when we don't provide the filepath for caching the ds then elements are cached() in the the memory.  However, if the ds is too large we will want to cached it into existing folder


for element in ds:
    print(element)

tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(5, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(7, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)
tf.Tensor(9, shape=(), dtype=int32)
tf.Tensor(10, shape=(), dtype=int32)
tf.Tensor(11, shape=(), dtype=int32)
tf.Tensor(12, shape=(), dtype=int32)
tf.Tensor(13, shape=(), dtype=int32)
tf.Tensor(14, shape=(), dtype=int32)
tf.Tensor(15, shape=(), dtype=int32)
tf.Tensor(16, shape=(), dtype=int32)
tf.Tensor(17, shape=(), dtype=int32)
tf.Tensor(18, shape=(), dtype=int32)
tf.Tensor(19, shape=(), dtype=int32)
tf.Tensor(20, shape=(), dtype=int32)
tf.Tensor(21, shape=(), dtype=int32)
tf.Tensor(22, shape=(), dtype=int32)
tf.Tensor(23, shape=(), dtype=int32)


In [12]:

"""
-  .shuffle(buffer_size, seed=None, reshuffle_each_iteration=None, name=None): This method is used to shuffle the ds. 
Randomly shuffles the elements of this dataset.

This dataset fills a buffer with buffer_size elements, then randomly samples elements from this buffer, replacing the selected elements with new elements.
 For perfect shuffling, a uffer size greater than or equal to the full size of the dataset is required.

For instance, if your dataset contains 10,000 elements but buffer_size is set to 1,000, then shuffle will initially select a random element from only the first 1,000 elements 
in the buffer. Once an element is selected, its space in the buffer is replaced by the next (i.e. 1,001-st) element, maintaining the 1,000 element buffer.

reshuffle_each_iteration controls whether the shuffle order should be different for each epoch. In TF 1.X, the idiomatic way to create epochs was through the repeat transformation:=

""";


dataset = tf.data.Dataset.range(3) 
dataset = dataset.shuffle(3, reshuffle_each_iteration=True)
dataset = dataset.repeat(2)
# [1, 0, 2, 1, 2, 0]

dataset = tf.data.Dataset.range(3)
dataset = dataset.shuffle(3, reshuffle_each_iteration=False)
dataset = dataset.repeat(2)
# [1, 0, 2, 1, 0, 2]

In [15]:
"""
.prefetch():  Most dataset input pipelines should end with a call to prefetch. This allows later elements to be prepared while the current element is being processed. 
This often improves latency and throughput, at the cost of using additional memory to store prefetched elements.

Note: Like other Dataset methods, prefetch operates on the elements of the input dataset. It has no concept of examples vs. batches. examples.prefetch(2) will prefetch two elements (2 examples), 
while examples.batch(20).prefetch(2) will prefetch 2 elements (2 batches, of 20 examples each).

""";




In [34]:
"""
.take(): From the total ds how many samples you would like to take
""";

ds=tf.data.Dataset.from_tensor_slices(tf.range(1,101))
print('The number of element in ds is : ',len(ds))



ds_new=ds.take(20)
print()
print('Creating a new ds by taking first twenty samples from ds :',len(ds_new))



## What if we try to take samples more than what existing samples itselfs contains? 
ds_ok=ds.take(1000) ## Orginally contains only 100 samples
print('\nlen of ds_ok :',len(ds_ok)) ## Len matches the len of original dataset





The number of element in ds is :  100

Creating a new ds by taking first twenty samples from ds : 20

len of ds_ok : 100


In [37]:
#take can also be used after batch()

ds=tf.data.Dataset.range(1,21)
ds_batch=ds.batch(5) ## Total number of batches will be 4(20/5=4)

for batch_element in ds_batch.take(3): ## We will only print till first three batches
    print(batch_element.numpy())

[1 2 3 4 5]
[ 6  7  8  9 10]
[11 12 13 14 15]


In [38]:
"""
Suppose we will like to skip first few samples and take few samples? How will we do that? we can use ds.skip().take()
""";

ds=tf.data.Dataset.range(10)

ds_new=ds.skip(5).take(5)


for element in ds_new:
    print(element.numpy())

5
6
7
8
9
