# Getting started with TensorFlow's `Dataset` API (continuation)

In this notebook we will learn how to divide the dataset over the ranks in distributed training.
This time we are going to use `tf.distribute`.

Here we are going to see `tf.distribute.Strategy.experimental_distribute_dataset`. This is the recommended API for an automatic way to shard the data over the workers.

 * `tf.distribute` rebatches the input tf.data.Dataset instance with a new batch size that is equal to the given batch size divided by the number of replicas in sync.
 * `tf.distribute` autoshards the input dataset in multi worker training.
 * `tf.distribute` adds a prefetch transformation at the end of the user provided tf.data.Dataset instance. The buffer_size is equal to the number of replicas in sync.


In [1]:
import ipcmagic

In [2]:
%ipcluster start -n 2 --mpi

IPCluster is ready! (9 seconds)


In [3]:
%%px
import numpy as np
import tensorflow as tf

In [4]:
%%px
strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy(
    cluster_resolver=tf.distribute.cluster_resolver.SlurmClusterResolver(),
    communication=tf.distribute.experimental.CollectiveCommunication.NCCL,
)

In [5]:
%%px
def dataset_generator():
    """A data-producing logic"""
    for i in range(8):
        yield (i, i)

In [8]:
%%px
dataset = tf.data.Dataset.from_generator(dataset_generator, output_types=(tf.int32, tf.int32))
dataset = dataset.batch(4)
dataset = dataset.repeat(2)

for x, y in dataset:
    print(f'    x: {x}    y: {y}')

[stdout:0] 
    x: [0 1 2 3]    y: [0 1 2 3]
    x: [4 5 6 7]    y: [4 5 6 7]
    x: [0 1 2 3]    y: [0 1 2 3]
    x: [4 5 6 7]    y: [4 5 6 7]
[stdout:1] 
    x: [0 1 2 3]    y: [0 1 2 3]
    x: [4 5 6 7]    y: [4 5 6 7]
    x: [0 1 2 3]    y: [0 1 2 3]
    x: [4 5 6 7]    y: [4 5 6 7]


In [9]:
%%px
dataset = tf.data.Dataset.from_generator(dataset_generator, output_types=(tf.int32, tf.int32))
dataset = dataset.batch(4)
dataset = dataset.repeat(2)
dataset = strategy.experimental_distribute_dataset(dataset)

for x, y in dataset:
    print(f'    x: {x}    y: {y}')

[stdout:0] 
    x: [0 1]    y: [0 1]
    x: [4 5]    y: [4 5]
    x: [0 1]    y: [0 1]
    x: [4 5]    y: [4 5]
[stdout:1] 
    x: [2 3]    y: [2 3]
    x: [6 7]    y: [6 7]
    x: [2 3]    y: [2 3]
    x: [6 7]    y: [6 7]


In [10]:
%%px
dataset = tf.data.Dataset.from_generator(dataset_generator, output_types=(tf.int32, tf.int32))
dataset = dataset.batch(4)
dataset = dataset.repeat(2)
# the followinf is the same than the cell above
# `tf.data.experimental.AutoShardPolicy.DATA` is the default option
options = tf.data.Options()
options.experimental_distribute.auto_shard_policy = tf.data.experimental.AutoShardPolicy.DATA
dataset = dataset.with_options(options)
#
dataset = strategy.experimental_distribute_dataset(dataset)

for x, y in dataset:
    print(f'    x: {x}    y: {y}')

[stdout:0] 
    x: [0 1]    y: [0 1]
    x: [4 5]    y: [4 5]
    x: [0 1]    y: [0 1]
    x: [4 5]    y: [4 5]
[stdout:1] 
    x: [2 3]    y: [2 3]
    x: [6 7]    y: [6 7]
    x: [2 3]    y: [2 3]
    x: [6 7]    y: [6 7]


In [11]:
%%px
dataset = tf.data.Dataset.from_generator(dataset_generator, output_types=(tf.int32, tf.int32))
dataset = dataset.batch(4)
dataset = dataset.repeat(2)
options = tf.data.Options()
options.experimental_distribute.auto_shard_policy = tf.data.experimental.AutoShardPolicy.OFF
dataset = dataset.with_options(options)
dataset = strategy.experimental_distribute_dataset(dataset)

for x, y in dataset:
    print(f'    x: {x}    y: {y}')

[stdout:0] 
    x: [0 1]    y: [0 1]
    x: [2 3]    y: [2 3]
    x: [4 5]    y: [4 5]
    x: [6 7]    y: [6 7]
    x: [0 1]    y: [0 1]
    x: [2 3]    y: [2 3]
    x: [4 5]    y: [4 5]
    x: [6 7]    y: [6 7]
[stdout:1] 
    x: [0 1]    y: [0 1]
    x: [2 3]    y: [2 3]
    x: [4 5]    y: [4 5]
    x: [6 7]    y: [6 7]
    x: [0 1]    y: [0 1]
    x: [2 3]    y: [2 3]
    x: [4 5]    y: [4 5]
    x: [6 7]    y: [6 7]


In [12]:
%ipcluster stop

>> In practice  `strategy.experimental_distribute_dataset(dataset)` is not added byt the user to the input pipeline.`tf.distributed` adds automatically `dataset = strategy.experimental_distribute_dataset(dataset)` to the dataset.