# Getting started with TensorFlow's `Dataset` API (continuation)

In this notebook we will learn how to divide the dataset over the ranks in distributed training.

Let's run this notebook in two nodes and see what happens with the data on each worker. In distributed training one can use [`tf.data.Dataset.shard`]( https://www.tensorflow.org/api_docs/python/tf/data/Dataset#shard) to divide the dataset over the ranks, otherwise the same data will be sent to each of the workers.

In [1]:
import ipcmagic

In [2]:
%ipcluster start -n 2 --mpi

IPCluster is ready! (3 seconds)


In [3]:
%%px
import numpy as np
import tensorflow as tf
import horovod.tensorflow.keras as hvd

In [4]:
%%px
hvd.init()

In [5]:
%%px
hvd.size(), hvd.rank()

[0;31mOut[0:3]: [0m(2, 0)

[0;31mOut[1:3]: [0m(2, 1)

In [6]:
%%px
def dataset_generator():
    """A data-producing logic"""
    for i in range(8):
        yield (i, i)

In [7]:
%%px
for x, y in dataset_generator():
    print((f'    x: {x}    y: {y}'))

[stdout:0] 
    x: 0    y: 0
    x: 1    y: 1
    x: 2    y: 2
    x: 3    y: 3
    x: 4    y: 4
    x: 5    y: 5
    x: 6    y: 6
    x: 7    y: 7
[stdout:1] 
    x: 0    y: 0
    x: 1    y: 1
    x: 2    y: 2
    x: 3    y: 3
    x: 4    y: 4
    x: 5    y: 5
    x: 6    y: 6
    x: 7    y: 7


<mark>Exercise</mark>: Batch after shard or shard after bash? Consider both options on the following pipeline and see what's the result.

In [8]:
%%px

dataset = tf.data.Dataset.from_generator(dataset_generator, output_types=(tf.int32, tf.int32))
dataset = dataset.batch(2)
dataset = dataset.shard(hvd.size(), hvd.rank())
dataset = dataset.repeat(2)

for x, y in dataset:
    print(f'    x: {x}    y: {y}')

[stdout:0] 
    x: [0 1]    y: [0 1]
    x: [4 5]    y: [4 5]
    x: [0 1]    y: [0 1]
    x: [4 5]    y: [4 5]
[stdout:1] 
    x: [2 3]    y: [2 3]
    x: [6 7]    y: [6 7]
    x: [2 3]    y: [2 3]
    x: [6 7]    y: [6 7]


In [9]:
%ipcluster stop