# Getting started with TensorFlow's `Dataset` API (continuation)

In this notebook we will learn how to divide the dataset over the ranks in distributed training.

The following steps were done on one of the previous notebooks. If necessary they can be run again on a new cell.
```bash
wget https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv
echo "sepal_length,sepal_width,petal_length,petal_width,species" > iris_setosa.csv
grep setosa iris.csv >> iris_setosa.csv
echo "sepal_length,sepal_width,petal_length,petal_width,species" > iris_versic.csv
grep versicolor iris.csv >> iris_versic.csv
echo "sepal_length,sepal_width,petal_length,petal_width,species" > iris_virgin.csv
grep virginica iris.csv >> iris_virgin.csv
```

In [1]:
import tensorflow as tf

In [2]:
tf.VERSION

'1.12.0'

In [3]:
def parse_columns(*row, classes):
    """Convert the string classes to one-hot encoded:
    setosa     -> [1, 0, 0]
    virginica  -> [0, 1, 0]
    versicolor -> [0, 0, 1]
    """
    features = tf.convert_to_tensor(row[:4])
    label_int = tf.where(tf.equal(classes, row[4]))
    label = tf.one_hot(label_int, 3)
    return features, label


def get_csv_dataset(filename):
    return tf.data.experimental.CsvDataset(filename, header=True,
                                           record_defaults=[tf.float32,
                                                            tf.float32,
                                                            tf.float32,
                                                            tf.float32,
                                                            tf.string])

## Using Shards <a id='using_shards'></a>


Let's simulate a distributed training with two ranks to see what happens with the data on each worker. In distributed training one can use [`tf.data.Dataset.shard`]( https://www.tensorflow.org/api_docs/python/tf/data/Dataset#shard) to divide the dataset over the ranks, otherwise the same data might be sent to each of the workers.

> Note: `tf.data.Dataset.shard` is deprecated from TensorFlow-1.13, but it will replaced for another function which works very similar.

Let's consider:
 * `tf.data.Dataset.list_files` with `shuffle=True`.
 * `tf.data.Dataset.list_files` with `shuffle=False`.
 * Shard before interleaving.
 * Shard after interleaving.

In [4]:
# Common for both ranks
dataset = tf.data.Dataset.list_files(['iris_setosa.csv',
                                       'iris_versic.csv'],
                                      shuffle=True)

In [5]:
# Rank 0
dataset0 = dataset.shard(2, 0)  # hvd.size(), hvd.rank()
dataset0 = dataset0.interleave(get_csv_dataset,
                               cycle_length=2,
                               block_length=1,
                               num_parallel_calls=2)
dataset0 = dataset0.map(lambda *row: parse_columns(*row, classes=['setosa', 'virginica', 'versicolor']))
iterator0 = dataset0.make_one_shot_iterator()
next_item0 = iterator0.get_next()

with tf.Session() as sess:
    try:
        for i in range(10):
            features, label = sess.run(next_item0)
            print('features: %s  |  label: %s' % (features, label))
    except tf.errors.OutOfRangeError:
        print('The dataset ran out of entries!')

features: [5.1 3.5 1.4 0.2]  |  label: [[[1. 0. 0.]]]
features: [4.9 3.  1.4 0.2]  |  label: [[[1. 0. 0.]]]
features: [4.7 3.2 1.3 0.2]  |  label: [[[1. 0. 0.]]]
features: [4.6 3.1 1.5 0.2]  |  label: [[[1. 0. 0.]]]
features: [5.  3.6 1.4 0.2]  |  label: [[[1. 0. 0.]]]
features: [5.4 3.9 1.7 0.4]  |  label: [[[1. 0. 0.]]]
features: [4.6 3.4 1.4 0.3]  |  label: [[[1. 0. 0.]]]
features: [5.  3.4 1.5 0.2]  |  label: [[[1. 0. 0.]]]
features: [4.4 2.9 1.4 0.2]  |  label: [[[1. 0. 0.]]]
features: [4.9 3.1 1.5 0.1]  |  label: [[[1. 0. 0.]]]


In [6]:
# Rank 1
dataset1 = dataset.shard(2, 1)  # hvd.size(), hvd.rank()
dataset1 = dataset1.interleave(get_csv_dataset,
                               cycle_length=2,
                               block_length=1,
                               num_parallel_calls=2)
dataset1 = dataset1.map(lambda *row: parse_columns(*row, classes=['setosa', 'virginica', 'versicolor']))
iterator1 = dataset1.make_one_shot_iterator()
next_item1 = iterator1.get_next()

with tf.Session() as sess:
    try:
        for i in range(10):
            features, label = sess.run(next_item1)
            print('features: %s  |  label: %s' % (features, label))
    except tf.errors.OutOfRangeError:
        print('The dataset ran out of entries!')

features: [7.  3.2 4.7 1.4]  |  label: [[[0. 0. 1.]]]
features: [6.4 3.2 4.5 1.5]  |  label: [[[0. 0. 1.]]]
features: [6.9 3.1 4.9 1.5]  |  label: [[[0. 0. 1.]]]
features: [5.5 2.3 4.  1.3]  |  label: [[[0. 0. 1.]]]
features: [6.5 2.8 4.6 1.5]  |  label: [[[0. 0. 1.]]]
features: [5.7 2.8 4.5 1.3]  |  label: [[[0. 0. 1.]]]
features: [6.3 3.3 4.7 1.6]  |  label: [[[0. 0. 1.]]]
features: [4.9 2.4 3.3 1. ]  |  label: [[[0. 0. 1.]]]
features: [6.6 2.9 4.6 1.3]  |  label: [[[0. 0. 1.]]]
features: [5.2 2.7 3.9 1.4]  |  label: [[[0. 0. 1.]]]
