# Getting started with TensorFlow's `Dataset` API (continuation)

In this notebook we will contruct `Dataset` objects from cvs files and we will learn how to interleave dataset files.

We will use [the iris dataset in cvs format](https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv) as example. The iris dataset has 150 entries and include three classes of iris flowers. The cvs file looks like this:
```
sepal_length,sepal_width,petal_length,petal_width,species
4.6,3.2,1.4,0.2,setosa
5.3,3.7,1.5,0.2,setosa
5,3.3,1.4,0.2,setosa
...
7,3.2,4.7,1.4,versicolor
6.4,3.2,4.5,1.5,versicolor
6.9,3.1,4.9,1.5,versicolor
...
6.1,3,4.9,1.8,virginica
6.4,2.8,5.6,2.1,virginica
7.2,3,5.8,1.6,virginica
...
```

In [1]:
import tensorflow as tf

In [2]:
tf.enable_eager_execution()

In [3]:
tf.VERSION

'1.13.1'

## Some helper functions

In [4]:
def parse_columns_minimal(*row):
    features = tf.convert_to_tensor(row[:4])
    label = row[4]  # this is a string!
    return features, label

classes = ['setosa', 'virginica', 'versicolor']
# dataset = dataset.map(parse_columns_defaultargs)
def parse_columns_defaultargs(*row, classes=classes):
    """Convert the string classes to one-hot econcoded:
    setosa     -> [1, 0, 0]
    virginica  -> [0, 1, 0]
    versicolor -> [0, 0, 1]
    """
    features = tf.convert_to_tensor(row[:4])
    # classes = tf.constant(classes)
    label_int = tf.where(tf.equal(classes, row[4]))
    label = tf.one_hot(label_int, 3)
    return features, label


# dataset = dataset.map(lambda *row: parse_columns(*row, classes=classes))
def parse_columns(*row, classes):
    """Convert the string classes to one-hot econcoded:
    setosa     -> [1, 0, 0]
    virginica  -> [0, 1, 0]
    versicolor -> [0, 0, 1]
    """
    features = tf.convert_to_tensor(row[:4])
    label_int = tf.where(tf.equal(classes, row[4]))
    label = tf.one_hot(label_int, 3)    
    return features, label

## Read single dataset file in CVS format

In [5]:
dataset = tf.data.experimental.CsvDataset('iris.csv', header=True,
                                          record_defaults=[tf.float32, tf.float32, tf.float32,
                                                           tf.float32, tf.string])
dataset = dataset.map(lambda *row: parse_columns(*row, classes=['setosa', 'virginica', 'versicolor']))
dataset = dataset.shuffle(150)
dataset = dataset.batch(1)
dataset = dataset.repeat(1)

In [6]:
# Download the dataset file
! wget https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv

--2019-06-17 09:13:37--  https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.112.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.112.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3716 (3.6K) [text/plain]
Saving to: ‘iris.csv.1’


2019-06-17 09:13:38 (35.3 MB/s) - ‘iris.csv.1’ saved [3716/3716]



In [7]:
for features, label in dataset:
    print('features: %s  |  label: %s' % (features.numpy(), label.numpy()))

Instructions for updating:
Colocations handled automatically by placer.
features: [[5.9 3.  4.2 1.5]]  |  label: [[[[0. 0. 1.]]]]
features: [[6.4 2.8 5.6 2.1]]  |  label: [[[[0. 1. 0.]]]]
features: [[6.  2.2 5.  1.5]]  |  label: [[[[0. 1. 0.]]]]
features: [[5.6 3.  4.1 1.3]]  |  label: [[[[0. 0. 1.]]]]
features: [[6.3 2.3 4.4 1.3]]  |  label: [[[[0. 0. 1.]]]]
features: [[6.6 3.  4.4 1.4]]  |  label: [[[[0. 0. 1.]]]]
features: [[6.1 2.8 4.7 1.2]]  |  label: [[[[0. 0. 1.]]]]
features: [[4.6 3.1 1.5 0.2]]  |  label: [[[[1. 0. 0.]]]]
features: [[5.8 4.  1.2 0.2]]  |  label: [[[[1. 0. 0.]]]]
features: [[6.4 2.9 4.3 1.3]]  |  label: [[[[0. 0. 1.]]]]
features: [[5.7 2.8 4.5 1.3]]  |  label: [[[[0. 0. 1.]]]]
features: [[7.3 2.9 6.3 1.8]]  |  label: [[[[0. 1. 0.]]]]
features: [[7.2 3.2 6.  1.8]]  |  label: [[[[0. 1. 0.]]]]
features: [[6.4 2.8 5.6 2.2]]  |  label: [[[[0. 1. 0.]]]]
features: [[4.9 3.1 1.5 0.1]]  |  label: [[[[1. 0. 0.]]]]
features: [[4.3 3.  1.1 0.1]]  |  label: [[[[1. 0. 0.]]]]


## Interleaving datasets files

### Notes
[`tf.data.Dataset.list_files`]( https://www.tensorflow.org/api_docs/python/tf/data/Dataset#list_files) lists filenames in a non-deterministic random shuffled order. Passing either `seed=<int>` or `shuffle=False` will make the order deterministic:
 * `seed=<int>`: Random shuffle from a specified seed.
 * `shuffle=True`: Random shuffle from a random seed.
 * `shuffle=False`: Alphabetical order.
 
This is specially important when using `shards` togetehr with `interleaves` as we will see on the next session.

In [8]:
%%bash
# Divide the iris dataset file into three dataset files, each containing a single variety of flower.
echo "sepal_length,sepal_width,petal_length,petal_width,species" > iris_setosa.csv
grep setosa iris.csv >> iris_setosa.csv
#
echo "sepal_length,sepal_width,petal_length,petal_width,species" > iris_versic.csv
grep versicolor iris.csv >> iris_versic.csv
#
echo "sepal_length,sepal_width,petal_length,petal_width,species" > iris_virgin.csv
grep virginica iris.csv >> iris_virgin.csv

In [9]:
def get_csv_dataset(filename):
    return tf.data.experimental.CsvDataset(filename, header=True,
                                           record_defaults=[tf.float32, tf.float32, tf.float32,
                                                            tf.float32, tf.string])

In [10]:
dataset = tf.data.Dataset.list_files(['iris_setosa.csv',
                                      'iris_virgin.csv',
                                      'iris_versic.csv'],
                                     shuffle=False)
dataset = dataset.interleave(get_csv_dataset,
                             cycle_length=3,
                             block_length=2,
                             num_parallel_calls=3)
dataset = dataset.map(lambda *row: parse_columns(*row, classes=['setosa', 'virginica', 'versicolor']))

for features, label in dataset:
    print('features: %s  |  label: %s' % (features.numpy(), label.numpy()))

features: [5.1 3.5 1.4 0.2]  |  label: [[[1. 0. 0.]]]
features: [4.9 3.  1.4 0.2]  |  label: [[[1. 0. 0.]]]
features: [7.  3.2 4.7 1.4]  |  label: [[[0. 0. 1.]]]
features: [6.4 3.2 4.5 1.5]  |  label: [[[0. 0. 1.]]]
features: [6.3 3.3 6.  2.5]  |  label: [[[0. 1. 0.]]]
features: [5.8 2.7 5.1 1.9]  |  label: [[[0. 1. 0.]]]
features: [4.7 3.2 1.3 0.2]  |  label: [[[1. 0. 0.]]]
features: [4.6 3.1 1.5 0.2]  |  label: [[[1. 0. 0.]]]
features: [6.9 3.1 4.9 1.5]  |  label: [[[0. 0. 1.]]]
features: [5.5 2.3 4.  1.3]  |  label: [[[0. 0. 1.]]]
features: [7.1 3.  5.9 2.1]  |  label: [[[0. 1. 0.]]]
features: [6.3 2.9 5.6 1.8]  |  label: [[[0. 1. 0.]]]
features: [5.  3.6 1.4 0.2]  |  label: [[[1. 0. 0.]]]
features: [5.4 3.9 1.7 0.4]  |  label: [[[1. 0. 0.]]]
features: [6.5 2.8 4.6 1.5]  |  label: [[[0. 0. 1.]]]
features: [5.7 2.8 4.5 1.3]  |  label: [[[0. 0. 1.]]]
features: [6.5 3.  5.8 2.2]  |  label: [[[0. 1. 0.]]]
features: [7.6 3.  6.6 2.1]  |  label: [[[0. 1. 0.]]]
features: [4.6 3.4 1.4 0.3] 

## More on interleave

Let's try different combinations of `cycle_length` and `block_length` when interleaving data files and see what happens.

In [11]:
dataset = tf.data.Dataset.list_files(['iris_setosa.csv',
                                      'iris_virgin.csv',
                                      'iris_versic.csv'],
                                     shuffle=False)
dataset = dataset.interleave(get_csv_dataset,
                             cycle_length=2,
                             block_length=1,
                             num_parallel_calls=2)
dataset = dataset.map(lambda *row: parse_columns(*row, classes=classes))

for features, label in dataset:
    print('features: %s  |  label: %s' % (features.numpy(), label.numpy()))

features: [5.1 3.5 1.4 0.2]  |  label: [[[1. 0. 0.]]]
features: [7.  3.2 4.7 1.4]  |  label: [[[0. 0. 1.]]]
features: [4.9 3.  1.4 0.2]  |  label: [[[1. 0. 0.]]]
features: [6.4 3.2 4.5 1.5]  |  label: [[[0. 0. 1.]]]
features: [4.7 3.2 1.3 0.2]  |  label: [[[1. 0. 0.]]]
features: [6.9 3.1 4.9 1.5]  |  label: [[[0. 0. 1.]]]
features: [4.6 3.1 1.5 0.2]  |  label: [[[1. 0. 0.]]]
features: [5.5 2.3 4.  1.3]  |  label: [[[0. 0. 1.]]]
features: [5.  3.6 1.4 0.2]  |  label: [[[1. 0. 0.]]]
features: [6.5 2.8 4.6 1.5]  |  label: [[[0. 0. 1.]]]
features: [5.4 3.9 1.7 0.4]  |  label: [[[1. 0. 0.]]]
features: [5.7 2.8 4.5 1.3]  |  label: [[[0. 0. 1.]]]
features: [4.6 3.4 1.4 0.3]  |  label: [[[1. 0. 0.]]]
features: [6.3 3.3 4.7 1.6]  |  label: [[[0. 0. 1.]]]
features: [5.  3.4 1.5 0.2]  |  label: [[[1. 0. 0.]]]
features: [4.9 2.4 3.3 1. ]  |  label: [[[0. 0. 1.]]]
features: [4.4 2.9 1.4 0.2]  |  label: [[[1. 0. 0.]]]
features: [6.6 2.9 4.6 1.3]  |  label: [[[0. 0. 1.]]]
features: [4.9 3.1 1.5 0.1] 