# 1. Datasets Types

https://www.tensorflow.org/guide/data#dataset_structure

* **from tensors:** Creates a Dataset with a single element, comprising the given tensors.
* **from tensor slices:** Creates a Dataset whose elements are slices of the given tensors.
* **from tfrecord:** A Dataset comprising records from one or more TFRecord files.
* **from textline:** A Dataset comprising lines from one or more text files.
* **from generator:** Creates a Dataset whose elements are generated by generator.

```python
tf.data.Dataset.from_tensors()
tf.data.Dataset.from_tensor_slices()
tf.data.Dataset.from_generator()
tf.data.TFRecordDataset()
```

## [1.1 From Tensors](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#from_tensors)

In [None]:
import tensorflow as tf
import numpy as np

data = np.arange(10)
dataset = tf.data.Dataset.from_tensors(data)
dataset.take(1)

## [1.2 From tensor slices](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#from_tensor_slices)

In [None]:
data = np.arange(10)
dataset = tf.data.Dataset.from_tensor_slices(data)
dataset.take(1)

## [1.3 From TextLine](https://www.tensorflow.org/api_docs/python/tf/data/TextLineDataset)

In [None]:
with open('./test.txt', 'w') as f:
    f.writelines(np.array2string(np.random.rand(10), separator=',' ))

In [None]:
dataset = tf.data.TextLineDataset(['./test.txt']) #note the input is a list of files
dataset

In [None]:
for x in dataset.take(20):
    print(x)

In [None]:
!rm ./test.txt

## [1.4 From Generator](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#from_generator)

In [None]:
def generator():
    for i in range(10):
        yield 2*i
    
dataset = tf.data.Dataset.from_generator(generator, (tf.int32))
dataset

In [None]:
for x in dataset.take(5):
    print(x)

## [1.5 From tfrecords](https://www.tensorflow.org/tutorials/load_data/tfrecord#top_of_page)

In [None]:
import tensorflow as tf

data_arr = [
    {
        'int_data': 108,
        'float_data': 2.45,
        'str_data': 'String 100',
        'float_list_data': [256.78, 13.9]
    },
    {
        'int_data': 37,
        'float_data': 84.3,
        'str_data': 'String 200',
        'float_list_data': [1.34, 843.9, 65.22]
    }
]

def get_example_object(data_record):
    # Convert individual data into a list of int64 or float or bytes
    int_list1 = tf.train.Int64List(value = [data_record['int_data']])
    float_list1 = tf.train.FloatList(value = [data_record['float_data']])
    # Convert string data into list of bytes
    str_list1 = tf.train.BytesList(value = [data_record['str_data'].encode('utf-8')])
    float_list2 = tf.train.FloatList(value = data_record['float_list_data'])

    # Create a dictionary with above lists individually wrapped in Feature
    feature_key_value_pair = {
        'int_list1': tf.train.Feature(int64_list = int_list1),
        'float_list1': tf.train.Feature(float_list = float_list1),
        'str_list1': tf.train.Feature(bytes_list = str_list1),
        'float_list2': tf.train.Feature(float_list = float_list2)
    }

    # Create Features object with above feature dictionary
    features = tf.train.Features(feature = feature_key_value_pair)

    # Create Example object with features
    example = tf.train.Example(features = features)
    return example

with tf.io.TFRecordWriter('./example.tfrecord') as tfwriter:
    # Iterate through all records
    for data_record in data_arr:
        example = get_example_object(data_record)
        # Append each example into tfrecord
        tfwriter.write(example.SerializeToString())

In [None]:
dataset = tf.data.TFRecordDataset(['./example.tfrecord'])
dataset

In [None]:
for raw_record in dataset.take(1):
    example = tf.train.Example()
    example.ParseFromString(raw_record.numpy())
    print(example)

In [None]:
!rm ./example.tfrecord

## [1.6 From other formats](https://github.com/tensorflow/io#tensorflow-io) (BigQuery, GCS, hdf5, AVRO, parquet, etc.)


___
# 2. Datasets Operations

## 2.1 Batching: 
Combines consecutive elements of this dataset into batches.


<img src="../images/2_datasets/batch.jpeg" width="400">

In [None]:
data = np.arange(8)
dataset = tf.data.Dataset.from_tensor_slices(data)
dataset = dataset.batch(3)
for x in dataset:
    print(x)

## 2.2 Repeat
Repeats this dataset count times.

<img src="../images/2_datasets/repeat.jpeg" width="400">

In [None]:
data = np.arange(4)
dataset = tf.data.Dataset.from_tensor_slices(data)
dataset = dataset.batch(4).repeat(2)
for x in dataset:
    print(x)

## [2.3 Suffle](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#shuffle)
Randomly shuffles the elements of this dataset.

<img src="../images/2_datasets/shuffle.jpeg" width="400">

In [None]:
data = np.arange(8)
dataset = tf.data.Dataset.from_tensor_slices(data)
dataset = dataset.shuffle(4)
for x in dataset:
    print(x)

In [None]:
for x in dataset.batch(8):
    print(x)

## 2.4 Map
Maps map_func across the elements of this dataset.

<img src="../images/2_datasets/map.jpeg" width="400">

In [None]:
data = np.arange(8)
dataset = tf.data.Dataset.from_tensor_slices(data)
dataset = dataset.batch(8).map(lambda x: x+1)
for x in dataset:
    print(x)

<img src="../images/2_datasets/map_2.png" width="600">

In [None]:
data = np.arange(8)
dataset = tf.data.Dataset.from_tensor_slices(data)
dataset = dataset.batch(8).map(lambda x: x+1, num_parallel_calls= tf.data.experimental.AUTOTUNE)
for x in dataset:
    print(x)

## 2.5 Filter
Filters this dataset according to predicate.

<img src="../images/2_datasets/filter.jpeg" width="400">

In [None]:
data = np.arange(8)
dataset = tf.data.Dataset.from_tensor_slices(data)
dataset = dataset.filter(lambda x: x%2==0).batch(4)
for x in dataset:
    print(x)

# 2.6 Prefetch
Creates a Dataset that prefetches elements from this dataset.

<img src="../images/2_datasets/prefetch.png" width="600">
<img src="../images/2_datasets/prefetch_1.png" width="600">
<img src="../images/2_datasets/prefetch_2.png" width="600">

In [None]:
data = np.arange(8)
dataset = tf.data.Dataset.from_tensor_slices(data)
dataset = dataset.filter(lambda x: x%2==0).map(lambda y: y+1).batch(4)
dataset = dataset.prefetch(3)
for x in dataset:
    print(x)

# 3. Performance

Use [the official guide](https://www.tensorflow.org/guide/data_performance#optimize_performance) below is the summary.

* Use the prefetch transformation to overlap the work of a producer and consumer. In particular, we recommend adding prefetch to the end of your input pipeline to overlap the transformations performed on the CPU with the training done on the accelerator. Either manually tuning the buffer size, or using tf.data.experimental.AUTOTUNE to delegate the decision to the tf.data runtime.
* Parallelize the map transformation by setting the num_parallel_calls argument. Either manually tuning the level of parallelism, or using tf.data.experimental.AUTOTUNE to delegate the decision to the tf.data runtime.
* If you are working with data stored remotely and / or requiring deserialization, we recommend using the interleave transformation to parallelize the reading (and deserialization) of data from different files.
* Vectorize cheap user-defined functions passed in to the map transformation to amortize the overhead associated with scheduling and executing the function.
* If your data can fit into memory, use the cache transformation to cache it in memory during the first epoch, so that subsequent epochs can avoid the overhead associated with reading, parsing, and transforming it.
* If your pre-processing increases the size of your data, we recommend applying the interleave, prefetch, and shuffle first (if possible) to reduce memory usage.
* We recommend applying the shuffle transformation before the repeat transformation.