<a name="top"></a><a id="top">
# Tests on tf.train.Feature data types (non-scalar inputs)
   
<a href="https://colab.research.google.com/github/gbih/ml-notes/blob/main/tf_record_tftrain/nb_003_tftrainFeature_nonscalars.ipynb">
<strong>View in Colab</strong>
</a>

1. [Setup](#setup)
2. [Introduction](#2.0)
3. [Handling non-scalar input via tf.io.serialize_tensor](#3.0)
    * 3.1 [bytes](#3.1)
    * 3.2 [floats](#3.2)
    * 3.3 [int64s](#3.3)
---
**To-do**: 

Use the inverse operation `tf.io.parse_tensor` to transform the scalar string containing a serialized Tensor into a Tensor of a specified type.

---
<a id="setup"></a><a name="setup"></a>
# 1. Setup
<a href="#top">[back to top]</a>

In [1]:
#import glob
import numpy as np
import os
import pprint as pp
import tensorflow as tf

# To make this notebook's output stable across runs
tf.random.set_seed(42)
np.random.seed(42)

def HR():
    print("-"*40)
    
print("Libraries loaded..")

Libraries loaded..


---
<a id="2.0"></a><a name="2.0"></a>
# 2. Introduction
<a href="#top">[back to top]</a>


According to the [official documentation](https://www.tensorflow.org/tutorials/load_data/tfrecord), to handle non-scalar inputs, the simplest way to handle non-scalar features is to use `tf.io.serialize_tensor` to convert tensors to binary-strings (strings are scalars in TensorFlow).

This is the technique we explore here.

We want to test that each of the following sub-types that were previously passed to the appropriate Example proto, can now be processed with `tf.io.serialize_tensor` and then passed only to `tf.train.BytesList`


1. `tf.train.BytesList` (since everything becomes a binary string, this is the sole Example proto that we use)
    - `byte`
    - `string`
2. `tf.train.FloatList`
    - `float` (`float32`)
    - `double` (`float64`)
3. `tf.train.Int64List`
    - `bool`
    - `enum`
    - `int32`
    - `uint32`
    - `int64`
    - `uint64`

---
<a id="3.0"></a><a name="3.0"></a>
# 3. Handling non-scalar input via tf.io.serialize_tensor
<a href="#top">[back to top]</a>

Use `tf.io.serialize_tensor` to transform a Tensor into a serialized TensorProto proto. This operation transforms data in a `tf.Tensor` into a `tf.Tensor` of type `tf.string` containing the data in a binary string format. This operation can transform scalar data and linear arrays, but it is most useful in converting multidimensional arrays into a format accepted by binary storage formats such as a TFRecord or `tf.train.Example`.

In [2]:
# scalar bytes
def bytes_feature(value):
    """Returns a bytes_list from a string / byte."""
    # tf.constant(0) is arbitrary, just use it to test for eager_tensor type
    eager_tensor_type = type(tf.constant(0))
    
    if isinstance(value, eager_tensor_type):
        value = value.numpy()
    return tf.train.Feature(
        bytes_list=tf.train.BytesList(value=[value])
    )

<a id='3.1'></a><a name="3.1"></a>
## 3.1 bytes
<a href="#top">[back to top]</a>

In [22]:
bytes1 = [b'this is sentence1 of byte-type', b'this is sentence2of byte-type']
assert isinstance(bytes1, list) # EagerTensor
serialized_bytes1 = tf.io.serialize_tensor(bytes1)
assert isinstance(serialized_bytes1, type(tf.constant(0)))
print(serialized_bytes1)
print(bytes_feature(serialized_bytes1))
HR()


bytes2 = tf.constant(["one", "two", "three"])
print(type(bytes2))
# This is an EagerTensor but it contains a list, and will fail if we don't serialize it
serialized_bytes2 = tf.io.serialize_tensor(bytes2)
assert isinstance(serialized_bytes2, type(tf.constant(0)))
print(serialized_bytes2)
print(bytes_feature(serialized_bytes2))
HR()


bytes3 = [np.random.bytes(2), np.random.bytes(3), np.random.bytes(4)]
print(type(bytes3))
serialized_bytes3 = tf.io.serialize_tensor(bytes3)
assert isinstance(serialized_bytes3, type(tf.constant(0)))
print(serialized_bytes3)
print(bytes_feature(serialized_bytes3))
HR()

tf.Tensor(b'\x08\x07\x12\x04\x12\x02\x08\x02B\x1ethis is sentence1 of byte-typeB\x1dthis is sentence2of byte-type', shape=(), dtype=string)
bytes_list {
  value: "\010\007\022\004\022\002\010\002B\036this is sentence1 of byte-typeB\035this is sentence2of byte-type"
}

----------------------------------------
<class 'tensorflow.python.framework.ops.EagerTensor'>
tf.Tensor(b'\x08\x07\x12\x04\x12\x02\x08\x03B\x03oneB\x03twoB\x05three', shape=(), dtype=string)
bytes_list {
  value: "\010\007\022\004\022\002\010\003B\003oneB\003twoB\005three"
}

----------------------------------------
<class 'list'>
tf.Tensor(b'\x08\x07\x12\x04\x12\x02\x08\x03B\x02\x97>B\x03\x82\xa0\xa0B\x04\x95\x06E\x05', shape=(), dtype=string)
bytes_list {
  value: "\010\007\022\004\022\002\010\003B\002\227>B\003\202\240\240B\004\225\006E\005"
}

----------------------------------------


<a id='3.2'></a><a name="3.2"></a>
## 3.2 floats
<a href="#top">[back to top]</a>

In [25]:
float32_1 = [np.exp(1, dtype=np.float32), 2.3]
assert isinstance(float32_1, list) 
serialized_float32_1 = tf.io.serialize_tensor(float32_1)
print(serialized_float32_1)
print(bytes_feature(serialized_float32_1))
HR()


float64_1 = [np.exp(1, dtype=np.float64), 4.5]
assert isinstance(float64_1, list) 
serialized_float64_1 = tf.io.serialize_tensor(float64_1)
print(serialized_float64_1)
print(bytes_feature(serialized_float64_1))

tf.Tensor(b'\x08\x01\x12\x04\x12\x02\x08\x02"\x08U\xf8-@33\x13@', shape=(), dtype=string)
bytes_list {
  value: "\010\001\022\004\022\002\010\002\"\010U\370-@33\023@"
}

----------------------------------------
tf.Tensor(b'\x08\x02\x12\x04\x12\x02\x08\x02"\x10iW\x14\x8b\n\xbf\x05@\x00\x00\x00\x00\x00\x00\x12@', shape=(), dtype=string)
bytes_list {
  value: "\010\002\022\004\022\002\010\002\"\020iW\024\213\n\277\005@\000\000\000\000\000\000\022@"
}



<a id='3.3'></a><a name="3.3"></a>
## 3.3 Int64s
<a href="#top">[back to top]</a>

In [32]:
bool_1 = np.array([True, False], dtype=bool)
assert isinstance(bool_1, np.ndarray) 
serialized_bool_1 = tf.io.serialize_tensor(bool_1)
print(serialized_bool_1)
print(bytes_feature(serialized_bool_1))
HR()


# Different Enum types: EnumMeta, Enum, IntEnum, Flag, IntFlag, auto, unique
import enum
# Need to pass an enum which is a subclass of int, so it will be compatible with tf.train.Int64List
class Color(enum.IntEnum):
    RED = 1
    GREEN = 2
    BLUE = 3
enum_list_1 = [Color.RED, Color.BLUE]    
assert isinstance(Color.RED, enum.IntEnum)
assert isinstance(enum_list_1, list)
serialized_enum_1 = tf.io.serialize_tensor(enum_list_1)
print(serialized_enum_1)
print(bytes_feature(serialized_enum_1))
HR()


int32_1 = np.array([
    np.int32(np.iinfo(np.int32).min), 
    np.int32(np.iinfo(np.int32).max)
], dtype=np.int32)
assert isinstance(int32_1, np.ndarray)
serialized_int32_1 = tf.io.serialize_tensor(int32_1)
print(serialized_int32_1)
print(bytes_feature(serialized_int32_1))
HR()


uint32_1 = np.array([
    np.uint32(np.iinfo(np.uint32).min), 
    np.uint32(np.iinfo(np.uint32).max)
], dtype=np.uint32)
assert isinstance(uint32_1, np.ndarray)
serialized_uint32_1 = tf.io.serialize_tensor(uint32_1)
print(serialized_uint32_1)
print(bytes_feature(serialized_uint32_1))
HR()


int64_1 = np.array([
    np.int64(np.iinfo(np.int64).min), 
    np.int64(np.iinfo(np.int64).max)
], dtype=np.int64)
assert isinstance(int64_1, np.ndarray)
serialized_int64_1 = tf.io.serialize_tensor(int64_1)
print(serialized_int64_1)
print(bytes_feature(serialized_int64_1))
HR()


uint64_1 = np.array([
    np.uint64(np.iinfo(np.uint64).min), 
    np.uint64(np.iinfo(np.uint64).max)
], dtype=np.uint64)
assert isinstance(uint64_1, np.ndarray)
serialized_uint64_1 = tf.io.serialize_tensor(uint64_1)
print(serialized_uint64_1)
print(bytes_feature(serialized_uint64_1))

tf.Tensor(b'\x08\n\x12\x04\x12\x02\x08\x02"\x02\x01\x00', shape=(), dtype=string)
bytes_list {
  value: "\010\n\022\004\022\002\010\002\"\002\001\000"
}

----------------------------------------
tf.Tensor(b'\x08\x03\x12\x04\x12\x02\x08\x02"\x08\x01\x00\x00\x00\x03\x00\x00\x00', shape=(), dtype=string)
bytes_list {
  value: "\010\003\022\004\022\002\010\002\"\010\001\000\000\000\003\000\000\000"
}

----------------------------------------
tf.Tensor(b'\x08\x03\x12\x04\x12\x02\x08\x02"\x08\x00\x00\x00\x80\xff\xff\xff\x7f', shape=(), dtype=string)
bytes_list {
  value: "\010\003\022\004\022\002\010\002\"\010\000\000\000\200\377\377\377\177"
}

----------------------------------------
tf.Tensor(b'\x08\x16\x12\x04\x12\x02\x08\x02"\x08\x00\x00\x00\x00\xff\xff\xff\xff', shape=(), dtype=string)
bytes_list {
  value: "\010\026\022\004\022\002\010\002\"\010\000\000\000\000\377\377\377\377"
}

----------------------------------------
tf.Tensor(b'\x08\t\x12\x04\x12\x02\x08\x02"\x10\x00\x00\x00\x00\