# TFRecord

TFRecord是TensorFlow推荐的数据存储格式。这是一种简单的二进制格式包含了一系列可变长的二进制记录。每条记录由长度、长度CRC校验，数据，数据CRC校验构成。

In [2]:
import tensorflow as tf

In [2]:
with tf.io.TFRecordWriter('my_data.tfrecord') as f:
    f.write(b'This is the first record')
    f.write(b'And this is the second record')

In [3]:
filepaths = ['my_data.tfrecord']
dataset = tf.data.TFRecordDataset(filepaths)
for item in dataset:
    print(item)

tf.Tensor(b'This is the first record', shape=(), dtype=string)
tf.Tensor(b'And this is the second record', shape=(), dtype=string)


2022-02-28 21:25:12.330868: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] could not open file to read NUMA node: /sys/bus/pci/devices/0000:09:00.0/numa_node
Your kernel may have been built without NUMA support.
2022-02-28 21:25:12.352832: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] could not open file to read NUMA node: /sys/bus/pci/devices/0000:09:00.0/numa_node
Your kernel may have been built without NUMA support.
2022-02-28 21:25:12.353134: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] could not open file to read NUMA node: /sys/bus/pci/devices/0000:09:00.0/numa_node
Your kernel may have been built without NUMA support.
2022-02-28 21:25:12.353905: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow wi

TFRecord文件支持压缩，尤其是需要通过网络加载时。

In [4]:
options = tf.io.TFRecordOptions(compression_type='GZIP')
with tf.io.TFRecordWriter('my_compressed.tfrecord', options) as f:
    f.write(b'This is the first record')
    f.write(b'And this is the second record')
    
for item in tf.data.TFRecordDataset(['my_compressed.tfrecord'], compression_type='GZIP'):
    print(item)

tf.Tensor(b'This is the first record', shape=(), dtype=string)
tf.Tensor(b'And this is the second record', shape=(), dtype=string)


## Protobuf介绍

一般情况下TFRecord是和Protobuf联合使用的，Protobuf是一种可移植的序列化格式。

In [5]:
%%writefile person.proto
syntax = "proto3";
message Person {
    string name = 1;
    int32 id = 2;
    repeated string email = 3;
}

Writing person.proto


In [6]:
!protoc person.proto --python_out=. --descriptor_set_out=person.desc --include_imports

In [7]:
from person_pb2 import Person

person = Person(name='A1', id=123, email=['hi@hs.com'])
print(person)

name: "A1"
id: 123
email: "hi@hs.com"



In [11]:
person.email.append('hello@gu.com')

In [12]:
serialized = person.SerializeToString()
serialized

b'\n\x02A1\x10{\x1a\thi@hs.com\x1a\x0chello@gu.com'

In [13]:
person2 = Person()
person2.ParseFromString(serialized)

31

In [16]:
person_tf = tf.io.decode_proto(
    bytes = serialized,
    message_type = 'Person',
    field_names = ['name', 'id', 'email'],
    output_types = [tf.string, tf.int32, tf.string],
    descriptor_source = 'person.desc'
)

person_tf.values

[<tf.Tensor: shape=(1,), dtype=string, numpy=array([b'A1'], dtype=object)>,
 <tf.Tensor: shape=(1,), dtype=int32, numpy=array([123], dtype=int32)>,
 <tf.Tensor: shape=(2,), dtype=string, numpy=array([b'hi@hs.com', b'hello@gu.com'], dtype=object)>]

## Tensorflow内置Example Protobuf

对于绝大部分应用场景而言，不需要自定义Protobuf格式，内置的就够用。最重要的Protobuf就是Example，其定义如下：

```proto
syntax = "proto3";

message BytesList { repeated bytes value = 1; }
message FloatList { repeated float value = 1 [packed = true]; }
message Int64List { repeated int64 value = 1 [packed = true]; }
message Feature {
    oneof kind {
        BytesList bytes_list = 1;
        FloatList float_list = 2;
        Int64List int64_list = 3;
    }
};
message Features { map<string, Feature> feature = 1; };
message Example { Features features = 1; };
```


In [3]:
from tensorflow.train import BytesList, FloatList, Int64List
from tensorflow.train import Feature, Features, Example

person_example = Example(
    features = Features(
        feature = {
            'name': Feature(bytes_list = BytesList(value = [b'A1'])),
            'id': Feature(int64_list = Int64List(value = [123])),
            'email': Feature(bytes_list = BytesList(value = [b'hi@hs.com', b'hello@gu.com']))
        }
    )
)

with tf.io.TFRecordWriter('my_contacts.tfrecord') as f:
    for _ in range(5):
        f.write(person_example.SerializeToString())

In [19]:
feature_description = {
    'name': tf.io.FixedLenFeature([], tf.string, default_value=''),
    'id': tf.io.FixedLenFeature([], tf.int64, default_value=0),
    'emails': tf.io.VarLenFeature(tf.string),
}

def parse(serialized_example):
    return tf.io.parse_single_example(serialized_example, feature_description)

dataset = tf.data.TFRecordDataset(['my_contacts.tfrecord']).map(parse)
for parsed_example in dataset:
    print(parsed_example)

{'emails': <tensorflow.python.framework.sparse_tensor.SparseTensor object at 0x7f9e92b57310>, 'id': <tf.Tensor: shape=(), dtype=int64, numpy=123>, 'name': <tf.Tensor: shape=(), dtype=string, numpy=b'A1'>}
{'emails': <tensorflow.python.framework.sparse_tensor.SparseTensor object at 0x7f9e92b573d0>, 'id': <tf.Tensor: shape=(), dtype=int64, numpy=123>, 'name': <tf.Tensor: shape=(), dtype=string, numpy=b'A1'>}
{'emails': <tensorflow.python.framework.sparse_tensor.SparseTensor object at 0x7f9e92b57dc0>, 'id': <tf.Tensor: shape=(), dtype=int64, numpy=123>, 'name': <tf.Tensor: shape=(), dtype=string, numpy=b'A1'>}
{'emails': <tensorflow.python.framework.sparse_tensor.SparseTensor object at 0x7f9e92b57340>, 'id': <tf.Tensor: shape=(), dtype=int64, numpy=123>, 'name': <tf.Tensor: shape=(), dtype=string, numpy=b'A1'>}
{'emails': <tensorflow.python.framework.sparse_tensor.SparseTensor object at 0x7f9e92b575b0>, 'id': <tf.Tensor: shape=(), dtype=int64, numpy=123>, 'name': <tf.Tensor: shape=(), dty

## SequenceExample Protobuf

SequenceExample是一类更复杂的protobuf，其proto定义为：

```proto
syntax = "proto3";

message FeatureList { repeated Feature feature = 1; };
message FeatureLists { map<string, FeatureList> feature_list = 1; };
message SequenceExample {
    Features context = 1;
    FeatureLists feature_lists = 2;
};
```

下面是一个文章页面的示例，既包含了作者、标题、发布时间等context信息，也包含了内容（以句子和token划分）和评论。

In [31]:
from tensorflow.train import FeatureList, FeatureLists, SequenceExample

context = Features(feature={
    'author_id': Feature(int64_list = Int64List(value = [123])),
    'title': Feature(bytes_list = BytesList(value = [b'A', b'desert', b'.'])),
    'pub_date': Feature(int64_list = Int64List(value = [1623, 12, 25]))
})

content = [['When', 'shall', 'we', 'three', 'meet', 'again', '?'],
            ['In', 'thunder', ',', "lightning", ",", "or", "in", "rain", "?"]]
comments = [["When", "the", "hurlyburly", "'s", "done", "."],
            ["When", "the", "battle", "'s", "lost", "and", "won", "."]]

def words_to_feature(words):
    return Feature(bytes_list = BytesList(value = [word.encode('utf-8') for word in words]))

content_features = [words_to_feature(sentence) for sentence in content]
comment_features = [words_to_feature(comment) for comment in comments]

sequence_example = SequenceExample(
    context = context,
    feature_lists = FeatureLists(feature_list = {
        'content': FeatureList(feature = content_features),
        'comments': FeatureList(feature = comment_features)
    })
)


In [32]:
sequence_example

context {
  feature {
    key: "author_id"
    value {
      int64_list {
        value: 123
      }
    }
  }
  feature {
    key: "pub_date"
    value {
      int64_list {
        value: 1623
        value: 12
        value: 25
      }
    }
  }
  feature {
    key: "title"
    value {
      bytes_list {
        value: "A"
        value: "desert"
        value: "."
      }
    }
  }
}
feature_lists {
  feature_list {
    key: "comments"
    value {
      feature {
        bytes_list {
          value: "When"
          value: "the"
          value: "hurlyburly"
          value: "\'s"
          value: "done"
          value: "."
        }
      }
      feature {
        bytes_list {
          value: "When"
          value: "the"
          value: "battle"
          value: "\'s"
          value: "lost"
          value: "and"
          value: "won"
          value: "."
        }
      }
    }
  }
  feature_list {
    key: "content"
    value {
      feature {
        bytes_list {
        

In [33]:
serialized_sequence_example = sequence_example.SerializeToString()

In [34]:
context_feature_descriptions = {
    'author_id': tf.io.FixedLenFeature([], tf.int64, default_value = 0),
    'title': tf.io.VarLenFeature(tf.string),
    'pub_date': tf.io.FixedLenFeature([3], tf.int64, default_value = [0, 0, 0])
}

sequence_feature_descriptions = {
    'content': tf.io.VarLenFeature(tf.string),
    'comments': tf.io.VarLenFeature(tf.string)
}

parsed_context, parsed_feature_lists = tf.io.parse_single_sequence_example(
    serialized_sequence_example, context_feature_descriptions,
    sequence_feature_descriptions)
parsed_content = tf.RaggedTensor.from_sparse(parsed_feature_lists['content'])

In [35]:
parsed_context

{'title': <tensorflow.python.framework.sparse_tensor.SparseTensor at 0x7f9fc85651c0>,
 'author_id': <tf.Tensor: shape=(), dtype=int64, numpy=123>,
 'pub_date': <tf.Tensor: shape=(3,), dtype=int64, numpy=array([1623,   12,   25])>}

In [36]:
parsed_context['title'].values

<tf.Tensor: shape=(3,), dtype=string, numpy=array([b'A', b'desert', b'.'], dtype=object)>

In [37]:
parsed_content

<tf.RaggedTensor [[b'When', b'shall', b'we', b'three', b'meet', b'again', b'?'], [b'In', b'thunder', b',', b'lightning', b',', b'or', b'in', b'rain', b'?']]>

## ExampleListWithContext Protobuf

这种Protobuf常见于ranking领域，可以看到和SequenceExample有点相似，结构不如SequenceExample紧凑

```proto
syntax = "proto3";

message ExampleListWithContext {
    repeated Example examples = 1;
    Example context = 2
};
```

In [4]:
context = Example(features = Features(feature = {
    'query_tokens': Feature(bytes_list = BytesList(value = [b'this', b'is', b'a', b'relevant', b'question'])),
}))

document1 = Example(features = Features(feature = {
    'document_tokens': Feature(bytes_list = BytesList(value = [b'this', b'is', b'a', b'relevant', b'answer'])),
    'relevance': Feature(int64_list = Int64List(value = [5]))
}))

document2 = Example(features = Features(feature = {
    'document_tokens': Feature(bytes_list = BytesList(value = [b'irrelevant', b'data'])),
    'relevance': Feature(int64_list = Int64List(value = [1]))
}))

In [5]:
from tensorflow_serving.apis.input_pb2 import ExampleListWithContext as ELWC

elwc = ELWC(examples = [document1, document2], context=context)
elwc

examples {
  features {
    feature {
      key: "document_tokens"
      value {
        bytes_list {
          value: "this"
          value: "is"
          value: "a"
          value: "relevant"
          value: "answer"
        }
      }
    }
    feature {
      key: "relevance"
      value {
        int64_list {
          value: 5
        }
      }
    }
  }
}
examples {
  features {
    feature {
      key: "document_tokens"
      value {
        bytes_list {
          value: "irrelevant"
          value: "data"
        }
      }
    }
    feature {
      key: "relevance"
      value {
        int64_list {
          value: 1
        }
      }
    }
  }
}
context {
  features {
    feature {
      key: "query_tokens"
      value {
        bytes_list {
          value: "this"
          value: "is"
          value: "a"
          value: "relevant"
          value: "question"
        }
      }
    }
  }
}

In [6]:
serialized_elwc = elwc.SerializeToString()

In [9]:
context_feature_descriptions = {
    'query_tokens': tf.io.RaggedFeature(tf.string)
}

example_feature_descriptions = {
    'document_tokens': tf.io.RaggedFeature(tf.string),
    'relevance': tf.io.FixedLenFeature([], tf.int64)
}

import tensorflow_ranking as tfr

parsed_elwc = tfr.data.parse_from_example_list(
    serialized=[serialized_elwc],
    list_size = 2,
    context_feature_spec=context_feature_descriptions,
    example_feature_spec=example_feature_descriptions,
    size_feature_name = '_list_size_',
    mask_feature_name = '_mask_'
)

parsed_elwc

{'relevance': <tf.Tensor: shape=(1, 2), dtype=int64, numpy=array([[5, 1]])>,
 'document_tokens': <tf.RaggedTensor [[[b'this', b'is', b'a', b'relevant', b'answer'], [b'irrelevant', b'data']]]>,
 'query_tokens': <tf.RaggedTensor [[b'this', b'is', b'a', b'relevant', b'question']]>,
 '_list_size_': <tf.Tensor: shape=(1,), dtype=int32, numpy=array([2], dtype=int32)>,
 '_mask_': <tf.Tensor: shape=(1, 2), dtype=bool, numpy=array([[ True,  True]])>}