Skip to content
Dion Mendel edited this page Jun 26, 2023 · 13 revisions

Navigation


Buffers

A buffer is a block of bytes of fixed size. Buffers are useful when you don't wish to manually calculate padding or if you know the size of the data but not the exact contents.

Here we have a collection of various fields that are contained within a 100 byte block. The fields don't total 100 bytes so we need to add padding.

class MyRecord < BinData::Record
  endian :little

  struct :one_hundred_bytes do
    uint8  :a
    string :b, length: 10
    uint16 :c
    ...
    string :padding, length: -> { 100 - padding.rel_offset }
  end
end

Using a buffer removes the need to calculate this padding.

class MyRecord < BinData::Record
  endian :little

  buffer :one_hundred_bytes, length: 100 do
    uint8  :a
    string :b, length: 10
    uint16 :c
    ...
  end
end

The amount of padding in a buffer can be calculated if needed.

buffer.num_bytes - buffer.raw_num_bytes

A buffer is also useful when there is an array defined by number of bytes rather than number of elements.

class StringTable < BinData::Record
  uint16le :table_size_in_bytes
  buffer :strings, length: :table_size_in_bytes do
    array type: :stringz, read_until: :eof
  end
end

Sections

A Section is a layer on top of a stream that transforms the underlying data. This allows BinData to process a stream that has multiple encodings, e.g. some data in the stream is compressed or encrypted.

There are several common transforms provided in bindata/transform.

Here is an example of using the builtin zlib transform.

require 'bindata/transform/zlib'

class ZlibRecord < BinData::Record
  int32le :section_len, value: -> { s.num_bytes }
  section :s, transform: -> { BinData::Transform::Zlib.new(section_len) } do
    int32le :len, value: -> { str.length }
    string :str, read_length: :len
  end
end

obj = ZlibRecord.new
obj.s.str = "highly compressible" * 100
obj.num_bytes #=> 51

If the required transform is not provided, then you will need to provide the transformation code yourself, but fortunately it is not difficult.

Here is an example of a xor encrypted stream.

class XorTransform < BinData::IO::Transform
  def initialize(xor)
    super()
    @xor = xor
  end

  def read(n)
    chain_read(n).bytes.map { |byte| (byte ^ @xor).chr }.join
  end

  def write(data)
    chain_write(data.bytes.map { |byte| (byte ^ @xor).chr }.join)
  end
end

obj = BinData::Section.new(transform: -> { XorTransform.new(0xff) },
                           type: [:string, read_length: 5])

obj.read("\x97\x9A\x93\x93\x90") #=> "hello"

Skipping over unused data

Some structures contain binary data that is irrelevant to your purposes.

Say you are interested in 50 bytes of data located 10 megabytes into the stream. One way of accessing this useful data is:

class MyData < BinData::Record
  string length: 10 * 1024 * 1024
  string :data, length: 50
end

The advantage of this method is that the irrelevant data is preserved when writing the record. The disadvantage is that even if you don't care about preserving this irrelevant data, it still occupies memory.

If you don't need to preserve this data, an alternative is to use skip instead of string. When reading it will seek over the irrelevant data and won't consume space in memory. When writing it will write :length number of zero bytes.

class MyData < BinData::Record
  skip length: 10 * 1024 * 1024
  string :data, length: 50
end

Skip also has a :to_abs_offset convenience option as an alternative to :length:. It will advance the stream to the given absolute offset. Skipping backwards is not supported. If you need to skip backwards then you should refer to Multi-pass I/O below.

Sometimes you don't know the offset that the data is located at. You can skip directly to the data by using assertions to specify a pattern to search for. It is better to be as specific as possible to avoid false matches.

class MyData < BinData::Record
  endian :little

  # skip to 'ST' followed by int16 between 10 and 1000
  skip do
    string read_length: 2, asserted_value: "ST"
    uint16 assert: -> { [10..1000].include? value }
  end

  # we are now positioned correctly
  string :sig, read_length: 2
  uint16 :count
  array  :my_data, length: :count
  ...
end

Determining stream length

Some file formats don't use length fields but rather read until the end of the file. The stream length is needed when reading these formats. The count_bytes_remaining keyword will give the number of bytes remaining in the stream.

Consider a string followed by a 2 byte checksum. The length of the string is not specified but is implied by the file length.

class StringWithChecksum < BinData::Record
  count_bytes_remaining :bytes_remaining
  string :the_string, read_length: -> { bytes_remaining - 2 }
  int16le :checksum
end

These file formats only work with seekable streams (e.g. files). These formats do not stream well as they must be buffered by the client before being processed. Consider using an explicit length when creating a new file format as it is easier to work with.

Multi-pass I/O

BinData optimises for single pass file formats. A single pass format is one that is processed as a sequential stream of bytes.

Some file formats require multi pass I/O - i.e. they require seeking backwards in the stream. BinData provides for this with delayed_io fields. delayed_io are similar to virtual fields in that they aren't read or written along with the other fields.

Let's consider a file format that contains a directory of people, represented by name and age. The person records are of variable length, so an offset array is provided for easy lookup.

class Person < BinData::Record
  uint8  :name_len, value: -> { name.length }
  string :name,     read_length: :name_len
  uint8  :age
end

class Directory < BinData::Record
  endian :little
  uint32 :num_entries
  array  :offsets, type: :uint32, initial_length: :num_entries

  array  :people, initial_length: :num_entries do
    delayed_io type: :person, read_abs_offset: -> { offsets[index] }
  end
end

This file format is multi pass as the offsets may not necessarily refer to the person data sequentially. e.g. The people could be stored in alphabetical order while the offsets could order them by age.

Reading the directory will read the offsets field, but will not read any person data.

d = Directory.read(io)

d.num_entries       #=> 200
d.offsets.length    #=> 200
d.offsets.num_bytes #=> 800

d.people.length     #=> 200
d.people.num_bytes  #=>   0

d.people[50]        #=> { name_len: 0, name: "", age: 0 }

The person data won't be read until explicitly requested.

d = Directory.read(io)
d.people[50].read_now!
d.people[50] #=> { name_len: 13, name: "Charlie Brown", age: 8 }

To request all entries we could read like this:

d = Directory.read(io) { |obj| obj.people.each { |per| per.read_now! } }
d.people[50] #=> { name_len: 13, name: "Charlie Brown", age: 8 }

However, if we want BinData to perform the multiple passes for us automatically, we can use the auto_call_delayed_io keyword.

class Directory < BinData::Record
  auto_call_delayed_io

  endian :little
  uint32 :num_entries
  array  :offsets, type: :uint32, initial_length: :num_entries

  array  :people, initial_length: :num_entries do
    delayed_io type: :person, read_abs_offset: -> { offsets[index] }
  end
end

d = Directory.read(io)
d.people[50] #=> { name_len: 13, name: "Charlie Brown", age: 8 }

Multi-pass writing is similar in that writing is not performed until #write_now! is called.

Note that #num_bytes may behave unexpectedly when using delayed_io. It will behave as normal when using the auto_call_delayed_io keyword. When not using this keyword #num_bytes will only return the number of bytes for a single pass as the number of passes hasn't be specified yet.