# Using Construct

The [construct](https://construct.readthedocs.io) library has replaced the previously used [enstructured](http://github.com/Defense-Cyber-Crime-Center/DC3-MWCP/blob/fe38f8eacc99dfce5b9fb4bf6fa0b63a697529dd/mwcp/resoures/enstructured.md) library for parsing binary data. It supports all the abilities of enstructured, but expands upon it by adding support such as validation, more advanced types, easy extending, and a cleaner syntax.

This tutorial is designed to get you familiar with the basic functionality of construct and how to translate your knowledge of enstructured into construct. It does not attempt to completely cover all the uses for construct. For a more in-depth information, I highly recommend reading the documentation found at [construct.readthedocs.io](construct.readthedocs.io).

## TOC
- [Importing](#importing)
- [Parsing](#parsing)
- [Structs](#structs)
- [Integers](#integers)
- [Bytes](#bytes)
- [Strings](#strings)
- [Offsetting and Skipping](#offsetting-and-skipping)
- [SkipNull](#skipnull)
- [Subfields](#subfields)
- [Dynamic Parameters](#dynamic-parameters)
- [Formatting Values](#formatting-values)
- [Enums](#enums)
- [Switches](#switches)
- [Ranges (lists)](#ranges-\(lists\))
- [Computed Values](#computed-values)
- [Delimited](#delimited)
- [PE Physical Address](#pe-physical-address)
- [PE Physical Address 64-bit](#pe-phyiscal-address-64-bit)
- [Regular Expressions](#regular-expressions)
- [HTML Documentation](#html-documentation)
- [Building](#building)
- [Validation](#validation)
- [Debugging](#debugging)

## Importing

Along with the standard construct library, DC3-MWCP provides some extra helper utilities found in the `mwcp.utils.construct`. When you import by doing `from mwcp.utils import construct` you will get everything that the construct library includes, plus extra helper functions and types that have been developed specifically for parsing malware.

*For the rest of this documentation, you can assume the following was used to import.*

In [2]:
from mwcp.utils import construct
from mwcp.resources import enstructured  # This library has been deprecated and will be remove on next release.

##  Parsing

Every element in construct has a "parse()" function that will take data and return the parsed results for that data.

In [3]:
spec = [[enstructured.DWORD]]

enstructured.Extractor(data=b'\x01\x00\x00\x00', specification=spec).extract_members()

{'member0': {'formatted_value': 1,
  'ignore': False,
  'index': 0,
  'length': 4,
  'location': 0,
  'offset': 0,
  'type': 'uint32',
  'value': 1}}

In [4]:
spec = construct.Int32ul

spec.parse(b'\x01\x00\x00\x00')

1

By default, the results only return the parsed value and does not include the metadata that you see in enstructured. However, you can get this information by wrapping your construct element with `RawCopy`.

(You can access the original value that you would get by the `value` attribute.)

In [5]:
spec = construct.RawCopy(construct.Int32ul)

spec.parse(b'\x01\x00\x00\x00')

Container(data=b'\x01\x00\x00\x00')(value=1)(offset1=0)(offset2=4)(length=4)

The `Container` class that gets returned by construct works just like a dictionary, but also allows for accessing and setting the keys by attribute. 

In [6]:
spec = construct.RawCopy(construct.Int32ul)
result = spec.parse(b'\x01\x00\x00\x00')

# Retrieving results.
print(result['offset1'], result['length'], result['value'])
print(result.offset1, result.length, result.value)

# Setting new results.
result.value = 2
print(result)

0 4 1
0 4 1
Container: 
    data = \x01\x00\x00\x00 (total 4)
    value = 2
    offset1 = 0
    offset2 = 4
    length = 4


## Structs

`Structs` are the equivalent to specifications in enstructured. They allow you to contain multiple elements and optionally name elements using the `/` operator.

*NOTE: If you don't give a name to an element, the results will not show up, but still be parsed. You would usually have a nameless element if you don't care about a certain element (an unknown field) or if you are performing validation. (More on that later.)*

In [7]:
spec = [
    [enstructured.BYTE, 'a'],
    [enstructured.WORD, 'b'],
    [enstructured.DWORD]
]

enstructured.Extractor(data=b'\x0A\x0B\x00\x0C\x00\x00\x00', specification=spec).extract_members()

{'a': {'formatted_value': 10,
  'ignore': False,
  'index': 0,
  'length': 1,
  'location': 0,
  'offset': 0,
  'type': 'uint8',
  'value': 10},
 'b': {'formatted_value': 11,
  'ignore': False,
  'index': 1,
  'length': 2,
  'location': 1,
  'offset': 1,
  'type': 'uint16',
  'value': 11},
 'member2': {'formatted_value': 12,
  'ignore': False,
  'index': 2,
  'length': 4,
  'location': 3,
  'offset': 3,
  'type': 'uint32',
  'value': 12}}

In [8]:
spec = construct.Struct(
    'a' / construct.Byte,
    'b' / construct.Int16ul,
    construct.Int32ul
)

spec.parse(b'\x0A\x0B\x00\x0C\x00\x00\x00')

Container(a=10)(b=11)

## Integers

Integers work very similar to enstructured. However, the biggest different between these two libraries is that construct allows you to specify the endianess per element. 

The enstructured library requires you to specify endianness during parsing for all elements.

In construct, you can set integers from a variety of bit lengths and endianess. (e.g. `Int32ul`, `Int16ul`, `Int32sb`, `Int32sl`, etc.)  The mwcp.utils.construct helper module has aliased some of these formats to "WORD", "DWORD", etc. to match the style used in enstructured.

`construct.Byte` is the equivelent to `construct.Int8ul` just like enstructured.




In [9]:
spec = [
    [enstructured.WORD, 'a'],
    [enstructured.DWORD, 'b'],
    ['int16', 'c']   # Not possible to have this be big endian!
]

enstructured.Extractor(data=b'\x0A\x00\x0B\x00\x00\x00\x00\x0C', specification=spec).extract_members()

{'a': {'formatted_value': 10,
  'ignore': False,
  'index': 0,
  'length': 2,
  'location': 0,
  'offset': 0,
  'type': 'uint16',
  'value': 10},
 'b': {'formatted_value': 11,
  'ignore': False,
  'index': 1,
  'length': 4,
  'location': 2,
  'offset': 2,
  'type': 'uint32',
  'value': 11},
 'c': {'formatted_value': 3072,
  'ignore': False,
  'index': 2,
  'length': 2,
  'location': 6,
  'offset': 6,
  'type': 'int16',
  'value': 3072}}

In [10]:
enstructured.Extractor(data=b'\x0A\x00\x0B\x00\x00\x00\x00\x0C', specification=spec, endian=enstructured.BIG_ENDIAN).extract_members()

{'a': {'formatted_value': 2560,
  'ignore': False,
  'index': 0,
  'length': 2,
  'location': 0,
  'offset': 0,
  'type': 'uint16',
  'value': 2560},
 'b': {'formatted_value': 184549376,
  'ignore': False,
  'index': 1,
  'length': 4,
  'location': 2,
  'offset': 2,
  'type': 'uint32',
  'value': 184549376},
 'c': {'formatted_value': 12,
  'ignore': False,
  'index': 2,
  'length': 2,
  'location': 6,
  'offset': 6,
  'type': 'int16',
  'value': 12}}

In [11]:
spec = construct.Struct(
    'a' / construct.WORD,
    'b' / construct.DWORD,
    'c' / construct.Int16sb
)

spec.parse(b'\x0A\x00\x0B\x00\x00\x00\x00\x0C')

Container(a=10)(b=11)(c=12)

## Bytes

Bytes work the same way as enstructured. You can specify the length of the bytes by supplying a parameter to `Bytes`.

In [12]:
spec = [[enstructured.BYTES, 'my_data', {'length': 3}]]

enstructured.Extractor(data=b'\x0A\x0B\x0C', specification=spec).extract_members()

3


{'my_data': {'formatted_value': b'\n\x0b\x0c',
  'ignore': False,
  'index': 0,
  'length': 3,
  'location': 0,
  'offset': 0,
  'params': {'length': 3},
  'type': 'bytes',
  'value': b'\n\x0b\x0c'}}

In [13]:
spec = construct.Struct(
    'my_data' / construct.Bytes(3)
)

spec.parse(b'\x0A\x0B\x0C')

Container(my_data=b'\n\x0b\x0c')

## Strings

Strings work slightly different in construct then it does in enstructured. 

Instead of having multiple different member types ("STR", "STR16", "STR32", etc.) The construct using the single "String" type and allows you to specify the encoding as a parameter. (ie. STR16 == 'utf-16' and STR32 == 'utf-32')

*NOTE: By default construct will strip off any null characters before decoding the string. If you are using `utf-16` or `utf-32` (which is actually `utf-16-le` and `utf-32-le`) you will need to change the direction of the padding to the left to ensure that the null characters are being cut from the right side and avoid a decode error. Helper functions `String16` and `String32` have been created to do this for you.*

In [14]:
spec = [
    [enstructured.STR, 'greeting', {'length': 5}],
    [enstructured.STR16, 'farewell', {'length': 14}]
]

enstructured.Extractor(data='hellog\x00o\x00o\x00d\x00b\x00y\x00e\x00', specification=spec).extract_members()

5


In [15]:
spec = construct.Struct(
    'greeting' / construct.String(5),
    'farewell' / construct.String(14, encoding='utf-16', paddir='left')  # OR construct.String16(14)
)
spec.parse(b'hellog\x00o\x00o\x00d\x00b\x00y\x00e\x00')

Container(greeting='hello')(farewell='goodbye')

To parse a null terminated string without any length, you can use the `CString` subconstruct.

In [16]:
spec = [[enstructured.STR]]

enstructured.Extractor(data=b'hello world\x00otherstuff', specification=spec).extract_members()

In [17]:
spec = construct.CString()

spec.parse(b'hello world\x00otherstuff')

b'hello world'

## Offsetting and Skipping

Offsetting in construct is done differently from enstructured. Construct doesn't have the ability to jump to an offset based on the starting offset of the current struct it is parsing. Instead, construct can use either use `Seek` or `Pointer` to jump to a position relative to either the current position or the data itself or `Padding` to specify a number of bytes consume before parsing the next thing.

When using `Pointer`, your position is restored after you have finished parsing the given element.

In [18]:
spec = [
    [enstructured.BYTE, 'a'],
    [enstructured.BYTE, 'b', {'offset': 7}],
    [enstructured.BYTE, 'c']
]

enstructured.Extractor(data=b'\x00\x01\x02\x03\x04\x05\x06\x07\x08\x09', specification=spec).extract_members()

7


{'a': {'formatted_value': 0,
  'ignore': False,
  'index': 0,
  'length': 1,
  'location': 0,
  'offset': 0,
  'type': 'uint8',
  'value': 0},
 'b': {'formatted_value': 7,
  'ignore': False,
  'index': 1,
  'length': 1,
  'location': 7,
  'offset': 7,
  'type': 'uint8',
  'value': 7},
 'c': {'formatted_value': 8,
  'ignore': False,
  'index': 2,
  'length': 1,
  'location': 8,
  'offset': 8,
  'type': 'uint8',
  'value': 8}}

In [19]:
import os

spec = construct.Struct(
    'a' / construct.Byte,
    construct.Seek(7, whence=os.SEEK_SET),
    'b' / construct.Byte,
    'c' / construct.Byte
)

spec.parse(b'\x00\x01\x02\x03\x04\x05\x06\x07\x08\x09')

Container(a=0)(b=7)(c=8)

In [20]:
spec = construct.Struct(
    'a' / construct.Byte,
    construct.Padding(6),  # Pattern can be set and enforced as a form of validation.
    'b' / construct.Byte,
    'c' / construct.Byte
)

spec.parse(b'\x00\x01\x02\x03\x04\x05\x06\x07\x08\x09')

Container(a=0)(b=7)(c=8)

In [21]:
spec = construct.Struct(
    'a' / construct.Byte,
    'b' / construct.Pointer(7, construct.Byte),  # Pointer can be dynamically computed.
    'c' / construct.Byte,
)

spec.parse(b'\x00\x01\x02\x03\x04\x05\x06\x07\x08\x09')

Container(a=0)(b=7)(c=1)

# SkipNull

Construct doesn't originally have "skipnull" support, but the element can easily be recreated using `Const` and `GreedyRange`.

*The following SkipNull implementation has been provided in the helpers repo.*

In [22]:
spec = [[enstructured.BYTE, 'a', {'skipnull': True}]]

enstructured.Extractor(data=b'\x00\x00\x00\x00\x01', specification=spec).extract_members()

True


{'a': {'formatted_value': 0,
  'ignore': False,
  'index': 0,
  'length': 1,
  'location': 0,
  'offset': 0,
  'type': 'uint8',
  'value': 0}}

In [23]:
SkipNull = construct.Const(b'\x00')[:]

spec = construct.Struct(
    SkipNull,
    'a' / construct.Byte
)

spec.parse(b'\x00\x00\x00\x00\x01')

Container(a=1)

## Subfields

Providing embedded structs within a struct works by simply providing the spec as the element for a member of the struct.

In [24]:
sub_spec = [
    [enstructured.DWORD, 'sub_a'],
    [enstructured.WORD, 'sub_b']
]
spec = [
    [enstructured.DWORD, 'a'],
    [enstructured.SUBFIELD, 'sub', {'spec': sub_spec}]
]

enstructured.Extractor(data=b'\x0A\x00\x00\x00\x0B\x00\x00\x00\x0C\x00', specification=spec).extract_members()

[['uint32', 'sub_a'], ['uint16', 'sub_b']]


{'a': {'formatted_value': 10,
  'ignore': False,
  'index': 0,
  'length': 4,
  'location': 0,
  'offset': 0,
  'type': 'uint32',
  'value': 10},
 'sub': {'ignore': False,
  'index': 1,
  'length': 6,
  'location': 4,
  'offset': 4,
  'params': {'spec': [['uint32', 'sub_a'], ['uint16', 'sub_b']]},
  'type': 'subfield',
  'value': {'sub_a': {'formatted_value': 11,
    'ignore': False,
    'index': 0,
    'length': 4,
    'location': 4,
    'offset': 0,
    'type': 'uint32',
    'value': 11},
   'sub_b': {'formatted_value': 12,
    'ignore': False,
    'index': 1,
    'length': 2,
    'location': 8,
    'offset': 4,
    'type': 'uint16',
    'value': 12}}}}

In [25]:
sub_spec = construct.Struct(
    'sub_a' / construct.DWORD,
    'sub_b' / construct.WORD
)
spec = construct.Struct(
    'a' / construct.DWORD,
    'sub' / sub_spec
)

spec.parse(b'\x0A\x00\x00\x00\x0B\x00\x00\x00\x0C\x00')

Container(a=10)(sub=Container(sub_a=11)(sub_b=12))

*NOTE: Access to sub attributes can be accessed as attributes within the sub*

In [26]:
results = spec.parse(b'\x0A\x00\x00\x00\x0B\x00\x00\x00\x0C\x00')
results.a, results.sub.sub_a, results.sub.sub_b

(10, 11, 12)

## Dynamic Parameters

The construct library can dynamically determine parameters for elements in a similar way to the enstructured library.

While the enstructured library can replace a hardcoded parameter with a string that gets evaulated, the construct library can replace most parameters with a function that accepts the current context of parsing before and returns the value. The context passed in is the same thing that you would get if you stopped parsing at that point. 

In [27]:
spec = [
    [enstructured.WORD, 'size'],
    [enstructured.BYTES, 'data', {'length': '`members["size"]["value"] * 2`'}]
]

enstructured.Extractor(data=b'\x05\x00helloworld\x01\x02\x03', specification=spec).extract_members()

`members["size"]["value"] * 2`


{'data': {'formatted_value': b'helloworld',
  'ignore': False,
  'index': 1,
  'length': 10,
  'location': 2,
  'offset': 2,
  'params': {'length': 10},
  'type': 'bytes',
  'value': b'helloworld'},
 'size': {'formatted_value': 5,
  'ignore': False,
  'index': 0,
  'length': 2,
  'location': 0,
  'offset': 0,
  'type': 'uint16',
  'value': 5}}

In [28]:
spec = construct.Struct(
    'size' / construct.WORD,
    'data' / construct.Bytes(lambda ctx: ctx.size * 2)
)

spec.parse(b'\x05\x00helloworld\x01\x02\x03')

Container(size=5)(data=b'helloworld')

The construct library has an alternative syntax for setting these functions by using the `this` singleton like the following:

*More information about this can be found at construct.readthedocs.io/en/latest/meta.html#using-this-expression*

In [29]:
spec = construct.Struct(
    'size' / construct.WORD,
    'data' / construct.Bytes(construct.this.size * 2)
)

spec.parse(b'\x05\x00helloworld\x01\x02\x03')

Container(size=5)(data=b'helloworld')

You can also initialize the context with external information before parsing by adding keyword arguments to the "parse()" function.

Although, you normally won't need to do this since the lambda function should be in the scope of the external data. (But, it can be useful when defining constructs separately from parsing.)

NOTE: To access these, you need to access the parent context by using the "_" key. (You would do this same thing if you were trying to access the context of a parent struct within an embedded struct.)

In [30]:
# (Defined somewhere way above, possibly as a class variable.)
spec = construct.Struct(
    construct.WORD,
    'data' / construct.Bytes(lambda ctx: ctx._.external_length)
)

# ...

spec.parse(b'\x05\x00helloworld\x01\x02\x03', external_length=7)

Container(data=b'hellowo')

## Formatting Values

You can change the resulting format of any element in similar way to enstructured. 

With enstructured you would set a "formatter" parameter that would be a function that takes a value and converts it to a new format. 

The construct library does this by wrapping an element in an `Adapter`. An adapter is a class that inherits from `construct.Adapter` and implements the `_encode` and `_decode` functions. These functions take in two parameters `obj` and `context`. `obj` is the value that you are encoding or decoding and `context` is the current context of parsed/built elements. (Both need to be implemented because construct can both parse and build.) 

A number of adapters have already been developed within the `mwcp.utils.construct` module. Please contribute to this module if you develop a new one.

In [31]:
spec = [[enstructured.DWORD, '', {'formatter': enstructured.format_timestamp}]]

enstructured.Extractor(data=b'\xfa\x40\xd2\x59', specification=spec).extract_members()

<function format_timestamp at 0x0000000005C2AD08>


{'': {'formatted_value': '2017-10-02T09:36:58',
  'formatter': 'format_timestamp',
  'ignore': False,
  'index': 0,
  'length': 4,
  'location': 0,
  'offset': 0,
  'type': 'uint32',
  'value': 1506951418}}

In [32]:
import time

class UTCTimeStampAdapter(construct.Adapter):
    def _decode(self, obj, context):
        return time.ctime(obj)
    def _encode(self, obj, context):
        return int(time.mktime(time.strptime(obj)))

UTCTimeStamp = UTCTimeStampAdapter(construct.Int32ul)

UTCTimeStamp.parse(b'\xfa\x40\xd2\x59')

'Mon Oct  2 09:36:58 2017'

In [33]:
UTCTimeStamp.build('Mon Oct 02 09:36:58 2017')

b'\xfa@\xd2Y'

## Enums

Enums can be created in construct by using `Enum` and specifying keyword arguments (or a dictionary using the `**` notation).

The main difference is that construct takes the enums with the key and values reversed from enstructured.

Also, construct can take a default value to be used if the enum is not available. If not provided, an exception will be raised if the parse value is not one of the enum values. This is a form of validation. (More on that later.)
If `construct.Pass` is set as default, the parsed value as-is will be set instead.

In [34]:
PE_machines_enum_values = { 
    0x014c: "Intel 386",
    0x014d: "Intel 486",
    0x014e: "Intel 586",
    0x0200: "Intel 64-bit",
    0x0162: "MIPS",
    0x8664: "AMD64",
} 

spec = [[enstructured.WORD, 'machine', 
         {'formatter': enstructured.format_enum_factory(PE_machines_enum_values)}]]

enstructured.Extractor(data=b'\x4c\x01', specification=spec).extract_members()

<function format_enum_factory.<locals>.formatter_enum at 0x00000000048DBE18>


{'machine': {'formatted_value': 'Intel 386',
  'formatter': 'formatter_enum',
  'ignore': False,
  'index': 0,
  'length': 2,
  'location': 0,
  'offset': 0,
  'type': 'uint16',
  'value': 332}}

In [35]:
PE_machines_enum_values = { 
    "Intel 386": 0x014c,
    "Intel 486": 0x014d,
    "Intel 586": 0x014e,
    "Intel 64-bit": 0x0200,
    "MIPS": 0x0162,
    "AMD64": 0x8664
} 

spec = construct.Enum(construct.WORD, default=construct.Pass, **PE_machines_enum_values)

spec.parse(b'\x4c\x01')

'Intel 386'

In [36]:
spec.build('Intel 386')

b'L\x01'

## Switches

The equivelent to enstructured's `MAPSUBFIELD` is `Switch` in construct. It works the same way, only you can also specify a default element or allow it to raise an exception if the parsed value doesn't match.

In [37]:
header1 = [
    [enstructured.BYTE, 'a'],
    [enstructured.BYTE, 'b']
]
header2 = [
    [enstructured.BYTE, 'c'],
    [enstructured.BYTE, 'd']
]
header_map = {
    224: header1,
    240: header2,
}  

spec = [
    [enstructured.BYTE, 'size'],
    [enstructured.MAPSUBFIELD, 'header', {'key':'`members["size"]["value"]`', 'specmap': header_map}]
]

enstructured.Extractor(data=b'\xF0\x01\x02', specification=spec).extract_members()

`members["size"]["value"]`
{224: [['uint8', 'a'], ['uint8', 'b']], 240: [['uint8', 'c'], ['uint8', 'd']]}


{'header': {'ignore': False,
  'index': 1,
  'length': 2,
  'location': 1,
  'offset': 1,
  'params': {'key': 240,
   'specmap': {224: [['uint8', 'a'], ['uint8', 'b']],
    240: [['uint8', 'c'], ['uint8', 'd']]}},
  'type': 'mapsubfield',
  'value': {'c': {'formatted_value': 1,
    'ignore': False,
    'index': 0,
    'length': 1,
    'location': 1,
    'offset': 0,
    'type': 'uint8',
    'value': 1},
   'd': {'formatted_value': 2,
    'ignore': False,
    'index': 1,
    'length': 1,
    'location': 2,
    'offset': 1,
    'type': 'uint8',
    'value': 2}}},
 'size': {'formatted_value': 240,
  'ignore': False,
  'index': 0,
  'length': 1,
  'location': 0,
  'offset': 0,
  'type': 'uint8',
  'value': 240}}

In [38]:
header1 = construct.Struct(
    'a' / construct.Byte,
    'b' / construct.Byte
)
header2 = construct.Struct(
    'c' / construct.Byte,
    'd' / construct.Byte
)
header_map = {
    224: header1,
    240: header2,
}  

spec = construct.Struct(
    'size' / construct.Byte,
    'header' / construct.Switch(lambda ctx: ctx.size, cases=header_map)
)

spec.parse(b'\xF0\x01\x02')

Container(size=240)(header=Container(c=1)(d=2))

Alternatively, it sometimes may be simplier to use an `If` or `IfThenElse` statement.

In [39]:
spec = construct.Struct(
    'size' / construct.Byte,
    'header' / construct.IfThenElse(
        lambda ctx: ctx.size == 224,  # if
        header1,  # then
        header2   # else
    )
)

spec.parse(b'\xF0\x01\x02')

Container(size=240)(header=Container(c=1)(d=2))

## Ranges (lists)

Equivalent to enstructured's "SUBFIELDLIST" and "count" parameter, in construct you can define an array or list of constructs by using the '[]' notation after any element.

In [40]:
sub_spec = [[enstructured.BYTE]]

spec = [[enstructured.SUBFIELDLIST, '', {'count': 5, 'spec': sub_spec}]]

enstructured.Extractor(data=b'\x01\x02\x03\x04\x05', specification=spec).extract_members()

5
[['uint8']]


{'': {'count': 5,
  'ignore': False,
  'index': 0,
  'length': 5,
  'location': 0,
  'offset': 0,
  'params': {'count': 5, 'spec': [['uint8']]},
  'type': 'subfieldlist',
  'value': [{'member0': {'formatted_value': 1,
     'ignore': False,
     'index': 0,
     'length': 1,
     'location': 0,
     'offset': 0,
     'type': 'uint8',
     'value': 1}},
   {'member0': {'formatted_value': 2,
     'ignore': False,
     'index': 0,
     'length': 1,
     'location': 1,
     'offset': 0,
     'type': 'uint8',
     'value': 2}},
   {'member0': {'formatted_value': 3,
     'ignore': False,
     'index': 0,
     'length': 1,
     'location': 2,
     'offset': 0,
     'type': 'uint8',
     'value': 3}},
   {'member0': {'formatted_value': 4,
     'ignore': False,
     'index': 0,
     'length': 1,
     'location': 3,
     'offset': 0,
     'type': 'uint8',
     'value': 4}},
   {'member0': {'formatted_value': 5,
     'ignore': False,
     'index': 0,
     'length': 1,
     'location': 4,
     'off

In [41]:
spec = construct.Byte[5]

spec.parse(b'\x01\x02\x03\x04\x05')

[1, 2, 3, 4, 5]

In [42]:
sub_spec = [[enstructured.BYTE]]

spec = [
    [enstructured.BYTE, 'size'],
    [enstructured.SUBFIELDLIST, 'items', {'count': '`members["size"]["value"]`', 'spec': sub_spec}]
]

enstructured.Extractor(data=b'\x03\x01\x02\x03\x04\x05', specification=spec).extract_members()

`members["size"]["value"]`
[['uint8']]


{'items': {'count': 3,
  'ignore': False,
  'index': 1,
  'length': 3,
  'location': 1,
  'offset': 1,
  'params': {'count': 3, 'spec': [['uint8']]},
  'type': 'subfieldlist',
  'value': [{'member0': {'formatted_value': 1,
     'ignore': False,
     'index': 0,
     'length': 1,
     'location': 1,
     'offset': 0,
     'type': 'uint8',
     'value': 1}},
   {'member0': {'formatted_value': 2,
     'ignore': False,
     'index': 0,
     'length': 1,
     'location': 2,
     'offset': 0,
     'type': 'uint8',
     'value': 2}},
   {'member0': {'formatted_value': 3,
     'ignore': False,
     'index': 0,
     'length': 1,
     'location': 3,
     'offset': 0,
     'type': 'uint8',
     'value': 3}}]},
 'size': {'formatted_value': 3,
  'ignore': False,
  'index': 0,
  'length': 1,
  'location': 0,
  'offset': 0,
  'type': 'uint8',
  'value': 3}}

In [43]:
spec = construct.Struct(
    'size' / construct.Byte,
    'items' / construct.Byte[lambda ctx: ctx.size]
)

spec.parse(b'\x03\x01\x02\x03\x04\x05')

Container(size=3)(items=[1, 2, 3])

Construct can also parse a "GreedyRange" using the `[:]` notation. This will parse the given element continuously until either it has run out of data or validation within the element has failed.

In [44]:
spec = construct.CString()[:]

# The last one doesn't show up because it has failed the validation of needing a '\x00' at the end.
spec.parse(b'hello\x00world\x00\xFAGARBAGE!')

[b'hello', b'world']

## Computed Values

The construct library also has the ability to produce computed values using the `Computed` subconstruct. This can be useful if you would like to create new attributes based on the values of other attributes, which can then be used for further parsing.

In [45]:
spec = construct.Struct(
    'width' / construct.WORD,
    'height' / construct.WORD,
    'size' / construct.Computed(lambda ctx: ctx.width * ctx.height),
    'data' / construct.Bytes(lambda ctx: ctx.size)
)

spec.parse(b'\x02\x00\x03\x00hello!')

Container(width=2)(height=3)(size=6)(data=b'hello!')

## Delimited

You can parse delimited data by using the `Delimited` construct and defining the construct to use on each individual element within the delimited data.

In [46]:
spec = construct.Delimited(b'|',
    'first' / construct.CString(),
    'second' / construct.DWORD,
    # When using a Greedy construct, either all data till EOF or the next delimiter will be consumed.
    'third' / construct.GreedyBytes,
    'fourth' / construct.Byte
)

spec.parse(b'Hello\x00\x00|\x01\x00\x00\x00|world!!\x01\x02|\xff')

Container(first=b'Hello')(second=1)(third=b'world!!\x01\x02')(fourth=255)

If you don't care about a particular element, you can leave it nameless just like in Structs.

In [47]:
spec = construct.Delimited(b'|',
    'first' / construct.CString(),
    'second' / construct.DWORD,
    construct.Pass,
    'fourth' / construct.Byte
)

spec.parse(b'Hello\x00\x00|\x01\x00\x00\x00|world!!\x01\x02|\xff')

Container(first=b'Hello')(second=1)(fourth=255)

It may also be useful to use Pass or Optional for fields that may not exist or you don't care about.

In [48]:
spec = construct.Delimited(b'|',
    'first' / construct.CString(),
    'second' / construct.Pass,
    'third' / construct.Optional(construct.DWORD)
)

spec.parse(b'Hello\x00\x00|dont care|\x01\x00\x00\x00')

Container(first=b'Hello')(second=None)(third=1)

In [49]:
spec.parse(b'Hello\x00\x00||')

Container(first=b'Hello')(second=None)(third=None)

## PE Physical Address

A PE virtual address can be automatically converted to a physical address by using `PEPhysicalAddress` found in our helpers module. This is used as an adapter to wrap around a construct that is parsing an integer representing the virtual address.

In order to use this, you must pass a PE object either within the constructor or as a keyword argument in the "parse()" function.

In [50]:
import pefile

pe = pefile.PE(r'C:\32bit_exe')
data = pe.trim()

In [51]:
spec = construct.PEPhysicalAddress(construct.DWORD, pe=pe)

spec.parse(b'd\x00@\x00')

100

In [52]:
spec = construct.PEPhysicalAddress(construct.DWORD)

spec.parse(b'd\x00@\x00', pe=pe)

100

This can be useful when used in combination with `Pointer` in order to retreive data pointed to by an address.

*NOTE: `Pointer` is is based on an offset of the input data and has no relation to the pe we passed in. Therefore, we must pass in the full file data to parse.*

In [53]:
spec = construct.Struct(
	construct.Seek(0x6BD),
	'data_ptr' / construct.PEPhysicalAddress(construct.DWORD),
	'data' / construct.Pointer(lambda ctx: ctx.data_ptr, construct.CString())
)

spec.parse(data, pe=pe)

Container(data_ptr=515072)(data=b'../../../src/c_init.cpp')

Alternatively, you can used `PEPointer` to simplify this common routine.

In [54]:
spec = construct.Struct(
	construct.Seek(0x6BD),
	'data_ptr' / construct.DWORD,
	'data' / construct.PEPointer(lambda ctx: ctx.data_ptr, construct.CString())
)

spec.parse(data, pe=pe)

Container(data_ptr=4718592)(data=b'../../../src/c_init.cpp')

# PE Physical Address 64-bit

There is a 64-bit version of `PEPointer` called `PEPointer64` that works the same way, but also requires an extra parameter that specifies the offset to the end of the instruction to base the relative pointer on.  The offset can be retrieved during parsing easily by using the `Tell` subconstruct.

In [55]:
import pefile

pe = pefile.PE(r'C:\64bit_exe')
data = pe.trim()

spec = construct.Struct(
    construct.Seek(0x6555),
    construct.Const(b'\x48\x8D\x15'),  # lea rdx, ...
    'data_ptr' / construct.DWORD,
    'inst_end' / construct.Tell,
    'data' / construct.PEPointer64(
        lambda ctx: ctx.data_ptr, lambda ctx: ctx.inst_end, construct.Bytes(10))
)

spec.parse(data, pe=pe)

Container(data_ptr=10348)(inst_end=25948)(data=b'/solutions')

## Regular Expressions

Regular expressions can be used to find and parse capture groups by using the `Regex` construct found in the helpers.

In [56]:
import re
regex = re.compile(b'\x01\x02(?P<size>.{4})\x03\x04(?P<path>[A-Za-z].*\x00)', re.DOTALL)

spec = construct.Regex(regex, size=construct.DWORD, path=construct.CString())
    
spec.parse(b'GARBAGE!\x01\x02\x0A\x00\x00\x00\x03\x04C:\Windows\x00MORE GARBAGE!')

Container(size=10)(path=b'C:\\Windows')

Since the stream position is left after the matched data, this can also be used to trigger to particular part of the input file.

In [57]:
spec = construct.Struct(
    construct.Regex(b'hello '),
    'person' / construct.CString()
)

spec.parse(b'\x01\x03\x04hello bob\x00')

Container(person=b'bob')

If the regular expression is not found, a `ConstructError` exception will be raised. This allows `Regex` to act as a form of [validation](#validation]).

In [58]:
spec = construct.Struct(
    construct.Regex(b'hello '),
    'person' / construct.CString()
)

try:
    results = spec.parse(b'\x01\x03\x04goodbye bob\x00')
except construct.ConstructError as e:
    print('Unable to parse: {}'.format(e))

Unable to parse: regex did not match


Multiple regular expressions can be tried by wrapping the constructs in a `Select` construct.


In [59]:
spec = construct.Select(
    construct.Regex(re.compile(b'goodbye (?P<farewell_person>[A-Za-z]*)')),
    construct.Regex(re.compile(b'hello (?P<greet_person>[A-Za-z]*)')),
)

spec.parse(b'\x01\x02hello bob\x00\x03')

Container(greet_person=b'bob')

Although, you must be careful, the order within the `Select` matters since it will stop after the first successful parse.

In this example, we can get either the farewell or greeting depending on the order of our regex.

In [60]:
spec = construct.Select(
    construct.Regex(re.compile(b'goodbye (?P<farwell_person>[A-Za-z]*)')),
    construct.Regex(re.compile(b'hello (?P<greet_person>[A-Za-z]*)'))
)

spec.parse(b'\x01\x02hello bob\x00\x03\x04goodbye george\x00\x05\x06')

Container(farwell_person=b'george')

In [61]:
spec = construct.Select(
    construct.Regex(re.compile(b'hello (?P<greet_person>[A-Za-z]*)')),
    construct.Regex(re.compile(b'goodbye (?P<farwell_person>[A-Za-z]*)')),
)

spec.parse(b'\x01\x02hello bob\x00\x03\x04goodbye george\x00\x05\x06')

Container(greet_person=b'bob')

If you want ensure both regular expressions are found but don't know the order that they will appear, you can use `Union` instead.

*NOTE: When using `Union` the stream offset won't advance unless you specify a `buildfrom` parameter, which specifies which field to leave the stream offset at.*

In [62]:
spec = construct.Union(
    'farwell' / construct.Regex(re.compile(b'(?<=goodbye )[A-Za-z]*')),
    'greeting' / construct.Regex(re.compile(b'(?<=hello )[A-Za-z]*')),
)

spec.parse(b'\x01\x02hello bob\x00\x03\x04goodbye george\x00\x05\x06')

Container(farwell=b'george')(greeting=b'bob')

Also, the regular expression can be run multiple times if you use the [range](#ranges-lists) notation: "[]"

In [63]:
spec = construct.Regex(re.compile(b'(?P<prefix>hello|goodbye) (?P<person>[A-Za-z]*)'))

spec[:].parse(b'\x01\x02hello bob\x00\x03\x04goodbye george\x00\x05\x06')

[Container(prefix=b'hello')(person=b'bob'),
 Container(prefix=b'goodbye')(person=b'george')]

## HTML Documentation

In `mwcp.utils.construct`, we have created the function `html_hex()` which will create a user-friendly hex dump of the parsed data in the same way as enstructured.

Every element with a provided name (that doesn't start with "_") will be highlighted within the hex dump.

In [64]:
EMBED_PAYLOAD = construct.Struct(
    'Hardcoded Value' / construct.Int16ul,
    'Data' / construct.Bytes(5)
)

PACKET = construct.Struct(
        construct.Padding(0x9),
        # construct.HexString is an adapter to format an integer into hex.
        'Payloads' / EMBED_PAYLOAD[3],
        construct.Padding(0x17),
        # construct.IP4Address is a helper construct that converts 4 bytes into an ip address.
        'Compromised Host IP' / construct.IP4Address,
        'Unknown IP Addresses' / construct.IP4Address[4],
        construct.Padding(8),
        'Unknown Indicator' / construct.CString(),
        'Number of CPUs' / construct.Int32ul,
        'CPU Mhz' / construct.Int32ul,
        'Total Memory (MB)' / construct.Int32ul,
        'Compromised System Kernel' / construct.CString(),
        'Possible Version' / construct.CString()
    )

data = (b'\x01\x00\x00\x00}\x00\x00\x00\x00\xf4\x01hello\x02\x00world\xe8'
        b'\x03\x01\xFA\xFFYO\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01'
        b'\x01\x00\x00\x00\x00\x01\x00\x00\x00\xc0\xa8\x01\r\xc0\xa8\x01\r\xc0'
        b'\xa8\x01\r\xc0\xa8\x01\r\xc0\xa8\x01\r\xff\xff\x01\x00\x00\x00\x00\x00'
        b'foo bar!\x00\x01\x00\x00\x00d\n\x00\x00\xc4\x07\x00\x00'
        b'Linux 3.13.0-93-generic\x001.0.0\x00')

html_data = construct.html_hex(PACKET, data, depth=3)

# (ignore this)
from IPython.core.display import display, HTML
display(HTML(html_data))

0,1,2
Offset,Name,Value
000009,Payloads[0] / Hardcoded Value,500
00000b,Payloads[0] / Data,hello
000010,Payloads[1] / Hardcoded Value,2
000012,Payloads[1] / Data,world
000017,Payloads[2] / Hardcoded Value,1000
000019,Payloads[2] / Data,\x01\xfa\xffYO
000035,Compromised Host IP,192.168.1.13
000039,Unknown IP Addresses,- 192.168.1.13 - 192.168.1.13 - 192.168.1.13 - 192.168.1.13
000051,Unknown Indicator,foo bar!


## Building

All constructs created can also work in reverse to build the data given a dictionary of the results by running the "build()" function with a given dictionary either created by scratch or from a previous parse.

In [65]:
spec = construct.Struct(
    'a' / construct.DWORD,
    'b' / construct.CString(),
    'c' / construct.Byte
)

spec.build({'a': 1042, 'b': b'hello world', 'c': 0x54})

b'\x12\x04\x00\x00hello world\x00T'

In [66]:
results = spec.parse(b'\x01\x00\x00\x00hello world\x00\x54')

results.a = 2
results.b = b'how are you today?'

spec.build(results)

b'\x02\x00\x00\x00how are you today?\x00T'

## Validation

Data can be validated while parsing by supplying one of the many validating constructs (`Const`, `OneOf`, `NoneOf`, `Check`, `Regex`, etc)

If the validation fails during parsing, a `ConstructError` will be raised.

In [67]:
spec = construct.Struct(
    construct.Const('MZ'),
    'a' / construct.Byte,
)

try:
    results = spec.parse(b'nope\x01')
except construct.ConstructError as e:
    print('Invalid data: {}'.format(e))

Invalid data: expected 'MZ' but parsed b'no'


In [68]:
spec = construct.OneOf(construct.Byte, [1, 5, 7])

try:
    results = spec.parse(b'\x02')
except construct.ConstructError as e:
    print('Invalid data: {}'.format(e))

Invalid data: ('object failed validation', 2)


## Debugging

You can the construct you are developing by sticking in a `Probe` element wherever you would like to see a print out of the currently parsed construct and data it is about to read in. You can optionally add a name as a parameter in order to tell the difference between multiple probes.

In [69]:
spec = construct.Struct(
    'count' / construct.Byte,
    construct.Probe('before'),
    'items' / construct.Byte[lambda ctx: ctx.count], 
    construct.Probe('after')
)

spec.parse(b'\x06abcdef')

Probe before
path is parsing, func is None
0000   61 62 63 64 65 66                                                                                 abcdef

Container: 
    count = 6
Probe after
path is parsing, func is None
EOF reached
Container: 
    count = 6
    items = ListContainer: 
        97
        98
        99
        100
        101
        102


Container(count=6)(items=[97, 98, 99, 100, 101, 102])

Alternatively, you can attach a pdb-based full python debugger by wrapping your construct in `Debugger`.

This will cause an interactive debugger pop up, letting you tweak around. When finished use "q" to quit the debugger prompt and resume execution.

```bash
spec = construct.Debugger(construct.Struct(
    'a' / construct.Byte,
    'b' / construct.Enum(construct.Byte, A=1, B=2, C=3)
))
spec.parse('\x0A\xFF')
================================================================================
Debugging exception of <Struct: None>:
path is parsing
   ...
MappingError: no decoding mapping for 255
    parsing -> b
(you can set the value of 'self.retval', which will be returned)
> c:\python\construct\core.py(2713)_parse()
-> raise e.__class__("%s\n    %s" % (e, path))
(Pdb) >? context
Container(a=10)
(Pdb) >? stream
<_io.BytesIO object at 0x0329CC00>
(Pdb) >? stream.tell()
2L
(Pdb) >? stream.seek(1)
1L
(Pdb) >? stream.read(1)
'\xff'
(Pdb) >? q
```

Finally, you can also force the parsing to fail at a certain point by adding a `Error` construct. This can be used as a sentinel that blows a whistle when a conditional branch goes the wrong way, or to raise an error explicitly the declarative way.

*NOTE: You must catch "ExplicitError". "ConstructError" is not a base class.*

In [69]:
spec = construct.Struct(
    'a' / construct.Int8sb,
    'b' / construct.IfThenElse(
        lambda ctx: ctx.a > 0,  # if
        construct.Byte,         # then 
        construct.Error         # else
    )
)

try:
    spec.parse(b'\xff\x05')
except construct.ExplicitError as e:    
    print('Error: {}'.format(e))

Error: Error field was activated during parsing
