Skip to content

Latest commit

 

History

History
279 lines (232 loc) · 13 KB

Data-Type-Compatibility.md

File metadata and controls

279 lines (232 loc) · 13 KB

Zed/Zeek Data Type Compatibility

Introduction

As the Zed data model was in many ways inspired by the Zeek TSV log format, the rich Zed storage formats (ZSON, ZNG, etc.) maintain comprehensive interoperability with Zeek. When Zeek is configured to output its logs in NDJSON format, much of the rich type information is lost in translation, but this can be restored by following the guidance for shaping Zeek NDJSON. On the other hand, Zeek TSV can be converted to Zed storage formats and back to Zeek TSV without any loss of information.

This document describes how the Zed type system is able to represent each of the types that may appear in Zeek logs.

Tools like zq and Brim maintain an internal Zed-typed representation of any Zeek data that is read or imported. Therefore, knowing the equivalent types will prove useful when performing operations in the Zed language such as type casting or looking at the data when output as ZSON.

Equivalent Types

The following table summarizes which Zed data type corresponds to each Zeek data type that may appear in a Zeek TSV log. While most types have a simple 1-to-1 mapping from Zeek to Zed and back to Zeek again, the sections linked from the Additional Detail column describe cosmetic differences and other subtleties applicable to handling certain types.

Zeek Type Zed Type Additional Detail
bool bool
count uint64
int int64
double float64 See double details
time time
interval duration
string string See string details about escaping
port uint16 See port details
addr ip
subnet net
enum string See enum details
set set See set details
vector array
record record See record details

Note: The Zeek data type page describes the types in the context of the Zeek scripting language. The Zeek types available in scripting are a superset of the data types that may appear in Zeek log files. The encodings of the types also differ in some ways between the two contexts. However, we link to this reference because there is no authoritative specification of the Zeek TSV log format.

Example

The following example shows a TSV log that includes each Zeek data type, how it's output as ZSON by zq, and then how it's written back out again as a Zeek log. You may find it helpful to refer to this example when reading the Type-Specific Details sections.

Viewing the TSV log:

cat zeek_types.log

Output:

#separator \x09
#set_separator	,
#empty_field	(empty)
#unset_field	-
#fields	my_bool	my_count	my_int	my_double	my_time	my_interval	my_printable_string	my_bytes_string	my_port	my_addr	my_subnet	my_enum	my_set	my_vector	my_record.name	my_record.age
#types	bool	count	int	double	time	interval	string	string	port	addr	subnet	enum	set[string]	vector[string]	string	count
T	123	456	123.4560	1592502151.123456	123.456	smile😁smile	\x09\x07\x04	80	127.0.0.1	10.0.0.0/8	tcp	things,in,a,set	order,is,important	Jeanne	122

Reading the TSV log, outputting as ZSON, and saving a copy:

zq -Z zeek_types.log | tee zeek_types.zson

Output:

{
    my_bool: true,
    my_count: 123 (uint64),
    my_int: 456,
    my_double: 123.456,
    my_time: 2020-06-18T17:42:31.123456Z,
    my_interval: 2m3.456s,
    my_printable_string: "smile😁smile",
    my_bytes_string: "\t\u0007\u0004",
    my_port: 80 (port=uint16),
    my_addr: 127.0.0.1,
    my_subnet: 10.0.0.0/8,
    my_enum: "tcp" (=zenum),
    my_set: |[
        "a",
        "in",
        "set",
        "things"
    ]|,
    my_vector: [
        "order",
        "is",
        "important"
    ],
    my_record: {
        name: "Jeanne",
        age: 122 (uint64)
    }
}

Reading the saved ZSON output and outputting as Zeek TSV:

zq -f zeek zeek_types.zson

Output:

#separator \x09
#set_separator	,
#empty_field	(empty)
#unset_field	-
#fields	my_bool	my_count	my_int	my_double	my_time	my_interval	my_printable_string	my_bytes_string	my_port	my_addr	my_subnet	my_enum	my_set	my_vector	my_record.name	my_record.age
#types	bool	count	int	double	time	interval	string	string	port	addr	subnet	enum	set[string]	vector[string]	string	count
T	123	456	123.456	1592502151.123456	123.456000	smile😁smile	\x09\x07\x04	80	127.0.0.1	10.0.0.0/8	tcp	a,in,set,things	order,is,important	Jeanne	122

Type-Specific Details

As zq acts as a reference implementation for Zed storage formats such as ZSON and ZNG, it's helpful to understand how it reads the following Zeek data types into readable text equivalents in the ZSON format, then writes them back out again in the Zeek TSV log format. Other implementations of the Zed storage formats (should they exist) may handle these differently.

Multiple Zeek types discussed below are represented via a type definition to one of Zed's primitive types. The Zed type definitions maintain the history of the field's original Zeek type name such that zq may restore it if the field is later output in Zeek format. Knowledge of its original Zeek type may also enable special operations in Zed that are unique to values known to have originated as a specific Zeek type, though no such operations are currently implemented in zq.

double

As they do not affect accuracy, "trailing zero" decimal digits on Zeek double values will not be preserved when they are formatted into a string, such as via the ZSON/Zeek/table output options in zq (e.g., 123.4560 becomes 123.456).

enum

As they're encountered in common programming languages, enum variables typically hold one of a set of predefined values. While this is how Zeek's enum type behaves inside the Zeek scripting language, when the enum type is output in a Zeek log, the log does not communicate any such set of "allowed" values as they were originally defined. Therefore, these values are represented with a ZSON type name bound to the Zed string type. See the text above regarding type definitions for more details.

port

The numeric values that appear in Zeek logs under this type are represented in ZSON with a type name of port bound to the uint16 type. See the text above regarding type names for more details.

set

Because order within sets is not significant, no attempt is made to maintain the order of set elements as they originally appeared in a Zeek log.

string

Zeek's string data type is complicated by its ability to hold printable ASCII and UTF-8 as well as arbitrary unprintable bytes represented as \x escapes. Because such binary data may need to legitimately be captured (e.g. to record the symptoms of DNS exfiltration), it's helpful that Zeek has a mechanism to log it. Unfortunately, Zeek's use of the single string type for these multiple uses leaves out important details about the intended interpretation and presentation of the bytes that make up the value. For instance, one Zeek string field may hold arbitrary network data that coincidentally sometimes form byte sequences that could be interpreted as printable UTF-8, but they are not intended to be read or presented as such. Meanwhile, another Zeek string field may be populated such that it will only ever contain printable UTF-8. These details are currently only captured within the Zeek source code itself that defines how these values are generated.

Zed includes a primitive type called bytes that's suited to storing the former "always binary" case and a string type for the latter "always printable" case. However, Zeek logs do not currently communicate details that would allow an implementation to know which Zeek string fields to store as which of these two Zed data types. Instead, the Zed system does what the Zeek system does when writing strings to JSON: any \x escapes used in Zeek TSV strings are translated into valid Zed UTF-8 strings by escaping the backslash before the x. In this way, you can still see binary-corrupted strings that are generated by Zeek in the Zed data formats.

Unfortunately there is no way to distinguish whether a \x escape occurred or whether that string pattern happen to occur in the original data. A nice solution would be to convert Zeek strings that are valid UTF-8 strings into Zed strings and convert invalid strings into a Zed bytes type, or we could covert both of them into a Zed union of string and bytes. If you have interest in a capability like this, please let us know and we can elevate the priority.

If Zeek were to provide an option to output logs directly in one or more of Zed's richer storage storage formats, this would create an opportunity to assign the appropriate Zed bytes or string type at the point of origin, depending on what's known about how the field's value is intended to be populated and used.

record

Zeek's record type is unique in that every Zeek log line effectively is a record, with its schema defined via the #fields and #types directives in the headers of each log file. The word "record" never appears explicitly in the schema definition in Zeek logs.

Embedded records also subtly appear within Zeek log lines in the form of dot-separated field names. A common example in Zeek is the id record, which captures the source and destination IP addresses and ports for a network connection as fields id.orig_h, id.orig_p, id.resp_h, and id.resp_p. When reading such fields into their Zed equivalent, zq restores the hierarchical nature of the record as it originally existed inside of Zeek itself before it was output by its logging system. This enables operations in Zed that refer to the record at a higher level but affect all values lower down in the record hierarchy.

Revisiting the data from our example, we can output all fields within my_record via a Zed cut operation.

Command:

zq -f zeek 'cut my_record' zeek_types.zson

Output:

#separator \x09
#set_separator	,
#empty_field	(empty)
#unset_field	-
#fields	my_record.name	my_record.age
#types	string	count
Jeanne	122