Skip to content

Header Rules

sat edited this page Jun 26, 2026 · 1 revision

Header Rules

The header stage extracts the structured front matter of a line — at least a message, and usually a timestamp and host. You describe it with a HeaderParser: an ordered list of Items that log2seq compiles into one regular expression.

from log2seq.header import HeaderParser, MonthAbbreviation, Digit, Time, Hostname, Statement

rule = [MonthAbbreviation(), Digit("day"), Time(), Hostname("host"), Statement()]
hp = HeaderParser(rule, separator=" ", defaults={"year": 2024})
hp.process_line("Mar  5 06:07:08 db1 disk full")
# {'host': 'db1', 'message': 'disk full', 'timestamp': datetime.datetime(2024, 3, 5, 6, 7, 8)}
  • Exactly one Statement() is mandatory in every rule; it captures the body under message.
  • Each item's value lands in the result under its value name (see the catalog). Timestamp-related items are reassembled into a single timestamp.
  • Missing timestamp fields (a syslog line has no year) come from defaults.

Placing items: separator vs full_format

A HeaderParser needs to know where one item ends and the next begins. Two ways:

  • separator (simple, recommended) — a set of characters that separate items. separator=" :[]" means runs of space/:/[/] divide the fields. Good when the layout is just whitespace/punctuation-delimited.
  • full_format — a template where <i> is replaced by item i's pattern and everything else is literal. Use it to pin fixed delimiters so they are not read as content. Runs of spaces become \s+; wrap optional items by hand with (...)?.
rule = [MonthAbbreviation(), Digit("day"), Time(), Hostname("host"),
        UserItem("comp", r".+?"), Digit("pid"), Statement()]
hp = HeaderParser(rule, full_format=r"<0> <1> <2> <3> <4>\[<5>\]: <6>",
                  defaults={"year": 2024})
hp.process_line("Mar  5 06:07:08 db1 sshd[42]: accepted")
# {'host': 'db1', 'comp': 'sshd', 'pid': 42, 'message': 'accepted',
#  'timestamp': datetime.datetime(2024, 3, 5, 6, 7, 8)}

Here the [, ] and : are literal in the template, so comp (.+?) and pid are cleanly delimited. With a plain separator that included []:, those brackets would be consumed as separators. Choosing between the two is the subject of Practical Patterns.

Several formats in one parser (first match wins)

A LogParser can hold a list of HeaderParsers, tried from the front; the first that matches is used. Put the more specific rule first.

from log2seq import LogParser
from log2seq.statement import StatementParser, Split
from log2seq.header import Date

iso = HeaderParser([Date(), Time(), Hostname("host"), Statement()], separator=" ")
parser = LogParser([hp, iso], StatementParser([Split(" ")]))   # hp from above, then iso

parser.process_line("Mar  5 06:07:08 db1 sshd[42]: accepted")["message"]   # 'accepted'  (rule 1)
parser.process_line("2024-03-05 06:07:08 db1 plain message")["timestamp"]  # 2024-03-05 06:07:08 (rule 2)

If no rule matches, process_line raises LogParseFailure (unless the LogParser was built with ignore_failure=True).

The Item catalog

Items split into timestamp components (reassembled into timestamp) and plain fields.

Timestamp components

Item matches value (name)
Date() 2024-03-05 datetime.date (date)
Time() 06:07:08, 06:07:08.012345+09:00 datetime.time (time)
DatetimeISOFormat() 2024-03-05T06:07:08+09:00 datetime.datetime (timestamp)
MonthAbbreviation() JanDec month int (month)
Digit("year"/"month"/"day"/"hour"/…) digits int, under the given name
YearWithoutCentury(century=20) 24 year int (year) — century*100 + nn
DateConcat(no_century=False, century=20) 20240305 / 240305 datetime.date (date)
TimeConcat() 060708 datetime.time (time)
DemicalSecond() fractional digits microseconds int (microsecond)
UnixTime(tz=timezone.utc) 1551024123 datetime.datetime (timestamp)
TimeZone() Z, +0900, +09:00 datetime.tzinfo (tzinfo)
from log2seq import header as h
def val(item, s): return item.pick_value(item.test(s))

val(h.MonthAbbreviation(), "Mar")              # 3
val(h.YearWithoutCentury(), "24")              # 2024   (default century 20)
val(h.YearWithoutCentury(century=19), "98")    # 1998
val(h.UnixTime(), "1551024123")                # datetime(2019, 2, 24, 16, 2, 3, tzinfo=utc)

Notes on determinism (see also Practical Patterns):

  • YearWithoutCentury / DateConcat(no_century=True) complete the century from the century argument (default 20 = 2000-2099), not from the wall clock.
  • UnixTime resolves the epoch in UTC by default; pass tz= for another zone.

The reassembly works on value names: an item named year/month/day/hour/ minute/second/microsecond/tzinfo feeds the timestamp; or use the aggregate items (Datedate, Timetime, DatetimeISOFormat/UnixTime → the whole timestamp). Supply any missing piece through defaults.

Plain fields

Item matches notes
Hostname("host") hostnames / IPv4 / IPv6 e.g. 2001:db8::1 ✓, but a token with a space ✗
String("name", symbols="_-") [A-Za-z0-9]+ plus any symbols letters/digits + the extra chars
Digit("name") \d+ returns an int
UserItem("name", r"…", strip=None) your own regex the most flexible item; strip trims the value
ItemGroup([...], separator=…, optional=…) a sub-group a cluster sharing a local separator, optionally absent
Statement() the rest (.*) the message body — exactly one per rule
val(h.String("s", symbols="_-"), "a_b-c")          # 'a_b-c'
val(h.UserItem("u", r".+", strip=" "), " x ")      # 'x'
bool(h.Hostname("h").test("2001:db8::1"))           # True

Item flags

  • optional=True — the item may be absent. An absent optional item is omitted from the result (the key is not added), so "pid" in d tells you whether it matched. (Digit("pid", optional=True).)
  • dummy=True — match but extract nothing. Use it for a fixed marker, or to avoid a duplicate value name when the same field appears twice.
  • strip="…" (UserItem only) — str.strip() the extracted value.

UserItem patterns must not contain ^, $, or optional groups (?); make a group optional through ItemGroup(optional=True) or the (...)? wrapper in full_format instead.

Writing a custom Item

Subclass Item (or NamedItem for a named one), give it a pattern, and — if the matched text needs converting — a pick_value:

from log2seq.header import NamedItem

class HexId(NamedItem):
    @property
    def pattern(self):
        return r"0x[0-9a-fA-F]+"

    def pick_value(self, mo):
        return int(mo[self.match_name], 16)   # mo[self.match_name] is the matched text

HexId("id").pick_value(HexId("id").test("0x1f"))   # 31

The contract: pattern returns the item's regex (no capture group — log2seq adds the named group itself); pick_value(mo) reads mo[self.match_name] and returns the final value (any type), or mo[self.match_name] unchanged if you don't override it. For a named item, the name doubles as both the regex group name and the result-dict key. Use Item.test(s) to probe one item in isolation (it compiles a throwaway anchored pattern — for debugging only).

See also

Clone this wiki locally