-
Notifications
You must be signed in to change notification settings - Fork 0
Header Rules
The header stage extracts the structured front matter of a line — at least a
message, and usually a timestamp and host. You describe it with a
HeaderParser: an ordered list of Items that log2seq compiles into one
regular expression.
from log2seq.header import HeaderParser, MonthAbbreviation, Digit, Time, Hostname, Statement
rule = [MonthAbbreviation(), Digit("day"), Time(), Hostname("host"), Statement()]
hp = HeaderParser(rule, separator=" ", defaults={"year": 2024})
hp.process_line("Mar 5 06:07:08 db1 disk full")
# {'host': 'db1', 'message': 'disk full', 'timestamp': datetime.datetime(2024, 3, 5, 6, 7, 8)}- Exactly one
Statement()is mandatory in every rule; it captures the body undermessage. - Each item's value lands in the result under its value name (see the
catalog). Timestamp-related items are reassembled into a single
timestamp. - Missing timestamp fields (a syslog line has no year) come from
defaults.
A HeaderParser needs to know where one item ends and the next begins. Two ways:
-
separator(simple, recommended) — a set of characters that separate items.separator=" :[]"means runs of space/:/[/]divide the fields. Good when the layout is just whitespace/punctuation-delimited. -
full_format— a template where<i>is replaced by item i's pattern and everything else is literal. Use it to pin fixed delimiters so they are not read as content. Runs of spaces become\s+; wrap optional items by hand with(...)?.
rule = [MonthAbbreviation(), Digit("day"), Time(), Hostname("host"),
UserItem("comp", r".+?"), Digit("pid"), Statement()]
hp = HeaderParser(rule, full_format=r"<0> <1> <2> <3> <4>\[<5>\]: <6>",
defaults={"year": 2024})
hp.process_line("Mar 5 06:07:08 db1 sshd[42]: accepted")
# {'host': 'db1', 'comp': 'sshd', 'pid': 42, 'message': 'accepted',
# 'timestamp': datetime.datetime(2024, 3, 5, 6, 7, 8)}Here the [, ] and : are literal in the template, so comp (.+?) and
pid are cleanly delimited. With a plain separator that included []:, those
brackets would be consumed as separators. Choosing between the two is the subject
of Practical Patterns.
A LogParser can hold a list of HeaderParsers, tried from the front; the
first that matches is used. Put the more specific rule first.
from log2seq import LogParser
from log2seq.statement import StatementParser, Split
from log2seq.header import Date
iso = HeaderParser([Date(), Time(), Hostname("host"), Statement()], separator=" ")
parser = LogParser([hp, iso], StatementParser([Split(" ")])) # hp from above, then iso
parser.process_line("Mar 5 06:07:08 db1 sshd[42]: accepted")["message"] # 'accepted' (rule 1)
parser.process_line("2024-03-05 06:07:08 db1 plain message")["timestamp"] # 2024-03-05 06:07:08 (rule 2)If no rule matches, process_line raises LogParseFailure (unless the
LogParser was built with ignore_failure=True).
Items split into timestamp components (reassembled into timestamp) and
plain fields.
| Item | matches | value (name) |
|---|---|---|
Date() |
2024-03-05 |
datetime.date (date) |
Time() |
06:07:08, 06:07:08.012345+09:00
|
datetime.time (time) |
DatetimeISOFormat() |
2024-03-05T06:07:08+09:00 |
datetime.datetime (timestamp) |
MonthAbbreviation() |
Jan…Dec
|
month int (month) |
Digit("year"/"month"/"day"/"hour"/…) |
digits | int, under the given name |
YearWithoutCentury(century=20) |
24 |
year int (year) — century*100 + nn
|
DateConcat(no_century=False, century=20) |
20240305 / 240305
|
datetime.date (date) |
TimeConcat() |
060708 |
datetime.time (time) |
DemicalSecond() |
fractional digits | microseconds int (microsecond) |
UnixTime(tz=timezone.utc) |
1551024123 |
datetime.datetime (timestamp) |
TimeZone() |
Z, +0900, +09:00
|
datetime.tzinfo (tzinfo) |
from log2seq import header as h
def val(item, s): return item.pick_value(item.test(s))
val(h.MonthAbbreviation(), "Mar") # 3
val(h.YearWithoutCentury(), "24") # 2024 (default century 20)
val(h.YearWithoutCentury(century=19), "98") # 1998
val(h.UnixTime(), "1551024123") # datetime(2019, 2, 24, 16, 2, 3, tzinfo=utc)Notes on determinism (see also Practical Patterns):
-
YearWithoutCentury/DateConcat(no_century=True)complete the century from thecenturyargument (default20= 2000-2099), not from the wall clock. -
UnixTimeresolves the epoch in UTC by default; passtz=for another zone.
The reassembly works on value names: an item named year/month/day/hour/
minute/second/microsecond/tzinfo feeds the timestamp; or use the
aggregate items (Date → date, Time → time, DatetimeISOFormat/UnixTime
→ the whole timestamp). Supply any missing piece through defaults.
| Item | matches | notes |
|---|---|---|
Hostname("host") |
hostnames / IPv4 / IPv6 | e.g. 2001:db8::1 ✓, but a token with a space ✗ |
String("name", symbols="_-") |
[A-Za-z0-9]+ plus any symbols
|
letters/digits + the extra chars |
Digit("name") |
\d+ |
returns an int |
UserItem("name", r"…", strip=None) |
your own regex | the most flexible item; strip trims the value |
ItemGroup([...], separator=…, optional=…) |
a sub-group | a cluster sharing a local separator, optionally absent |
Statement() |
the rest (.*) |
the message body — exactly one per rule |
val(h.String("s", symbols="_-"), "a_b-c") # 'a_b-c'
val(h.UserItem("u", r".+", strip=" "), " x ") # 'x'
bool(h.Hostname("h").test("2001:db8::1")) # True-
optional=True— the item may be absent. An absent optional item is omitted from the result (the key is not added), so"pid" in dtells you whether it matched. (Digit("pid", optional=True).) -
dummy=True— match but extract nothing. Use it for a fixed marker, or to avoid a duplicate value name when the same field appears twice. -
strip="…"(UserItemonly) —str.strip()the extracted value.
UserItem patterns must not contain ^, $, or optional groups (?); make a
group optional through ItemGroup(optional=True) or the (...)? wrapper in
full_format instead.
Subclass Item (or NamedItem for a named one), give it a pattern, and — if
the matched text needs converting — a pick_value:
from log2seq.header import NamedItem
class HexId(NamedItem):
@property
def pattern(self):
return r"0x[0-9a-fA-F]+"
def pick_value(self, mo):
return int(mo[self.match_name], 16) # mo[self.match_name] is the matched text
HexId("id").pick_value(HexId("id").test("0x1f")) # 31The contract: pattern returns the item's regex (no capture group — log2seq adds
the named group itself); pick_value(mo) reads mo[self.match_name] and returns
the final value (any type), or mo[self.match_name] unchanged if you don't
override it. For a named item, the name doubles as both the regex group name and
the result-dict key. Use Item.test(s) to probe one item in isolation (it
compiles a throwaway anchored pattern — for debugging only).
- Statement Rules — the second stage.
- Building a Parser — assembling and driving the parser.
-
Python API —
HeaderParser, exceptions, result keys.