-
Notifications
You must be signed in to change notification settings - Fork 0
Practical Patterns
Writing a parser for a real log format is mostly a few recurring decisions. This
page collects them, plus how to debug a parser against sample data with the CLI.
The repository's example/loghub_*/parser.py are full worked examples of
everything here.
A workflow from raw lines to a working parser. Take one representative line:
2024-08-09 11:22:33 web01 nginx[4521]: request from 10.0.0.5:443 at 09:15:00 done
1. Read a sample first. Look at a few dozen real lines and separate the
header envelope (here: an ISO date and time, a host, a program[pid] tag)
from the free body (request from …). Note the fixed delimiters ([, ],
: ) and whether the data has more than one line shape (tag-less meta lines,
continuation lines — see Multiple rules below).
2. Design the header. List the header fields left to right, pick an Item
for each, and pin the fixed delimiters with full_format (see separator vs
full_format). Iterate process_line on samples until message is exactly the
body you want:
from log2seq.header import HeaderParser, Date, Time, Hostname, UserItem, Digit, Statement
hp = HeaderParser([Date(), Time(), Hostname("host"), UserItem("program", r".+?"),
Digit("pid"), Statement()],
full_format=r"<0> <1> <2> <3>\[<4>\]: <5>")
hp.process_line("2024-08-09 11:22:33 web01 nginx[4521]: request from 10.0.0.5:443 at 09:15:00 done")
# {'host': 'web01', 'program': 'nginx', 'pid': 4521,
# 'message': 'request from 10.0.0.5:443 at 09:15:00 done',
# 'timestamp': datetime.datetime(2024, 8, 9, 11, 22, 33)}For several line shapes, write one rule per shape, most specific first (see Multiple rules).
3. Design the statement — iterate with verbose=True. Decide what must stay
whole and what to split on, build the action list, and watch it run. A naive
"split on spaces, then on colons" tears the clock time apart:
from log2seq.statement import StatementParser, Split, Fix
from log2seq.preset import pattern_time
body = "request from 10.0.0.5:443 at 09:15:00 done"
StatementParser([Split(" "), Split(":")]).process_line(body, verbose=True)Split: 'request', 'from', '10.0.0.5:443', 'at', '09:15:00', 'done'
Split: 'request', 'from', '10.0.0.5', '443', 'at', '09', '15', '00', 'done'
The trace ('…' is UNKNOWN) shows 09:15:00 shattered into 09 15 00. It also
tells you the fix: protect the time before the : split.
StatementParser([Split(" "), Fix(pattern_time), Split(":")]).process_line(body, verbose=True)Split: 'request', 'from', '10.0.0.5:443', 'at', '09:15:00', 'done'
Fix: 'request', 'from', '10.0.0.5:443', 'at', #09:15:00#, 'done'
Split: 'request', 'from', '10.0.0.5', '443', 'at', #09:15:00#, 'done'
Now #09:15:00# is FIXED and survives the final split. This protect-then-split
loop — read the trace, insert or reorder one action, re-run — is how you design
the statement stage. The action catalog and the (part, flag) model are in
Statement Rules.
4. Verify against real data. Bind the two stages into a LogParser, run the
CLI over a sample with --failures-only and drive failures to zero; if you have
ground truth, check the parsed message/words against it; then run the full
data for coverage. See Debugging a parser with the CLI and Full-data
robustness below.
The rest of this page expands each decision in that workflow.
- Use
separatorwhen fields are just whitespace/punctuation-delimited and you don't care which separator appears where. It's the simplest option. - Use
full_formatwhen the format has fixed delimiters that must stay literal — brackets around a time/level, a-between program and content, the:that ends a syslog tag. If you put those characters in aseparatorset instead, every occurrence (including ones inside the content) is eaten as a separator.
A concrete trap: Apache's [client <ip>] is part of the content. With
separator=" []" the brackets are consumed and the leading [ is lost; pinning
\[<Time>\] \[<Level>\] <Content> with full_format keeps it.
Some fields have no tidy character set — a syslog component can be
Kernel command line, syslogd 1.4.1, /sbin/mingetty, com.apple.xpc.launchd.
The right model is a non-greedy .+? anchored by an explicit delimiter in
full_format:
# <Month> <Date> <Time> <Level> <Component>(\[<PID>\])?: <Content>
header_rule = [MonthAbbreviation(), Digit("day"), Time(), Hostname("host"),
UserItem("component", r".+?"), Digit("processid", optional=True),
Statement()]
HeaderParser(header_rule, full_format=r"<0> <1> <2> <3> <4>(\[<5>\])?: <6>")The : is what stops .+? from running away. This is the difference that
matters: .+? with a delimiter anchor models a real free-form field; a bare
permissive pattern or a catch-all that swallows the whole line hides structure
and mis-splits future logs. Reach for .+? only when (a) the field has no
fixed character set and (b) a literal delimiter pins its end.
A LogParser tries its header rules in order. Use that to express real line
classes, not to paper over failures:
- The primary rule mirrors the documented format.
- A secondary rule models another real class the data contains — and should still extract whatever structure it has.
Syslog streams interleave tag-less meta-lines that have no tag: content:
last message repeated 2 times, exiting on signal 15. They are not malformed —
they are a real syslog message class. Model them with a rule that keeps the
timestamp/host envelope and takes the remainder as the message, rather than
dropping the whole line to a blanket catch-all. (A blanket [Statement()] rule
that discards the timestamp is a last resort, for genuinely header-less lines
such as multi-line continuation output.)
The in-repo <Name>_2k.log is only a 2,000-line loghub sample; a real parser
must process the full dataset (often orders of magnitude larger) without
failures. Verify on two axes:
-
Correctness — on the labelled sample, the parsed
messageequals the ground-truth content. -
Coverage — over the full data, zero
LogParseFailure/exceptions. Stream it rather than loading it; watch for non-UTF-8 bytes (decode witherrors="replace").
A pattern that looks fine on 2,000 lines can still fail or mis-split on the long tail, so both axes matter.
Point the CLI at a sample with --parser. Successful results go to stdout
(pipeable); parse failures and a final summary go to stderr:
$ printf 'Jan 1 12:34:56 host a[1]: ok one\nGARBAGE no header\nFeb 2 01:02:03 host b[2]: ok two\n' \
| python -m log2seq -t words
a 1 ok one # stdout
b 2 ok two # stdout
parse failed: 'GARBAGE no header': header format mismatch: GARBAGE no header # stderr
# processed 3 lines: 2 ok, 1 failed # stderr-
--failures-onlysuppresses the stdout results, leaving just the failures and the summary — the quickest way to see what a parser still misses. -
--max-failures Ncaps the per-failure lines (default 5;0for all). - Exit status is 0 when at least one line parses, 1 when none do (often a sign the parser doesn't fit the data), and 2 on a startup error (e.g. an unloadable parser script).
So python -m log2seq -p myparser.py sample.log --failures-only iterates a parser
against real data: the failures and the M ok, K failed count tell you exactly
where it stands.
- Header Rules / Statement Rules — the parts.
- Building a Parser — assembling and driving a parser.
-
Presets —
example/loghub_*as full worked examples.