Skip to content

Practical Patterns

sat edited this page Jun 26, 2026 · 2 revisions

Practical Patterns

Writing a parser for a real log format is mostly a few recurring decisions. This page collects them, plus how to debug a parser against sample data with the CLI. The repository's example/loghub_*/parser.py are full worked examples of everything here.

Designing a parser from your logs

A workflow from raw lines to a working parser. Take one representative line:

2024-08-09 11:22:33 web01 nginx[4521]: request from 10.0.0.5:443 at 09:15:00 done

1. Read a sample first. Look at a few dozen real lines and separate the header envelope (here: an ISO date and time, a host, a program[pid] tag) from the free body (request from …). Note the fixed delimiters ([, ], : ) and whether the data has more than one line shape (tag-less meta lines, continuation lines — see Multiple rules below).

2. Design the header. List the header fields left to right, pick an Item for each, and pin the fixed delimiters with full_format (see separator vs full_format). Iterate process_line on samples until message is exactly the body you want:

from log2seq.header import HeaderParser, Date, Time, Hostname, UserItem, Digit, Statement
hp = HeaderParser([Date(), Time(), Hostname("host"), UserItem("program", r".+?"),
                   Digit("pid"), Statement()],
                  full_format=r"<0> <1> <2> <3>\[<4>\]: <5>")
hp.process_line("2024-08-09 11:22:33 web01 nginx[4521]: request from 10.0.0.5:443 at 09:15:00 done")
# {'host': 'web01', 'program': 'nginx', 'pid': 4521,
#  'message': 'request from 10.0.0.5:443 at 09:15:00 done',
#  'timestamp': datetime.datetime(2024, 8, 9, 11, 22, 33)}

For several line shapes, write one rule per shape, most specific first (see Multiple rules).

3. Design the statement — iterate with verbose=True. Decide what must stay whole and what to split on, build the action list, and watch it run. A naive "split on spaces, then on colons" tears the clock time apart:

from log2seq.statement import StatementParser, Split, Fix
from log2seq.preset import pattern_time
body = "request from 10.0.0.5:443 at 09:15:00 done"
StatementParser([Split(" "), Split(":")]).process_line(body, verbose=True)
Split: 'request', 'from', '10.0.0.5:443', 'at', '09:15:00', 'done'
Split: 'request', 'from', '10.0.0.5', '443', 'at', '09', '15', '00', 'done'

The trace ('…' is UNKNOWN) shows 09:15:00 shattered into 09 15 00. It also tells you the fix: protect the time before the : split.

StatementParser([Split(" "), Fix(pattern_time), Split(":")]).process_line(body, verbose=True)
Split: 'request', 'from', '10.0.0.5:443', 'at', '09:15:00', 'done'
Fix: 'request', 'from', '10.0.0.5:443', 'at', #09:15:00#, 'done'
Split: 'request', 'from', '10.0.0.5', '443', 'at', #09:15:00#, 'done'

Now #09:15:00# is FIXED and survives the final split. This protect-then-split loop — read the trace, insert or reorder one action, re-run — is how you design the statement stage. The action catalog and the (part, flag) model are in Statement Rules.

4. Verify against real data. Bind the two stages into a LogParser, run the CLI over a sample with --failures-only and drive failures to zero; if you have ground truth, check the parsed message/words against it; then run the full data for coverage. See Debugging a parser with the CLI and Full-data robustness below.

The rest of this page expands each decision in that workflow.

separator vs full_format

  • Use separator when fields are just whitespace/punctuation-delimited and you don't care which separator appears where. It's the simplest option.
  • Use full_format when the format has fixed delimiters that must stay literal — brackets around a time/level, a - between program and content, the : that ends a syslog tag. If you put those characters in a separator set instead, every occurrence (including ones inside the content) is eaten as a separator.

A concrete trap: Apache's [client <ip>] is part of the content. With separator=" []" the brackets are consumed and the leading [ is lost; pinning \[<Time>\] \[<Level>\] <Content> with full_format keeps it.

Free-form fields: anchor, don't widen blindly

Some fields have no tidy character set — a syslog component can be Kernel command line, syslogd 1.4.1, /sbin/mingetty, com.apple.xpc.launchd. The right model is a non-greedy .+? anchored by an explicit delimiter in full_format:

# <Month> <Date> <Time> <Level> <Component>(\[<PID>\])?: <Content>
header_rule = [MonthAbbreviation(), Digit("day"), Time(), Hostname("host"),
               UserItem("component", r".+?"), Digit("processid", optional=True),
               Statement()]
HeaderParser(header_rule, full_format=r"<0> <1> <2> <3> <4>(\[<5>\])?: <6>")

The : is what stops .+? from running away. This is the difference that matters: .+? with a delimiter anchor models a real free-form field; a bare permissive pattern or a catch-all that swallows the whole line hides structure and mis-splits future logs. Reach for .+? only when (a) the field has no fixed character set and (b) a literal delimiter pins its end.

Multiple rules: a faithful primary, then meaningful fallbacks

A LogParser tries its header rules in order. Use that to express real line classes, not to paper over failures:

  • The primary rule mirrors the documented format.
  • A secondary rule models another real class the data contains — and should still extract whatever structure it has.

Syslog streams interleave tag-less meta-lines that have no tag: content: last message repeated 2 times, exiting on signal 15. They are not malformed — they are a real syslog message class. Model them with a rule that keeps the timestamp/host envelope and takes the remainder as the message, rather than dropping the whole line to a blanket catch-all. (A blanket [Statement()] rule that discards the timestamp is a last resort, for genuinely header-less lines such as multi-line continuation output.)

Full-data robustness

The in-repo <Name>_2k.log is only a 2,000-line loghub sample; a real parser must process the full dataset (often orders of magnitude larger) without failures. Verify on two axes:

  1. Correctness — on the labelled sample, the parsed message equals the ground-truth content.
  2. Coverage — over the full data, zero LogParseFailure/exceptions. Stream it rather than loading it; watch for non-UTF-8 bytes (decode with errors="replace").

A pattern that looks fine on 2,000 lines can still fail or mis-split on the long tail, so both axes matter.

Debugging a parser with the CLI

Point the CLI at a sample with --parser. Successful results go to stdout (pipeable); parse failures and a final summary go to stderr:

$ printf 'Jan  1 12:34:56 host a[1]: ok one\nGARBAGE no header\nFeb  2 01:02:03 host b[2]: ok two\n' \
    | python -m log2seq -t words
a 1 ok one              # stdout
b 2 ok two              # stdout
parse failed: 'GARBAGE no header': header format mismatch: GARBAGE no header   # stderr
# processed 3 lines: 2 ok, 1 failed                                            # stderr
  • --failures-only suppresses the stdout results, leaving just the failures and the summary — the quickest way to see what a parser still misses.
  • --max-failures N caps the per-failure lines (default 5; 0 for all).
  • Exit status is 0 when at least one line parses, 1 when none do (often a sign the parser doesn't fit the data), and 2 on a startup error (e.g. an unloadable parser script).

So python -m log2seq -p myparser.py sample.log --failures-only iterates a parser against real data: the failures and the M ok, K failed count tell you exactly where it stands.

See also

Clone this wiki locally