Skip to content

Practical Patterns

sat edited this page Jun 26, 2026 · 2 revisions

Practical Patterns

Writing a parser for a real log format is mostly a few recurring decisions. This page collects them, plus how to debug a parser against sample data with the CLI. The repository's example/loghub_*/parser.py are full worked examples of everything here.

separator vs full_format

  • Use separator when fields are just whitespace/punctuation-delimited and you don't care which separator appears where. It's the simplest option.
  • Use full_format when the format has fixed delimiters that must stay literal — brackets around a time/level, a - between program and content, the : that ends a syslog tag. If you put those characters in a separator set instead, every occurrence (including ones inside the content) is eaten as a separator.

A concrete trap: Apache's [client <ip>] is part of the content. With separator=" []" the brackets are consumed and the leading [ is lost; pinning \[<Time>\] \[<Level>\] <Content> with full_format keeps it.

Free-form fields: anchor, don't widen blindly

Some fields have no tidy character set — a syslog component can be Kernel command line, syslogd 1.4.1, /sbin/mingetty, com.apple.xpc.launchd. The right model is a non-greedy .+? anchored by an explicit delimiter in full_format:

# <Month> <Date> <Time> <Level> <Component>(\[<PID>\])?: <Content>
header_rule = [MonthAbbreviation(), Digit("day"), Time(), Hostname("host"),
               UserItem("component", r".+?"), Digit("processid", optional=True),
               Statement()]
HeaderParser(header_rule, full_format=r"<0> <1> <2> <3> <4>(\[<5>\])?: <6>")

The : is what stops .+? from running away. This is the difference that matters: .+? with a delimiter anchor models a real free-form field; a bare permissive pattern or a catch-all that swallows the whole line hides structure and mis-splits future logs. Reach for .+? only when (a) the field has no fixed character set and (b) a literal delimiter pins its end.

Multiple rules: a faithful primary, then meaningful fallbacks

A LogParser tries its header rules in order. Use that to express real line classes, not to paper over failures:

  • The primary rule mirrors the documented format.
  • A secondary rule models another real class the data contains — and should still extract whatever structure it has.

Syslog streams interleave tag-less meta-lines that have no tag: content: last message repeated 2 times, exiting on signal 15. They are not malformed — they are a real syslog message class. Model them with a rule that keeps the timestamp/host envelope and takes the remainder as the message, rather than dropping the whole line to a blanket catch-all. (A blanket [Statement()] rule that discards the timestamp is a last resort, for genuinely header-less lines such as multi-line continuation output.)

Full-data robustness

The in-repo <Name>_2k.log is only a 2,000-line loghub sample; a real parser must process the full dataset (often orders of magnitude larger) without failures. Verify on two axes:

  1. Correctness — on the labelled sample, the parsed message equals the ground-truth content.
  2. Coverage — over the full data, zero LogParseFailure/exceptions. Stream it rather than loading it; watch for non-UTF-8 bytes (decode with errors="replace").

A pattern that looks fine on 2,000 lines can still fail or mis-split on the long tail, so both axes matter.

Debugging a parser with the CLI

Point the CLI at a sample with --parser. Successful results go to stdout (pipeable); parse failures and a final summary go to stderr:

$ printf 'Jan  1 12:34:56 host a[1]: ok one\nGARBAGE no header\nFeb  2 01:02:03 host b[2]: ok two\n' \
    | python -m log2seq -t words
a 1 ok one              # stdout
b 2 ok two              # stdout
parse failed: 'GARBAGE no header': header format mismatch: GARBAGE no header   # stderr
# processed 3 lines: 2 ok, 1 failed                                            # stderr
  • --failures-only suppresses the stdout results, leaving just the failures and the summary — the quickest way to see what a parser still misses.
  • --max-failures N caps the per-failure lines (default 5; 0 for all).
  • Exit status is 0 when at least one line parses, 1 when none do (often a sign the parser doesn't fit the data), and 2 on a startup error (e.g. an unloadable parser script).

So python -m log2seq -p myparser.py sample.log --failures-only iterates a parser against real data: the failures and the M ok, K failed count tell you exactly where it stands.

See also

Clone this wiki locally