Practical Patterns

Writing a parser for a real log format is mostly a few recurring decisions. This page collects them, plus how to debug a parser against sample data with the CLI. The repository's example/loghub_*/parser.py are full worked examples of everything here.

`separator` vs `full_format`

Use separator when fields are just whitespace/punctuation-delimited and you don't care which separator appears where. It's the simplest option.
Use full_format when the format has fixed delimiters that must stay literal — brackets around a time/level, a - between program and content, the : that ends a syslog tag. If you put those characters in a separator set instead, every occurrence (including ones inside the content) is eaten as a separator.

A concrete trap: Apache's [client <ip>] is part of the content. With separator=" []" the brackets are consumed and the leading [ is lost; pinning \[<Time>\] \[<Level>\] <Content> with full_format keeps it.

Free-form fields: anchor, don't widen blindly

Some fields have no tidy character set — a syslog component can be Kernel command line, syslogd 1.4.1, /sbin/mingetty, com.apple.xpc.launchd. The right model is a non-greedy .+? anchored by an explicit delimiter in full_format:

# <Month> <Date> <Time> <Level> <Component>(\[<PID>\])?: <Content>
header_rule = [MonthAbbreviation(), Digit("day"), Time(), Hostname("host"),
               UserItem("component", r".+?"), Digit("processid", optional=True),
               Statement()]
HeaderParser(header_rule, full_format=r"<0> <1> <2> <3> <4>(\[<5>\])?: <6>")

The : is what stops .+? from running away. This is the difference that matters: .+? with a delimiter anchor models a real free-form field; a bare permissive pattern or a catch-all that swallows the whole line hides structure and mis-splits future logs. Reach for .+? only when (a) the field has no fixed character set and (b) a literal delimiter pins its end.

Multiple rules: a faithful primary, then meaningful fallbacks

A LogParser tries its header rules in order. Use that to express real line classes, not to paper over failures:

The primary rule mirrors the documented format.
A secondary rule models another real class the data contains — and should still extract whatever structure it has.

Syslog streams interleave tag-less meta-lines that have no tag: content: last message repeated 2 times, exiting on signal 15. They are not malformed — they are a real syslog message class. Model them with a rule that keeps the timestamp/host envelope and takes the remainder as the message, rather than dropping the whole line to a blanket catch-all. (A blanket [Statement()] rule that discards the timestamp is a last resort, for genuinely header-less lines such as multi-line continuation output.)

Full-data robustness

The in-repo <Name>_2k.log is only a 2,000-line loghub sample; a real parser must process the full dataset (often orders of magnitude larger) without failures. Verify on two axes:

Correctness — on the labelled sample, the parsed message equals the ground-truth content.
Coverage — over the full data, zero LogParseFailure/exceptions. Stream it rather than loading it; watch for non-UTF-8 bytes (decode with errors="replace").

A pattern that looks fine on 2,000 lines can still fail or mis-split on the long tail, so both axes matter.

Debugging a parser with the CLI

Point the CLI at a sample with --parser. Successful results go to stdout (pipeable); parse failures and a final summary go to stderr:

$ printf 'Jan  1 12:34:56 host a[1]: ok one\nGARBAGE no header\nFeb  2 01:02:03 host b[2]: ok two\n' \
    | python -m log2seq -t words
a 1 ok one              # stdout
b 2 ok two              # stdout
parse failed: 'GARBAGE no header': header format mismatch: GARBAGE no header   # stderr
# processed 3 lines: 2 ok, 1 failed                                            # stderr

--failures-only suppresses the stdout results, leaving just the failures and the summary — the quickest way to see what a parser still misses.
--max-failures N caps the per-failure lines (default 5; 0 for all).
Exit status is 0 when at least one line parses, 1 when none do (often a sign the parser doesn't fit the data), and 2 on a startup error (e.g. an unloadable parser script).

So python -m log2seq -p myparser.py sample.log --failures-only iterates a parser against real data: the failures and the M ok, K failed count tell you exactly where it stands.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Practical Patterns

Practical Patterns

`separator` vs `full_format`

Free-form fields: anchor, don't widen blindly

Multiple rules: a faithful primary, then meaningful fallbacks

Full-data robustness

Debugging a parser with the CLI

See also

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

Uh oh!

Practical Patterns

Practical Patterns

separator vs full_format

Free-form fields: anchor, don't widen blindly

Multiple rules: a faithful primary, then meaningful fallbacks

Full-data robustness

Debugging a parser with the CLI

See also

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

`separator` vs `full_format`