Practical Patterns

Writing a parser for a real log format is mostly a few recurring decisions. This page collects them, plus how to debug a parser against sample data with the CLI. The repository's example/loghub_*/parser.py are full worked examples of everything here.

Designing a parser from your logs

A workflow from raw lines to a working parser. Take one representative line:

2024-08-09 11:22:33 web01 nginx[4521]: request from 10.0.0.5:443 at 09:15:00 done

1. Read a sample first. Look at a few dozen real lines and separate the header envelope (here: an ISO date and time, a host, a program[pid] tag) from the free body (request from …). Note the fixed delimiters ([, ], : ) and whether the data has more than one line shape (tag-less meta lines, continuation lines — see Multiple rules below).

2. Design the header. List the header fields left to right, pick an Item for each, and pin the fixed delimiters with full_format (see separator vs full_format). Iterate process_line on samples until message is exactly the body you want:

from log2seq.header import HeaderParser, Date, Time, Hostname, UserItem, Digit, Statement
hp = HeaderParser([Date(), Time(), Hostname("host"), UserItem("program", r".+?"),
                   Digit("pid"), Statement()],
                  full_format=r"<0> <1> <2> <3>\[<4>\]: <5>")
hp.process_line("2024-08-09 11:22:33 web01 nginx[4521]: request from 10.0.0.5:443 at 09:15:00 done")
# {'host': 'web01', 'program': 'nginx', 'pid': 4521,
#  'message': 'request from 10.0.0.5:443 at 09:15:00 done',
#  'timestamp': datetime.datetime(2024, 8, 9, 11, 22, 33)}

For several line shapes, write one rule per shape, most specific first (see Multiple rules).

3. Design the statement — iterate with verbose=True. Decide what must stay whole and what to split on, build the action list, and watch it run. A naive "split on spaces, then on colons" tears the clock time apart:

from log2seq.statement import StatementParser, Split, Fix
from log2seq.preset import pattern_time
body = "request from 10.0.0.5:443 at 09:15:00 done"
StatementParser([Split(" "), Split(":")]).process_line(body, verbose=True)

Split: 'request', 'from', '10.0.0.5:443', 'at', '09:15:00', 'done'
Split: 'request', 'from', '10.0.0.5', '443', 'at', '09', '15', '00', 'done'

The trace ('…' is UNKNOWN) shows 09:15:00 shattered into 09 15 00. It also tells you the fix: protect the time before the : split.

StatementParser([Split(" "), Fix(pattern_time), Split(":")]).process_line(body, verbose=True)

Split: 'request', 'from', '10.0.0.5:443', 'at', '09:15:00', 'done'
Fix: 'request', 'from', '10.0.0.5:443', 'at', #09:15:00#, 'done'
Split: 'request', 'from', '10.0.0.5', '443', 'at', #09:15:00#, 'done'

Now #09:15:00# is FIXED and survives the final split. This protect-then-split loop — read the trace, insert or reorder one action, re-run — is how you design the statement stage. The action catalog and the (part, flag) model are in Statement Rules.

4. Verify against real data. Bind the two stages into a LogParser, run the CLI over a sample with --failures-only and drive failures to zero; if you have ground truth, check the parsed message/words against it; then run the full data for coverage. See Debugging a parser with the CLI and Full-data robustness below.

The rest of this page expands each decision in that workflow.

`separator` vs `full_format`

Use separator when fields are just whitespace/punctuation-delimited and you don't care which separator appears where. It's the simplest option.
Use full_format when the format has fixed delimiters that must stay literal — brackets around a time/level, a - between program and content, the : that ends a syslog tag. If you put those characters in a separator set instead, every occurrence (including ones inside the content) is eaten as a separator.

A concrete trap: Apache's [client <ip>] is part of the content. With separator=" []" the brackets are consumed and the leading [ is lost; pinning \[<Time>\] \[<Level>\] <Content> with full_format keeps it.

Free-form fields: anchor, don't widen blindly

Some fields have no tidy character set — a syslog component can be Kernel command line, syslogd 1.4.1, /sbin/mingetty, com.apple.xpc.launchd. The right model is a non-greedy .+? anchored by an explicit delimiter in full_format:

# <Month> <Date> <Time> <Level> <Component>(\[<PID>\])?: <Content>
header_rule = [MonthAbbreviation(), Digit("day"), Time(), Hostname("host"),
               UserItem("component", r".+?"), Digit("processid", optional=True),
               Statement()]
HeaderParser(header_rule, full_format=r"<0> <1> <2> <3> <4>(\[<5>\])?: <6>")

The : is what stops .+? from running away. This is the difference that matters: .+? with a delimiter anchor models a real free-form field; a bare permissive pattern or a catch-all that swallows the whole line hides structure and mis-splits future logs. Reach for .+? only when (a) the field has no fixed character set and (b) a literal delimiter pins its end.

Multiple rules: a faithful primary, then meaningful fallbacks

A LogParser tries its header rules in order. Use that to express real line classes, not to paper over failures:

The primary rule mirrors the documented format.
A secondary rule models another real class the data contains — and should still extract whatever structure it has.

Syslog streams interleave tag-less meta-lines that have no tag: content: last message repeated 2 times, exiting on signal 15. They are not malformed — they are a real syslog message class. Model them with a rule that keeps the timestamp/host envelope and takes the remainder as the message, rather than dropping the whole line to a blanket catch-all. (A blanket [Statement()] rule that discards the timestamp is a last resort, for genuinely header-less lines such as multi-line continuation output.)

Full-data robustness

The in-repo <Name>_2k.log is only a 2,000-line loghub sample; a real parser must process the full dataset (often orders of magnitude larger) without failures. Verify on two axes:

Correctness — on the labelled sample, the parsed message equals the ground-truth content.
Coverage — over the full data, zero LogParseFailure/exceptions. Stream it rather than loading it; watch for non-UTF-8 bytes (decode with errors="replace").

A pattern that looks fine on 2,000 lines can still fail or mis-split on the long tail, so both axes matter.

Debugging a parser with the CLI

Point the CLI at a sample with --parser. Successful results go to stdout (pipeable); parse failures and a final summary go to stderr:

$ printf 'Jan  1 12:34:56 host a[1]: ok one\nGARBAGE no header\nFeb  2 01:02:03 host b[2]: ok two\n' \
    | python -m log2seq -t words
a 1 ok one              # stdout
b 2 ok two              # stdout
parse failed: 'GARBAGE no header': header format mismatch: GARBAGE no header   # stderr
# processed 3 lines: 2 ok, 1 failed                                            # stderr

--failures-only suppresses the stdout results, leaving just the failures and the summary — the quickest way to see what a parser still misses.
--max-failures N caps the per-failure lines (default 5; 0 for all).
Exit status is 0 when at least one line parses, 1 when none do (often a sign the parser doesn't fit the data), and 2 on a startup error (e.g. an unloadable parser script).

So python -m log2seq -p myparser.py sample.log --failures-only iterates a parser against real data: the failures and the M ok, K failed count tell you exactly where it stands.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Practical Patterns

Practical Patterns

Designing a parser from your logs

`separator` vs `full_format`

Free-form fields: anchor, don't widen blindly

Multiple rules: a faithful primary, then meaningful fallbacks

Full-data robustness

Debugging a parser with the CLI

See also

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

Uh oh!

Practical Patterns

Practical Patterns

Designing a parser from your logs

separator vs full_format

Free-form fields: anchor, don't widen blindly

Multiple rules: a faithful primary, then meaningful fallbacks

Full-data robustness

Debugging a parser with the CLI

See also

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

`separator` vs `full_format`