Skip to content

Example Parsers

sat edited this page Jun 26, 2026 · 1 revision

Example Parsers

The repository ships a set of real parser scripts under example/, one per loghub open dataset (Apache, Linux, Mac, Thunderbird, Windows, BGL, HDFS, Hadoop, Spark, Zookeeper, OpenSSH, OpenStack, Proxifier, HealthApp, HPC, Android). They are the most complete worked examples of everything in Practical Patterns — read them after that page.

Running one

Each example/loghub_<Name>/ holds a parser.py and a 2,000-line sample <Name>_2k.log:

$ cd example/loghub_Linux
$ python -m log2seq -i -p parser.py Linux_2k.log

-i/--show-input echoes each source line next to its parse, and -p/--parser loads the script (a module-level parser). See Practical Patterns for the other CLI flags.

Design principles

These scripts are meant to be exemplary, not merely passing. They follow a few rules worth copying:

  • Faithful to the dataset's log_format. Each parser mirrors the structure loghub itself documents for that dataset, rather than an ad-hoc guess.
  • Built for the full dataset. The in-repo <Name>_2k.log is only a sample; the parsers are written for the complete datasets (linked from loghub), which are far larger.
  • Free-form fields are anchored, not widened. A component that can contain spaces or slashes is .+? pinned by a full_format delimiter — never a loose catch-all (see Practical Patterns).
  • Every real line class gets a meaningful rule. Tag-less meta-lines and continuation lines are modelled with their own rule that still keeps whatever structure they have — not dropped to a whole-line catch-all.

Walkthrough: loghub_Linux

The Linux parser is two header rules plus the default statement parser:

# Rule 1 — the normal "<Component>[pid]: <Content>" syslog line.
# <Component> can be "syslogd 1.4.1", "/sbin/mingetty", "sshd(pam_unix)" — spaces
# and slashes — so it is a non-greedy .+? pinned by full_format's "[pid]: ".
header_rule1 = [MonthAbbreviation(), Digit("day"), Time(), Hostname("host"),
                UserItem("component", r".+?"), Digit("processid", optional=True),
                Statement()]
header_parser1 = HeaderParser(header_rule1,
                              full_format=r"<0> <1> <2> <3> <4>(\[<5>\])?: <6>",
                              defaults=defaults)

# Rule 2 — tag-less syslog meta-lines that have no "<Component>: " part, e.g.
#   "Sep 28 09:08:56 combo last message repeated 2 times"
# Keep the timestamp/host envelope and take the remainder as the message.
header_rule2 = [MonthAbbreviation(), Digit("day"), Time(), Hostname("host"),
                Statement()]
header_parser2 = HeaderParser(header_rule2, separator=" ", defaults=defaults)

parser = LogParser([header_parser1, header_parser2],
                   preset.default_statement_parser())

The two classes land in the right rule (first match wins):

parser.process_line("Jun 14 15:16:01 combo sshd(pam_unix)[19939]: authentication failure; logname= uid=0")
# component='sshd(pam_unix)', host='combo', message='authentication failure; logname= uid=0'   (rule 1)

parser.process_line("Sep 28 09:08:56 combo last message repeated 2 times")
# host='combo', message='last message repeated 2 times'   (rule 2; no component key)

The meta-line keeps its timestamp and host and carries a real message, instead of being thrown away — exactly the faithful primary + meaningful fallback pattern.

Verifying an example parser

The examples are checked on two axes:

  1. Correctness — on the labelled 2k sample, the parsed message equals loghub's ground-truth <Content> (logparser/logs/<Name>/<Name>_2k.log_structured.csv).

  2. Coverage — over the full dataset, zero parse failures. The CLI's summary line is the quick check:

    $ python -m log2seq -p parser.py Linux_2k.log
    # processed 2000 lines: 2000 ok, 0 failed

A parser that is right on 2,000 lines can still fail on the long tail, so the full-data run matters — see Full-data robustness in Practical Patterns.

See also

  • Practical Patterns — the decisions these scripts embody.
  • Presets — the bundled parsers, a smaller starting point.
  • The scripts themselves: example/loghub_*/parser.py in the repository.

Clone this wiki locally