-
Notifications
You must be signed in to change notification settings - Fork 0
Example Parsers
The repository ships a set of real parser scripts under example/, one per
loghub open dataset (Apache, Linux, Mac,
Thunderbird, Windows, BGL, HDFS, Hadoop, Spark, Zookeeper, OpenSSH, OpenStack,
Proxifier, HealthApp, HPC, Android). They are the most complete worked examples
of everything in Practical Patterns — read them after that
page.
Each example/loghub_<Name>/ holds a parser.py and a 2,000-line sample
<Name>_2k.log:
$ cd example/loghub_Linux
$ python -m log2seq -i -p parser.py Linux_2k.log-i/--show-input echoes each source line next to its parse, and -p/--parser
loads the script (a module-level parser). See Practical Patterns
for the other CLI flags.
These scripts are meant to be exemplary, not merely passing. They follow a few rules worth copying:
-
Faithful to the dataset's
log_format. Each parser mirrors the structure loghub itself documents for that dataset, rather than an ad-hoc guess. -
Built for the full dataset. The in-repo
<Name>_2k.logis only a sample; the parsers are written for the complete datasets (linked from loghub), which are far larger. -
Free-form fields are anchored, not widened. A component that can contain
spaces or slashes is
.+?pinned by afull_formatdelimiter — never a loose catch-all (see Practical Patterns). - Every real line class gets a meaningful rule. Tag-less meta-lines and continuation lines are modelled with their own rule that still keeps whatever structure they have — not dropped to a whole-line catch-all.
The Linux parser is two header rules plus the default statement parser:
# Rule 1 — the normal "<Component>[pid]: <Content>" syslog line.
# <Component> can be "syslogd 1.4.1", "/sbin/mingetty", "sshd(pam_unix)" — spaces
# and slashes — so it is a non-greedy .+? pinned by full_format's "[pid]: ".
header_rule1 = [MonthAbbreviation(), Digit("day"), Time(), Hostname("host"),
UserItem("component", r".+?"), Digit("processid", optional=True),
Statement()]
header_parser1 = HeaderParser(header_rule1,
full_format=r"<0> <1> <2> <3> <4>(\[<5>\])?: <6>",
defaults=defaults)
# Rule 2 — tag-less syslog meta-lines that have no "<Component>: " part, e.g.
# "Sep 28 09:08:56 combo last message repeated 2 times"
# Keep the timestamp/host envelope and take the remainder as the message.
header_rule2 = [MonthAbbreviation(), Digit("day"), Time(), Hostname("host"),
Statement()]
header_parser2 = HeaderParser(header_rule2, separator=" ", defaults=defaults)
parser = LogParser([header_parser1, header_parser2],
preset.default_statement_parser())The two classes land in the right rule (first match wins):
parser.process_line("Jun 14 15:16:01 combo sshd(pam_unix)[19939]: authentication failure; logname= uid=0")
# component='sshd(pam_unix)', host='combo', message='authentication failure; logname= uid=0' (rule 1)
parser.process_line("Sep 28 09:08:56 combo last message repeated 2 times")
# host='combo', message='last message repeated 2 times' (rule 2; no component key)The meta-line keeps its timestamp and host and carries a real message, instead of being thrown away — exactly the faithful primary + meaningful fallback pattern.
The examples are checked on two axes:
-
Correctness — on the labelled 2k sample, the parsed
messageequals loghub's ground-truth<Content>(logparser/logs/<Name>/<Name>_2k.log_structured.csv). -
Coverage — over the full dataset, zero parse failures. The CLI's summary line is the quick check:
$ python -m log2seq -p parser.py Linux_2k.log # processed 2000 lines: 2000 ok, 0 failed
A parser that is right on 2,000 lines can still fail on the long tail, so the full-data run matters — see Full-data robustness in Practical Patterns.
- Practical Patterns — the decisions these scripts embody.
- Presets — the bundled parsers, a smaller starting point.
- The scripts themselves:
example/loghub_*/parser.pyin the repository.