Skip to content

Building a Parser

sat edited this page Jun 26, 2026 · 1 revision

Building a Parser

The whole job of log2seq is to turn a raw log line into structured fields. This page is the spine of the guide: it shows the shape of a parser, walks through assembling one from parts, and lists the three ways to drive it. The catalogs of parts live in Header Rules and Statement Rules; the internals are in Architecture Overview.

The two stages and the result

A LogParser runs every line through two stages:

  1. Header — split off the front matter (timestamp, host, …) and leave the free-format body as the message.
  2. Statement — tokenize that body into words and the symbols between them.
import log2seq

mes = ("Jan  1 12:34:56 host-device1 system[12345]: "
       "host 2001:0db8:1234::1 (interface:eth0) disconnected")
parser = log2seq.init_parser()          # the default parser
d = parser.process_line(mes)

d is a plain dict:

{
  'timestamp': datetime.datetime(2026, 1, 1, 12, 34, 56),
  'host': 'host-device1',
  'message': 'system[12345]: host 2001:0db8:1234::1 (interface:eth0) disconnected',
  'words': ['system', '12345', 'host', '2001:0db8:1234::1',
            'interface', 'eth0', 'disconnected'],
  'symbols': ['', '[', ']: ', ' ', ' (', ':', ') ', ''],
}

A few facts worth knowing up front (see Python API for the full contract):

  • The header keys are not fixed: host is present because a default rule names an item host. Each item's value lands under its own name, so the available header keys depend on the rule.
  • A syslog line carries no year, so the default parser fills the current year. Provide one explicitly (a <year> item, or defaults={"year": ...}) when you need a fixed value.
  • symbols is always one longer than words (len(symbols) == len(words) + 1): there is a separator before the first word and after the last; either end may be empty.

Assembling a parser from parts

A header rule is a list of Items; a statement parser is a list of Actions. You build each stage, then bind them with LogParser.

from log2seq import LogParser
from log2seq.header import (HeaderParser, MonthAbbreviation, Digit, Time,
                            Hostname, UserItem, Statement)
from log2seq.statement import StatementParser, Split, FixIP

# Stage 1: a header rule, placed with full_format (fixed "[pid]: " delimiter)
header_rule = [
    MonthAbbreviation(), Digit("day"), Time(), Hostname("host"),
    UserItem("program", r"[a-zA-Z0-9._-]+"), Digit("pid", optional=True),
    Statement(),
]
hp = HeaderParser(header_rule,
                  full_format=r"<0> <1> <2> <3> <4>(\[<5>\])?: <6>",
                  defaults={"year": 2024})

# Stage 2: split on spaces, keep IP addresses whole, then split on ":"
sp = StatementParser([Split(" "), FixIP(), Split(":")])

parser = LogParser(hp, sp)
d = parser.process_line("Aug  9 11:22:33 web01 nginx[4521]: connect from 10.0.0.5:443 ok")
d['timestamp']  # datetime.datetime(2024, 8, 9, 11, 22, 33)
d['host']       # 'web01'
d['program']    # 'nginx'
d['pid']        # 4521
d['words']      # ['connect', 'from', '10.0.0.5', '443', 'ok']
d['symbols']    # ['', ' ', ' ', ':', ' ', '']

Two ideas in that example carry most of the power of log2seq:

  • Placement. full_format pins the literal [, ] and : so they are not mistaken for content; the alternative, separator=..., is simpler when fields are just whitespace-delimited. See Header Rules.
  • Order matters in the statement stage. FixIP() runs before the ":" split, so 10.0.0.5 is marked as one word and the later split leaves it alone (without it, the address would break into 10, 0, 0, 5). See Statement Rules.

An optional item that does not match is simply absent from the result (the key is omitted, not set to None), so "pid" in d tells you whether a pid was present.

Three ways to drive it

  1. In code — build a LogParser yourself, or start from init_parser() and replace only one stage. You can reuse a preset's statement parser while swapping the header rules, and vice versa:

    from log2seq import init_parser, preset
    parser = init_parser(header_parsers=[hp],
                         statement_parser=preset.default_statement_parser())
  2. A bundled preset — call a ready-made parser for a common format:

    from log2seq import preset
    parser = preset.default()                 # syslog / ISO-style date
    parser = preset.apache_errorlog_parser()  # Apache error_log

    See Presets.

  3. An external parser script + the CLI — put a LogParser in a .py file as a module-level variable named parser, then point the command line at it:

    # myparser.py
    from log2seq import LogParser
    from log2seq.header import *
    from log2seq.statement import *
    # ... build hp and sp ...
    parser = LogParser(hp, sp)
    $ log2seq --parser myparser.py app.log
    $ python -m log2seq -p myparser.py app.log

    The CLI imports the script with load_parser_script and runs every line through its parser. See Practical Patterns for using the CLI to debug a parser against sample data.

Where to go next

  • Header Rules — the Item catalog and how to place them.
  • Statement Rules — the Action catalog and the (part, flag) model that makes ordering matter.
  • Presets — ready-made parsers, also readable as worked examples.
  • Practical Patterns — authoring real parsers and debugging them with the CLI.

Clone this wiki locally