Skip to content

Architecture Overview

sat edited this page Jun 26, 2026 · 1 revision

Architecture Overview

This page is the entry point for developers who want to understand how log2seq is structured internally: the responsibility of each module, and how a single raw log line travels through the library until it becomes a structured record.

log2seq is a preprocessing library that turns a syslog-style line (timestamp host statement and its many variants) into structured fields. It works in two stages: first it splits the line into a header (the timestamp, host and other metadata) and a statement (the free-format body), then it tokenizes the statement into a sequence of words and the symbols that separate them. Both stages are rule-based and customizable.

Downstream consumers (e.g. amulog) use the returned timestamp / host / message / words as the input to event extraction, template mining and time-series analysis, so the quality of this output — especially timestamp determinism and the host/message boundary — propagates to everything built on top of it.

For the precise contracts of the rule classes referenced below (Item and Action subclasses, the methods you implement when writing a custom rule), see Header Rules and Statement Rules. The public API is summarized in Python API.

Package Structure

log2seq/
├── __init__.py     # Public API facade: re-exports LogParser, init_parser, exceptions, KEY_* (no logic)
├── _common.py      # Shared base (keys, exceptions, strip_linefeed) + LogParser orchestrator + init_parser / load_parser_script factories
├── header.py       # Header parsing: HeaderParser + the Item class hierarchy (extract timestamp / host)
├── statement.py    # Statement splitting: StatementParser + the Action class hierarchy (produce words / symbols)
├── preset.py       # Bundled ready-to-use parsers (default syslog/ISO, Apache error log)
└── __main__.py     # CLI entry point built on click (python -m log2seq)

header.py and statement.py hold most of the code (the Item and Action class families); _common.py, preset.py and __main__.py are small glue layers, and __init__.py is a thin re-export facade.

Module Responsibilities

Module Key classes / functions Responsibility
__init__.py (re-exports only) Public facade. Re-exports LogParser, init_parser, the exceptions ParserDefinitionError / LogParseFailure, and the result-dict key constants KEY_TIMESTAMP / KEY_STATEMENT / KEY_WORDS / KEY_SYMBOLS. Defines __version__ (__init__.py:1-12).
_common.py LogParser, init_parser, load_parser_script, strip_linefeed, KEY_*, exceptions Shared lowest layer plus the central orchestrator. LogParser binds one-or-more HeaderParsers to one StatementParser and drives the per-line pipeline (_common.py:31-163). init_parser builds the default configuration; load_parser_script dynamically imports a user parser script for the CLI (_common.py:166-205).
header.py HeaderParser, Item and its subclasses Parse the header. A HeaderParser compiles a list of Items into a single regex, matches it at the start of the line, extracts each item's value, and (optionally) reassembles the time fields into a datetime (header.py:121-211).
statement.py StatementParser, Action (_ActionBase) and its subclasses Tokenize the statement. A StatementParser applies an ordered list of Actions and finally separates the result into (words, symbols) (statement.py:23-112).
preset.py default_header_parsers, default_statement_parser, default, apache_errorlog_parser Bundled parsers for common formats; also serve as worked examples for writing your own (preset.py:15-148).
__main__.py main, iter_lines The python -m log2seq CLI: runs each line of the input files (or stdin) through a parser and prints the result (__main__.py:68-148).

Module dependencies

__init__ and __main__ sit at the top; header, statement and preset form the rule layer; _common is the shared base. Two edges run "upward" from _common into the rule layer and are therefore done as lazy (in-function) imports to avoid import cycles:

  • LogParser.__init__ imports header._HeaderParserBase to normalize a single parser into a list (_common.py:75).
  • init_parser imports preset to obtain the default rules (_common.py:181-186).

So although _common is conceptually the lowest layer, it does know about header and preset at call time. header.py and statement.py depend only on _common (for the KEY_* constants and exceptions); preset.py pulls in header and statement via from ... import * and binds them with LogParser.

The Two-Stage Pipeline

The whole library is organized around LogParser, which composes two ordered, swappable rule sequences.

Stage 1 — header: an ordered list of HeaderParsers (first match wins)

LogParser holds a list of HeaderParsers. process_header tries them from the front, in order, and the first one that matches is used; the rest are skipped. If every parser fails, it raises LogParseFailure (_common.py:100-116).

Order is priority: put the more specific rule first. This is how one parser can accept several log formats at once — e.g. the default configuration has a syslog rule first and an ISO-date rule second (intended for Python-logging asctime output) (preset.py:40-49). Within a single HeaderParser, the list of Items is compiled into one re.Pattern; placement is defined either by a separator (the easier, recommended option) or by a full_format template that anchors fixed delimiters such as Apache's [time] [level] (header.py:163-176, header.py:207-211). Exactly one Statement item is mandatory in every rule (header.py:237-241); it captures the free-format body under the message key.

Stage 2 — statement: a single StatementParser (ordered Action pipeline)

process_statement delegates to the one StatementParser, which applies its list of Actions sequentially to the statement and then calls _separate to produce the final (words, symbols) lists (statement.py:87-112).

Internally the statement is carried as a list of (substring, flag) tuples. The flag is one of three values (statement.py:17-20):

  • UNKNOWN — not yet processed; still a candidate for later actions to split or fix.
  • FIXED — confirmed as a single word; not changed by later actions.
  • SEPARATORS — a delimiter; becomes a symbol, never a word.

Each Action.do(parts) -> parts consumes that list and returns a new one, re-flagging substrings as it goes. Because actions only ever transform this tuple list, the order is significant and you can freely insert or reorder them. The default statement parser is a 4-step pipeline (Split of common symbols, not including :FixIPFix of timestamps and MAC addresses → Split of :), and deferring the : split to the end is what protects IPv6 addresses, clock times and MAC addresses from being broken apart (preset.py:68-74).

The process_line result dict

LogParser.process_line ties the two stages together: it strips the line feed, returns None for an empty line, runs the header stage, and — only when the header produced a non-None message — runs the statement stage and attaches the tokenized output (_common.py:133-163).

A typical result looks like:

{
  "timestamp": datetime.datetime(...),    # reassembled by the HeaderParser
  "host":      "host-device1",            # from a Hostname/String Item (key = the item's value_name)
  "message":   "system[12345]: ...",      # the statement string
  "words":     ["system", "12345", ...],  # from the StatementParser
  "symbols":   ["", "[", "]: ", ...],     # the separators around the words
}

Key facts about the shape:

  • The header keys are not fixed: host is present because a default rule names an item host. Any item's value goes into the dict under its own value_name, so the available header keys depend on the rule (_common.py:40, preset.py:44). The constants KEY_TIMESTAMP="timestamp", KEY_STATEMENT="message", KEY_WORDS="words", KEY_SYMBOLS="symbols" name the standard keys (_common.py:6-9).
  • symbols is always one longer than words: len(symbols) == len(words) + 1, because there is a separator before the first word and after the last word (statement.py:84, statement.py:99-101). Either end may be the empty string.
  • words and symbols are absent when message is None: the statement stage runs only inside if mes is not None (_common.py:158-163). A rule that parses a tag-less metadata line and leaves the body empty therefore yields a dict with no words / symbols.

When all header rules fail, process_line re-raises LogParseFailure, unless the LogParser was created with ignore_failure=True, in which case it returns None (_common.py:150-156).

End-to-End Data Flow

flowchart TD
    L["Raw log line (str)"] --> SF["strip_linefeed; empty line -> None"]
    SF --> PH["process_header: try HeaderParser list from the front (first match wins)"]
    PH -->|no rule matches| FAIL["raise LogParseFailure (or None if ignore_failure)"]
    PH -->|matched| HP["compiled re.Pattern match -> Item.pick per item -> reassemble datetime"]
    HP --> D1["dict: timestamp, host, message (statement str)"]
    D1 -->|message is None| OUT2["dict without words / symbols"]
    D1 -->|message is not None| PS["process_statement: StatementParser applies Action list in order"]
    PS --> SEP["_separate: (part, flag) list -> words / symbols (len(symbols) = len(words)+1)"]
    SEP --> OUT["dict: timestamp, host, message, words, symbols"]
Loading

In words: raw line → split off the header (timestamp / host) → tokenize the statement → a structured list of words and their separators. Both stages share the same skeleton — apply an ordered, swappable rule sequence from the front — which is what makes the parser configurable: swap or reorder the HeaderParser list to support a new log format, and swap or reorder the Action list to change how statements are tokenized.

Customization Paths

There are three ways to drive this pipeline, in increasing distance from code:

  1. In codeinit_parser(header_parsers=[...], statement_parser=StatementParser([...])). You can reuse preset.default_statement_parser() while replacing only the header rules, and vice versa.
  2. Presets — call preset.default() / preset.apache_errorlog_parser() directly, or read them as templates for your own (example/loghub_* are full worked examples).
  3. External parser script + CLI — write a .py that binds a LogParser to a module-level variable named parser, then point the CLI at it with python -m log2seq --parser script.py; load_parser_script imports it (_common.py:190-205, __main__.py:101-108).

See Also

Clone this wiki locally