-
Notifications
You must be signed in to change notification settings - Fork 0
Architecture Overview
This page is the entry point for developers who want to understand how log2seq is structured internally: the responsibility of each module, and how a single raw log line travels through the library until it becomes a structured record.
log2seq is a preprocessing library that turns a syslog-style line
(timestamp host statement and its many variants) into structured fields. It
works in two stages: first it splits the line into a header (the
timestamp, host and other metadata) and a statement (the free-format body),
then it tokenizes the statement into a sequence of words and the
symbols that separate them. Both stages are rule-based and customizable.
Downstream consumers (e.g. amulog) use the returned timestamp / host /
message / words as the input to event extraction, template mining and
time-series analysis, so the quality of this output — especially timestamp
determinism and the host/message boundary — propagates to everything built on
top of it.
For the precise contracts of the rule classes referenced below (Item and
Action subclasses, the methods you implement when writing a custom rule), see
Header Rules and Statement Rules. The public
API is summarized in Python API.
log2seq/
├── __init__.py # Public API facade: re-exports LogParser, init_parser, exceptions, KEY_* (no logic)
├── _common.py # Shared base (keys, exceptions, strip_linefeed) + LogParser orchestrator + init_parser / load_parser_script factories
├── header.py # Header parsing: HeaderParser + the Item class hierarchy (extract timestamp / host)
├── statement.py # Statement splitting: StatementParser + the Action class hierarchy (produce words / symbols)
├── preset.py # Bundled ready-to-use parsers (default syslog/ISO, Apache error log)
└── __main__.py # CLI entry point built on click (python -m log2seq)
header.py and statement.py hold most of the code (the Item and Action
class families); _common.py, preset.py and __main__.py are small glue
layers, and __init__.py is a thin re-export facade.
| Module | Key classes / functions | Responsibility |
|---|---|---|
__init__.py |
(re-exports only) | Public facade. Re-exports LogParser, init_parser, the exceptions ParserDefinitionError / LogParseFailure, and the result-dict key constants KEY_TIMESTAMP / KEY_STATEMENT / KEY_WORDS / KEY_SYMBOLS. Defines __version__ (__init__.py:1-12). |
_common.py |
LogParser, init_parser, load_parser_script, strip_linefeed, KEY_*, exceptions |
Shared lowest layer plus the central orchestrator. LogParser binds one-or-more HeaderParsers to one StatementParser and drives the per-line pipeline (_common.py:31-163). init_parser builds the default configuration; load_parser_script dynamically imports a user parser script for the CLI (_common.py:166-205). |
header.py |
HeaderParser, Item and its subclasses |
Parse the header. A HeaderParser compiles a list of Items into a single regex, matches it at the start of the line, extracts each item's value, and (optionally) reassembles the time fields into a datetime (header.py:121-211). |
statement.py |
StatementParser, Action (_ActionBase) and its subclasses |
Tokenize the statement. A StatementParser applies an ordered list of Actions and finally separates the result into (words, symbols) (statement.py:23-112). |
preset.py |
default_header_parsers, default_statement_parser, default, apache_errorlog_parser
|
Bundled parsers for common formats; also serve as worked examples for writing your own (preset.py:15-148). |
__main__.py |
main, iter_lines
|
The python -m log2seq CLI: runs each line of the input files (or stdin) through a parser and prints the result (__main__.py:68-148). |
__init__ and __main__ sit at the top; header, statement and preset
form the rule layer; _common is the shared base. Two edges run "upward" from
_common into the rule layer and are therefore done as lazy (in-function)
imports to avoid import cycles:
-
LogParser.__init__importsheader._HeaderParserBaseto normalize a single parser into a list (_common.py:75). -
init_parserimportspresetto obtain the default rules (_common.py:181-186).
So although _common is conceptually the lowest layer, it does know about
header and preset at call time. header.py and statement.py depend only
on _common (for the KEY_* constants and exceptions); preset.py pulls in
header and statement via from ... import * and binds them with
LogParser.
The whole library is organized around LogParser, which composes two ordered,
swappable rule sequences.
LogParser holds a list of HeaderParsers. process_header tries them from
the front, in order, and the first one that matches is used; the rest are
skipped. If every parser fails, it raises LogParseFailure (_common.py:100-116).
Order is priority: put the more specific rule first. This is how one parser can
accept several log formats at once — e.g. the default configuration has a
syslog rule first and an ISO-date rule second (intended for Python-logging
asctime output) (preset.py:40-49). Within a single HeaderParser, the list of Items is
compiled into one re.Pattern; placement is defined either by a separator
(the easier, recommended option) or by a full_format template that anchors
fixed delimiters such as Apache's [time] [level]
(header.py:163-176, header.py:207-211). Exactly one Statement item is
mandatory in every rule (header.py:237-241); it captures the free-format body
under the message key.
process_statement delegates to the one StatementParser, which applies its
list of Actions sequentially to the statement and then calls _separate
to produce the final (words, symbols) lists (statement.py:87-112).
Internally the statement is carried as a list of (substring, flag) tuples.
The flag is one of three values (statement.py:17-20):
-
UNKNOWN— not yet processed; still a candidate for later actions to split or fix. -
FIXED— confirmed as a single word; not changed by later actions. -
SEPARATORS— a delimiter; becomes a symbol, never a word.
Each Action.do(parts) -> parts consumes that list and returns a new one,
re-flagging substrings as it goes. Because actions only ever transform this
tuple list, the order is significant and you can freely insert or reorder them.
The default statement parser is a 4-step pipeline
(Split of common symbols, not including : → FixIP → Fix of timestamps
and MAC addresses → Split of :), and deferring the : split to the end is
what protects IPv6 addresses, clock times and MAC addresses from being broken
apart (preset.py:68-74).
LogParser.process_line ties the two stages together: it strips the line feed,
returns None for an empty line, runs the header stage, and — only when the
header produced a non-None message — runs the statement stage and attaches
the tokenized output (_common.py:133-163).
A typical result looks like:
{
"timestamp": datetime.datetime(...), # reassembled by the HeaderParser
"host": "host-device1", # from a Hostname/String Item (key = the item's value_name)
"message": "system[12345]: ...", # the statement string
"words": ["system", "12345", ...], # from the StatementParser
"symbols": ["", "[", "]: ", ...], # the separators around the words
}Key facts about the shape:
- The header keys are not fixed:
hostis present because a default rule names an itemhost. Any item's value goes into the dict under its ownvalue_name, so the available header keys depend on the rule (_common.py:40,preset.py:44). The constantsKEY_TIMESTAMP="timestamp",KEY_STATEMENT="message",KEY_WORDS="words",KEY_SYMBOLS="symbols"name the standard keys (_common.py:6-9). -
symbolsis always one longer thanwords:len(symbols) == len(words) + 1, because there is a separator before the first word and after the last word (statement.py:84,statement.py:99-101). Either end may be the empty string. -
wordsandsymbolsare absent whenmessageisNone: the statement stage runs only insideif mes is not None(_common.py:158-163). A rule that parses a tag-less metadata line and leaves the body empty therefore yields a dict with nowords/symbols.
When all header rules fail, process_line re-raises LogParseFailure, unless
the LogParser was created with ignore_failure=True, in which case it returns
None (_common.py:150-156).
flowchart TD
L["Raw log line (str)"] --> SF["strip_linefeed; empty line -> None"]
SF --> PH["process_header: try HeaderParser list from the front (first match wins)"]
PH -->|no rule matches| FAIL["raise LogParseFailure (or None if ignore_failure)"]
PH -->|matched| HP["compiled re.Pattern match -> Item.pick per item -> reassemble datetime"]
HP --> D1["dict: timestamp, host, message (statement str)"]
D1 -->|message is None| OUT2["dict without words / symbols"]
D1 -->|message is not None| PS["process_statement: StatementParser applies Action list in order"]
PS --> SEP["_separate: (part, flag) list -> words / symbols (len(symbols) = len(words)+1)"]
SEP --> OUT["dict: timestamp, host, message, words, symbols"]
In words: raw line → split off the header (timestamp / host) → tokenize the
statement → a structured list of words and their separators. Both stages share
the same skeleton — apply an ordered, swappable rule sequence from the front —
which is what makes the parser configurable: swap or reorder the HeaderParser
list to support a new log format, and swap or reorder the Action list to
change how statements are tokenized.
There are three ways to drive this pipeline, in increasing distance from code:
-
In code —
init_parser(header_parsers=[...], statement_parser=StatementParser([...])). You can reusepreset.default_statement_parser()while replacing only the header rules, and vice versa. -
Presets — call
preset.default()/preset.apache_errorlog_parser()directly, or read them as templates for your own (example/loghub_*are full worked examples). -
External parser script + CLI — write a
.pythat binds aLogParserto a module-level variable namedparser, then point the CLI at it withpython -m log2seq --parser script.py;load_parser_scriptimports it (_common.py:190-205,__main__.py:101-108).
-
Header Rules / Statement Rules — the
Item/Actioncontracts you implement to write custom rules. - Python API — the public classes, functions and result keys.