Semantic Log Line Classifier

A modular Python application for processing and classifying log files through a configurable processor pipeline, with LLM-assisted regex pattern generation for semantic classification.

Overview

The Semantic Log Line Classifier processes log files line-by-line through a configurable pipeline of processors, then classifies processed lines into semantic categories using regex matching with LLM-assisted regex generation for new patterns.

Architecture

Input File → [Line Reader] → [Processor Pipeline] → [Classifier] → [Report Generator]
                                    │
                    ┌───────────────┼───────────────┐
                    ▼               ▼               ▼
              Timestamp      IP/GUID Remover   Tokenizer
               Remover            │               │
                    ┌─────────────┼───────────────┘
                    ▼             ▼
              Token Normalizer → Token Filter → Token Counter → Patcher

Processor Pipeline

The default pipeline consists of the following processors in order:

TimestampRemover - Removes ISO 8601 timestamps, common date formats, and Unix timestamps
IPPortRemover - Removes IP addresses with port numbers (e.g., 10.68.21.11:48438)
IPRemover - Removes standalone IP addresses (e.g., 127.0.0.1)
GUIDRemover - Removes UUIDs and 32-char hex identifiers
Tokenizer - Splits input on whitespace into tokens
TokenNormalizer - Normalizes tokens (lowercase, remove non-alphanumeric)
TokenFilter - Filters out single characters, empty strings, and stop words
TokenCounter - Counts token occurrences
Patcher - Converts token counts to a single string format

Installation

This project uses simple modular Python packages with no installation machinery. Install the required LLM provider package(s):

# For Anthropic/Claude (default)
pip install anthropic

# For OpenAI
pip install openai

# Or install both
pip install anthropic openai

No other external dependencies are required (uses standard library for everything else).

Usage

Basic Example

from pathlib import Path
from log_classifier.pipeline import Pipeline
from log_classifier.classifier import Classifier, LLMClient, LLMConfig
from log_classifier.reporter import Reporter

# Setup components
# Option 1: Use factory function (recommended)
from log_classifier.classifier import create_llm_client
llm_client = create_llm_client("anthropic", api_key="your-api-key")
# Or for OpenAI:
# llm_client = create_llm_client("openai", api_key="your-api-key")

# Option 2: Use provider-specific configs
# from log_classifier.classifier import AnthropicLLMClient, AnthropicConfig
# config = AnthropicConfig.from_env()  # Uses ANTHROPIC_API_KEY env var
# llm_client = AnthropicLLMClient(config)
classifier = Classifier(llm_client)
pipeline = Pipeline.default()
reporter = Reporter(Path("./output"))

# Process log file
with open("input.log") as f:
    for line_num, line in enumerate(f, 1):
        line = line.strip()
        if not line:
            continue
        
        # Process line through pipeline
        processed = pipeline.run([line])
        if processed:
            # Classify the processed line
            class_name = classifier.classify(processed, line, line_num)

# Generate reports
classes = classifier.get_all_classes()
reporter.generate(classes)

Custom Pipeline

You can create a custom pipeline with specific processors:

from log_classifier.pipeline import Pipeline
from log_classifier.processors import (
    TimestampRemover, GUIDRemover, Tokenizer,
    TokenNormalizer, TokenFilter, TokenCounter, Patcher
)

# Create custom pipeline
custom_processors = [
    TimestampRemover(),
    GUIDRemover(),
    Tokenizer(),
    TokenNormalizer(),
    TokenFilter(),
    TokenCounter(),
    Patcher(),
]
custom_pipeline = Pipeline(custom_processors)

Custom Stop Words

You can customize the stop words list for the TokenFilter:

from log_classifier.processors import TokenFilter

custom_stop_words = frozenset({"custom", "stop", "words"})
token_filter = TokenFilter(stop_words=custom_stop_words)

LLM Configuration

The classifier supports multiple LLM providers (Anthropic/Claude and OpenAI). Multiple configuration methods are supported:

Method 1: Factory Function (Recommended for Multi-Provider)

Use the create_llm_client() factory function for easy provider switching:

from log_classifier.classifier import create_llm_client

# Anthropic/Claude (default)
llm_client = create_llm_client("anthropic", api_key="your-api-key")

# OpenAI
llm_client = create_llm_client("openai", api_key="your-api-key", model="gpt-4o")

# Using environment variables
# export ANTHROPIC_API_KEY="your-api-key" or export OPENAI_API_KEY="your-api-key"
llm_client = create_llm_client("anthropic")  # Uses ANTHROPIC_API_KEY env var

Method 2: Provider-Specific Configuration

Anthropic/Claude Configuration:

from log_classifier.classifier import AnthropicLLMClient, AnthropicConfig, ClaudeModels

# Using environment variables
config = AnthropicConfig.from_env()
llm_client = AnthropicLLMClient(config)

# Using model constants
config = AnthropicConfig.with_model(ClaudeModels.OPUS_4, api_key="your-api-key")
llm_client = AnthropicLLMClient(config)

# Direct configuration
config = AnthropicConfig(
    api_key="your-api-key",
    model=ClaudeModels.SONNET_4,
    max_tokens=1024
)
llm_client = AnthropicLLMClient(config)

# Available Claude models:
# - ClaudeModels.SONNET_4 (default)
# - ClaudeModels.OPUS_4
# - ClaudeModels.HAIKU_4
# - ClaudeModels.SONNET_3_5
# - ClaudeModels.OPUS_3
# - ClaudeModels.HAIKU_3

OpenAI Configuration:

from log_classifier.classifier import OpenAILLMClient, OpenAIConfig, OpenAIModels

# Using environment variables
config = OpenAIConfig.from_env()
llm_client = OpenAILLMClient(config)

# Using model constants
config = OpenAIConfig.with_model(OpenAIModels.GPT_4O, api_key="your-api-key")
llm_client = OpenAILLMClient(config)

# Direct configuration
config = OpenAIConfig(
    api_key="your-api-key",
    model=OpenAIModels.GPT_4O,
    max_tokens=1024,
    temperature=0.0  # Lower for more deterministic outputs
)
llm_client = OpenAILLMClient(config)

# Available OpenAI models:
# - OpenAIModels.GPT_4O (default)
# - OpenAIModels.GPT_4O_MINI
# - OpenAIModels.GPT_4_TURBO
# - OpenAIModels.GPT_4
# - OpenAIModels.GPT_3_5_TURBO

Method 3: Backward Compatibility (Anthropic Only)

The old LLMConfig and LLMClient aliases still work for backward compatibility:

from log_classifier.classifier import LLMConfig, LLMClient

# Old-style configuration still works
config = LLMConfig(api_key="your-api-key")
llm_client = LLMClient(config)

Output Format

The reporter generates two types of output:

Per-Class JSON Files (`1.json`, `2.json`, etc.)

Each class gets its own JSON file with all members:

{
  "class_name": "HTTP 404 Request Failed",
  "regex": ".*code.*404.*event.*request.*failed.*",
  "member_count": 3,
  "members": [
    {
      "original_line": "{\"code\": 404, \"event\": \"request_finished\"...}",
      "line_number": 2,
      "processed_line": "code\u00001\u0001event\u00001..."
    }
  ]
}

Summary Report (`report.json`)

A summary of all classes:

{
  "total_classes": 3,
  "total_lines_processed": 7,
  "classes": [
    {
      "class_name": "HTTP 404 Request Failed",
      "regex": ".*code.*404.*event.*request.*failed.*",
      "member_count": 3
    }
  ]
}

Project Structure

log_classifier/
├── processors/          # Processing pipeline components
│   ├── base.py         # Abstract Processor base class
│   ├── timestamp.py    # TimestampRemover
│   ├── ip.py           # IPRemover, IPPortRemover
│   ├── guid.py         # GUIDRemover
│   ├── tokenizer.py    # Tokenizer
│   ├── normalizer.py   # TokenNormalizer
│   ├── filter.py       # TokenFilter
│   ├── counter.py      # TokenCounter
│   └── patcher.py      # Patcher
├── pipeline/           # Pipeline orchestration
│   └── pipeline.py     # Pipeline class
├── classifier/         # Classification system
│   ├── classifier.py   # Classifier and data models
│   └── llm_client.py   # LLM integration
├── reporter/           # Report generation
│   └── reporter.py     # Reporter class
└── tests/              # Test suite
    ├── sample_data.py  # Sample log lines
    ├── test_processors.py
    ├── test_pipeline.py
    └── test_classifier.py

Dependencies

Required

Python 3.10+
At least one LLM provider package:
- anthropic - For Anthropic/Claude models
- openai - For OpenAI models
- Or both for flexibility

Standard Library (no installation needed)

re - Regular expressions
json - JSON handling
dataclasses - Data structures
pathlib - File paths
typing - Type hints
abc - Abstract base classes
collections - Counter for token counting
os - Environment variables

How It Works

Processing: Each log line is processed through the pipeline:
- Timestamps, IPs, and GUIDs are removed
- Text is tokenized and normalized
- Tokens are filtered and counted
- Result is converted to a patched string format
Classification: Processed lines are matched against existing regex patterns:
- If a match is found, the line is added to that class
- If no match is found, the LLM generates a new regex pattern and class name
- New classes are added to the registry in order
Reporting: Classification results are written to JSON files:
- One file per class with all member log lines
- A summary file with class statistics

Configuration

Environment Variables

Environment variables are supported for both providers:

Anthropic/Claude:

ANTHROPIC_API_KEY (required): Your Anthropic API key
ANTHROPIC_MODEL (optional): Model identifier (default: claude-sonnet-4-20250514)
ANTHROPIC_MAX_TOKENS (optional): Maximum tokens per request (default: 1024)

OpenAI:

OPENAI_API_KEY (required): Your OpenAI API key
OPENAI_MODEL (optional): Model identifier (default: gpt-4o)
OPENAI_MAX_TOKENS (optional): Maximum tokens per request (default: 1024)
OPENAI_TEMPERATURE (optional): Temperature for sampling (default: 0.0)

Set them in your shell:

# For Anthropic
export ANTHROPIC_API_KEY="your-api-key-here"
export ANTHROPIC_MODEL="claude-opus-4-20250514"  # Optional
export ANTHROPIC_MAX_TOKENS="2048"  # Optional

# For OpenAI
export OPENAI_API_KEY="your-api-key-here"
export OPENAI_MODEL="gpt-4o"  # Optional
export OPENAI_MAX_TOKENS="2048"  # Optional
export OPENAI_TEMPERATURE="0.0"  # Optional

Then use them in code:

from log_classifier.classifier import create_llm_client, AnthropicConfig, OpenAIConfig

# Using factory function (automatically uses env vars)
anthropic_client = create_llm_client("anthropic")
openai_client = create_llm_client("openai")

# Or using provider-specific configs
from log_classifier.classifier import AnthropicLLMClient, OpenAILLMClient
anthropic_config = AnthropicConfig.from_env()
openai_config = OpenAIConfig.from_env()
anthropic_client = AnthropicLLMClient(anthropic_config)
openai_client = OpenAILLMClient(openai_config)

Override specific values:

# Override model while using env var for API key
anthropic_client = create_llm_client("anthropic", model="claude-haiku-4-20250514")
openai_client = create_llm_client("openai", model="gpt-4o-mini")

Testing

Run the test suite:

cd log_classifier
python -m pytest tests/

Or run individual test files:

python -m unittest log_classifier.tests.test_processors
python -m unittest log_classifier.tests.test_pipeline
python -m unittest log_classifier.tests.test_classifier

Example Input

The classifier expects JSON log lines like:

{"event": "Created RunSession 222202", "timestamp": "2026-01-10T15:29:20.266784Z", "trace_id": "a6dfb9dd-eb72-4e42-a1ff-cdac6105020e"}
{"code": 404, "event": "request_finished", "timestamp": "2026-01-10T15:29:45.865083Z"}
{"event": "request_failed", "status_code": 404, "remote_addr": "10.68.21.11:48438", "timestamp": "2026-01-10T15:29:45.865522Z"}

Limitations

Requires an API key from at least one LLM provider (Anthropic or OpenAI)
LLM API calls are made for each unmatched log line (consider caching or batching for large files)
No retry logic for LLM API failures (exceptions are raised)
Classification is based on processed token patterns, not semantic understanding of log content
OpenAI models require JSON mode support (GPT-3.5-turbo and GPT-4 series)

Contributing

This project uses simple modular Python packages with direct imports. Each module should be independently testable.

License

[Add your license information here]

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.cursor/rules		.cursor/rules
log_classifier		log_classifier
.gitignore		.gitignore
README.md		README.md
main.py		main.py
run_pipeline_sample.py		run_pipeline_sample.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Semantic Log Line Classifier

Overview

Architecture

Processor Pipeline

Installation

Usage

Basic Example

Custom Pipeline

Custom Stop Words

LLM Configuration

Method 1: Factory Function (Recommended for Multi-Provider)

Method 2: Provider-Specific Configuration

Method 3: Backward Compatibility (Anthropic Only)

Output Format

Per-Class JSON Files (`1.json`, `2.json`, etc.)

Summary Report (`report.json`)

Project Structure

Dependencies

Required

Standard Library (no installation needed)

How It Works

Configuration

Environment Variables

Testing

Example Input

Limitations

Contributing

License

About

Uh oh!

Releases

Packages

Languages

akshayrw25/log_processor

Folders and files

Latest commit

History

Repository files navigation

Semantic Log Line Classifier

Overview

Architecture

Processor Pipeline

Installation

Usage

Basic Example

Custom Pipeline

Custom Stop Words

LLM Configuration

Method 1: Factory Function (Recommended for Multi-Provider)

Method 2: Provider-Specific Configuration

Method 3: Backward Compatibility (Anthropic Only)

Output Format

Per-Class JSON Files (1.json, 2.json, etc.)

Summary Report (report.json)

Project Structure

Dependencies

Required

Standard Library (no installation needed)

How It Works

Configuration

Environment Variables

Testing

Example Input

Limitations

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Per-Class JSON Files (`1.json`, `2.json`, etc.)

Summary Report (`report.json`)

Packages