Skip to content

akshayrw25/log_processor

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Semantic Log Line Classifier

A modular Python application for processing and classifying log files through a configurable processor pipeline, with LLM-assisted regex pattern generation for semantic classification.

Overview

The Semantic Log Line Classifier processes log files line-by-line through a configurable pipeline of processors, then classifies processed lines into semantic categories using regex matching with LLM-assisted regex generation for new patterns.

Architecture

Input File → [Line Reader] → [Processor Pipeline] → [Classifier] → [Report Generator]
                                    │
                    ┌───────────────┼───────────────┐
                    ▼               ▼               ▼
              Timestamp      IP/GUID Remover   Tokenizer
               Remover            │               │
                    ┌─────────────┼───────────────┘
                    ▼             ▼
              Token Normalizer → Token Filter → Token Counter → Patcher

Processor Pipeline

The default pipeline consists of the following processors in order:

  1. TimestampRemover - Removes ISO 8601 timestamps, common date formats, and Unix timestamps
  2. IPPortRemover - Removes IP addresses with port numbers (e.g., 10.68.21.11:48438)
  3. IPRemover - Removes standalone IP addresses (e.g., 127.0.0.1)
  4. GUIDRemover - Removes UUIDs and 32-char hex identifiers
  5. Tokenizer - Splits input on whitespace into tokens
  6. TokenNormalizer - Normalizes tokens (lowercase, remove non-alphanumeric)
  7. TokenFilter - Filters out single characters, empty strings, and stop words
  8. TokenCounter - Counts token occurrences
  9. Patcher - Converts token counts to a single string format

Installation

This project uses simple modular Python packages with no installation machinery. Install the required LLM provider package(s):

# For Anthropic/Claude (default)
pip install anthropic

# For OpenAI
pip install openai

# Or install both
pip install anthropic openai

No other external dependencies are required (uses standard library for everything else).

Usage

Basic Example

from pathlib import Path
from log_classifier.pipeline import Pipeline
from log_classifier.classifier import Classifier, LLMClient, LLMConfig
from log_classifier.reporter import Reporter

# Setup components
# Option 1: Use factory function (recommended)
from log_classifier.classifier import create_llm_client
llm_client = create_llm_client("anthropic", api_key="your-api-key")
# Or for OpenAI:
# llm_client = create_llm_client("openai", api_key="your-api-key")

# Option 2: Use provider-specific configs
# from log_classifier.classifier import AnthropicLLMClient, AnthropicConfig
# config = AnthropicConfig.from_env()  # Uses ANTHROPIC_API_KEY env var
# llm_client = AnthropicLLMClient(config)
classifier = Classifier(llm_client)
pipeline = Pipeline.default()
reporter = Reporter(Path("./output"))

# Process log file
with open("input.log") as f:
    for line_num, line in enumerate(f, 1):
        line = line.strip()
        if not line:
            continue
        
        # Process line through pipeline
        processed = pipeline.run([line])
        if processed:
            # Classify the processed line
            class_name = classifier.classify(processed, line, line_num)

# Generate reports
classes = classifier.get_all_classes()
reporter.generate(classes)

Custom Pipeline

You can create a custom pipeline with specific processors:

from log_classifier.pipeline import Pipeline
from log_classifier.processors import (
    TimestampRemover, GUIDRemover, Tokenizer,
    TokenNormalizer, TokenFilter, TokenCounter, Patcher
)

# Create custom pipeline
custom_processors = [
    TimestampRemover(),
    GUIDRemover(),
    Tokenizer(),
    TokenNormalizer(),
    TokenFilter(),
    TokenCounter(),
    Patcher(),
]
custom_pipeline = Pipeline(custom_processors)

Custom Stop Words

You can customize the stop words list for the TokenFilter:

from log_classifier.processors import TokenFilter

custom_stop_words = frozenset({"custom", "stop", "words"})
token_filter = TokenFilter(stop_words=custom_stop_words)

LLM Configuration

The classifier supports multiple LLM providers (Anthropic/Claude and OpenAI). Multiple configuration methods are supported:

Method 1: Factory Function (Recommended for Multi-Provider)

Use the create_llm_client() factory function for easy provider switching:

from log_classifier.classifier import create_llm_client

# Anthropic/Claude (default)
llm_client = create_llm_client("anthropic", api_key="your-api-key")

# OpenAI
llm_client = create_llm_client("openai", api_key="your-api-key", model="gpt-4o")

# Using environment variables
# export ANTHROPIC_API_KEY="your-api-key" or export OPENAI_API_KEY="your-api-key"
llm_client = create_llm_client("anthropic")  # Uses ANTHROPIC_API_KEY env var

Method 2: Provider-Specific Configuration

Anthropic/Claude Configuration:

from log_classifier.classifier import AnthropicLLMClient, AnthropicConfig, ClaudeModels

# Using environment variables
config = AnthropicConfig.from_env()
llm_client = AnthropicLLMClient(config)

# Using model constants
config = AnthropicConfig.with_model(ClaudeModels.OPUS_4, api_key="your-api-key")
llm_client = AnthropicLLMClient(config)

# Direct configuration
config = AnthropicConfig(
    api_key="your-api-key",
    model=ClaudeModels.SONNET_4,
    max_tokens=1024
)
llm_client = AnthropicLLMClient(config)

# Available Claude models:
# - ClaudeModels.SONNET_4 (default)
# - ClaudeModels.OPUS_4
# - ClaudeModels.HAIKU_4
# - ClaudeModels.SONNET_3_5
# - ClaudeModels.OPUS_3
# - ClaudeModels.HAIKU_3

OpenAI Configuration:

from log_classifier.classifier import OpenAILLMClient, OpenAIConfig, OpenAIModels

# Using environment variables
config = OpenAIConfig.from_env()
llm_client = OpenAILLMClient(config)

# Using model constants
config = OpenAIConfig.with_model(OpenAIModels.GPT_4O, api_key="your-api-key")
llm_client = OpenAILLMClient(config)

# Direct configuration
config = OpenAIConfig(
    api_key="your-api-key",
    model=OpenAIModels.GPT_4O,
    max_tokens=1024,
    temperature=0.0  # Lower for more deterministic outputs
)
llm_client = OpenAILLMClient(config)

# Available OpenAI models:
# - OpenAIModels.GPT_4O (default)
# - OpenAIModels.GPT_4O_MINI
# - OpenAIModels.GPT_4_TURBO
# - OpenAIModels.GPT_4
# - OpenAIModels.GPT_3_5_TURBO

Method 3: Backward Compatibility (Anthropic Only)

The old LLMConfig and LLMClient aliases still work for backward compatibility:

from log_classifier.classifier import LLMConfig, LLMClient

# Old-style configuration still works
config = LLMConfig(api_key="your-api-key")
llm_client = LLMClient(config)

Output Format

The reporter generates two types of output:

Per-Class JSON Files (1.json, 2.json, etc.)

Each class gets its own JSON file with all members:

{
  "class_name": "HTTP 404 Request Failed",
  "regex": ".*code.*404.*event.*request.*failed.*",
  "member_count": 3,
  "members": [
    {
      "original_line": "{\"code\": 404, \"event\": \"request_finished\"...}",
      "line_number": 2,
      "processed_line": "code\u00001\u0001event\u00001..."
    }
  ]
}

Summary Report (report.json)

A summary of all classes:

{
  "total_classes": 3,
  "total_lines_processed": 7,
  "classes": [
    {
      "class_name": "HTTP 404 Request Failed",
      "regex": ".*code.*404.*event.*request.*failed.*",
      "member_count": 3
    }
  ]
}

Project Structure

log_classifier/
├── processors/          # Processing pipeline components
│   ├── base.py         # Abstract Processor base class
│   ├── timestamp.py    # TimestampRemover
│   ├── ip.py           # IPRemover, IPPortRemover
│   ├── guid.py         # GUIDRemover
│   ├── tokenizer.py    # Tokenizer
│   ├── normalizer.py   # TokenNormalizer
│   ├── filter.py       # TokenFilter
│   ├── counter.py      # TokenCounter
│   └── patcher.py      # Patcher
├── pipeline/           # Pipeline orchestration
│   └── pipeline.py     # Pipeline class
├── classifier/         # Classification system
│   ├── classifier.py   # Classifier and data models
│   └── llm_client.py   # LLM integration
├── reporter/           # Report generation
│   └── reporter.py     # Reporter class
└── tests/              # Test suite
    ├── sample_data.py  # Sample log lines
    ├── test_processors.py
    ├── test_pipeline.py
    └── test_classifier.py

Dependencies

Required

  • Python 3.10+
  • At least one LLM provider package:
    • anthropic - For Anthropic/Claude models
    • openai - For OpenAI models
    • Or both for flexibility

Standard Library (no installation needed)

  • re - Regular expressions
  • json - JSON handling
  • dataclasses - Data structures
  • pathlib - File paths
  • typing - Type hints
  • abc - Abstract base classes
  • collections - Counter for token counting
  • os - Environment variables

How It Works

  1. Processing: Each log line is processed through the pipeline:

    • Timestamps, IPs, and GUIDs are removed
    • Text is tokenized and normalized
    • Tokens are filtered and counted
    • Result is converted to a patched string format
  2. Classification: Processed lines are matched against existing regex patterns:

    • If a match is found, the line is added to that class
    • If no match is found, the LLM generates a new regex pattern and class name
    • New classes are added to the registry in order
  3. Reporting: Classification results are written to JSON files:

    • One file per class with all member log lines
    • A summary file with class statistics

Configuration

Environment Variables

Environment variables are supported for both providers:

Anthropic/Claude:

  • ANTHROPIC_API_KEY (required): Your Anthropic API key
  • ANTHROPIC_MODEL (optional): Model identifier (default: claude-sonnet-4-20250514)
  • ANTHROPIC_MAX_TOKENS (optional): Maximum tokens per request (default: 1024)

OpenAI:

  • OPENAI_API_KEY (required): Your OpenAI API key
  • OPENAI_MODEL (optional): Model identifier (default: gpt-4o)
  • OPENAI_MAX_TOKENS (optional): Maximum tokens per request (default: 1024)
  • OPENAI_TEMPERATURE (optional): Temperature for sampling (default: 0.0)

Set them in your shell:

# For Anthropic
export ANTHROPIC_API_KEY="your-api-key-here"
export ANTHROPIC_MODEL="claude-opus-4-20250514"  # Optional
export ANTHROPIC_MAX_TOKENS="2048"  # Optional

# For OpenAI
export OPENAI_API_KEY="your-api-key-here"
export OPENAI_MODEL="gpt-4o"  # Optional
export OPENAI_MAX_TOKENS="2048"  # Optional
export OPENAI_TEMPERATURE="0.0"  # Optional

Then use them in code:

from log_classifier.classifier import create_llm_client, AnthropicConfig, OpenAIConfig

# Using factory function (automatically uses env vars)
anthropic_client = create_llm_client("anthropic")
openai_client = create_llm_client("openai")

# Or using provider-specific configs
from log_classifier.classifier import AnthropicLLMClient, OpenAILLMClient
anthropic_config = AnthropicConfig.from_env()
openai_config = OpenAIConfig.from_env()
anthropic_client = AnthropicLLMClient(anthropic_config)
openai_client = OpenAILLMClient(openai_config)

Override specific values:

# Override model while using env var for API key
anthropic_client = create_llm_client("anthropic", model="claude-haiku-4-20250514")
openai_client = create_llm_client("openai", model="gpt-4o-mini")

Testing

Run the test suite:

cd log_classifier
python -m pytest tests/

Or run individual test files:

python -m unittest log_classifier.tests.test_processors
python -m unittest log_classifier.tests.test_pipeline
python -m unittest log_classifier.tests.test_classifier

Example Input

The classifier expects JSON log lines like:

{"event": "Created RunSession 222202", "timestamp": "2026-01-10T15:29:20.266784Z", "trace_id": "a6dfb9dd-eb72-4e42-a1ff-cdac6105020e"}
{"code": 404, "event": "request_finished", "timestamp": "2026-01-10T15:29:45.865083Z"}
{"event": "request_failed", "status_code": 404, "remote_addr": "10.68.21.11:48438", "timestamp": "2026-01-10T15:29:45.865522Z"}

Limitations

  • Requires an API key from at least one LLM provider (Anthropic or OpenAI)
  • LLM API calls are made for each unmatched log line (consider caching or batching for large files)
  • No retry logic for LLM API failures (exceptions are raised)
  • Classification is based on processed token patterns, not semantic understanding of log content
  • OpenAI models require JSON mode support (GPT-3.5-turbo and GPT-4 series)

Contributing

This project uses simple modular Python packages with direct imports. Each module should be independently testable.

License

[Add your license information here]

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%