A Go library that automatically learns URL patterns from batches of URLs and normalizes them by detecting dynamic segments.
We built this library to normalize URLs at Zori when processing customer analytics events. High-cardinality URLs with dynamic IDs, UUIDs, and dates create storage and query performance issues that this library solves by:
- Automatic pattern detection - No manual route definitions needed
- Batch-based learning - Discovers patterns from actual URLs in each batch
- Cardinality reduction - Converts
/users/12345/profileto/users/{id}/profile - Type-aware - Distinguishes UUIDs, IuDs, dates, hashes, and other parameter types
Works for analytics, metrics storage, log normalization, and any system that needs to aggregate URL-based data.
- Batch-Based Processing: Create a new classifier instance per batch to avoid memory growth
- Smart Pattern Detection: Detects high-cardinality segments automatically
- Parameter Type Classification: Identifies UUIDs, IDs, dates, timestamps, hashes, tokens, and slugs
- No Upfront Knowledge: Learns patterns from actual URLs in each batch
- Configurable: Adjust cardinality thresholds and minimum sample requirements
go get github.com/ZoriHQ/trie-url-classifierpackage main
import "fmt"
func main() {
// Process a batch of URLs
batch := []string{
"/projects/d381b052-99eb-40f2-9ede-9bce790faae1/analytics",
"/projects/a1b2c3d4-e5f6-7890-abcd-ef1234567890/analytics",
"/projects/12345678-1234-1234-1234-123456789012/analytics",
"/users/123456/profile",
"/users/789012/profile",
"/users/345678/profile",
}
// Create a NEW classifier for this batch
classifier := NewClassifier()
// Learn patterns from the batch
classifier.Learn(batch)
// Normalize URLs in the batch
for _, url := range batch {
pattern := classifier.Classify(url)
fmt.Printf("%s -> %s\n", url, pattern)
}
// After processing, discard the classifier
// For the next batch, create a NEW instance
}Process batches of URLs independently:
func processBatch(urlBatch []string) {
classifier := NewClassifier()
classifier.Learn(urlBatch)
for _, url := range urlBatch {
normalizedURL := classifier.Classify(url)
// Use normalizedURL for storage, aggregation, etc.
}
}Important: Create a new classifier for each batch to avoid memory growth.
Customize the classifier behavior:
classifier := NewClassifier(
WithCardinalityThreshold(0.75), // 75% unique values = dynamic (default)
WithMinSamples(2), // Minimum samples before detection (default: 2)
)- CardinalityThreshold (default: 0.75): Ratio of unique values to total count. Higher = stricter detection
- MinSamples (default: 2): Minimum samples needed at a position before considering it for parametrization
The classifier automatically detects and labels these parameter types:
| Type | Pattern | Example |
|---|---|---|
{uuid} |
UUID v4 format | d381b052-99eb-40f2-9ede-9bce790faae1 |
{id} |
Numeric ID (6+ digits) or prefixed IDs | 123456, cus_abc123 |
{hash} |
24+ hex characters | 507f1f77bcf86cd799439011 |
{date} |
ISO date (YYYY-MM-DD) | 2024-01-15 |
{timestamp} |
Unix timestamp (10+ digits) | 1705334400 |
{token} |
JWT tokens | eyJhbGci... |
{slug} |
Hyphenated words with numbers | my-post-12345 |
{param} |
Generic parameter (fallback) | Any other dynamic value |
- Build Trie: URLs are split by
/and inserted into a trie structure - Track Cardinality: Each node tracks unique values and total count
- Detect Patterns: High-cardinality positions (many unique values) = dynamic segments
- Classify Types: Dynamic values are classified using regex patterns
- Normalize: New URLs are matched against learned patterns
Given training URLs:
/projects/uuid-1/analytics
/projects/uuid-2/analytics
/projects/uuid-3/analytics
The trie detects:
- "projects" appears 3 times with same value → static
- Second segment has 3 unique values in 3 traversals → dynamic (100% cardinality)
- "analytics" appears 3 times with same value → static
Result: /projects/{uuid}/analytics
Creates a new URL pattern classifier with optional configuration.
Learns patterns from a batch of URLs. Can be called multiple times.
Normalizes a URL based on learned patterns.
WithCardinalityThreshold(threshold float64) OptionWithMinSamples(min int) Option
The classifier detects dynamic segments through:
- Cardinality analysis - Segments with many unique values (≥75% by default) are marked as dynamic
- Pattern matching - Even with identical values, segments matching UUID/ID/date patterns are parameterized
- Static preservation - Common words like "api", "users", "settings" remain static
This handles both diverse batches and edge cases like repeated identical URLs.
- Space: O(N × M) where N = number of unique URLs, M = average URL segments
- Time:
- Learning: O(N × M) for N URLs
- Classification: O(M) for M segments
Contributions welcome! Please open an issue or PR.
MIT License