See CONTRIBUTING.md for local contributor workflows and SECURITY.md for disclosure guidance.
A Generalized Inverted Index (GIN) for JSON data, designed for row-group pruning in columnar storage formats like Parquet.
- String indexing - Exact match and IN queries on string fields
- Numeric indexing - Range queries (GT, GTE, LT, LTE) with per-row-group min/max stats
- Field transformers - Convert values (e.g., date strings to epoch) for efficient range queries
- Trigram indexing - Full-text CONTAINS queries using n-gram matching
- Regex support - Pattern matching with trigram-based candidate selection
- Null tracking - IS NULL / IS NOT NULL predicates
- Bloom filter - Fast-path rejection for non-existent values
- HyperLogLog - Efficient cardinality estimation
- Compression - zstd-compressed binary serialization
- Parquet integration - Build from Parquet, embed in metadata, sidecar files, S3 support
- CLI tool - Command-line interface for build, query, info, and extract operations
A serverless pruning index for data lakes - the GIN index is a compact, immutable index designed to answer one question: "Which row groups might contain my data?"
Querying large data lakes is expensive. When you search for trace_id=abc123 across millions of Parquet files, traditional approaches either:
- Full scan - Read every row group (~TB of data, high latency, high cost)
- Database approach - Run PostgreSQL/Elasticsearch cluster (~ms latency, operational burden)
- Parquet stats - Use built-in min/max (useless for high-cardinality strings)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Serverless Row-Group Pruning โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ 1. Cache index anywhere 2. Prune locally 3. Read only โ
โ (<1MB for millions of files) (~1ยตs) matching RGsโ
โ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโ โ
โ โ memcached โ โโโโโโโโโโถ โ GIN Index โ โโถ โ S3/GCS โ โ
โ โ nginx โ decode โ Evaluate() โ โ [RG 5, 23] โ โ
โ โ CDN edge โ โ โ โ โ โ
โ โ localStorageโ โ Result: 3 โ โ Skip 99% โ โ
โ โโโโโโโโโโโโโโโโ โ row groups โ โ of data โ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
| Challenge | PostgreSQL GIN | Elasticsearch | This GIN Index |
|---|---|---|---|
| Deployment | Database cluster | Search cluster | Just bytes - cache anywhere |
| Query latency | ~1ms | ~5-10ms | ~1ยตs - client-side |
| High cardinality | Index bloat | Shard overhead | Bloom filter fast-path |
| Index size | MB-GB | GB | ~30KB per 1K row groups |
| Arbitrary JSON | Schema required | Mapping required | Auto-discovered paths |
- Log/observability platforms - Query by
trace_id,request_id, arbitrary labels - Vector databases - Pre-filter segments before expensive ANN search
- Data lake query engines - Pruning index for DuckDB, Trino, Spark
- Edge/serverless - Cache index at CDN edge, query without backend
The index decouples pruning (which row groups to read) from execution (DuckDB, Trino, Spark). Your query engine handles the actual data reading - this index just tells it where to look.
go get github.com/amikos-tech/ami-ginpackage main
import (
"fmt"
gin "github.com/amikos-tech/ami-gin"
)
func main() {
// Create builder for 3 row groups
builder := gin.NewBuilder(gin.DefaultConfig(), 3)
// Add documents to row groups
builder.AddDocument(0, []byte(`{"name": "alice", "age": 30}`))
builder.AddDocument(1, []byte(`{"name": "bob", "age": 25}`))
builder.AddDocument(2, []byte(`{"name": "alice", "age": 40}`))
// Build index
idx := builder.Finalize()
// Query: find row groups where name = "alice"
result := idx.Evaluate([]gin.Predicate{
gin.EQ("$.name", "alice"),
})
fmt.Println(result.ToSlice()) // [0, 2]
}GIN Index v0.1.0 intentionally focuses on the proven single-index predicate surface described above.
- OR/AND composites are not part of the v0.1.0 query API yet.
- Index merge across multiple index files is intentionally deferred beyond v0.1.0.
- Query-time transformers are not supported in v0.1.0; transformations must happen at index-build time.
gin.EQ("$.status", "active")
gin.NE("$.status", "deleted")
gin.IN("$.status", "active", "pending", "review")
gin.NIN("$.status", "deleted", "archived") // NOT INgin.GT("$.price", 100.0) // price > 100
gin.GTE("$.price", 100.0) // price >= 100
gin.LT("$.price", 500.0) // price < 500
gin.LTE("$.price", 500.0) // price <= 500
// Combined range
idx.Evaluate([]gin.Predicate{
gin.GTE("$.price", 100.0),
gin.LTE("$.price", 500.0),
})Transform date strings into numeric epoch milliseconds for efficient range queries:
// Configure transformers for date fields
config, _ := gin.NewConfig(
gin.WithFieldTransformer("$.created_at", gin.ISODateToEpochMs), // RFC3339
gin.WithFieldTransformer("$.birth_date", gin.DateToEpochMs), // YYYY-MM-DD
gin.WithFieldTransformer("$.custom_ts", gin.CustomDateToEpochMs("2006/01/02 15:04")),
)
builder, _ := gin.NewBuilder(config, numRGs)
// Add documents - dates are automatically transformed to epoch ms
builder.AddDocument(0, []byte(`{"created_at": "2024-01-15T10:30:00Z", "birth_date": "1990-05-20"}`))
builder.AddDocument(1, []byte(`{"created_at": "2024-06-15T14:00:00Z", "birth_date": "1985-03-10"}`))
idx := builder.Finalize()
// Query with epoch milliseconds
july2024 := float64(time.Date(2024, 7, 1, 0, 0, 0, 0, time.UTC).UnixMilli())
result := idx.Evaluate([]gin.Predicate{gin.GT("$.created_at", july2024)})
// Date range: Q1 2024
jan := float64(time.Date(2024, 1, 1, 0, 0, 0, 0, time.UTC).UnixMilli())
apr := float64(time.Date(2024, 4, 1, 0, 0, 0, 0, time.UTC).UnixMilli())
result = idx.Evaluate([]gin.Predicate{
gin.GTE("$.created_at", jan),
gin.LT("$.created_at", apr),
})Built-in date transformers:
ISODateToEpochMs- RFC3339/ISO8601 (2024-01-15T10:30:00Z)DateToEpochMs- Date only (2024-01-15)CustomDateToEpochMs(layout)- Custom Go time layout
Built-in string transformers:
ToLower- Lowercase normalization for case-insensitive queriesEmailDomain- Extract and lowercase domain from email (alice@Example.COMโexample.com)URLHost- Extract and lowercase host from URL (https://API.Example.COM/v1โapi.example.com)RegexExtract(pattern, group)- Extract substring via regex capture groupRegexExtractInt(pattern, group)- Extract and convert to numeric
Built-in numeric transformers:
IPv4ToInt- IPv4 address to uint32 for range queries (192.168.1.1โ3232235777)SemVerToInt- Semantic version to integer (2.1.3โ2001003)DurationToMs- Go duration string to milliseconds (1h30mโ5400000)NumericBucket(size)- Bucket values for histograms (150with size100โ100)BoolNormalize- Normalize boolean-like values ("yes","1","on"โtrue)
IP subnet helpers (for use with IPv4ToInt):
CIDRToRange(cidr)- Parse CIDR notation, returns(start, end float64, err)InSubnet(path, cidr)- Returns[]Predicatefor subnet membership check
Custom transformers:
// Create your own transformer
myTransformer := func(v any) (any, bool) {
s, ok := v.(string)
if !ok {
return nil, false
}
// Your transformation logic
return transformedValue, true
}
config, _ := gin.NewConfig(gin.WithFieldTransformer("$.my_field", myTransformer))Example: IP subnet queries (network/security logs)
config, _ := gin.NewConfig(
gin.WithFieldTransformer("$.client_ip", gin.IPv4ToInt),
)
// "192.168.1.1" indexed as 3232235777
// Query: Find IPs in 192.168.1.0/24 subnet using InSubnet helper
result := idx.Evaluate(gin.InSubnet("$.client_ip", "192.168.1.0/24"))
// Or use CIDRToRange for manual control
start, end, _ := gin.CIDRToRange("10.0.0.0/8")
result = idx.Evaluate([]gin.Predicate{
gin.GTE("$.client_ip", start),
gin.LTE("$.client_ip", end),
})Example: Version range queries (software metadata)
config, _ := gin.NewConfig(
gin.WithFieldTransformer("$.version", gin.SemVerToInt),
)
// "v2.1.3" indexed as 2001003
// Query: Find versions >= 2.0.0
result := idx.Evaluate([]gin.Predicate{
gin.GTE("$.version", float64(2000000)),
})Example: Case-insensitive email queries
config, _ := gin.NewConfig(
gin.WithFieldTransformer("$.email", gin.ToLower),
)
// "Alice@Example.COM" indexed as "alice@example.com"
result := idx.Evaluate([]gin.Predicate{
gin.EQ("$.email", "alice@example.com"),
})Example: Extract error codes from log messages
config, _ := gin.NewConfig(
gin.WithFieldTransformer("$.message", gin.RegexExtract(`ERROR\[(\w+)\]:`, 1)),
)
// "ERROR[E1234]: Connection failed" indexed as "E1234"
result := idx.Evaluate([]gin.Predicate{
gin.EQ("$.message", "E1234"),
})// Uses trigram index for substring matching
gin.Contains("$.description", "hello")
gin.Contains("$.title", "database") // matches "database", "databases", etc.// Uses trigram index for regex candidate selection
gin.Regex("$.message", "ERROR|WARNING") // Alternation
gin.Regex("$.brand", "Toyota|Tesla|Ford") // Multiple literals
gin.Regex("$.log", "error.*timeout") // Prefix + wildcard + suffix
gin.Regex("$.code", "[A-Z]{3}_[0-9]+") // Pattern with literalsThe Regex operator extracts literal strings from regex patterns and uses the trigram index for candidate row-group selection. This enables efficient pruning before actual regex matching.
How it works:
- Parse regex pattern and extract literal substrings
- For alternations like
(error|warn)_message, extracts combined literals:["error_message", "warn_message"] - Query trigram index for each literal
- Union results (OR semantics for alternation)
- Row groups not containing any literal are pruned
Limitations:
- Requires trigram index enabled (
EnableTrigrams: true) - Literals shorter than trigram length (default: 3) cannot prune
- Pure wildcard patterns (
.*) return all row groups - This is candidate selection, not regex execution - actual matching happens at query time
gin.IsNull("$.optional_field")
gin.IsNotNull("$.required_field")// Nested objects
gin.EQ("$.user.address.city", "New York")
// Array elements (wildcard)
gin.EQ("$.tags[*]", "important")
gin.IN("$.roles[*]", "admin", "editor")Supported path syntax:
$- root$.field- dot notation$['field']- bracket notation$.items[*]- array wildcard
Not supported (will error):
$.items[0]- array indices$..field- recursive descent$.items[0:5]- slices$[?(@.price > 10)]- filters
Validate paths before use:
if err := gin.ValidateJSONPath("$.user.name"); err != nil {
log.Fatal(err)
}// Encode to bytes (zstd compressed)
data, err := gin.Encode(idx)
// Save to file
os.WriteFile("index.gin", data, 0644)
// Load and decode
data, _ := os.ReadFile("index.gin")
idx, err := gin.Decode(data)The GIN index integrates directly with Parquet files, supporting three storage strategies:
- Sidecar file - Index stored as
data.parquet.ginalongside the Parquet file - Embedded metadata - Index stored in Parquet file's key-value metadata
- Build-time embedding - Index built and embedded during Parquet file creation
// Build index from a Parquet file's JSON column
idx, err := gin.BuildFromParquet("data.parquet", "attributes", gin.DefaultConfig())// Write index as sidecar file (data.parquet.gin)
err := gin.WriteSidecar("data.parquet", idx)
// Read sidecar
idx, err := gin.ReadSidecar("data.parquet")
// Check if sidecar exists
if gin.HasSidecar("data.parquet") {
// ...
}cfg := gin.DefaultParquetConfig() // MetadataKey: "gin.index"
// Rebuild existing Parquet file with embedded index
err := gin.RebuildWithIndex("data.parquet", idx, cfg)
// Check if Parquet has embedded index
hasIdx, err := gin.HasGINIndex("data.parquet", cfg)
// Read embedded index
idx, err := gin.ReadFromParquetMetadata("data.parquet", cfg)// Tries embedded metadata first, falls back to sidecar
idx, err := gin.LoadIndex("data.parquet", gin.DefaultParquetConfig())When creating a new Parquet file, you can embed the index during creation:
// Get key-value pair for Parquet metadata
key, value, err := gin.EncodeToMetadata(idx, gin.DefaultParquetConfig())
// key = "gin.index", value = base64-encoded compressed index
// Use with parquet-go writer
writer := parquet.NewGenericWriter[Record](f,
parquet.KeyValueMetadata(key, value),
)Helper functions for working with multiple files:
// Local filesystem
if gin.IsDirectory("./data") {
// List all .parquet files in directory
parquetFiles, err := gin.ListParquetFiles("./data")
// List all .gin files in directory
ginFiles, err := gin.ListGINFiles("./data")
// Process each file
for _, f := range parquetFiles {
idx, _ := gin.BuildFromParquet(f, "attributes", gin.DefaultConfig())
gin.WriteSidecar(f, idx)
}
}
// S3
s3Client, _ := gin.NewS3ClientFromEnv()
// List all .parquet files under prefix
parquetKeys, err := s3Client.ListParquetFiles("bucket", "data/")
// List all .gin files under prefix
ginKeys, err := s3Client.ListGINFiles("bucket", "data/")All operations support S3 paths via AWS SDK v2:
// Configure from environment variables:
// AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION
// AWS_ENDPOINT_URL (for MinIO, LocalStack), AWS_S3_PATH_STYLE=true
s3Client, err := gin.NewS3ClientFromEnv()
// Build from S3
idx, err := s3Client.BuildFromParquet("bucket", "path/to/data.parquet", "attributes", gin.DefaultConfig())
// Write sidecar to S3
err := s3Client.WriteSidecar("bucket", "path/to/data.parquet", idx)
// Read sidecar from S3
idx, err := s3Client.ReadSidecar("bucket", "path/to/data.parquet")
// Load index (tries embedded, then sidecar)
idx, err := s3Client.LoadIndex("bucket", "path/to/data.parquet", gin.DefaultParquetConfig())A command-line tool is provided for common operations:
# Install
go install github.com/amikos-tech/ami-gin/cmd/gin-index@latest
# Build sidecar index
gin-index build -c attributes data.parquet
gin-index build -c attributes -o custom.gin data.parquet
# Build and embed into Parquet file
gin-index build -c attributes -embed data.parquet
# Query index
gin-index query data.parquet.gin '$.status = "error"'
gin-index query data.parquet.gin '$.count > 100'
gin-index query data.parquet.gin '$.name IN ("alice", "bob")'
# Show index info
gin-index info data.parquet.gin
# Extract embedded index to sidecar
gin-index extract -o data.parquet.gin data.parquet
# S3 paths (uses AWS env vars)
gin-index build -c attributes s3://bucket/data.parquet
gin-index query s3://bucket/data.parquet.gin '$.status = "ok"'Batch Processing (Directory/S3 Prefix):
Process multiple files at once by passing a directory or S3 prefix:
# Build index for all .parquet files in a directory
gin-index build -c attributes ./data/
gin-index build -c attributes -embed ./data/
# Query all .gin files in a directory
gin-index query ./data/ '$.status = "error"'
# Show info for all .gin files
gin-index info ./data/
# S3 prefix - processes all .parquet files under the prefix
gin-index build -c attributes s3://bucket/data/
gin-index query s3://bucket/data/ '$.status = "error"'
gin-index info s3://bucket/data/
# Glob patterns work too
gin-index build -c attributes './data/*.parquet'
gin-index query './data/*.gin' '$.level = "error"'CLI Query Syntax:
- Equality:
$.field = "value",$.field != "value" - Numeric:
$.field > 100,$.field >= 100,$.field < 100,$.field <= 100 - IN/NOT IN:
$.field IN ("a", "b"),$.field NOT IN (1, 2, 3) - Null:
$.field IS NULL,$.field IS NOT NULL - Contains:
$.field CONTAINS "substring" - Regex:
$.field REGEX "pattern"(e.g.,$.brand REGEX "Toyota|Tesla")
config := gin.GINConfig{
CardinalityThreshold: 10000, // Use bloom-only for high-cardinality paths
BloomFilterSize: 65536,
BloomFilterHashes: 5,
EnableTrigrams: true, // Enable CONTAINS queries
TrigramMinLength: 3,
HLLPrecision: 12, // HyperLogLog precision (4-16)
PrefixBlockSize: 16,
}
builder := gin.NewBuilder(config, numRowGroups)See the examples directory:
go run ./examples/basic/main.go # Equality queries
go run ./examples/range/main.go # Numeric ranges
go run ./examples/transformers/main.go # Date field transformers
go run ./examples/transformers-advanced/main.go # IP, SemVer, email, regex transformers
go run ./examples/fulltext/main.go # CONTAINS queries
go run ./examples/regex/main.go # Regex pattern matching
go run ./examples/null/main.go # NULL handling
go run ./examples/nested/main.go # Nested JSON and arrays
go run ./examples/serialize/main.go # Persistence
go run ./examples/full/main.go # All types and operators
go run ./examples/parquet/main.go # Parquet integration (sidecar, embedded, queries)Run benchmarks with:
go test -bench=. -benchmem -benchtime=1s| Operation | Latency | Notes |
|---|---|---|
| EQ query | ~1ยตs | Bloom filter + sorted term lookup |
| Range query (GT/LT) | 4-24ยตs | Min/max stats scan |
| IN query (10 values) | ~8ยตs | Union of EQ results |
| CONTAINS query | 2-17ยตs | Trigram intersection |
| IsNull/IsNotNull | 2-4ยตs | Bitmap lookup |
| Bloom lookup | ~100ns | Fast path rejection |
| AddDocument | ~43ยตs | JSON parsing + indexing |
| Encode (1K RGs) | ~4ms | zstd compression |
| Decode (1K RGs) | ~2ms | zstd decompression |
| Row Groups | Encoded Size | Per RG |
|---|---|---|
| 100 | 6.7 KB | 67 bytes |
| 500 | 18 KB | 36 bytes |
| 1,000 | 30 KB | 30 bytes |
| 2,000 | 51 KB | 26 bytes |
Query time scales well with index size:
- 10 RGs: ~340ns
- 100 RGs: ~530ns
- 1,000 RGs: ~680ns
- 5,000 RGs: ~800ns
Build time is linear with document count and complexity:
- 100 docs (7 fields): ~1.2ms
- 1,000 docs: ~6.7ms
- High cardinality (10K unique values): ~3.3ms per 1K docs
| Component | Operation | Latency |
|---|---|---|
| Bloom Filter | Add | ~100ns |
| Bloom Filter | Lookup | ~100ns |
| RGSet (10K) | Intersect | ~12ยตs |
| RGSet (10K) | Union | ~10ยตs |
| Trigram | Add (50 chars) | ~16ยตs |
| Trigram | Search | 1-6ยตs |
| HyperLogLog | Add | ~70ns |
| HyperLogLog | Estimate | 7-410ยตs (precision dependent) |
| Prefix Compress | 1K terms | ~60ยตs |
Simulating a log storage scenario:
- 1M documents across 50K row groups (~20 docs/RG)
- 10 labels: 2 integers (
status_code,duration_ms) + 8 strings - Mix of cardinalities:
trace_id(high),service(low),host(medium) - Trigrams disabled (no FTS)
| Metric | Value |
|---|---|
| Index Size | 289 KB (0.28 MB) |
| Bytes per RG | 5.9 bytes |
| Bytes per doc | 0.3 bytes |
| Build time | 464ms |
| Encode | 41ms |
| Decode | 41ms |
Query Performance:
| Query | Latency | Notes |
|---|---|---|
trace_id=X (high cardinality) |
950ns | Bloom filter fast-path |
service=api (low cardinality) |
6.5ยตs | ~10K RGs match |
trace_id=X AND level=error |
6ยตs | High card + low card |
duration_ms > 5000 |
244ยตs | Range scan over 50K RGs |
service=api AND env=prod AND status>=400 |
285ยตs | 3 predicates combined |
Key takeaway: High-cardinality lookups (trace ID, request ID) are sub-microsecond. The entire index for 1M documents fits in 289 KB - easily cacheable in memory, localStorage, or CDN edge.
The benchmark suite (benchmark_test.go) covers:
- Builder Performance - Document ingestion, batch loading, finalization
- Query Performance - All operators, parallel queries, multiple predicates
- Serialization - Encode/decode latency, compression ratios
- Components - Bloom filter, RGSet, trigram, HLL, prefix compression
- Scaling - Row group count, document size, cardinality, nesting depth
| Aspect | This GIN Index | PostgreSQL GIN |
|---|---|---|
| Query Latency | ~1ยตs (EQ) | ~0.7-1.2ms per predicate |
| Deployment | Embedded bytes, no server | Requires PostgreSQL server |
| Cacheability | Cache anywhere (nginx, memcached, CDN) | Tied to database buffer cache |
| Index Size | 26-67 bytes/row-group | Larger, includes posting lists |
| Range Queries | Native min/max stats | Poor (GIN doesn't support ranges) |
| Full-Text | Trigram-based CONTAINS | Full-featured tsvector/tsquery |
| ACID | No (read-only after build) | Full transaction support |
PostgreSQL GIN uses Bitmap Index Scans which cost ~0.7-1.2ms each when cached. This index achieves ~1ยตs queries by being purpose-built for row-group pruning with simpler data structures.
| Aspect | This GIN Index | Parquet Min/Max Stats | Parquet Bloom Filters |
|---|---|---|---|
| String Equality | Exact term โ RG bitmap | Only min/max (poor for strings) | Yes, but per-column only |
| CONTAINS/FTS | Trigram index | No | No |
| Multi-path Queries | Single index file | Scattered in column chunks | Scattered in column chunks |
| Cardinality | HyperLogLog estimates | No | No |
| Null Tracking | Explicit null/present bitmaps | Null count only | No |
| Index Location | Footer or sidecar file | Column chunk metadata | Column chunk metadata |
Parquet's built-in bloom filters are effective for single-column equality but require reading multiple column chunks for multi-field queries. This GIN index consolidates all paths into one structure.
| Aspect | This GIN Index | Delta Lake | Apache Iceberg |
|---|---|---|---|
| Statistics | Per-path term index + min/max | First 32 columns min/max | Partition-level + column stats |
| High Cardinality | Bloom filter fallback | Requires Z-ordering | Requires sorting |
| JSON Support | Native path extraction | Requires schema | Requires schema |
| Query Planning | Client-side, cacheable | Spark/engine dependent | Engine dependent |
| Deployment | Standalone bytes | Delta transaction log | Metadata tables |
Delta Lake's data skipping relies on Z-ordering for effectiveness with high-cardinality columns. This GIN index handles high cardinality natively via bloom filters.
| Aspect | This GIN Index | Elasticsearch |
|---|---|---|
| Query Latency | ~1ยตs | ~1-10ms (network + processing) |
| Deployment | Embedded, no server | Cluster required |
| Index Size | ~30KB for 1K row-groups | GB+ for equivalent data |
| Use Case | Row-group pruning | Full search engine |
| Updates | Rebuild required | Near real-time |
Elasticsearch provides millisecond-level latency for searches but requires cluster infrastructure. This index is designed for embedding in data lake metadata.
This index was born from log storage needs - indexing arbitrary attributes/labels where:
- High cardinality is the norm - trace IDs, request IDs, user IDs, session tokens
- Schema is unknown - arbitrary key-value labels attached at runtime
- Queries are selective - "find logs where
trace_id=abc123" should be instant
Traditional solutions struggle here:
| Challenge | PostgreSQL GIN | Parquet Stats | This GIN Index |
|---|---|---|---|
trace_id (millions unique) |
Index bloat, slow writes | Min/max useless | Bloom filter fast-path |
user.email (arbitrary path) |
Requires schema | Column must exist | Auto-discovered paths |
labels["env"] (dynamic keys) |
JSONB @> operator (~1ms) | Not supported | Native path indexing (~1ยตs) |
| Mixed types per path | Type coercion issues | Single type per column | Tracks observed types |
Log/observability example:
{
"timestamp": "2024-01-15T10:30:00Z",
"level": "error",
"trace_id": "abc123def456",
"user": {"id": "user_98765", "email": "alice@example.com"},
"labels": {"env": "prod", "region": "us-east-1", "version": "2.1.0"},
"message": "Connection timeout to downstream service"
}Query: trace_id=abc123def456 AND labels.env=prod AND level=error
- Bloom filter rejects non-matching row groups instantly
- High-cardinality
trace_iddoesn't degrade performance - Arbitrary
labels.*paths indexed automatically
Vector database metadata filtering:
Vector databases need efficient pre-filtering before similarity search. Without good metadata indexing, you either:
- Scan all vectors then filter (slow)
- Filter first with poor index (still slow)
- Build separate metadata infrastructure (complex)
{
"id": "doc_12345",
"embedding": [0.1, 0.2, ...],
"metadata": {
"source": "arxiv",
"year": 2024,
"authors": ["Alice", "Bob"],
"topics": ["machine-learning", "transformers"],
"cited_by": 142,
"full_text": "We present a novel approach to..."
}
}Query: Find similar vectors WHERE metadata.source=arxiv AND metadata.year>=2023 AND metadata.topics[*]=transformers
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Vector DB Hybrid Search โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ 1. Metadata Filter (GIN Index) 2. Vector Search โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โ
โ โ source=arxiv โ โ โ โ
โ โ year>=2023 โโโโโโโโโโโผโโโโโโโถโ ANN Search โ โ
โ โ topics[*]=transformers โ โ (only on โ โ
โ โ โ โ segments 2,5) โ โ
โ โ Result: segments [2, 5] โ โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โ
โ ~1ยตs search scope โ
โ reduced 80% โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
This enables:
- Pre-filtering - Prune segments before expensive ANN search
- Flexible schemas - Each document can have different metadata fields
- High cardinality - Filter by
doc_id,user_id,session_id - Range + equality -
year>=2023 AND source=arxiv - Array membership -
topics[*]=machine-learning - Full-text on metadata -
CONTAINS(full_text, "transformer")
The second differentiator is deployment flexibility:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Data Lake Query Flow โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ Client โ
โ โ โ
โ โ 1. Fetch GIN index (cached) โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ nginx/memcached/CDN/local cache โ โ
โ โ โโโโโโโโโโโโโโโ โ โ
โ โ โ index.gin โ โ 30KB, serves in <1ms โ โ
โ โ โ (cached) โ โ โ
โ โ โโโโโโโโโโโโโโโ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โ 2. Evaluate predicates locally (~1ยตs) โ
โ โ Result: [RG 5, RG 23, RG 47] โ
โ โ โ
โ โ 3. Read only matching row groups from object storage โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ S3 / GCS / Azure Blob โ โ
โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ data.parquet โ โ โ
โ โ โ RG 0 โโโโโโโโ skipped โ โ โ
โ โ โ RG 5 โโโโโโโโ read โ โ โ
โ โ โ RG 10 โโโโโโโ skipped โ โ โ
โ โ โ RG 23 โโโโโโโโ read โ โ โ
โ โ โ RG 47 โโโโโโโโ read โ โ โ
โ โ โ ... โ โ โ
โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Benefits:
- No database server - Index is just bytes, evaluate anywhere
- Cache at edge - nginx, memcached, CDN, browser localStorage
- Cross-language - Binary format works in any language
- Offline capable - Cache index locally for disconnected queries
- Cost efficient - Avoid scanning TB of Parquet data
This architecture is ideal for:
- Log/observability platforms - Index arbitrary labels, query by trace ID
- Vector databases - Pre-filter segments before ANN search
- Serverless query engines - No database to manage
- Browser-based data explorers - Cache index in localStorage
- Edge computing / IoT analytics - Offline-capable querying
- Cost-sensitive data lake queries - Minimize S3/GCS egress
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ GINIndex โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Header โ โ
โ โ โข Version, NumRowGroups, NumDocs, NumPaths โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Path Directory โ โ
โ โ pathID โ { PathName, ObservedTypes, Cardinality, Flags } โ โ
โ โ โ โ
โ โ Example: โ โ
โ โ 0 โ { "$.name", String, 150, 0x00 } โ โ
โ โ 1 โ { "$.age", Int, 80, 0x00 } โ โ
โ โ 2 โ { "$.tags[*]", String, 50000, FlagBloomOnly } โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Global Bloom Filter โ โ
โ โ Fast rejection for path=value pairs โ โ
โ โ Contains: "$.name=alice", "$.name=bob", "$.age=30", ... โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ StringIndex โ โ NumericIndex โ โ
โ โ (per pathID) โ โ (per pathID) โ โ
โ โ โ โ โ โ
โ โ pathID: 0 ($.name) โ โ pathID: 1 ($.age) โ โ
โ โ โโโโโโโโโโโฌโโโโโโโโโ โ โ โโโโโโโฌโโโโโโโฌโโโโโโโโ โ
โ โ โ Term โ RGSet โ โ โ โ RG โ Min โ Max โโ โ
โ โ โโโโโโโโโโโผโโโโโโโโโค โ โ โโโโโโโผโโโโโโโผโโโโโโโคโ โ
โ โ โ "alice" โ {0,2} โ โ โ โ 0 โ 25 โ 35 โโ โ
โ โ โ "bob" โ {1} โ โ โ โ 1 โ 20 โ 45 โโ โ
โ โ โ "carol" โ {2,3} โ โ โ โ 2 โ 30 โ 30 โโ โ
โ โ โโโโโโโโโโโดโโโโโโโโโ โ โ โโโโโโโดโโโโโโโดโโโโโโโโ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ NullIndex โ โ TrigramIndex โ โ
โ โ (per pathID) โ โ (per pathID) โ โ
โ โ โ โ โ โ
โ โ pathID: 1 ($.age) โ โ pathID: 3 ($.desc) โ โ
โ โ โโโโโโโโโโโโฌโโโโโโโโโ โ โ โโโโโโโโโโโฌโโโโโโโโโ โ โ
โ โ โ NullRGs โ {4,7} โ โ โ โ Trigram โ RGSet โ โ โ
โ โ โ Present โ {0-9} โ โ โ โโโโโโโโโโโผโโโโโโโโโค โ โ
โ โ โโโโโโโโโโโโดโโโโโโโโโ โ โ โ "hel" โ {0,2} โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโ โ โ "ell" โ {0,2} โ โ โ
โ โ โ "llo" โ {0,2,5}โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ "wor" โ {1,3} โ โ โ
โ โ DocID Mapping โ โ โโโโโโโโโโโดโโโโโโโโโ โ โ
โ โ (optional) โ โโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ โ
โ โ pos โ DocID โ โโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ 0 โ 1000 โ โ PathCardinality (HLL) โ โ
โ โ 1 โ 1001 โ โ (per pathID) โ โ
โ โ 2 โ 1020 โ โ โ โ
โ โ 3 โ 1021 โ โ Estimates unique vals โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Note: All RGSet bitmaps use Roaring Bitmaps for efficient compression
BUILD PHASE
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โ JSON Documents โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ DocID 0 โโโโโโถโ GINBuilder โ โ
โ โ RG: 0 โ โ โ โ
โ โโโโโโโโโโโ โ AddDocument(docID, json) โ โ
โ โโโโโโโโโโโ โ โ โ โ
โ โ DocID 1 โโโโโโถโ โผ โ โ
โ โ RG: 0 โ โ โโโโโโโโโโโโโโโ โ โ
โ โโโโโโโโโโโ โ โ Walk JSON โ โ โ
โ โโโโโโโโโโโ โ โ Extract: โ โ โ
โ โ DocID 2 โโโโโโถโ โ โข paths โ โ โ
โ โ RG: 1 โ โ โ โข values โ โ โ
โ โโโโโโโโโโโ โ โ โข types โ โ โ
โ โฎ โ โโโโโโโโฌโโโโโโโ โ โ
โ โ โ โ โ
โ โ โผ โ โ
โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ Update per-path structures: โ โ โ
โ โ โ โข stringTerms[term] โ RGSet.Set(pos) โ โ โ
โ โ โ โข numericStats[pos].Min/Max โ โ โ
โ โ โ โข nullRGs.Set(pos) if null โ โ โ
โ โ โ โข trigrams.Add(term, pos) โ โ โ
โ โ โ โข bloom.Add(path=value) โ โ โ
โ โ โ โข hll.Add(value) โ โ โ
โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โผ โ
โ Finalize() โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโ โ
โ โ GINIndex โ โ
โ โโโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
QUERY PHASE
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โ Predicates: [EQ("$.name", "alice"), GT("$.age", 25)] โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ idx.Evaluate(predicates) โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โผ โผ โผ โ
โ โโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโ โ
โ โ Predicateโ โ Predicateโ โ ... โ โ
โ โ 1 โ โ 2 โ โ โ โ
โ โโโโโโฌโโโโโโ โโโโโโฌโโโโโโ โโโโโโโโโโโโ โ
โ โ โ โ
โ โผ โผ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โ Bloom Check โ โ Bloom Check โ โโโ Fast rejection path โ
โ โ path=value? โ โ (skip for โ โ
โ โโโโโโโโฌโโโโโโโโ โ ranges) โ โ
โ โ โโโโโโโโฌโโโโโโโโ โ
โ โผ โผ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โ StringIndex โ โ NumericIndex โ โ
โ โ lookup term โ โ scan min/max โ โ
โ โ โ RGSet โ โ โ RGSet โ โ
โ โโโโโโโโฌโโโโโโโโ โโโโโโโโฌโโโโโโโโ โ
โ โ โ โ
โ โ RGSet{0,2} โ RGSet{0,1,2} โ
โ โ โ โ
โ โโโโโโโโโโโฌโโโโโโโโโโโ โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโ โ
โ โ Intersect โ โ
โ โ (AND all) โ โ
โ โโโโโโโโฌโโโโโโโ โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโ โ
โ โ RGSet{0,2} โ โโโ Matching row groups โ
โ โโโโโโโโฌโโโโโโโ โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโ โ
โ โ ToSlice() โ โ [0, 2] โ
โ โ or โ โ
โ โ MatchingDocIDs() โ [DocID...] โ
โ โโโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
For composite document identifiers (e.g., file + row group):
// Encode file index and row group into single DocID
codec := gin.NewRowGroupCodec(20) // 20 RGs per file
builder := gin.NewBuilderWithCodec(config, totalRGs, codec)
docID := codec.Encode(fileIndex, rgIndex) // e.g., file=3, rg=15 โ DocID=75
builder.AddDocument(docID, jsonDoc)
// Query and decode results
result := idx.Evaluate(predicates)
for _, docID := range idx.MatchingDocIDs(result) {
decoded := codec.Decode(docID) // [3, 15]
fileIdx, rgIdx := decoded[0], decoded[1]
}The GIN index maintains several data structures:
- Path Directory - Maps JSON paths to their metadata (types, cardinality)
- String Index - For each path, maps terms to row-group bitmaps (Roaring)
- Numeric Index - Per-row-group min/max values for range pruning
- Null Index - Bitmaps tracking which row groups have null/present values
- Trigram Index - Maps 3-character sequences to row-group bitmaps
- Global Bloom Filter - Fast rejection of non-existent path=value pairs
- DocID Mapping - Optional external DocID to internal position mapping
Query evaluation intersects the matching row-group bitmaps from each predicate.
The NewBuilder(config, numRGs) requires the total number of row groups at construction time. This is intentional:
-
Complement operations require universe size - Operations like
AllRGs()andInvert()need to know the total number of row groups to compute complements. When a query cannot prune (e.g., unknown path, graceful degradation), the index returns "all row groups" - which requires knowing what "all" means. -
Parquet metadata provides this - The index is designed for Parquet row-group pruning. In this context, the number of row groups is always available from Parquet file metadata before indexing begins.
-
Bounds checking - The builder validates that document positions don't exceed the declared row group count, catching configuration errors early.
MIT