Skip to content

amikos-tech/ami-gin

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

19 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

GIN Index

CI Go Reference License: MIT

See CONTRIBUTING.md for local contributor workflows and SECURITY.md for disclosure guidance.

A Generalized Inverted Index (GIN) for JSON data, designed for row-group pruning in columnar storage formats like Parquet.

Features

  • String indexing - Exact match and IN queries on string fields
  • Numeric indexing - Range queries (GT, GTE, LT, LTE) with per-row-group min/max stats
  • Field transformers - Convert values (e.g., date strings to epoch) for efficient range queries
  • Trigram indexing - Full-text CONTAINS queries using n-gram matching
  • Regex support - Pattern matching with trigram-based candidate selection
  • Null tracking - IS NULL / IS NOT NULL predicates
  • Bloom filter - Fast-path rejection for non-existent values
  • HyperLogLog - Efficient cardinality estimation
  • Compression - zstd-compressed binary serialization
  • Parquet integration - Build from Parquet, embed in metadata, sidecar files, S3 support
  • CLI tool - Command-line interface for build, query, info, and extract operations

Why GIN Index?

A serverless pruning index for data lakes - the GIN index is a compact, immutable index designed to answer one question: "Which row groups might contain my data?"

The Problem

Querying large data lakes is expensive. When you search for trace_id=abc123 across millions of Parquet files, traditional approaches either:

  • Full scan - Read every row group (~TB of data, high latency, high cost)
  • Database approach - Run PostgreSQL/Elasticsearch cluster (~ms latency, operational burden)
  • Parquet stats - Use built-in min/max (useless for high-cardinality strings)

The Solution

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    Serverless Row-Group Pruning                      โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                                      โ”‚
โ”‚   1. Cache index anywhere         2. Prune locally     3. Read only  โ”‚
โ”‚      (<1MB for millions of files)    (~1ยตs)              matching RGsโ”‚
โ”‚                                                                      โ”‚
โ”‚   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”               โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚
โ”‚   โ”‚  memcached   โ”‚  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ถ   โ”‚  GIN Index   โ”‚ โ”€โ–ถ โ”‚ S3/GCS     โ”‚ โ”‚
โ”‚   โ”‚  nginx       โ”‚    decode     โ”‚  Evaluate()  โ”‚    โ”‚ [RG 5, 23] โ”‚ โ”‚
โ”‚   โ”‚  CDN edge    โ”‚               โ”‚              โ”‚    โ”‚            โ”‚ โ”‚
โ”‚   โ”‚  localStorageโ”‚               โ”‚ Result: 3    โ”‚    โ”‚ Skip 99%   โ”‚ โ”‚
โ”‚   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜               โ”‚ row groups   โ”‚    โ”‚ of data    โ”‚ โ”‚
โ”‚                                  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ”‚                                                                      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Key Advantages

Challenge PostgreSQL GIN Elasticsearch This GIN Index
Deployment Database cluster Search cluster Just bytes - cache anywhere
Query latency ~1ms ~5-10ms ~1ยตs - client-side
High cardinality Index bloat Shard overhead Bloom filter fast-path
Index size MB-GB GB ~30KB per 1K row groups
Arbitrary JSON Schema required Mapping required Auto-discovered paths

Designed For

  • Log/observability platforms - Query by trace_id, request_id, arbitrary labels
  • Vector databases - Pre-filter segments before expensive ANN search
  • Data lake query engines - Pruning index for DuckDB, Trino, Spark
  • Edge/serverless - Cache index at CDN edge, query without backend

The index decouples pruning (which row groups to read) from execution (DuckDB, Trino, Spark). Your query engine handles the actual data reading - this index just tells it where to look.

Installation

go get github.com/amikos-tech/ami-gin

Quick Start

package main

import (
    "fmt"
    gin "github.com/amikos-tech/ami-gin"
)

func main() {
    // Create builder for 3 row groups
    builder := gin.NewBuilder(gin.DefaultConfig(), 3)

    // Add documents to row groups
    builder.AddDocument(0, []byte(`{"name": "alice", "age": 30}`))
    builder.AddDocument(1, []byte(`{"name": "bob", "age": 25}`))
    builder.AddDocument(2, []byte(`{"name": "alice", "age": 40}`))

    // Build index
    idx := builder.Finalize()

    // Query: find row groups where name = "alice"
    result := idx.Evaluate([]gin.Predicate{
        gin.EQ("$.name", "alice"),
    })
    fmt.Println(result.ToSlice()) // [0, 2]
}

Known limitations

GIN Index v0.1.0 intentionally focuses on the proven single-index predicate surface described above.

  • OR/AND composites are not part of the v0.1.0 query API yet.
  • Index merge across multiple index files is intentionally deferred beyond v0.1.0.
  • Query-time transformers are not supported in v0.1.0; transformations must happen at index-build time.

Query Types

Equality

gin.EQ("$.status", "active")
gin.NE("$.status", "deleted")
gin.IN("$.status", "active", "pending", "review")
gin.NIN("$.status", "deleted", "archived")  // NOT IN

Numeric Range

gin.GT("$.price", 100.0)    // price > 100
gin.GTE("$.price", 100.0)   // price >= 100
gin.LT("$.price", 500.0)    // price < 500
gin.LTE("$.price", 500.0)   // price <= 500

// Combined range
idx.Evaluate([]gin.Predicate{
    gin.GTE("$.price", 100.0),
    gin.LTE("$.price", 500.0),
})

Date Range Queries with Field Transformers

Transform date strings into numeric epoch milliseconds for efficient range queries:

// Configure transformers for date fields
config, _ := gin.NewConfig(
    gin.WithFieldTransformer("$.created_at", gin.ISODateToEpochMs),  // RFC3339
    gin.WithFieldTransformer("$.birth_date", gin.DateToEpochMs),     // YYYY-MM-DD
    gin.WithFieldTransformer("$.custom_ts", gin.CustomDateToEpochMs("2006/01/02 15:04")),
)
builder, _ := gin.NewBuilder(config, numRGs)

// Add documents - dates are automatically transformed to epoch ms
builder.AddDocument(0, []byte(`{"created_at": "2024-01-15T10:30:00Z", "birth_date": "1990-05-20"}`))
builder.AddDocument(1, []byte(`{"created_at": "2024-06-15T14:00:00Z", "birth_date": "1985-03-10"}`))

idx := builder.Finalize()

// Query with epoch milliseconds
july2024 := float64(time.Date(2024, 7, 1, 0, 0, 0, 0, time.UTC).UnixMilli())
result := idx.Evaluate([]gin.Predicate{gin.GT("$.created_at", july2024)})

// Date range: Q1 2024
jan := float64(time.Date(2024, 1, 1, 0, 0, 0, 0, time.UTC).UnixMilli())
apr := float64(time.Date(2024, 4, 1, 0, 0, 0, 0, time.UTC).UnixMilli())
result = idx.Evaluate([]gin.Predicate{
    gin.GTE("$.created_at", jan),
    gin.LT("$.created_at", apr),
})

Built-in date transformers:

  • ISODateToEpochMs - RFC3339/ISO8601 (2024-01-15T10:30:00Z)
  • DateToEpochMs - Date only (2024-01-15)
  • CustomDateToEpochMs(layout) - Custom Go time layout

Built-in string transformers:

  • ToLower - Lowercase normalization for case-insensitive queries
  • EmailDomain - Extract and lowercase domain from email (alice@Example.COM โ†’ example.com)
  • URLHost - Extract and lowercase host from URL (https://API.Example.COM/v1 โ†’ api.example.com)
  • RegexExtract(pattern, group) - Extract substring via regex capture group
  • RegexExtractInt(pattern, group) - Extract and convert to numeric

Built-in numeric transformers:

  • IPv4ToInt - IPv4 address to uint32 for range queries (192.168.1.1 โ†’ 3232235777)
  • SemVerToInt - Semantic version to integer (2.1.3 โ†’ 2001003)
  • DurationToMs - Go duration string to milliseconds (1h30m โ†’ 5400000)
  • NumericBucket(size) - Bucket values for histograms (150 with size 100 โ†’ 100)
  • BoolNormalize - Normalize boolean-like values ("yes", "1", "on" โ†’ true)

IP subnet helpers (for use with IPv4ToInt):

  • CIDRToRange(cidr) - Parse CIDR notation, returns (start, end float64, err)
  • InSubnet(path, cidr) - Returns []Predicate for subnet membership check

Custom transformers:

// Create your own transformer
myTransformer := func(v any) (any, bool) {
    s, ok := v.(string)
    if !ok {
        return nil, false
    }
    // Your transformation logic
    return transformedValue, true
}
config, _ := gin.NewConfig(gin.WithFieldTransformer("$.my_field", myTransformer))

Example: IP subnet queries (network/security logs)

config, _ := gin.NewConfig(
    gin.WithFieldTransformer("$.client_ip", gin.IPv4ToInt),
)
// "192.168.1.1" indexed as 3232235777

// Query: Find IPs in 192.168.1.0/24 subnet using InSubnet helper
result := idx.Evaluate(gin.InSubnet("$.client_ip", "192.168.1.0/24"))

// Or use CIDRToRange for manual control
start, end, _ := gin.CIDRToRange("10.0.0.0/8")
result = idx.Evaluate([]gin.Predicate{
    gin.GTE("$.client_ip", start),
    gin.LTE("$.client_ip", end),
})

Example: Version range queries (software metadata)

config, _ := gin.NewConfig(
    gin.WithFieldTransformer("$.version", gin.SemVerToInt),
)
// "v2.1.3" indexed as 2001003

// Query: Find versions >= 2.0.0
result := idx.Evaluate([]gin.Predicate{
    gin.GTE("$.version", float64(2000000)),
})

Example: Case-insensitive email queries

config, _ := gin.NewConfig(
    gin.WithFieldTransformer("$.email", gin.ToLower),
)
// "Alice@Example.COM" indexed as "alice@example.com"
result := idx.Evaluate([]gin.Predicate{
    gin.EQ("$.email", "alice@example.com"),
})

Example: Extract error codes from log messages

config, _ := gin.NewConfig(
    gin.WithFieldTransformer("$.message", gin.RegexExtract(`ERROR\[(\w+)\]:`, 1)),
)
// "ERROR[E1234]: Connection failed" indexed as "E1234"
result := idx.Evaluate([]gin.Predicate{
    gin.EQ("$.message", "E1234"),
})

Full-Text Search (CONTAINS)

// Uses trigram index for substring matching
gin.Contains("$.description", "hello")
gin.Contains("$.title", "database")  // matches "database", "databases", etc.

Regex Matching

// Uses trigram index for regex candidate selection
gin.Regex("$.message", "ERROR|WARNING")        // Alternation
gin.Regex("$.brand", "Toyota|Tesla|Ford")      // Multiple literals
gin.Regex("$.log", "error.*timeout")           // Prefix + wildcard + suffix
gin.Regex("$.code", "[A-Z]{3}_[0-9]+")         // Pattern with literals

The Regex operator extracts literal strings from regex patterns and uses the trigram index for candidate row-group selection. This enables efficient pruning before actual regex matching.

How it works:

  1. Parse regex pattern and extract literal substrings
  2. For alternations like (error|warn)_message, extracts combined literals: ["error_message", "warn_message"]
  3. Query trigram index for each literal
  4. Union results (OR semantics for alternation)
  5. Row groups not containing any literal are pruned

Limitations:

  • Requires trigram index enabled (EnableTrigrams: true)
  • Literals shorter than trigram length (default: 3) cannot prune
  • Pure wildcard patterns (.*) return all row groups
  • This is candidate selection, not regex execution - actual matching happens at query time

Null Handling

gin.IsNull("$.optional_field")
gin.IsNotNull("$.required_field")

Nested Fields and Arrays

// Nested objects
gin.EQ("$.user.address.city", "New York")

// Array elements (wildcard)
gin.EQ("$.tags[*]", "important")
gin.IN("$.roles[*]", "admin", "editor")

JSONPath Support

Supported path syntax:

  • $ - root
  • $.field - dot notation
  • $['field'] - bracket notation
  • $.items[*] - array wildcard

Not supported (will error):

  • $.items[0] - array indices
  • $..field - recursive descent
  • $.items[0:5] - slices
  • $[?(@.price > 10)] - filters

Validate paths before use:

if err := gin.ValidateJSONPath("$.user.name"); err != nil {
    log.Fatal(err)
}

Serialization

// Encode to bytes (zstd compressed)
data, err := gin.Encode(idx)

// Save to file
os.WriteFile("index.gin", data, 0644)

// Load and decode
data, _ := os.ReadFile("index.gin")
idx, err := gin.Decode(data)

Parquet Integration

The GIN index integrates directly with Parquet files, supporting three storage strategies:

  1. Sidecar file - Index stored as data.parquet.gin alongside the Parquet file
  2. Embedded metadata - Index stored in Parquet file's key-value metadata
  3. Build-time embedding - Index built and embedded during Parquet file creation

Build Index from Parquet

// Build index from a Parquet file's JSON column
idx, err := gin.BuildFromParquet("data.parquet", "attributes", gin.DefaultConfig())

Sidecar Workflow

// Write index as sidecar file (data.parquet.gin)
err := gin.WriteSidecar("data.parquet", idx)

// Read sidecar
idx, err := gin.ReadSidecar("data.parquet")

// Check if sidecar exists
if gin.HasSidecar("data.parquet") {
    // ...
}

Embedded Metadata Workflow

cfg := gin.DefaultParquetConfig() // MetadataKey: "gin.index"

// Rebuild existing Parquet file with embedded index
err := gin.RebuildWithIndex("data.parquet", idx, cfg)

// Check if Parquet has embedded index
hasIdx, err := gin.HasGINIndex("data.parquet", cfg)

// Read embedded index
idx, err := gin.ReadFromParquetMetadata("data.parquet", cfg)

Auto-Loading (Embedded First, Then Sidecar)

// Tries embedded metadata first, falls back to sidecar
idx, err := gin.LoadIndex("data.parquet", gin.DefaultParquetConfig())

Encode for Parquet Metadata (Build-Time Embedding)

When creating a new Parquet file, you can embed the index during creation:

// Get key-value pair for Parquet metadata
key, value, err := gin.EncodeToMetadata(idx, gin.DefaultParquetConfig())
// key = "gin.index", value = base64-encoded compressed index

// Use with parquet-go writer
writer := parquet.NewGenericWriter[Record](f,
    parquet.KeyValueMetadata(key, value),
)

Batch Processing (Programmatic)

Helper functions for working with multiple files:

// Local filesystem
if gin.IsDirectory("./data") {
    // List all .parquet files in directory
    parquetFiles, err := gin.ListParquetFiles("./data")

    // List all .gin files in directory
    ginFiles, err := gin.ListGINFiles("./data")

    // Process each file
    for _, f := range parquetFiles {
        idx, _ := gin.BuildFromParquet(f, "attributes", gin.DefaultConfig())
        gin.WriteSidecar(f, idx)
    }
}

// S3
s3Client, _ := gin.NewS3ClientFromEnv()

// List all .parquet files under prefix
parquetKeys, err := s3Client.ListParquetFiles("bucket", "data/")

// List all .gin files under prefix
ginKeys, err := s3Client.ListGINFiles("bucket", "data/")

S3 Support

All operations support S3 paths via AWS SDK v2:

// Configure from environment variables:
// AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION
// AWS_ENDPOINT_URL (for MinIO, LocalStack), AWS_S3_PATH_STYLE=true
s3Client, err := gin.NewS3ClientFromEnv()

// Build from S3
idx, err := s3Client.BuildFromParquet("bucket", "path/to/data.parquet", "attributes", gin.DefaultConfig())

// Write sidecar to S3
err := s3Client.WriteSidecar("bucket", "path/to/data.parquet", idx)

// Read sidecar from S3
idx, err := s3Client.ReadSidecar("bucket", "path/to/data.parquet")

// Load index (tries embedded, then sidecar)
idx, err := s3Client.LoadIndex("bucket", "path/to/data.parquet", gin.DefaultParquetConfig())

CLI Tool

A command-line tool is provided for common operations:

# Install
go install github.com/amikos-tech/ami-gin/cmd/gin-index@latest

# Build sidecar index
gin-index build -c attributes data.parquet
gin-index build -c attributes -o custom.gin data.parquet

# Build and embed into Parquet file
gin-index build -c attributes -embed data.parquet

# Query index
gin-index query data.parquet.gin '$.status = "error"'
gin-index query data.parquet.gin '$.count > 100'
gin-index query data.parquet.gin '$.name IN ("alice", "bob")'

# Show index info
gin-index info data.parquet.gin

# Extract embedded index to sidecar
gin-index extract -o data.parquet.gin data.parquet

# S3 paths (uses AWS env vars)
gin-index build -c attributes s3://bucket/data.parquet
gin-index query s3://bucket/data.parquet.gin '$.status = "ok"'

Batch Processing (Directory/S3 Prefix):

Process multiple files at once by passing a directory or S3 prefix:

# Build index for all .parquet files in a directory
gin-index build -c attributes ./data/
gin-index build -c attributes -embed ./data/

# Query all .gin files in a directory
gin-index query ./data/ '$.status = "error"'

# Show info for all .gin files
gin-index info ./data/

# S3 prefix - processes all .parquet files under the prefix
gin-index build -c attributes s3://bucket/data/
gin-index query s3://bucket/data/ '$.status = "error"'
gin-index info s3://bucket/data/

# Glob patterns work too
gin-index build -c attributes './data/*.parquet'
gin-index query './data/*.gin' '$.level = "error"'

CLI Query Syntax:

  • Equality: $.field = "value", $.field != "value"
  • Numeric: $.field > 100, $.field >= 100, $.field < 100, $.field <= 100
  • IN/NOT IN: $.field IN ("a", "b"), $.field NOT IN (1, 2, 3)
  • Null: $.field IS NULL, $.field IS NOT NULL
  • Contains: $.field CONTAINS "substring"
  • Regex: $.field REGEX "pattern" (e.g., $.brand REGEX "Toyota|Tesla")

Configuration

config := gin.GINConfig{
    CardinalityThreshold: 10000,  // Use bloom-only for high-cardinality paths
    BloomFilterSize:      65536,
    BloomFilterHashes:    5,
    EnableTrigrams:       true,   // Enable CONTAINS queries
    TrigramMinLength:     3,
    HLLPrecision:         12,     // HyperLogLog precision (4-16)
    PrefixBlockSize:      16,
}

builder := gin.NewBuilder(config, numRowGroups)

Examples

See the examples directory:

go run ./examples/basic/main.go        # Equality queries
go run ./examples/range/main.go        # Numeric ranges
go run ./examples/transformers/main.go # Date field transformers
go run ./examples/transformers-advanced/main.go # IP, SemVer, email, regex transformers
go run ./examples/fulltext/main.go     # CONTAINS queries
go run ./examples/regex/main.go        # Regex pattern matching
go run ./examples/null/main.go         # NULL handling
go run ./examples/nested/main.go       # Nested JSON and arrays
go run ./examples/serialize/main.go    # Persistence
go run ./examples/full/main.go         # All types and operators
go run ./examples/parquet/main.go      # Parquet integration (sidecar, embedded, queries)

Benchmarks

Run benchmarks with:

go test -bench=. -benchmem -benchtime=1s

Performance Summary (Apple M3 Max)

Operation Latency Notes
EQ query ~1ยตs Bloom filter + sorted term lookup
Range query (GT/LT) 4-24ยตs Min/max stats scan
IN query (10 values) ~8ยตs Union of EQ results
CONTAINS query 2-17ยตs Trigram intersection
IsNull/IsNotNull 2-4ยตs Bitmap lookup
Bloom lookup ~100ns Fast path rejection
AddDocument ~43ยตs JSON parsing + indexing
Encode (1K RGs) ~4ms zstd compression
Decode (1K RGs) ~2ms zstd decompression

Index Size

Row Groups Encoded Size Per RG
100 6.7 KB 67 bytes
500 18 KB 36 bytes
1,000 30 KB 30 bytes
2,000 51 KB 26 bytes

Scaling Characteristics

Query time scales well with index size:

  • 10 RGs: ~340ns
  • 100 RGs: ~530ns
  • 1,000 RGs: ~680ns
  • 5,000 RGs: ~800ns

Build time is linear with document count and complexity:

  • 100 docs (7 fields): ~1.2ms
  • 1,000 docs: ~6.7ms
  • High cardinality (10K unique values): ~3.3ms per 1K docs

Component Performance

Component Operation Latency
Bloom Filter Add ~100ns
Bloom Filter Lookup ~100ns
RGSet (10K) Intersect ~12ยตs
RGSet (10K) Union ~10ยตs
Trigram Add (50 chars) ~16ยตs
Trigram Search 1-6ยตs
HyperLogLog Add ~70ns
HyperLogLog Estimate 7-410ยตs (precision dependent)
Prefix Compress 1K terms ~60ยตs

Real-World Scenario: 1M Docs / 50K Row Groups

Simulating a log storage scenario:

  • 1M documents across 50K row groups (~20 docs/RG)
  • 10 labels: 2 integers (status_code, duration_ms) + 8 strings
  • Mix of cardinalities: trace_id (high), service (low), host (medium)
  • Trigrams disabled (no FTS)
Metric Value
Index Size 289 KB (0.28 MB)
Bytes per RG 5.9 bytes
Bytes per doc 0.3 bytes
Build time 464ms
Encode 41ms
Decode 41ms

Query Performance:

Query Latency Notes
trace_id=X (high cardinality) 950ns Bloom filter fast-path
service=api (low cardinality) 6.5ยตs ~10K RGs match
trace_id=X AND level=error 6ยตs High card + low card
duration_ms > 5000 244ยตs Range scan over 50K RGs
service=api AND env=prod AND status>=400 285ยตs 3 predicates combined

Key takeaway: High-cardinality lookups (trace ID, request ID) are sub-microsecond. The entire index for 1M documents fits in 289 KB - easily cacheable in memory, localStorage, or CDN edge.

Benchmark Categories

The benchmark suite (benchmark_test.go) covers:

  1. Builder Performance - Document ingestion, batch loading, finalization
  2. Query Performance - All operators, parallel queries, multiple predicates
  3. Serialization - Encode/decode latency, compression ratios
  4. Components - Bloom filter, RGSet, trigram, HLL, prefix compression
  5. Scaling - Row group count, document size, cardinality, nesting depth

Comparison with Other Solutions

vs PostgreSQL GIN/JSONB

Aspect This GIN Index PostgreSQL GIN
Query Latency ~1ยตs (EQ) ~0.7-1.2ms per predicate
Deployment Embedded bytes, no server Requires PostgreSQL server
Cacheability Cache anywhere (nginx, memcached, CDN) Tied to database buffer cache
Index Size 26-67 bytes/row-group Larger, includes posting lists
Range Queries Native min/max stats Poor (GIN doesn't support ranges)
Full-Text Trigram-based CONTAINS Full-featured tsvector/tsquery
ACID No (read-only after build) Full transaction support

PostgreSQL GIN uses Bitmap Index Scans which cost ~0.7-1.2ms each when cached. This index achieves ~1ยตs queries by being purpose-built for row-group pruning with simpler data structures.

vs Parquet Built-in Statistics

Aspect This GIN Index Parquet Min/Max Stats Parquet Bloom Filters
String Equality Exact term โ†’ RG bitmap Only min/max (poor for strings) Yes, but per-column only
CONTAINS/FTS Trigram index No No
Multi-path Queries Single index file Scattered in column chunks Scattered in column chunks
Cardinality HyperLogLog estimates No No
Null Tracking Explicit null/present bitmaps Null count only No
Index Location Footer or sidecar file Column chunk metadata Column chunk metadata

Parquet's built-in bloom filters are effective for single-column equality but require reading multiple column chunks for multi-field queries. This GIN index consolidates all paths into one structure.

vs Delta Lake / Iceberg Data Skipping

Aspect This GIN Index Delta Lake Apache Iceberg
Statistics Per-path term index + min/max First 32 columns min/max Partition-level + column stats
High Cardinality Bloom filter fallback Requires Z-ordering Requires sorting
JSON Support Native path extraction Requires schema Requires schema
Query Planning Client-side, cacheable Spark/engine dependent Engine dependent
Deployment Standalone bytes Delta transaction log Metadata tables

Delta Lake's data skipping relies on Z-ordering for effectiveness with high-cardinality columns. This GIN index handles high cardinality natively via bloom filters.

vs Elasticsearch

Aspect This GIN Index Elasticsearch
Query Latency ~1ยตs ~1-10ms (network + processing)
Deployment Embedded, no server Cluster required
Index Size ~30KB for 1K row-groups GB+ for equivalent data
Use Case Row-group pruning Full search engine
Updates Rebuild required Near real-time

Elasticsearch provides millisecond-level latency for searches but requires cluster infrastructure. This index is designed for embedding in data lake metadata.

Key Advantage: High-Cardinality Arbitrary JSON

This index was born from log storage needs - indexing arbitrary attributes/labels where:

  • High cardinality is the norm - trace IDs, request IDs, user IDs, session tokens
  • Schema is unknown - arbitrary key-value labels attached at runtime
  • Queries are selective - "find logs where trace_id=abc123" should be instant

Traditional solutions struggle here:

Challenge PostgreSQL GIN Parquet Stats This GIN Index
trace_id (millions unique) Index bloat, slow writes Min/max useless Bloom filter fast-path
user.email (arbitrary path) Requires schema Column must exist Auto-discovered paths
labels["env"] (dynamic keys) JSONB @> operator (~1ms) Not supported Native path indexing (~1ยตs)
Mixed types per path Type coercion issues Single type per column Tracks observed types

Log/observability example:

{
  "timestamp": "2024-01-15T10:30:00Z",
  "level": "error",
  "trace_id": "abc123def456",
  "user": {"id": "user_98765", "email": "alice@example.com"},
  "labels": {"env": "prod", "region": "us-east-1", "version": "2.1.0"},
  "message": "Connection timeout to downstream service"
}

Query: trace_id=abc123def456 AND labels.env=prod AND level=error

  • Bloom filter rejects non-matching row groups instantly
  • High-cardinality trace_id doesn't degrade performance
  • Arbitrary labels.* paths indexed automatically

Vector database metadata filtering:

Vector databases need efficient pre-filtering before similarity search. Without good metadata indexing, you either:

  1. Scan all vectors then filter (slow)
  2. Filter first with poor index (still slow)
  3. Build separate metadata infrastructure (complex)
{
  "id": "doc_12345",
  "embedding": [0.1, 0.2, ...],
  "metadata": {
    "source": "arxiv",
    "year": 2024,
    "authors": ["Alice", "Bob"],
    "topics": ["machine-learning", "transformers"],
    "cited_by": 142,
    "full_text": "We present a novel approach to..."
  }
}

Query: Find similar vectors WHERE metadata.source=arxiv AND metadata.year>=2023 AND metadata.topics[*]=transformers

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                   Vector DB Hybrid Search                       โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                                 โ”‚
โ”‚   1. Metadata Filter (GIN Index)         2. Vector Search       โ”‚
โ”‚   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”       โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚   โ”‚ source=arxiv                โ”‚       โ”‚                  โ”‚   โ”‚
โ”‚   โ”‚ year>=2023        โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ–ถโ”‚  ANN Search      โ”‚   โ”‚
โ”‚   โ”‚ topics[*]=transformers      โ”‚       โ”‚  (only on        โ”‚   โ”‚
โ”‚   โ”‚                             โ”‚       โ”‚   segments 2,5)  โ”‚   โ”‚
โ”‚   โ”‚ Result: segments [2, 5]     โ”‚       โ”‚                  โ”‚   โ”‚
โ”‚   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜       โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ”‚           ~1ยตs                              search scope        โ”‚
โ”‚                                             reduced 80%         โ”‚
โ”‚                                                                 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

This enables:

  • Pre-filtering - Prune segments before expensive ANN search
  • Flexible schemas - Each document can have different metadata fields
  • High cardinality - Filter by doc_id, user_id, session_id
  • Range + equality - year>=2023 AND source=arxiv
  • Array membership - topics[*]=machine-learning
  • Full-text on metadata - CONTAINS(full_text, "transformer")

Cacheable Pruning Index

The second differentiator is deployment flexibility:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                     Data Lake Query Flow                        โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                                 โ”‚
โ”‚   Client                                                        โ”‚
โ”‚     โ”‚                                                           โ”‚
โ”‚     โ”‚ 1. Fetch GIN index (cached)                              โ”‚
โ”‚     โ–ผ                                                           โ”‚
โ”‚   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚   โ”‚  nginx/memcached/CDN/local cache                        โ”‚  โ”‚
โ”‚   โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                                        โ”‚  โ”‚
โ”‚   โ”‚  โ”‚ index.gin   โ”‚  โ† 30KB, serves in <1ms               โ”‚  โ”‚
โ”‚   โ”‚  โ”‚ (cached)    โ”‚                                        โ”‚  โ”‚
โ”‚   โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                                        โ”‚  โ”‚
โ”‚   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ”‚     โ”‚                                                           โ”‚
โ”‚     โ”‚ 2. Evaluate predicates locally (~1ยตs)                    โ”‚
โ”‚     โ”‚    Result: [RG 5, RG 23, RG 47]                          โ”‚
โ”‚     โ”‚                                                           โ”‚
โ”‚     โ”‚ 3. Read only matching row groups from object storage     โ”‚
โ”‚     โ–ผ                                                           โ”‚
โ”‚   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚   โ”‚  S3 / GCS / Azure Blob                                  โ”‚  โ”‚
โ”‚   โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”‚  โ”‚
โ”‚   โ”‚  โ”‚ data.parquet                                    โ”‚    โ”‚  โ”‚
โ”‚   โ”‚  โ”‚  RG 0 โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ skipped                         โ”‚    โ”‚  โ”‚
โ”‚   โ”‚  โ”‚  RG 5 โ—€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ read                            โ”‚    โ”‚  โ”‚
โ”‚   โ”‚  โ”‚  RG 10 โ”€โ”€โ”€โ”€โ”€โ”€โ”€ skipped                         โ”‚    โ”‚  โ”‚
โ”‚   โ”‚  โ”‚  RG 23 โ—€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ read                           โ”‚    โ”‚  โ”‚
โ”‚   โ”‚  โ”‚  RG 47 โ—€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ read                           โ”‚    โ”‚  โ”‚
โ”‚   โ”‚  โ”‚  ...                                            โ”‚    โ”‚  โ”‚
โ”‚   โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ”‚  โ”‚
โ”‚   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ”‚                                                                 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Benefits:

  • No database server - Index is just bytes, evaluate anywhere
  • Cache at edge - nginx, memcached, CDN, browser localStorage
  • Cross-language - Binary format works in any language
  • Offline capable - Cache index locally for disconnected queries
  • Cost efficient - Avoid scanning TB of Parquet data

This architecture is ideal for:

  • Log/observability platforms - Index arbitrary labels, query by trace ID
  • Vector databases - Pre-filter segments before ANN search
  • Serverless query engines - No database to manage
  • Browser-based data explorers - Cache index in localStorage
  • Edge computing / IoT analytics - Offline-capable querying
  • Cost-sensitive data lake queries - Minimize S3/GCS egress

Architecture

Index Structure

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                              GINIndex                                       โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                                             โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚  โ”‚ Header                                                               โ”‚   โ”‚
โ”‚  โ”‚  โ€ข Version, NumRowGroups, NumDocs, NumPaths                         โ”‚   โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ”‚                                                                             โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚  โ”‚ Path Directory                                                       โ”‚   โ”‚
โ”‚  โ”‚  pathID โ†’ { PathName, ObservedTypes, Cardinality, Flags }           โ”‚   โ”‚
โ”‚  โ”‚                                                                      โ”‚   โ”‚
โ”‚  โ”‚  Example:                                                            โ”‚   โ”‚
โ”‚  โ”‚    0 โ†’ { "$.name",   String,  150,   0x00 }                         โ”‚   โ”‚
โ”‚  โ”‚    1 โ†’ { "$.age",    Int,     80,    0x00 }                         โ”‚   โ”‚
โ”‚  โ”‚    2 โ†’ { "$.tags[*]", String, 50000, FlagBloomOnly }                โ”‚   โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ”‚                                                                             โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
โ”‚  โ”‚ Global Bloom Filter                                                  โ”‚   โ”‚
โ”‚  โ”‚  Fast rejection for path=value pairs                                 โ”‚   โ”‚
โ”‚  โ”‚  Contains: "$.name=alice", "$.name=bob", "$.age=30", ...            โ”‚   โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ”‚                                                                             โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                      โ”‚
โ”‚  โ”‚ StringIndex           โ”‚  โ”‚ NumericIndex          โ”‚                      โ”‚
โ”‚  โ”‚ (per pathID)          โ”‚  โ”‚ (per pathID)          โ”‚                      โ”‚
โ”‚  โ”‚                       โ”‚  โ”‚                       โ”‚                      โ”‚
โ”‚  โ”‚ pathID: 0 ($.name)    โ”‚  โ”‚ pathID: 1 ($.age)     โ”‚                      โ”‚
โ”‚  โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚  โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”โ”‚                      โ”‚
โ”‚  โ”‚ โ”‚  Term   โ”‚ RGSet  โ”‚  โ”‚  โ”‚ โ”‚ RG  โ”‚ Min  โ”‚ Max  โ”‚โ”‚                      โ”‚
โ”‚  โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค  โ”‚  โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”คโ”‚                      โ”‚
โ”‚  โ”‚ โ”‚ "alice" โ”‚ {0,2}  โ”‚  โ”‚  โ”‚ โ”‚  0  โ”‚  25  โ”‚  35  โ”‚โ”‚                      โ”‚
โ”‚  โ”‚ โ”‚ "bob"   โ”‚ {1}    โ”‚  โ”‚  โ”‚ โ”‚  1  โ”‚  20  โ”‚  45  โ”‚โ”‚                      โ”‚
โ”‚  โ”‚ โ”‚ "carol" โ”‚ {2,3}  โ”‚  โ”‚  โ”‚ โ”‚  2  โ”‚  30  โ”‚  30  โ”‚โ”‚                      โ”‚
โ”‚  โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚  โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”˜โ”‚                      โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                      โ”‚
โ”‚                                                                             โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                      โ”‚
โ”‚  โ”‚ NullIndex             โ”‚  โ”‚ TrigramIndex          โ”‚                      โ”‚
โ”‚  โ”‚ (per pathID)          โ”‚  โ”‚ (per pathID)          โ”‚                      โ”‚
โ”‚  โ”‚                       โ”‚  โ”‚                       โ”‚                      โ”‚
โ”‚  โ”‚ pathID: 1 ($.age)     โ”‚  โ”‚ pathID: 3 ($.desc)    โ”‚                      โ”‚
โ”‚  โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚  โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚                      โ”‚
โ”‚  โ”‚ โ”‚ NullRGs  โ”‚ {4,7}  โ”‚ โ”‚  โ”‚ โ”‚ Trigram โ”‚ RGSet  โ”‚ โ”‚                      โ”‚
โ”‚  โ”‚ โ”‚ Present  โ”‚ {0-9}  โ”‚ โ”‚  โ”‚ โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค โ”‚                      โ”‚
โ”‚  โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚  โ”‚ โ”‚ "hel"   โ”‚ {0,2}  โ”‚ โ”‚                      โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚ โ”‚ "ell"   โ”‚ {0,2}  โ”‚ โ”‚                      โ”‚
โ”‚                             โ”‚ โ”‚ "llo"   โ”‚ {0,2,5}โ”‚ โ”‚                      โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ "wor"   โ”‚ {1,3}  โ”‚ โ”‚                      โ”‚
โ”‚  โ”‚ DocID Mapping          โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚                      โ”‚
โ”‚  โ”‚ (optional)             โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                      โ”‚
โ”‚  โ”‚                        โ”‚                                                 โ”‚
โ”‚  โ”‚ pos โ†’ DocID            โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                     โ”‚
โ”‚  โ”‚  0  โ†’ 1000             โ”‚  โ”‚ PathCardinality (HLL) โ”‚                     โ”‚
โ”‚  โ”‚  1  โ†’ 1001             โ”‚  โ”‚ (per pathID)          โ”‚                     โ”‚
โ”‚  โ”‚  2  โ†’ 1020             โ”‚  โ”‚                       โ”‚                     โ”‚
โ”‚  โ”‚  3  โ†’ 1021             โ”‚  โ”‚ Estimates unique vals โ”‚                     โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                     โ”‚
โ”‚                                                                             โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Note: All RGSet bitmaps use Roaring Bitmaps for efficient compression

Data Flow

                                BUILD PHASE
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                                                                             โ”‚
โ”‚   JSON Documents                                                            โ”‚
โ”‚        โ”‚                                                                    โ”‚
โ”‚        โ–ผ                                                                    โ”‚
โ”‚   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”‚
โ”‚   โ”‚ DocID 0 โ”‚โ”€โ”€โ”€โ”€โ–ถโ”‚                  GINBuilder                       โ”‚     โ”‚
โ”‚   โ”‚ RG: 0   โ”‚     โ”‚                                                   โ”‚     โ”‚
โ”‚   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ”‚  AddDocument(docID, json)                        โ”‚     โ”‚
โ”‚   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”‚       โ”‚                                          โ”‚     โ”‚
โ”‚   โ”‚ DocID 1 โ”‚โ”€โ”€โ”€โ”€โ–ถโ”‚       โ–ผ                                          โ”‚     โ”‚
โ”‚   โ”‚ RG: 0   โ”‚     โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                                 โ”‚     โ”‚
โ”‚   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ”‚  โ”‚ Walk JSON   โ”‚                                 โ”‚     โ”‚
โ”‚   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”‚  โ”‚ Extract:    โ”‚                                 โ”‚     โ”‚
โ”‚   โ”‚ DocID 2 โ”‚โ”€โ”€โ”€โ”€โ–ถโ”‚  โ”‚  โ€ข paths    โ”‚                                 โ”‚     โ”‚
โ”‚   โ”‚ RG: 1   โ”‚     โ”‚  โ”‚  โ€ข values   โ”‚                                 โ”‚     โ”‚
โ”‚   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ”‚  โ”‚  โ€ข types    โ”‚                                 โ”‚     โ”‚
โ”‚        โ‹ฎ          โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜                                 โ”‚     โ”‚
โ”‚                   โ”‚         โ”‚                                         โ”‚     โ”‚
โ”‚                   โ”‚         โ–ผ                                         โ”‚     โ”‚
โ”‚                   โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”‚     โ”‚
โ”‚                   โ”‚  โ”‚ Update per-path structures:             โ”‚     โ”‚     โ”‚
โ”‚                   โ”‚  โ”‚  โ€ข stringTerms[term] โ†’ RGSet.Set(pos)   โ”‚     โ”‚     โ”‚
โ”‚                   โ”‚  โ”‚  โ€ข numericStats[pos].Min/Max            โ”‚     โ”‚     โ”‚
โ”‚                   โ”‚  โ”‚  โ€ข nullRGs.Set(pos) if null             โ”‚     โ”‚     โ”‚
โ”‚                   โ”‚  โ”‚  โ€ข trigrams.Add(term, pos)              โ”‚     โ”‚     โ”‚
โ”‚                   โ”‚  โ”‚  โ€ข bloom.Add(path=value)                โ”‚     โ”‚     โ”‚
โ”‚                   โ”‚  โ”‚  โ€ข hll.Add(value)                       โ”‚     โ”‚     โ”‚
โ”‚                   โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ”‚     โ”‚
โ”‚                   โ”‚                                                   โ”‚     โ”‚
โ”‚                   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ”‚
โ”‚                                           โ”‚                                 โ”‚
โ”‚                                           โ–ผ                                 โ”‚
โ”‚                                    Finalize()                               โ”‚
โ”‚                                           โ”‚                                 โ”‚
โ”‚                                           โ–ผ                                 โ”‚
โ”‚                                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                          โ”‚
โ”‚                                    โ”‚  GINIndex   โ”‚                          โ”‚
โ”‚                                    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                          โ”‚
โ”‚                                                                             โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

                                QUERY PHASE
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                                                                             โ”‚
โ”‚   Predicates: [EQ("$.name", "alice"), GT("$.age", 25)]                     โ”‚
โ”‚        โ”‚                                                                    โ”‚
โ”‚        โ–ผ                                                                    โ”‚
โ”‚   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      โ”‚
โ”‚   โ”‚                     idx.Evaluate(predicates)                     โ”‚      โ”‚
โ”‚   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜      โ”‚
โ”‚        โ”‚                                                                    โ”‚
โ”‚        โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                     โ”‚
โ”‚        โ–ผ                      โ–ผ                      โ–ผ                     โ”‚
โ”‚   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”          โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”          โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                 โ”‚
โ”‚   โ”‚ Predicateโ”‚          โ”‚ Predicateโ”‚          โ”‚   ...    โ”‚                 โ”‚
โ”‚   โ”‚    1     โ”‚          โ”‚    2     โ”‚          โ”‚          โ”‚                 โ”‚
โ”‚   โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜          โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜          โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                 โ”‚
โ”‚        โ”‚                     โ”‚                                              โ”‚
โ”‚        โ–ผ                     โ–ผ                                              โ”‚
โ”‚   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                                    โ”‚
โ”‚   โ”‚ Bloom Check  โ”‚     โ”‚ Bloom Check  โ”‚   โ—€โ”€โ”€ Fast rejection path          โ”‚
โ”‚   โ”‚ path=value?  โ”‚     โ”‚   (skip for  โ”‚                                    โ”‚
โ”‚   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ”‚    ranges)   โ”‚                                    โ”‚
โ”‚          โ”‚             โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                                    โ”‚
โ”‚          โ–ผ                    โ–ผ                                             โ”‚
โ”‚   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                                    โ”‚
โ”‚   โ”‚ StringIndex  โ”‚     โ”‚ NumericIndex โ”‚                                    โ”‚
โ”‚   โ”‚ lookup term  โ”‚     โ”‚ scan min/max โ”‚                                    โ”‚
โ”‚   โ”‚ โ†’ RGSet      โ”‚     โ”‚ โ†’ RGSet      โ”‚                                    โ”‚
โ”‚   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                                    โ”‚
โ”‚          โ”‚                    โ”‚                                             โ”‚
โ”‚          โ”‚    RGSet{0,2}      โ”‚    RGSet{0,1,2}                            โ”‚
โ”‚          โ”‚                    โ”‚                                             โ”‚
โ”‚          โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                                             โ”‚
โ”‚                    โ”‚                                                        โ”‚
โ”‚                    โ–ผ                                                        โ”‚
โ”‚             โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                                                 โ”‚
โ”‚             โ”‚  Intersect  โ”‚                                                 โ”‚
โ”‚             โ”‚  (AND all)  โ”‚                                                 โ”‚
โ”‚             โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜                                                 โ”‚
โ”‚                    โ”‚                                                        โ”‚
โ”‚                    โ–ผ                                                        โ”‚
โ”‚             โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                                                 โ”‚
โ”‚             โ”‚ RGSet{0,2}  โ”‚  โ—€โ”€โ”€ Matching row groups                       โ”‚
โ”‚             โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜                                                 โ”‚
โ”‚                    โ”‚                                                        โ”‚
โ”‚                    โ–ผ                                                        โ”‚
โ”‚             โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                                                 โ”‚
โ”‚             โ”‚ ToSlice()   โ”‚  โ†’ [0, 2]                                      โ”‚
โ”‚             โ”‚     or      โ”‚                                                 โ”‚
โ”‚             โ”‚ MatchingDocIDs() โ†’ [DocID...]                                โ”‚
โ”‚             โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                                                 โ”‚
โ”‚                                                                             โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

DocID Codec (Optional)

For composite document identifiers (e.g., file + row group):

// Encode file index and row group into single DocID
codec := gin.NewRowGroupCodec(20)  // 20 RGs per file
builder := gin.NewBuilderWithCodec(config, totalRGs, codec)

docID := codec.Encode(fileIndex, rgIndex)  // e.g., file=3, rg=15 โ†’ DocID=75
builder.AddDocument(docID, jsonDoc)

// Query and decode results
result := idx.Evaluate(predicates)
for _, docID := range idx.MatchingDocIDs(result) {
    decoded := codec.Decode(docID)  // [3, 15]
    fileIdx, rgIdx := decoded[0], decoded[1]
}

How It Works

The GIN index maintains several data structures:

  1. Path Directory - Maps JSON paths to their metadata (types, cardinality)
  2. String Index - For each path, maps terms to row-group bitmaps (Roaring)
  3. Numeric Index - Per-row-group min/max values for range pruning
  4. Null Index - Bitmaps tracking which row groups have null/present values
  5. Trigram Index - Maps 3-character sequences to row-group bitmaps
  6. Global Bloom Filter - Fast rejection of non-existent path=value pairs
  7. DocID Mapping - Optional external DocID to internal position mapping

Query evaluation intersects the matching row-group bitmaps from each predicate.

Design Notes

Why numRGs Must Be Known Upfront

The NewBuilder(config, numRGs) requires the total number of row groups at construction time. This is intentional:

  1. Complement operations require universe size - Operations like AllRGs() and Invert() need to know the total number of row groups to compute complements. When a query cannot prune (e.g., unknown path, graceful degradation), the index returns "all row groups" - which requires knowing what "all" means.

  2. Parquet metadata provides this - The index is designed for Parquet row-group pruning. In this context, the number of row groups is always available from Parquet file metadata before indexing begins.

  3. Bounds checking - The builder validates that document positions don't exceed the declared row group count, catching configuration errors early.

License

MIT

About

Custom GIN embeddable index for data lakes

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors