Skip to content

Conversation

@churromorales
Copy link

@churromorales churromorales commented Jan 27, 2026

Description

This PR introduces TemporalMergePolicy, a new merge policy designed for time-series workloads where documents contain a timestamp field. The policy groups segments into time windows and merges segments within the same window, but never merges segments across different time windows. This preserves temporal locality and improves query performance for time-range queries. relates to #15412.

How it works

Time Bucketing

  • Segments are assigned to time windows based on their maximum timestamp:
  • Exponential bucketing (default): Recent data uses small windows (e.g., 1 hour), older data uses progressively larger windows (4 hours, 16 hours, etc.)
  • Fixed bucketing: All time windows have the same size
  • Old data bucket: Segments older than maxAgeSeconds are placed in a special bucket and not merged

Merge Triggers

Merges are triggered when a time window meets two conditions:

  1. Contains at least minThreshold segments (default: 4)
  2. Total document count exceeds largestSegment * compactionRatio (default: 1.2)

Key Constraints

  • Never merge across time windows: Even forceMerge(1) respects bucket boundaries
  • Old data protection: Very old segments (configurable via maxAgeSeconds) are excluded from merging
  • Concurrency safety: Properly checks MergeContext.getMergingSegments() to avoid "segment already merging" errors

Handling Late-Arriving and Out-of-Order Data

Time-series data rarely arrives perfectly in order. TemporalMergePolicy handles various timing scenarios:

Late-Arriving Data

When data with older timestamps arrives after newer data has been indexed:

  • Each segment is assigned to a time window based on its maximum timestamp
  • A segment containing mostly recent data with a few old records will be placed in the recent bucket
  • A segment containing only old data will be placed in the appropriate older bucket
  • Segments with mixed timestamps (spanning multiple windows) are assigned based on their max timestamp

Example:

  Segment A: timestamps [2024-01-01 to 2024-01-02] → Jan 2024 bucket                                                                                                                                                                                                                                                               
  Segment B: timestamps [2024-02-01 to 2024-02-02] → Feb 2024 bucket                                                                                                                                                                                                                                                               
  Segment C: timestamps [2024-01-15 to 2024-01-16] → Jan 2024 bucket (late arrival)    

Result: Segments A and C can merge together (same bucket), but never with B

Future Data

Data with timestamps in the future (beyond current time):

  • Treated as age = 0 (most recent)
  • Placed in the smallest (most recent) time window
  • Prevents errors from clock skew or timestamp bugs

Out-of-Order Writes Within a Segment

If a single segment contains documents spanning multiple time windows:

  • The segment is bucketed by its max timestamp only
  • This prevents pathological cases where a single document with a far-future timestamp would prevent merging
  • Trade-off: Some temporal mixing can occur within individual segments before merging

@github-actions github-actions bot added this to the 11.0.0 milestone Jan 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant