Use a new synthetic _id format for time-series datastreams #137274

tlrx · 2025-10-28T16:12:47Z

This change follows #136810 and introduces a new format for documents _id fields in time-series datastreams. In order to test this new format, the TSDB synthetic terms and postings format implementations are also changed.

Why change the document _id format?

The current _id format is composed of a routing hash, the hashed value of the _tsid and a timestamp. While it is possible to extract the routing hash and the timestamp from a document _id value, it is not possible to extract the original _tsid value.

This is an issue for synthetic _id as the document _id value is not indexed anymore: instead the synthetic _id is computed at runtime from the value of the routing hash, _tsid and timestamp. Therefore we need to be able to extract the routing hash/_tsid/timestamp values from the _id value and vice-versa.

The format for the synthetic _id in this pull request has been changed to be:

_tsid (variable length)
Long.MAX_VALUE - timestamp (unsigned long on 8 bytes encoded using big endian)
routing hash _ts_routing_hash (4 bytes)

Extracting the values from the _id is then used for routing GETs and DELETEs requests to the appropriate shard or setting the _tsid/timestamp/_ts_routing_hash fields in tombstone documents.

Note that in searches, the document _id is built using the existing TsIdLoader (that has been adjusted).

It is also important that the generated _id can be sorted lexicographically, as Lucene stops applying doc values updates when it seeks to a term that is greater than the one used in the soft-update. The ordering of the arrays of bytes representing the _id must match the ordering of documents in the segment, and to do that the timestamp value is stored in the array as Long.MAX_VALUE - timestamp. Docs with higher timestamp are then sorted first.

Impact on synthetic terms and postings format

The SyntheticIdTermsEnum and SyntheticIdPostingsEnum introduced in #136810 had to be adjusted for the new _id format. We expect their implementation to be somewhat slow as it requires several lookups to work. In a different change we'll add a bloom filter on top of these enumerations to avoid costly lookups.

In contrast to #136810, SyntheticIdTermsEnum and SyntheticIdPostingsEnum implementations are ready for reviews.

Tests improvements

Tests have been improved to delete random documents and/or lookup documents from in-memory or flushed to disk segments. Searches and search-by-id should also work.

The following feature are NOT covered by tests:

aggregations over _id likely do not work
nested-docs (if that's possible in time-series datastreams?)
and many more (split/clone, malformed _ids rejections etc)

...a-streams/src/internalClusterTest/java/org/elasticsearch/datastreams/TSDBSyntheticIdsIT.java

kkrik-es · 2025-10-29T15:29:05Z

server/src/main/java/org/elasticsearch/index/IndexSortConfig.java

            TIME_SERIES_SORT = new FieldSortSpec[] { new FieldSortSpec(TimeSeriesIdFieldMapper.NAME), timeStampSpec };
+            TIME_SERIES_WITH_SYNTHETIC_ID_SORT = new FieldSortSpec[] {
+                new FieldSortSpec(TimeSeriesIdFieldMapper.NAME),
+                new FieldSortSpec(DataStreamTimestampFieldMapper.DEFAULT_PATH) };


Do you need to drop the DESC ordering for timestamp? This is used for processing recent data first, and dropping it will penalize query performance.

I changed the ordering of the timestamp because it was easier to reason about end-to-end.

When applying soft-updates of documents, Lucene iterates over the terms (this is _id values) to know which documents must be soft-updated. It has an optimization to stop applying updates once it finds a term (ie, another _id value) that is greater than the one used for the soft-update.

This comparison is done between two values of _id (the value for the update, and the next value from the terms enumeration in the segment) which are in fact arrays of bytes. So we want the lexicographical ordering of those arrays to match the ordering of documents in the segment, and using Big Endian encoded timestamps allows that (ie, if timestamp1 < timestamp2 then arrays of bytes 1 < arrays of bytes 2). It also mean that documents must be sorted according to timestamp long natural ordering in the segment, so ascending.

@tlrx Can we encode the synthetic ID using _tisd and Long.MAX_VALUE - timestamp instead? Changing the index sort would break downsampling and rate calculation in ES|QL.

Great idea, thanks Nhat! I pushed d71316d.

Thank you, Tanguy!

kkrik-es · 2025-10-29T15:44:44Z

server/src/main/java/org/elasticsearch/index/codec/tsdb/TSDBSyntheticIdFieldsProducer.java

+            assert 0 <= tsIdOrd : tsIdOrd;
+            assert tsIdOrd < tsIdDocValues.getValueCount() : tsIdOrd;
+
+            for (int docID = 0; docID != DocIdSetIterator.NO_MORE_DOCS; docID = tsIdDocValues.nextDoc()) {


@martijnvg looks like a good candidate for docvalue skipper? Food for thought, nothing needed in this PR.

Yes, Francisco identified this too.

Yes, and doc value skipper are already enabled for _tsid field. This sounds like a good followup change.

Additionally the tsdb codec for primary sort field has a special encoding that allows us to binary search to the target ordinal. (see SortedOrdinalReader).

Thanks both, I'll address this in a follow up.

server/src/main/java/org/elasticsearch/index/mapper/TsidExtractingIdFieldMapper.java

fcofdez

Looks great, I left a few minor comments. A would wait until someone with more expertise comments about the sorting changes.

fcofdez · 2025-10-30T11:05:26Z

server/src/main/java/org/elasticsearch/cluster/routing/IndexRouting.java

-            // see IndexRequest#autoGenerateTimeBasedId.
-            return hashToShardId(ByteUtils.readIntLE(idBytes, addIdWithRoutingHash ? idBytes.length - 9 : 0));
+            int hash;
+            if (addIdWithRoutingHash) {


nit: I wonder if we would see some performance degradation from all this new branching? I won't expect it to be important but I wanted to mention it.

I think this specific branch is OK, but your comment made me think about how the IndexRouting is instanciated, and we don't want to read the USE_SYNTHETIC_ID setting for routing every operation.

So I pushed 933e280 to compute the useTimeSeriesSyntheticId flag once and for all when IndexMetadata are built, and uses this flag for routing operations.

server/src/main/java/org/elasticsearch/index/codec/tsdb/TSDBSyntheticIdCodec.java

server/src/main/java/org/elasticsearch/index/codec/tsdb/TSDBSyntheticIdFieldsProducer.java

...a-streams/src/internalClusterTest/java/org/elasticsearch/datastreams/TSDBSyntheticIdsIT.java

fcofdez · 2025-10-31T19:55:25Z

server/src/main/java/org/elasticsearch/index/mapper/ParsedDocument.java

        document.add(versionField);
        if (useSyntheticId) {
            // Use a synthetic _id field which is not indexed nor stored
            document.add(IdFieldMapper.syntheticIdField(id));


I wonder if once we adopt the new stored fields format which would skip storing the _id we would need to adapt the SearchBasedChangesSnapshot where we load the _id from the stored fields as far as I can read in the code.

My understanding is that SearchBasedChangesSnapshot uses TsIdLoader to load document ids since #97409. I'll see if I can write a test for this in a follow up.

Taking this back, it uses the stored _id field. More work is needed to have SearchBasedChangesSnapshot implementations work with synthetic _id, but that should not prevent the merge of this PR.

elasticsearchmachine · 2025-11-03T11:58:55Z

Hi @tlrx, I've created a changelog YAML for you.

tlrx · 2025-11-03T13:07:55Z

Thanks all for your reviews. I applied your feedback, this is ready for another round of reviews.

server/src/main/java/org/elasticsearch/index/codec/tsdb/TSDBSyntheticIdCodec.java

kkrik-es · 2025-11-04T09:46:59Z

server/src/main/java/org/elasticsearch/index/mapper/ParsedDocument.java

+            var routingHash = TsidExtractingIdFieldMapper.extractRoutingHashBytesFromSyntheticId(uid);
+
+            if (useDocValuesSkipper) {
+                document.add(SortedDocValuesField.indexedField(TimeSeriesIdFieldMapper.NAME, timeSeriesId));


I think we're adding skippers for @timestamp first, but not for tsid? @martijnvg to double-check this part.

Skippers are added for both _tsid and @timestamp.

kkrik-es

Thanks Tanguy, this is much cleaner now. Please wait for Martijn to check the Lucene changes.

fcofdez

LGTM

martijnvg

Thanks @tlrx, LGTM.

martijnvg · 2025-11-05T08:01:19Z

server/src/main/java/org/elasticsearch/index/codec/tsdb/TSDBSyntheticIdFieldsProducer.java

+            assert 0 <= tsIdOrd : tsIdOrd;
+            assert tsIdOrd < tsIdDocValues.getValueCount() : tsIdOrd;
+
+            for (int docID = 0; docID != DocIdSetIterator.NO_MORE_DOCS; docID = tsIdDocValues.nextDoc()) {


Yes, and doc value skipper are already enabled for _tsid field. This sounds like a good followup change.

Additionally the tsdb codec for primary sort field has a special encoding that allows us to binary search to the target ordinal. (see SortedOrdinalReader).

martijnvg · 2025-11-05T08:07:53Z

server/src/main/java/org/elasticsearch/index/mapper/ParsedDocument.java

+            var routingHash = TsidExtractingIdFieldMapper.extractRoutingHashBytesFromSyntheticId(uid);
+
+            if (useDocValuesSkipper) {
+                document.add(SortedDocValuesField.indexedField(TimeSeriesIdFieldMapper.NAME, timeSeriesId));


Skippers are added for both _tsid and @timestamp.

…rch into 2025/10/24/new-id-format

tlrx · 2025-11-05T13:32:04Z

Thanks all!

…37274) This pull request follows elastic#136810 and introduces a new format for documents _id fields in time-series datastreams. In order to test this new format, the TSDB synthetic terms and postings format implementations are also changed. The current _id format is composed of a routing hash, the hashed value of the _tsid and a timestamp. While it is possible to extract the routing hash and the timestamp from a document _id value, it is not possible to extract the original _tsid value. This is an issue for synthetic _id as the document _id value is not indexed anymore: instead the synthetic _id is computed at runtime from the value of the routing hash, _tsid and timestamp. Therefore we need to be able to extract the routing hash/_tsid/timestamp values from the _id value and vice-versa. The format for the synthetic _id in this pull request has been changed to be: _tsid (variable length) Long.MAX_VALUE - timestamp (unsigned long on 8 bytes encoded using big endian) routing hash _ts_routing_hash (4 bytes) Extracting the values from the _id is then used for routing GETs and DELETEs requests to the appropriate shard or setting the _tsid/timestamp/_ts_routing_hash fields in tombstone documents. Note that in searches, the document _id is built using the existing TsIdLoader (that has been adjusted). It is also important that the generated _id can be sorted lexicographically, as Lucene stops applying doc values updates when it seeks to a term that is greater than the one used in the soft-update. The ordering of the arrays of bytes representing the _id must match the ordering of documents in the segment, and to do that the timestamp value is stored in the array as Long.MAX_VALUE - timestamp. Docs with higher timestamp are then sorted first. The SyntheticIdTermsEnum and SyntheticIdPostingsEnum introduced in elastic#136810 had to be adjusted for the new _id format. We expect their implementation to be somewhat slow as it requires several lookups to work. In a different change we'll add a bloom filter on top of these enumerations to avoid costly lookups. Relates elastic#136304

BASE=7a23516cce48dcd78aed0075a398b604531f1e81 HEAD=608ff674cd870bd5574c62682139de356f081672 Branch=main

…37274) This pull request follows elastic#136810 and introduces a new format for documents _id fields in time-series datastreams. In order to test this new format, the TSDB synthetic terms and postings format implementations are also changed. The current _id format is composed of a routing hash, the hashed value of the _tsid and a timestamp. While it is possible to extract the routing hash and the timestamp from a document _id value, it is not possible to extract the original _tsid value. This is an issue for synthetic _id as the document _id value is not indexed anymore: instead the synthetic _id is computed at runtime from the value of the routing hash, _tsid and timestamp. Therefore we need to be able to extract the routing hash/_tsid/timestamp values from the _id value and vice-versa. The format for the synthetic _id in this pull request has been changed to be: _tsid (variable length) Long.MAX_VALUE - timestamp (unsigned long on 8 bytes encoded using big endian) routing hash _ts_routing_hash (4 bytes) Extracting the values from the _id is then used for routing GETs and DELETEs requests to the appropriate shard or setting the _tsid/timestamp/_ts_routing_hash fields in tombstone documents. Note that in searches, the document _id is built using the existing TsIdLoader (that has been adjusted). It is also important that the generated _id can be sorted lexicographically, as Lucene stops applying doc values updates when it seeks to a term that is greater than the one used in the soft-update. The ordering of the arrays of bytes representing the _id must match the ordering of documents in the segment, and to do that the timestamp value is stored in the array as Long.MAX_VALUE - timestamp. Docs with higher timestamp are then sorted first. The SyntheticIdTermsEnum and SyntheticIdPostingsEnum introduced in elastic#136810 had to be adjusted for the new _id format. We expect their implementation to be somewhat slow as it requires several lookups to work. In a different change we'll add a bloom filter on top of these enumerations to avoid costly lookups. Relates elastic#136304

tlrx added 6 commits October 24, 2025 18:34

Change document _id format for time series datastreams

e562e8c

fix bug

9a9df49

Merge branch 'main' into 2025/10/24/new-id-format

39e6cd4

fix remaining bug

f6234c3

fix sorting

51d66a3

Merge branch 'main' into 2025/10/24/new-id-format

6fd8a69

elasticsearchmachine added the v9.3.0 label Oct 28, 2025

tlrx added 6 commits October 28, 2025 17:23

Merge branch 'main' into 2025/10/24/new-id-format

9babebf

fix compiling and tests

8cfa2fa

Merge branch 'main' into 2025/10/24/new-id-format

46cc58b

fix sort config

d885dda

fix sort config

b22f59c

Merge branch 'main' into 2025/10/24/new-id-format

ab2be04

tlrx requested review from fcofdez, kkrik-es and martijnvg October 29, 2025 15:00

tlrx marked this pull request as ready for review October 29, 2025 15:00

elasticsearchmachine added the needs:triage Requires assignment of a team area label label Oct 29, 2025

kkrik-es reviewed Oct 29, 2025

View reviewed changes

...a-streams/src/internalClusterTest/java/org/elasticsearch/datastreams/TSDBSyntheticIdsIT.java Outdated Show resolved Hide resolved

kkrik-es reviewed Oct 29, 2025

View reviewed changes

server/src/main/java/org/elasticsearch/index/mapper/TsidExtractingIdFieldMapper.java Outdated Show resolved Hide resolved

tlrx added 2 commits October 30, 2025 11:27

Merge branch 'main' into 2025/10/24/new-id-format

63911a7

fix merge

b50b64c

fcofdez reviewed Oct 30, 2025

View reviewed changes

compute useTimeSeriesSyntheticId in metadata

933e280

fcofdez reviewed Oct 31, 2025

View reviewed changes

tlrx added 3 commits November 3, 2025 09:24

Merge branch 'main' into 2025/10/24/new-id-format

ad94e5d

remove update

4662f94

startDocID >= 0

b3428c7

tlrx added 2 commits November 3, 2025 12:58

Update docs/changelog/137274.yaml

96eb36a

ensure no postings

136a267

tlrx requested review from dnhatn, fcofdez and kkrik-es November 3, 2025 13:07

tlrx added 4 commits November 3, 2025 15:28

Merge branch 'main' into 2025/10/24/new-id-format

dda5531

remove sort

3b22c46

Merge branch 'main' into 2025/10/24/new-id-format

6a4a9e1

remove compound

546e23b

kkrik-es reviewed Nov 4, 2025

View reviewed changes

server/src/main/java/org/elasticsearch/index/codec/tsdb/TSDBSyntheticIdCodec.java Outdated Show resolved Hide resolved

kkrik-es reviewed Nov 4, 2025

View reviewed changes

kkrik-es approved these changes Nov 4, 2025

View reviewed changes

Merge branch 'main' into 2025/10/24/new-id-format

5529733

fcofdez approved these changes Nov 4, 2025

View reviewed changes

Merge branch 'main' into 2025/10/24/new-id-format

7e82813

martijnvg approved these changes Nov 5, 2025

View reviewed changes

tlrx added 5 commits November 5, 2025 09:36

Merge branch 'main' into 2025/10/24/new-id-format

d731f9f

Merge branch '2025/10/24/new-id-format' of github.com:tlrx/elasticsea…

eb05d57

…rch into 2025/10/24/new-id-format

feedback

3655dc3

Merge branch 'main' into 2025/10/24/new-id-format

59687e8

fix setting registration

608ff67

tlrx merged commit 184c51b into elastic:main Nov 5, 2025
34 checks passed

tlrx deleted the 2025/10/24/new-id-format branch November 5, 2025 13:31

phananh1010 added a commit to phananh1010/elasticsearch that referenced this pull request Nov 6, 2025

Mirror upstream elastic#137274 as single snapshot commit for AI review

293f648

BASE=7a23516cce48dcd78aed0075a398b604531f1e81 HEAD=608ff674cd870bd5574c62682139de356f081672 Branch=main

phananh1010 added a commit to phananh1010/elasticsearch that referenced this pull request Nov 7, 2025

Mirror upstream elastic#137274 as single snapshot commit for AI review

d9df1c3

BASE=7a23516cce48dcd78aed0075a398b604531f1e81 HEAD=608ff674cd870bd5574c62682139de356f081672 Branch=main

Use a new synthetic _id format for time-series datastreams #137274

Use a new synthetic _id format for time-series datastreams #137274

Uh oh!

Conversation

tlrx commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why change the document _id format?

Impact on synthetic terms and postings format

Tests improvements

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fcofdez left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented Nov 3, 2025

Uh oh!

tlrx commented Nov 3, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kkrik-es left a comment

Choose a reason for hiding this comment

Uh oh!

fcofdez left a comment

Choose a reason for hiding this comment

Uh oh!

martijnvg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tlrx commented Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

tlrx commented Oct 28, 2025 •

edited

Loading