-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Use a new synthetic _id format for time-series datastreams #137274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
...a-streams/src/internalClusterTest/java/org/elasticsearch/datastreams/TSDBSyntheticIdsIT.java
Outdated
Show resolved
Hide resolved
| TIME_SERIES_SORT = new FieldSortSpec[] { new FieldSortSpec(TimeSeriesIdFieldMapper.NAME), timeStampSpec }; | ||
| TIME_SERIES_WITH_SYNTHETIC_ID_SORT = new FieldSortSpec[] { | ||
| new FieldSortSpec(TimeSeriesIdFieldMapper.NAME), | ||
| new FieldSortSpec(DataStreamTimestampFieldMapper.DEFAULT_PATH) }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to drop the DESC ordering for timestamp? This is used for processing recent data first, and dropping it will penalize query performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the ordering of the timestamp because it was easier to reason about end-to-end.
When applying soft-updates of documents, Lucene iterates over the terms (this is _id values) to know which documents must be soft-updated. It has an optimization to stop applying updates once it finds a term (ie, another _id value) that is greater than the one used for the soft-update.
This comparison is done between two values of _id (the value for the update, and the next value from the terms enumeration in the segment) which are in fact arrays of bytes. So we want the lexicographical ordering of those arrays to match the ordering of documents in the segment, and using Big Endian encoded timestamps allows that (ie, if timestamp1 < timestamp2 then arrays of bytes 1 < arrays of bytes 2). It also mean that documents must be sorted according to timestamp long natural ordering in the segment, so ascending.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tlrx Can we encode the synthetic ID using _tisd and Long.MAX_VALUE - timestamp instead? Changing the index sort would break downsampling and rate calculation in ES|QL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea, thanks Nhat! I pushed d71316d.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, Tanguy!
| assert 0 <= tsIdOrd : tsIdOrd; | ||
| assert tsIdOrd < tsIdDocValues.getValueCount() : tsIdOrd; | ||
|
|
||
| for (int docID = 0; docID != DocIdSetIterator.NO_MORE_DOCS; docID = tsIdDocValues.nextDoc()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@martijnvg looks like a good candidate for docvalue skipper? Food for thought, nothing needed in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, Francisco identified this too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and doc value skipper are already enabled for _tsid field. This sounds like a good followup change.
Additionally the tsdb codec for primary sort field has a special encoding that allows us to binary search to the target ordinal. (see SortedOrdinalReader).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks both, I'll address this in a follow up.
server/src/main/java/org/elasticsearch/index/mapper/TsidExtractingIdFieldMapper.java
Outdated
Show resolved
Hide resolved
fcofdez
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, I left a few minor comments. A would wait until someone with more expertise comments about the sorting changes.
| // see IndexRequest#autoGenerateTimeBasedId. | ||
| return hashToShardId(ByteUtils.readIntLE(idBytes, addIdWithRoutingHash ? idBytes.length - 9 : 0)); | ||
| int hash; | ||
| if (addIdWithRoutingHash) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I wonder if we would see some performance degradation from all this new branching? I won't expect it to be important but I wanted to mention it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this specific branch is OK, but your comment made me think about how the IndexRouting is instanciated, and we don't want to read the USE_SYNTHETIC_ID setting for routing every operation.
So I pushed 933e280 to compute the useTimeSeriesSyntheticId flag once and for all when IndexMetadata are built, and uses this flag for routing operations.
server/src/main/java/org/elasticsearch/index/codec/tsdb/TSDBSyntheticIdCodec.java
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/codec/tsdb/TSDBSyntheticIdFieldsProducer.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/codec/tsdb/TSDBSyntheticIdFieldsProducer.java
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/codec/tsdb/TSDBSyntheticIdFieldsProducer.java
Outdated
Show resolved
Hide resolved
...a-streams/src/internalClusterTest/java/org/elasticsearch/datastreams/TSDBSyntheticIdsIT.java
Show resolved
Hide resolved
| document.add(versionField); | ||
| if (useSyntheticId) { | ||
| // Use a synthetic _id field which is not indexed nor stored | ||
| document.add(IdFieldMapper.syntheticIdField(id)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if once we adopt the new stored fields format which would skip storing the _id we would need to adapt the SearchBasedChangesSnapshot where we load the _id from the stored fields as far as I can read in the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that SearchBasedChangesSnapshot uses TsIdLoader to load document ids since #97409. I'll see if I can write a test for this in a follow up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking this back, it uses the stored _id field. More work is needed to have SearchBasedChangesSnapshot implementations work with synthetic _id, but that should not prevent the merge of this PR.
|
Hi @tlrx, I've created a changelog YAML for you. |
|
Thanks all for your reviews. I applied your feedback, this is ready for another round of reviews. |
server/src/main/java/org/elasticsearch/index/codec/tsdb/TSDBSyntheticIdCodec.java
Outdated
Show resolved
Hide resolved
| var routingHash = TsidExtractingIdFieldMapper.extractRoutingHashBytesFromSyntheticId(uid); | ||
|
|
||
| if (useDocValuesSkipper) { | ||
| document.add(SortedDocValuesField.indexedField(TimeSeriesIdFieldMapper.NAME, timeSeriesId)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're adding skippers for @timestamp first, but not for tsid? @martijnvg to double-check this part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Skippers are added for both _tsid and @timestamp.
kkrik-es
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Tanguy, this is much cleaner now. Please wait for Martijn to check the Lucene changes.
fcofdez
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
martijnvg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tlrx, LGTM.
| assert 0 <= tsIdOrd : tsIdOrd; | ||
| assert tsIdOrd < tsIdDocValues.getValueCount() : tsIdOrd; | ||
|
|
||
| for (int docID = 0; docID != DocIdSetIterator.NO_MORE_DOCS; docID = tsIdDocValues.nextDoc()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and doc value skipper are already enabled for _tsid field. This sounds like a good followup change.
Additionally the tsdb codec for primary sort field has a special encoding that allows us to binary search to the target ordinal. (see SortedOrdinalReader).
| var routingHash = TsidExtractingIdFieldMapper.extractRoutingHashBytesFromSyntheticId(uid); | ||
|
|
||
| if (useDocValuesSkipper) { | ||
| document.add(SortedDocValuesField.indexedField(TimeSeriesIdFieldMapper.NAME, timeSeriesId)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Skippers are added for both _tsid and @timestamp.
|
Thanks all! |
…37274) This pull request follows elastic#136810 and introduces a new format for documents _id fields in time-series datastreams. In order to test this new format, the TSDB synthetic terms and postings format implementations are also changed. The current _id format is composed of a routing hash, the hashed value of the _tsid and a timestamp. While it is possible to extract the routing hash and the timestamp from a document _id value, it is not possible to extract the original _tsid value. This is an issue for synthetic _id as the document _id value is not indexed anymore: instead the synthetic _id is computed at runtime from the value of the routing hash, _tsid and timestamp. Therefore we need to be able to extract the routing hash/_tsid/timestamp values from the _id value and vice-versa. The format for the synthetic _id in this pull request has been changed to be: _tsid (variable length) Long.MAX_VALUE - timestamp (unsigned long on 8 bytes encoded using big endian) routing hash _ts_routing_hash (4 bytes) Extracting the values from the _id is then used for routing GETs and DELETEs requests to the appropriate shard or setting the _tsid/timestamp/_ts_routing_hash fields in tombstone documents. Note that in searches, the document _id is built using the existing TsIdLoader (that has been adjusted). It is also important that the generated _id can be sorted lexicographically, as Lucene stops applying doc values updates when it seeks to a term that is greater than the one used in the soft-update. The ordering of the arrays of bytes representing the _id must match the ordering of documents in the segment, and to do that the timestamp value is stored in the array as Long.MAX_VALUE - timestamp. Docs with higher timestamp are then sorted first. The SyntheticIdTermsEnum and SyntheticIdPostingsEnum introduced in elastic#136810 had to be adjusted for the new _id format. We expect their implementation to be somewhat slow as it requires several lookups to work. In a different change we'll add a bloom filter on top of these enumerations to avoid costly lookups. Relates elastic#136304
BASE=7a23516cce48dcd78aed0075a398b604531f1e81 HEAD=608ff674cd870bd5574c62682139de356f081672 Branch=main
BASE=7a23516cce48dcd78aed0075a398b604531f1e81 HEAD=608ff674cd870bd5574c62682139de356f081672 Branch=main
…37274) This pull request follows elastic#136810 and introduces a new format for documents _id fields in time-series datastreams. In order to test this new format, the TSDB synthetic terms and postings format implementations are also changed. The current _id format is composed of a routing hash, the hashed value of the _tsid and a timestamp. While it is possible to extract the routing hash and the timestamp from a document _id value, it is not possible to extract the original _tsid value. This is an issue for synthetic _id as the document _id value is not indexed anymore: instead the synthetic _id is computed at runtime from the value of the routing hash, _tsid and timestamp. Therefore we need to be able to extract the routing hash/_tsid/timestamp values from the _id value and vice-versa. The format for the synthetic _id in this pull request has been changed to be: _tsid (variable length) Long.MAX_VALUE - timestamp (unsigned long on 8 bytes encoded using big endian) routing hash _ts_routing_hash (4 bytes) Extracting the values from the _id is then used for routing GETs and DELETEs requests to the appropriate shard or setting the _tsid/timestamp/_ts_routing_hash fields in tombstone documents. Note that in searches, the document _id is built using the existing TsIdLoader (that has been adjusted). It is also important that the generated _id can be sorted lexicographically, as Lucene stops applying doc values updates when it seeks to a term that is greater than the one used in the soft-update. The ordering of the arrays of bytes representing the _id must match the ordering of documents in the segment, and to do that the timestamp value is stored in the array as Long.MAX_VALUE - timestamp. Docs with higher timestamp are then sorted first. The SyntheticIdTermsEnum and SyntheticIdPostingsEnum introduced in elastic#136810 had to be adjusted for the new _id format. We expect their implementation to be somewhat slow as it requires several lookups to work. In a different change we'll add a bloom filter on top of these enumerations to avoid costly lookups. Relates elastic#136304
This change follows #136810 and introduces a new format for documents
_idfields in time-series datastreams. In order to test this new format, the TSDB synthetic terms and postings format implementations are also changed.Why change the document _id format?
The current _id format is composed of a routing hash, the hashed value of the _tsid and a timestamp. While it is possible to extract the routing hash and the timestamp from a document _id value, it is not possible to extract the original _tsid value.
This is an issue for synthetic
_idas the document_idvalue is not indexed anymore: instead the synthetic_idis computed at runtime from the value of the routing hash,_tsidand timestamp. Therefore we need to be able to extract the routing hash/_tsid/timestamp values from the _id value and vice-versa.The format for the synthetic _id in this pull request has been changed to be:
_tsid(variable length)_ts_routing_hash(4 bytes)Extracting the values from the _id is then used for routing GETs and DELETEs requests to the appropriate shard or setting the _tsid/timestamp/_ts_routing_hash fields in tombstone documents.
Note that in searches, the document _id is built using the existing
TsIdLoader(that has been adjusted).It is also important that the generated _id can be sorted lexicographically, as Lucene stops applying doc values updates when it seeks to a term that is greater than the one used in the soft-update. The ordering of the arrays of bytes representing the
_idmust match the ordering of documents in the segment, and to do that the timestamp value is stored in the array asLong.MAX_VALUE - timestamp. Docs with higher timestamp are then sorted first.Impact on synthetic terms and postings format
The
SyntheticIdTermsEnumandSyntheticIdPostingsEnumintroduced in #136810 had to be adjusted for the new_idformat. We expect their implementation to be somewhat slow as it requires several lookups to work. In a different change we'll add a bloom filter on top of these enumerations to avoid costly lookups.In contrast to #136810,
SyntheticIdTermsEnumandSyntheticIdPostingsEnumimplementations are ready for reviews.Tests improvements
Tests have been improved to delete random documents and/or lookup documents from in-memory or flushed to disk segments. Searches and search-by-id should also work.
The following feature are NOT covered by tests: