-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Integrate stored fields format bloom filter with synthetic _id #138515
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Hi @fcofdez, I've created a changelog YAML for you. |
tlrx
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to take a deeper look but overall approach looks good.
...er/src/main/java/org/elasticsearch/index/codec/ES93TSDBDefaultCompressionLucene103Codec.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/codec/ES93TSDBLuceneDefaultCodec.java
Outdated
Show resolved
Hide resolved
| Property.Final | ||
| ); | ||
|
|
||
| public static final boolean USE_STORED_FIELDS_BLOOM_FILTER_FOR_ID_FEATURE_FLAG = new FeatureFlag("stored_field_bloom_filter") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we need another feature flag, or could it be folded with the existing one for synthetic id?
I'm not sure it makes a lot of sense to test one without the other, but maybe I'm missing a point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My idea is that the bloom filter is an optimization on top of the synthetic id. But happy to get rid of the feature flag and the index.mapping.use_stored_field_bloom_filter_id index setting if we think that's redundant. It'll simplify the code a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My idea is that the bloom filter is an optimization on top of the synthetic id
I agree but I think we won't use synthetic ids without a bloom filter on top of it, and having two features flags complicate the code. If that's OK, I would prefer use only one feature flag for both.
I won't block the PR for this so if you want to keep it that's OK too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got rid of the setting and feature flag in 799fb3a
|
|
||
| } | ||
|
|
||
| private enum StorageMode { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this storage mode a bit confusing. Maybe a useBloomFilterSyntheticId local variable would be simpler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got rid of it in a72a66a
|
Pinging @elastic/es-storage-engine (Team:StorageEngine) |
| private final TSDBStoredFieldsFormat storedFieldsFormat; | ||
|
|
||
| TSDBCodecWithSyntheticId(String name, Codec delegate, BigArrays bigArrays) { | ||
| super(name, new TSDBSyntheticIdCodec(delegate)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm planning to incorporate the code from TSDBSyntheticIdCodec into this class in a follow-up PR. But I wanted to keep the change size under control.
tlrx
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @fcofdez ! I only left minor comments, the direction makes sense to me and we can improve in follow ups. I'd like to have Martijn or Alan review the codec part before merging though.
| boolean useSyntheticId = IndexSettings.TSDB_SYNTHETIC_ID_FEATURE_FLAG | ||
| && mapperService != null | ||
| && mapperService.getIndexSettings().useTimeSeriesSyntheticId() | ||
| && mapperService.getIndexSettings().getMode() == IndexMode.TIME_SERIES; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mapperService.getIndexSettings().useTimeSeriesSyntheticId() already ensure that the index is a time-series index and that the feature flag is enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simplified in d0d94ef
| private final TSDBStoredFieldsFormat storedFieldsFormat; | ||
|
|
||
| TSDBCodecWithSyntheticId(String name, Codec delegate, BigArrays bigArrays) { | ||
| super(name, new TSDBSyntheticIdCodec(delegate)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can merge TSDBSyntheticIdCodec and TSDBCodecWithSyntheticId together in a follow up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I just saw your #138515 (comment) 👍
| return new FilterLeafReader.FilterTerms(delegate.terms(field)) { | ||
| @Override | ||
| public TermsEnum iterator() throws IOException { | ||
| return new LazyFilterTermsEnum() { | ||
| private TermsEnum delegate; | ||
|
|
||
| @Override | ||
| protected TermsEnum getDelegate() throws IOException { | ||
| if (delegate == null) { | ||
| delegate = in.iterator(); | ||
| } | ||
| return delegate; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I've been confused by the two delegate (the on in lazy and the one in the bloom filter) and what in was referencing to.
Maybe something like this would help?
final Terms terms = delegate.terms(field);
return new FilterLeafReader.FilterTerms(terms) {
@Override
public TermsEnum iterator() throws IOException {
return new LazyFilterTermsEnum() {
private TermsEnum termsEnum;
@Override
protected TermsEnum getDelegate() throws IOException {
if (termsEnum == null) {
termsEnum = terms.iterator();
}
return termsEnum;
}
@Override
public boolean seekExact(BytesRef text) throws IOException {
if (bloomFilter.mayContainTerm(field, text) == false) {
return false;
}
return getDelegate().seekExact(text);
}
};
}
};There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, changed in 92a6daa
server/src/main/java/org/elasticsearch/index/codec/storedfields/TSDBStoredFieldsFormat.java
Show resolved
Hide resolved
| var legacyBestSpeedCodec = new LegacyPerFieldMapperCodec(Lucene103Codec.Mode.BEST_SPEED, mapperService, bigArrays); | ||
| if (ZSTD_STORED_FIELDS_FEATURE_FLAG) { | ||
| codecs.put(DEFAULT_CODEC, new PerFieldMapperCodec(Zstd814StoredFieldsFormat.Mode.BEST_SPEED, mapperService, bigArrays)); | ||
| PerFieldMapperCodec defaultZstdCodec = new PerFieldMapperCodec( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to reduce the scope of this change, we could create our own default_code_with_synthetic_id and hard-coded this in INDEX_CODEC_SETTING for all time-series with use_synthetic_id enabled.
Here we go for the complete solution immediately, for which I'm ok too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong opinion about this. I'm ok with both approaches. The downside of an extra codec is that we need to maintain it indefinitely whereas with this change as long as the feature flag is off we keep the current behaviour.
...c/main/java/org/elasticsearch/index/codec/bloomfilter/ES93BloomFilterStoredFieldsFormat.java
Show resolved
Hide resolved
| * @see StoredFieldsFormat | ||
| */ | ||
| public class TSDBStoredFieldsFormat extends StoredFieldsFormat { | ||
| private final StoredFieldsFormat storedFieldsFormat; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I would call this delegate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tackled in e1cbce6
|
|
||
| TSDBStoredFieldsWriter(Directory directory, SegmentInfo si, IOContext context) throws IOException { | ||
| boolean success = false; | ||
| List<Closeable> toClose = new ArrayList<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
| List<Closeable> toClose = new ArrayList<>(); | |
| List<Closeable> toClose = new ArrayList<>(2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tackled in 680edf1
|
|
||
| TSDBStoredFieldsReader(Directory directory, SegmentInfo si, FieldInfos fn, IOContext context) throws IOException { | ||
| boolean success = false; | ||
| List<Closeable> toClose = new ArrayList<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| List<Closeable> toClose = new ArrayList<>(); | |
| List<Closeable> toClose = new ArrayList<>(2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tackled in 680edf1
...a-streams/src/internalClusterTest/java/org/elasticsearch/datastreams/TSDBSyntheticIdsIT.java
Show resolved
Hide resolved
tlrx
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, nice work @fcofdez !
Let's wait for @martijnvg approval too before merging.
This PR integrates
ES93BloomFilterStoredFieldsFormatwith the synthetic _id lookups. For that, it introduces a new set ofCodecs meant to be used only byTIME_SERIESindices. These new codecs are necessary to cover the case when the codec is loaded through SPI (i.e. after a shard relocation or node restarts).The new codecs just wrap the existing codecs and extend them with the necessary plumbing to populate the bloom filter during indexing.