Skip to content

Conversation

@fcofdez
Copy link
Contributor

@fcofdez fcofdez commented Nov 24, 2025

This PR integrates ES93BloomFilterStoredFieldsFormat with the synthetic _id lookups. For that, it introduces a new set of Codecs meant to be used only by TIME_SERIES indices. These new codecs are necessary to cover the case when the codec is loaded through SPI (i.e. after a shard relocation or node restarts).

The new codecs just wrap the existing codecs and extend them with the necessary plumbing to populate the bloom filter during indexing.

@elasticsearchmachine
Copy link
Collaborator

Hi @fcofdez, I've created a changelog YAML for you.

Copy link
Member

@tlrx tlrx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to take a deeper look but overall approach looks good.

Property.Final
);

public static final boolean USE_STORED_FIELDS_BLOOM_FILTER_FOR_ID_FEATURE_FLAG = new FeatureFlag("stored_field_bloom_filter")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we need another feature flag, or could it be folded with the existing one for synthetic id?

I'm not sure it makes a lot of sense to test one without the other, but maybe I'm missing a point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My idea is that the bloom filter is an optimization on top of the synthetic id. But happy to get rid of the feature flag and the index.mapping.use_stored_field_bloom_filter_id index setting if we think that's redundant. It'll simplify the code a bit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My idea is that the bloom filter is an optimization on top of the synthetic id

I agree but I think we won't use synthetic ids without a bloom filter on top of it, and having two features flags complicate the code. If that's OK, I would prefer use only one feature flag for both.

I won't block the PR for this so if you want to keep it that's OK too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got rid of the setting and feature flag in 799fb3a


}

private enum StorageMode {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this storage mode a bit confusing. Maybe a useBloomFilterSyntheticId local variable would be simpler?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got rid of it in a72a66a

@fcofdez fcofdez marked this pull request as ready for review November 25, 2025 13:07
@fcofdez fcofdez requested a review from a team as a code owner November 25, 2025 13:07
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

@fcofdez fcofdez requested a review from kkrik-es November 25, 2025 13:23
private final TSDBStoredFieldsFormat storedFieldsFormat;

TSDBCodecWithSyntheticId(String name, Codec delegate, BigArrays bigArrays) {
super(name, new TSDBSyntheticIdCodec(delegate));
Copy link
Contributor Author

@fcofdez fcofdez Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm planning to incorporate the code from TSDBSyntheticIdCodec into this class in a follow-up PR. But I wanted to keep the change size under control.

Copy link
Member

@tlrx tlrx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @fcofdez ! I only left minor comments, the direction makes sense to me and we can improve in follow ups. I'd like to have Martijn or Alan review the codec part before merging though.

boolean useSyntheticId = IndexSettings.TSDB_SYNTHETIC_ID_FEATURE_FLAG
&& mapperService != null
&& mapperService.getIndexSettings().useTimeSeriesSyntheticId()
&& mapperService.getIndexSettings().getMode() == IndexMode.TIME_SERIES;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mapperService.getIndexSettings().useTimeSeriesSyntheticId() already ensure that the index is a time-series index and that the feature flag is enabled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplified in d0d94ef

private final TSDBStoredFieldsFormat storedFieldsFormat;

TSDBCodecWithSyntheticId(String name, Codec delegate, BigArrays bigArrays) {
super(name, new TSDBSyntheticIdCodec(delegate));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can merge TSDBSyntheticIdCodec and TSDBCodecWithSyntheticId together in a follow up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I just saw your #138515 (comment) 👍

Comment on lines 53 to 65
return new FilterLeafReader.FilterTerms(delegate.terms(field)) {
@Override
public TermsEnum iterator() throws IOException {
return new LazyFilterTermsEnum() {
private TermsEnum delegate;

@Override
protected TermsEnum getDelegate() throws IOException {
if (delegate == null) {
delegate = in.iterator();
}
return delegate;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I've been confused by the two delegate (the on in lazy and the one in the bloom filter) and what in was referencing to.

Maybe something like this would help?

        final Terms terms = delegate.terms(field);
        return new FilterLeafReader.FilterTerms(terms) {
            @Override
            public TermsEnum iterator() throws IOException {
                return new LazyFilterTermsEnum() {
                    private TermsEnum termsEnum;

                    @Override
                    protected TermsEnum getDelegate() throws IOException {
                        if (termsEnum == null) {
                            termsEnum = terms.iterator();
                        }
                        return termsEnum;
                    }

                    @Override
                    public boolean seekExact(BytesRef text) throws IOException {
                        if (bloomFilter.mayContainTerm(field, text) == false) {
                            return false;
                        }
                        return getDelegate().seekExact(text);
                    }
                };
            }
        };

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, changed in 92a6daa

var legacyBestSpeedCodec = new LegacyPerFieldMapperCodec(Lucene103Codec.Mode.BEST_SPEED, mapperService, bigArrays);
if (ZSTD_STORED_FIELDS_FEATURE_FLAG) {
codecs.put(DEFAULT_CODEC, new PerFieldMapperCodec(Zstd814StoredFieldsFormat.Mode.BEST_SPEED, mapperService, bigArrays));
PerFieldMapperCodec defaultZstdCodec = new PerFieldMapperCodec(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to reduce the scope of this change, we could create our own default_code_with_synthetic_id and hard-coded this in INDEX_CODEC_SETTING for all time-series with use_synthetic_id enabled.

Here we go for the complete solution immediately, for which I'm ok too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion about this. I'm ok with both approaches. The downside of an extra codec is that we need to maintain it indefinitely whereas with this change as long as the feature flag is off we keep the current behaviour.

* @see StoredFieldsFormat
*/
public class TSDBStoredFieldsFormat extends StoredFieldsFormat {
private final StoredFieldsFormat storedFieldsFormat;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would call this delegate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tackled in e1cbce6


TSDBStoredFieldsWriter(Directory directory, SegmentInfo si, IOContext context) throws IOException {
boolean success = false;
List<Closeable> toClose = new ArrayList<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
List<Closeable> toClose = new ArrayList<>();
List<Closeable> toClose = new ArrayList<>(2);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tackled in 680edf1


TSDBStoredFieldsReader(Directory directory, SegmentInfo si, FieldInfos fn, IOContext context) throws IOException {
boolean success = false;
List<Closeable> toClose = new ArrayList<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
List<Closeable> toClose = new ArrayList<>();
List<Closeable> toClose = new ArrayList<>(2);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tackled in 680edf1

Copy link
Member

@tlrx tlrx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice work @fcofdez !

Let's wait for @martijnvg approval too before merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants