Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add setting to ignore dynamic fields when field limit is reached #96235

Merged

Conversation

felixbarny
Copy link
Member

@felixbarny felixbarny commented May 20, 2023

Adds a new index.mapping.total_fields.ignore_dynamic_beyond_limit index setting.

When set to true, new fields are added to the mapping as long as the field limit (index.mapping.total_fields.limit) is not exceeded. Fields that would exceed the limit are not added to the mapping, similar to dynamic: false. Ignored fields are added to the _ignored metadata field.

Relates to #89911

To make this easier to review, this is split into the following PRs:

Related but not a prerequisite:

@github-actions
Copy link

Documentation preview:

@elasticsearchmachine elasticsearchmachine added external-contributor Pull request authored by a developer outside the Elasticsearch team needs:triage Requires assignment of a team area label v8.9.0 labels May 20, 2023
@felixbarny felixbarny changed the title Ignore dynamic beyond limit Add setting to ignore dynamic fields when field limit is reached May 20, 2023
@felixbarny felixbarny added >enhancement :Search/Mapping Index mappings, including merging and defining field types labels May 20, 2023
@elasticsearchmachine
Copy link
Collaborator

Hi @felixbarny, I've created a changelog YAML for you.

@elasticsearchmachine elasticsearchmachine added Team:Search Meta label for search team and removed needs:triage Requires assignment of a team area label labels May 20, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@ruflin
Copy link
Member

ruflin commented May 22, 2023

I like the general approach here. It makes it clear, the limit is for dynamic mappings. If the overall limit is 1024 and the user has specified 1020 fields, 4 are open for dynamic fields. Every time an index rolls over, the game begins agains. If a user by accident created too many fields in an index under a data stream but then has fixed ingestion, a rollover resets the counter for the new index.

For the future

I'm starting to wonder, if index.mapping.total_fields.limit should actually only apply to dynamic fields. If a user puts predefined fields in, there is likely a reason for it and Elasticsearch should not block this users. Instead field limits are only for the cases of dynamic fields. Having this config option on by default and have a dynamic field limit only would also solve the problem. But likely we can't differentiate after creation, what a dynamic and what a static field is.

@felixbarny
Copy link
Member Author

I'm starting to wonder, if index.mapping.total_fields.limit should actually only apply to dynamic fields.

Makes sense to me. This has also been discussed in #89911

@javanna
Copy link
Member

javanna commented May 22, 2023

When we discussed this last with the team we talked about introducing a new dynamic mode rather than a new index settings that affects how existing dynamic modes work. Also, we said we'd want to better understand how users are going to consume the additional info added to the _ignored field. It is not so straight-forward to understand the reason why a field name is added to the _ignored field, and also the field is not yet aggregatable. We'd like to figure out the overall plan before committing individual changes that may be incremental steps but not necessarily in the desired direction.

@ruflin
Copy link
Member

ruflin commented May 22, 2023

We'd like to figure out the overall plan before committing individual changes that may be incremental steps but not necessarily in the desired direction.

We are one of the main requestors of this feature. For us, this incremental change solves the problem for now. I agree, eventually there should be a more holistic overhaul on how all the pieces work together but this should not block the addition of this config option, especially as it is rather small change.

we said we'd want to better understand how users are going to consume the additional info added to the _ignored field. I

The tradeoff we are making here is, dropping documents vs having the info available. We should not drop documents and optimise for this first. We can work in a second step to make it easier to retrieve to the user.

@felixbarny
Copy link
Member Author

It is not so straight-forward to understand the reason why a field name is added to the _ignored field,

Agree but that's a problem that already exists today (for example, is _ignored set because if ignore_above or ignore_malformed?). I acknowledge that this change exacerbates the problem by adding more reasons for why the _ignored field exists but this is an orthogonal problem that we'll need to look at regardless.

I don't see why we should block progress on this until we have a way to store the _ignored reason.

and also the field is not yet aggregatable.

Again, this is orthogonal and should be handled via #59946

@felixbarny
Copy link
Member Author

@javanna do you agree on the approach of just relying on _ignored for now and look at how to improve the experience around the _ignored field as a separate step that can come later? I think even with the limitations around _ignored this improvement already greatly simplifies the experience for users when dealing with dynamic fields. Progress over perfection 🙂

When we discussed this last with the team we talked about introducing a new dynamic mode rather than a new index settings that affects how existing dynamic modes work.

I've tried that, too in this PR: #96233
While that works, it would require two new modes for dynamic for adding runtime fields and concrete fields until the limit. Therefore, I found the setting a bit more elegant as it applies to both true and runtime. But I don't have strong feelings here. If you prefer adding more modes to the dynamic option, I can change it.

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

@javanna javanna dismissed stale reviews from benwtrent and piergm January 25, 2024 15:35

Code has changed considerably since it was reviewed, another round needed.

felixbarny added a commit to felixbarny/elasticsearch that referenced this pull request Jan 25, 2024
This is in preparation of elastic#96235.
At the moment, there's no difference between MAPPING_AUTO_UPDATE and MAPPING_AUTO_UPDATE_PREFLIGHT.

After the other PR is merged, when the merge reason is auto-update and if ignore_dynamic_beyond_limit is set,
the merge process will only add dynamically mapped fields until the field limit is reached and ignores additional ones.
elasticsearchmachine pushed a commit that referenced this pull request Jan 26, 2024
This is in preparation of
#96235. At the moment,
there's no difference between MAPPING_AUTO_UPDATE and
MAPPING_AUTO_UPDATE_PREFLIGHT.

After the other PR is merged, when the merge reason is auto-update and
if ignore_dynamic_beyond_limit is set, the merge process will only add
dynamically mapped fields until the field limit is reached and ignores
additional ones.
Copy link
Member

@javanna javanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments, this is not far at all!

Copy link
Member

@javanna javanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did another round. The main bit that's left open is how we expose the new behaviour I will get back to you as soon as that's discussed and decided with the team.

Copy link
Member

@javanna javanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@felixbarny felixbarny added the auto-merge Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Feb 1, 2024
@felixbarny
Copy link
Member Author

felixbarny commented Feb 2, 2024

@elasticmachine update branch

@elasticsearchmachine elasticsearchmachine merged commit f642b8a into elastic:main Feb 2, 2024
16 checks passed
@felixbarny felixbarny deleted the ignore-dynamic-beyond-limit branch February 2, 2024 10:54
elasticsearchmachine pushed a commit that referenced this pull request Feb 6, 2024
Today, we're counting all mappers, including mappers for subfields that
aren't explicitly added to the mapping towards the field limit.

This means that some field types, such as `search_as_you_type` or
`percolator` count as more than one field even though that's not
apparent to users as they're just defining them as a single field in the
mapping.

This change makes it so that each field mapper only counts as one. We're
still counting multi-fields.

This makes it easier to understand for users why the field limit is hit.

~In addition to that, it also simplifies
#96235 as it makes the
implementation of `Mapper.Builder#getTotalFieldsCount` much easier and
easier to align with `Mapper#getTotalFieldsCount`. This reduces the risk
of over- or under-estimating the field count of a `Mapper.Builder` in
`DocumentParserContext#addDynamicMapper`, which in turn reduces the risk
of data loss due to the issue described here:
#96235 (comment)

*Edit: due to #103865, we
don't need an implementation of `getTotalFieldsCount` or `mapperSize` in
`Mapper.Builder`. Still, this PR more closely aligns
`Mapper#getTotalFieldsCount` with `MappingLookup#getTotalFieldsCount`,
which  `DocumentParserContext#addDynamicMapper` uses to determine
whether the field limit is hit*

A potential risk of this is that we're now effectively allowing more
fields in the mapping. It may be surprising to users that more fields
can be added to a mapping. Although, I'd not expect negative
consequences from that. Generally, I'd  expect users to be happy about
any change that reduces the risk of data loss.

We could also think about whether to apply the new counting logic only
to new indices (depending on the `IndexVersion`). However, that would
add more complexity and I'm not convinced about the value. We'd then
need to maintain two different ways of counting fields and also require
passing in the `IndexVersion` to `MappingLookup` which previously didn't
require the `IndexVersion`.

This PR is meant as a conversation starter. It would also simplify
#96235 but I don't think
this blocks that PR in any way.

I'm curious about the opinion of @javanna and @jpountz on this.
felixbarny added a commit to felixbarny/elasticsearch that referenced this pull request Feb 8, 2024
Today, we're counting all mappers, including mappers for subfields that
aren't explicitly added to the mapping towards the field limit.

This means that some field types, such as `search_as_you_type` or
`percolator` count as more than one field even though that's not
apparent to users as they're just defining them as a single field in the
mapping.

This change makes it so that each field mapper only counts as one. We're
still counting multi-fields.

This makes it easier to understand for users why the field limit is hit.

~In addition to that, it also simplifies
elastic#96235 as it makes the
implementation of `Mapper.Builder#getTotalFieldsCount` much easier and
easier to align with `Mapper#getTotalFieldsCount`. This reduces the risk
of over- or under-estimating the field count of a `Mapper.Builder` in
`DocumentParserContext#addDynamicMapper`, which in turn reduces the risk
of data loss due to the issue described here:
elastic#96235 (comment)

*Edit: due to elastic#103865, we
don't need an implementation of `getTotalFieldsCount` or `mapperSize` in
`Mapper.Builder`. Still, this PR more closely aligns
`Mapper#getTotalFieldsCount` with `MappingLookup#getTotalFieldsCount`,
which  `DocumentParserContext#addDynamicMapper` uses to determine
whether the field limit is hit*

A potential risk of this is that we're now effectively allowing more
fields in the mapping. It may be surprising to users that more fields
can be added to a mapping. Although, I'd not expect negative
consequences from that. Generally, I'd  expect users to be happy about
any change that reduces the risk of data loss.

We could also think about whether to apply the new counting logic only
to new indices (depending on the `IndexVersion`). However, that would
add more complexity and I'm not convinced about the value. We'd then
need to maintain two different ways of counting fields and also require
passing in the `IndexVersion` to `MappingLookup` which previously didn't
require the `IndexVersion`.

This PR is meant as a conversation starter. It would also simplify
elastic#96235 but I don't think
this blocks that PR in any way.

I'm curious about the opinion of @javanna and @jpountz on this.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) >enhancement external-contributor Pull request authored by a developer outside the Elasticsearch team :Search/Mapping Index mappings, including merging and defining field types :StorageEngine/Logs You know, for Logs Team:Search Meta label for search team Team:StorageEngine test-update-serverless v8.13.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add setting to ignore dynamic fields when field limit is reached