New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
String field mappings and fielddata / doc values settings #12394
Comments
+1 One issue with the default dynamic mappings that you propose is that we would fail at indexing large string fields given that the keyword mapping could generate large tokens while Lucene refuses to index tokens that are greater than 32k. So we might want to set |
++ I meant to mention that. I'll update the description and add it |
+1 |
+1 !!! |
This is a big win. It makes strings much more clear, and should make a faster default behaviour! |
I am ±0 on this one. I think the change is useful but I feel like it's solving a wrong problem. I think what we really need is a mechanism to define custom types. So, users could define "text" the way they want with reasonable defaults in a single place and then just refer to this type everywhere in a mapping. We can still pre-populate a set of default custom types such as "text" and "keyword" proposed above, but this should be done through the same mechanism that is available to users who should be able to redefine these custom types. |
+1 to this issue, it will make string settings much easier for users to understand. |
Huge +1. This simplifies so much and it does what you want out of the box. @imotov I think that's an interesting point, but this significantly improves what is currently happening/allowed with not_analyzed strings (e.g., adding a filter). Perhaps your idea alone should be a separate feature? I like the concept of globally controlled type defaults. |
+1 on this, I think if implemented properly it could lead to possibly adding something like @imotov suggested in the future, but for now just starting with these two is a good idea. |
+1 2 |
+[insert some ridiculously large number] |
i hope so! |
@clintongormley What is the current thinking for:
Is there a way to know if a token filter outputs more than a single value? |
Hi @mattweber - long time! @rmuir can confirm but I believe it is not a problem, in the same way that a numeric field can have multiple values. What we need to avoid is proper full text tokenization. |
@clintongormley Thanks. I was thinking token filters like (edge)ngram and word delimiter might be an issue. Probably rare though. |
No, we don't need to allow filters that have multiple values. That is really really trappy! Yes, in lucene we annotate filters that are "safe". This can also be used to make things like wildcard "analysis" more intuitive. But elasticsearch does not make use of this right now. |
😍 +1 |
+1 |
The plan is to disable aggs on analyzed fields by default and to throw an exception if the user tries to do so. I believe this is the right thing to do as it is much more common for an unaware user to sort/aggregate on the wrong field (full text instead of raw) and blow up their heap. Aggs can be enabled on analyzed fields on a live index by updating the mappings, so it is quite easy for users who know what they are doing to resolve the situation to their liking
Index time boosting and aggs are unrelated. This is actually to do with whether field length norms are written for a not_analyzed field or not. Field length norms are for full text relevance calculations (but index time boosting is hacked in by adjusting the norm). I would argue that enabling norms on keyword (not_analyzed) fields counts as spooky action at a distance. |
Actually most fields we have do, for instance numerics. Maybe we could look into applying mapping boosts as query-time boosts instead of index-time boosts so that they would not implicitely enable norms? (This would not change how |
that could work |
What would be the solution then ? I see two use cases for index time boosting, the first one is to boost documents independently in the same field and the other one is to boost fields inside a document independently when multi fields are indexed into the same field (_all case). The first use case is implemented via the norms and the second one via the payloads. I don't see how we could implement it at query time if we don't index the boost value. @jpountz am I missing something ? |
That sounds right to me! The issue I think is that some users might also be using mapping boosts as a way to tell elasticsearch that some fields are more important for relevance than others (for multi-field queries). But this is more a job for query-time boosting I think, hence the proposal to not apply the mapping boosts via Field.setBoost like today but in MappedFieldType.termQuery via a BoostQuery? |
Should this be marked 5.0.0 and breaking now? @clintongormley |
+1 |
I understand the split but I feel it doesn't support a lot of useful options around using I have several use cases for using aggregations on a result of Currently it is impossible to do that without resorting to using fielddata (which I would really like to avoid) or storing the field twice and using ingest-time denormalization (essentially duplicating the job of the analyzer) which doesn't seem clean. |
Yes but this will likely be implemented in 5.x rather than 5.0.
ICU normalization is typically the kind of things we want to allow with this keyword |
…pcoming elasticsearch 5.0. Elasticsearch 5.0 has refactored mappings as described in elastic/elasticsearch#12394. Most notably the string field has been replaced by the `text` and `keyword` fields.
This has been implemented with some differences to the original proposal:
|
I wonder why running a terms aggregation query on a keyword field enabled with doc_values consumed fielddata cache, which was unexpected for me? It is easy to reproduce with ES version 5.6.4 or 6.2.3. Can anyone help me? @jpountz @clintongormley @elasticmachine |
This is because of global ordinals mappings. |
This issue addresses a few topics:
string
fields out intotext
andkeyword
fields (Rethink string versus not_analyzed string mappings and support #11901)fielddata
anddoc_values
settings (Improve fielddata mappings #8693) andnorms
string
→text
/keyword
Today, we use
string
both for full-text and for structured keywords. We don't support doc-values onanalyzed
string fields, which means that strings which are essentially keywords (but eg need to be lowercased) cannot use doc-values.Proposal:
string
fieldstext
fields which support the full analysis chain and don't support doc-valueskeywords
fields which support only thekeyword
tokenizer, and have doc-values enabled by defaultindex
to accepttrue
|false
Question: Should
keyword
fields allow token filters that introduce new tokens?Deprecating fielddata for fields that support doc values
In-memory fielddata is limited by the size of the heap, and has been one of the biggest pain-points for users. Doc-values are slightly slower but: (1) don't suffer from the same latency as fielddata, (2) are not limited by heap size, (3) don't impact garbage collection, (4) allow much greater scaling.
All fields that support doc values already have them enabled by default.
Proposal:
The question arises: what happens if the user disables doc values then decides that actually they DO want to aggregate on that field after all? The answer is the same as if they have set a field to
index:false
- they have to reindex.Fielddata and doc values settings
Today we have these settings:
doc_values
:true
|false
fielddata.format
:disabled
|doc_values
|paged_bytes
|array
fielddata.loading
:lazy
|eager
|eager_global_ordinals
fielddata.filters
:frequency:{}
,regex:{}
These become a lot easier to simplify if we deprecate fielddata for all but analyzed string fields.
Proposal for fields that support doc values:
doc_values
:true
(default) |false
global_ordinals
:lazy
(default) |eager
Proposal for analyzed string fields:
fielddata
:disabled
(default) |lazy
|eager
global_ordinals
:lazy
(default) |eager
fielddata.filters
:frequency:{}
,regex:{}
If, in the future, we can automatically figure out which global ordinals need to be built eagerly, then we can remove the
global_ordinals
setting.Norms settings
Similar to the above, we have:
norms.enabled
:true
|false
norms.loading
:lazy
|eager
In Lucene 5.3, norms are disk based, so the lazy/eager issue is less important (eager in this case would mean force-loading the norms into the file system cache, a decision which we can probably make automatically in the future).
Proposal:
norms
:true
|false
text
fieldsGood out-of-the-box dynamic mappings for string fields
Today, when we detect a new string field, we add it as an
analyzed
string
, withlazy
fielddata loading enabled. While this allows users to get going with full text search, sorting and aggregations (with limitations, egnew
+york
), it's a poor default for heap usage.Proposal:
Add a
text
main field (with fielddata loading disabled) and akeyword
multi-field by default, ie:With the default settings these fields would look like this:
The text was updated successfully, but these errors were encountered: