Remove possibility for conflicting field definitions and ambiguous field resolution #8870

Closed
clintongormley opened this Issue Dec 10, 2014 · 22 comments

Projects

None yet
@clintongormley
Member

Fields with the same name in different types in the same index should have the same mapping. Previously, this has been advised as "best practice", but relying on advice has proved insufficient (see #4081 for the many issues that have resulted from allowing conflicting field definitions). Instead we need to enforce this in the API.

This issue (which replaces #4081) is a meta issue listing all of the changes that need to be made:

  • #8871 - Internally, field mappings should be stored at the index level, rather than at the type level. Types are essentially a way of grouping fields. While the mapping APIs will not change, an exception will be thrown if an attempt is made to map a field in one type in a way that conflicts with the mapping of a field with same name in another type. The _parent field is the one exception - this field will remain at the type level.
  • #8872 Always require the full path for field names (wildcards still allowed), and remove the ability to prepend the full path with the _type.
  • #8874 Remove type-level analyzer, search_analyzer, index_analyzer settings.
  • #9279 Remove per-document _analyzer
  • #8143 Remove the ability to change the mapping for _uid, _id, _index, _type, _field_names.
  • #6730 Deprecate extracting _routing and _id from document fields, but should still be possible to set _routing to required.
  • #8142 Always store _timestamp, _ttl, _routing, _parent as these are required for reindexing.
  • #8142 Should we also always require that the _source field should be stored? Disabling source prevents reindexing, updates, and out-of-the-box highlighting. Perhaps with the new compression options (#8863), the ability to disable the _source field is less useful?
  • #8875 Remove the _boost field (deprecated in v1.0.0.RC1).
  • #8877 Remove the ability to delete mappings.
  • #6677 Deprecate index_name and path settings
  • #9443 Allow the user to upgrade mappings removing old settings without requiring reindexing (where possible)
  • #9927 Throw an exception if field mappings is invalid

Alternatives:

  • Fields with different types can be renamed to distinguish their purpose, eg login_name vs login_date.
  • Different types can be separated into different indices.
@OlegYch
OlegYch commented Dec 10, 2014

what about fields with different path (but same name) and different mapping in one type?

@clintongormley
Member

@OlegYch Since we will no longer support short names, only the full path is used to identify a field, so it is only the path that matters.

@OlegYch
OlegYch commented Dec 10, 2014

we are using 'mapping types' precisely to store different kinds of documents in the same index (so that we can use parent/child queries)
so far prefixing field path with _type worked fine for us
this change would mean that we would lose ability to use parent/child queries if there happens to be a conflict, as they would have to be in different indexes
the conflicts will probably be rare, and we could probably change documents schema if they arise, but a better way to prevent or diagnose the conflicts than an exception on put would be nice
i'm also wondering if there is no way to somehow append field type to its name internally (like one would do to resolve conflicts manually)?

@clintongormley
Member

the conflicts will probably be rare, and we could probably change documents schema if they arise, but a better way to prevent or diagnose the conflicts than an exception on put would be nice
i'm also wondering if there is no way to somehow append field type to its name internally (like one would do to resolve conflicts manually)?

As you said, conflicts are rare (for most people). Normally a field with the same name refers to the same type of data. The alternative (eg prefixing the field name with the type) would create much more sparsity in the index, and would impact every cross-type query (as multiple fields would need to be queried). Right now, we have opted for making the common case correct and efficient. However, we have left the APIs as they are in case, sometime in the future, we manage to figure out a cleverer way of handling conflicting field definitions.

@clintongormley clintongormley added the Meta label Dec 17, 2014
@rore
rore commented Dec 29, 2014

For the record, I want to raise here again our objection to the way this change is planned.

We have several use cases in which we have an index with custom types that have custom fields. Types are not pre-defined and fields are not pre-defined. There's high potential of fields with the same name under different types, including fields with the same name that have different field type (and it happens). We have a lot of this kind of data.

This modeling was aligned with the way Types where presented by Elasticsearch (and still are - in our last meetup Types where referred to by Boaz as equivalent to "tables").

So with this change we will be enforced to either hard-code the type as a prefix for all fields ourselves, or encapsulate all documents with a root "type" node. It can be done but is ugly, requires patchy handling and also reindexing all data.

A better option that will allow such use cases is having a setting on the index level to configure field type isolation. So if we know that we need field type isolation (and we don't have cross-type searches and are willing to pay the overhead), we can set it to "type" level, and internally fields will be prefixed by the type.

@clintongormley
Member

Hi @rore

There's high potential of fields with the same name under different types, including fields with the same name that have different field type (and it happens). We have a lot of this kind of data.

If you have this, then your data is essentially broken today, and you can end up with incorrect results, exceptions, or even corrupt indices.

The first thing that we're trying to do is to make everything safe and predictable. We are leaving the mapping APIs as they are so that, in the future, we may be able to revisit this decision and provide more alternatives.

@shivangshah

Going on the record here, we have the exact same usecase as @rore mentioned above. The details of the usecase can be found here as well : http://stackoverflow.com/questions/29041509/field-names-with-the-same-name-across-types-having-different-index-type-in-elast/29053553#29053553

TL;DR: Our types and fields per each type are not regulated at the Application level. Customers can create new types and corresponding new fields dynamically (or remove them) and we can't control it.

The ideal scenario here will be: while indexing a document ES already asks for the type, so during indexing atleast there shouldn't be any problem and ES should handle indexing the fields per type. However during searching, if a type is not provided but only the field name, it should pickup the default mapping for that index (maybe that will be another feature) and thats the type it will use to search.

@shivangshah

Also, we CANNOT have an index per type because you have no idea how many types are going to get create per customer (and ofcouse you can have thousands of customers). So now, in that case, in one index you can have multiple "types". Now lets say a type is "customer" and another type is "Company". They both can easily have a field name called "random" which can be of different core types. Moving the types to "index" level essentially makes ALL the field names and their core types unique to the index. This essentially makes each index specific to each type (in the fear of not conflicting the field names especially when you don't know what customers are going to create) which means that number of indexes almost equal to number of types. If a customer has 40 types and you have 1000 customers, that's 40K index. Almost impossible to maintain.

@kaidad
kaidad commented Mar 18, 2015

We are in the exact same position as @shivangshah and @rore. It really feels like the abstraction around doctype is completely broken if it doesn't provide isolation at the mapping level.

@AshwinJay

It came as a shock to us to see that this seemingly basic feature is broken. This is the equivalent of saying that the 2 SQL statements in an RDBMS are not valid because the data type of time conflicts with another table:

CREATE TABLE employee_timesheet ( time datetime, ... );

CREATE TABLE stop_watch ( time int, ... );

@gondar
gondar commented Apr 16, 2015

@clintongormley The way you describe it seems that default data model for logstash is broken then. In logstash multiple differing logs types are all placed in the same index (per day). If you enforce that all the same names need to have the same type you are effectively enforcing the same schema for all different logs from different sources.

@jordansissel any comments on that?

@rore
rore commented Apr 16, 2015

@gondar Awesome point.

@polyfractal
Member

This is the equivalent of saying that the 2 SQL statements in an RDBMS are not valid because the data type of time conflicts with another table

@AshwinJay This is not actually equivalent, since RDBMS tables do not share schemas. You hear people relating tables to Elasticsearch types, because conceptually they are similar. But schematically, tables are closer to Elasticsearch indices.

There isn't really a good SQL example for Elasticsearch types. The equivalent is closer to creating two columns with the same name but different types inside a single table ... which RDBMS won't allow either. Elasticsearch has allowed it, but the behavior is trappy and dangerous at times.

However during searching, if a type is not provided but only the field name, it should pickup the default mapping for that index (maybe that will be another feature) and thats the type it will use to search.

This is where the trappy behavior comes into play (it's also the current behavior). There are two broad categories, aggregations and search.

  • Aggregations have their own "un-inversion" logic which will take a field's inverted index and hydrate it into memory. To be efficient (computationally and memory), this process needs all data to be of the same type. E.g. numerics use delta compression and variable integer blocks, while strings use ordinals maps. You simply can't put two different data types into the same data structure. Presently field data will just explode if you try... you can see many of the isues referencing this ticket deal with this very issue. But there is no good way to fix this. You'd have to hold separate data structures for different types, but that introduces gaps which ruins how the data structures work
  • For search, queries are built based on the field's mappings. So if you have a field mapped as a long and query it with a Range, that decomposes to a NumericRangeQuery which is then used to search across all the docs. Now imagine that NumericRangeQuery encounters a string field during the process ... the options are to either ignore that field or just explode. Neither are good choices, althoug omitting it is the "better" of the two (and current behavior). There is really no efficient way to "do the right thing" when there are multiple data types in a single field.

I think it's important to note that this isn't just trappy at search time, it can also break things at index time:

  • Doc values use different compression schemes for different data types as well, and Lucene enforces a single type-per-field for doc values so it can use a single compression scheme.
  • If norms are disabled on one field, Lucene will disable it for the other same-named fields.
  • Lucene enforces the "lowest-common-denominator" index options between two different fields, and will downgrade as necessary. So things like frequencies (scoring), positions (phrase queries) and offsets (postings highlighter) might be disabled depending on the conflict.

Please keep in mind that these changes aren't because we dislike the functionality, but because the functionality can be unsafe (and surprising in many cases).

They both can easily have a field name called "random" which can be of different core types.

I think the easy solution is to just prefix custom fields from a customer with some unique customer ID prefix? This is essentially the "fix" if Elasticsearch were to do it internally, except the "fix" makes the general-case much worse for most users.

If you enforce that all the same names need to have the same type you are effectively enforcing the same schema for all different logs from different sources.

@gondar I'm curious, how often do you have the same name but different types in log message extractions? E.g. how often is message a string in one log, but a float in another? I would have guessed that your Logstash pipeline would have normalized the values into a structured format that is consistent (even if coming from disparate sources)?

I don't have a ton of operational log experience myself, so this question is 100% curiosity :)

@shivangshah

@polyfractal Out of curiosity, I am really interested in knowing what are your thoughts on the usage of "type" in elasticsearch now because essentially, per index, build blocks are pretty much fields and other than grouping them in "types" there is no usecase (atleast none that I can think of). Also, if that's going to be case, what's the purpose of providing a type during indexing a document anymore when literally NOTHING is dependent on type other than maybe faster queries based on type filters (which can easily be achieved by just having a type field in your document itself rather than having ES take care of it and worry about all the things discussed above)

@rjernst
Member
rjernst commented Apr 17, 2015

maybe faster queries based on type filters (which can easily be achieved by just having a type field in your document itself rather than having ES take care of it and worry about all the things discussed above)

@shivangshah This is all types ever were. They are a shortcut for sets of fields, and the fact that the system (currently) allows conflicts is a bug.

Types have a couple uses:

  • The mechanism for specifying parent/child relationships
  • Decreased file handles when fields are common (meaning the same name) across different types (mapping types, not data types...i really wish we could revamp this terminology b/c it is very confusing to talk about)

There are also some performance optimizations that can happen on queries for a subset of types, by sorting on type at index time, which is planned for #8873 (and which cannot be attained using separate indexes).

@timbunce

Is there some way to detect and report inconsistent mappings across types?
E.g., has someone written a script to do this that people could run on their indices?

@clintongormley
Member

@timbunce such a script is planned as part of the migration advisory plugin #10214

@jhansen-tt

+1. This has been a major pain ever since I started using ES, and still happens in 1.6. Deleting the type alone doesn't fix the problem -- I have to completely delete the entire index and re-index everything, and then sometimes the problem goes away. I do set up a strict mapping before I index anything. It really feels to be a timing thing between the shards.

@jordansissel

it seems that default data model for logstash is broken then.

Maybe? This only impacts users who have the same field name mapped to different data types under different document types in the same index. I don't know how many users this affects. Given Logstash has had this behavior (by default a daily index) for many years and given the anecdote that I can't recall much reports of this problem from Logstash users in those years, I'm not sure how big a problem this will be for Logstash users.

I will confess that prior to this ticket, my assumptions were that type mappings were fully independent even if they shared field names (w/ different mappings) - I know accept this as incorrect, but I don't know great an impact this has had against Logstash users at this time.

If you enforce that all the same names need to have the same type you are effectively enforcing the same schema for all different logs from different sources.

These are two different things. One constraint "Fields in the same index but on different document types must have the same mapping" is not the same as saying "the same schema must exist for all documents in the same index regardless of type - the conflict is only when two fields occupy the same name but different mappings in one index.

More research is needed for the Logstash side of things, but it's possible we may want to change the default index to include the 'type' field from Logstash (it'll be a backwards-compatibility-breaking change, if we do this). Hopefully the script from #10214 will help users figure out how this will impact them before upgrading, and we can address further from there.

@jhansen-tt

After reading this:

https://www.elastic.co/guide/en/elasticsearch/reference/1.3/mapping.html

I think this should be taken out of the docs:
In practice though, this restriction is almost never an issue.

It looks like this is probably my issue, but I think the documentation should be updated to say that using different mapping characteristics on fields with the same name across multiple types is not supported, because some searches fall apart completely, such as sorting.

@monowai
monowai commented Jul 2, 2015

Indexes can have the same field name with a different type. Doesn't this change move the query problem out of the index/type level and up in to the index? It seems to me that I had a query spanning indexes I'd still have the same fieldname+datatype conflict problem.

Is there any merit in resolving this as part of the query DSL? If you have conflicting field+datatypes then could the query allow the caller to specify which field+datatype they wanted ignoring docs that don't match the criteria.

@rjernst
Member
rjernst commented Jul 3, 2015

@monowai The important thing about #8871 was making field types consistent within an index. You are correct that across indexes, the problem can still exist. However, whether this is an error case depends on the query. If the query is not parseable in one of the indexes, an exception would be raised. This was already the case, but now it should be consistently raised, while before mixed field types within an index could have masked the problem (depending on the order the document types were loaded within mappings).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment