Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove possibility for conflicting field definitions and ambiguous field resolution #8870

Closed
clintongormley opened this issue Dec 10, 2014 · 22 comments

Comments

@clintongormley
Copy link
Contributor

@clintongormley clintongormley commented Dec 10, 2014

Fields with the same name in different types in the same index should have the same mapping. Previously, this has been advised as "best practice", but relying on advice has proved insufficient (see #4081 for the many issues that have resulted from allowing conflicting field definitions). Instead we need to enforce this in the API.

This issue (which replaces #4081) is a meta issue listing all of the changes that need to be made:

  • #8871 - Internally, field mappings should be stored at the index level, rather than at the type level. Types are essentially a way of grouping fields. While the mapping APIs will not change, an exception will be thrown if an attempt is made to map a field in one type in a way that conflicts with the mapping of a field with same name in another type. The _parent field is the one exception - this field will remain at the type level.
  • #8872 Always require the full path for field names (wildcards still allowed), and remove the ability to prepend the full path with the _type.
  • #8874 Remove type-level analyzer, search_analyzer, index_analyzer settings.
  • #9279 Remove per-document _analyzer
  • #8143 Remove the ability to change the mapping for _uid, _id, _index, _type, _field_names.
  • #6730 Deprecate extracting _routing and _id from document fields, but should still be possible to set _routing to required.
  • #8142 Always store _timestamp, _ttl, _routing, _parent as these are required for reindexing.
  • #8142 Should we also always require that the _source field should be stored? Disabling source prevents reindexing, updates, and out-of-the-box highlighting. Perhaps with the new compression options (#8863), the ability to disable the _source field is less useful?
  • #8875 Remove the _boost field (deprecated in v1.0.0.RC1).
  • #8877 Remove the ability to delete mappings.
  • #6677 Deprecate index_name and path settings
  • #9443 Allow the user to upgrade mappings removing old settings without requiring reindexing (where possible)
  • #9927 Throw an exception if field mappings is invalid

Alternatives:

  • Fields with different types can be renamed to distinguish their purpose, eg login_name vs login_date.
  • Different types can be separated into different indices.
@OlegYch
Copy link

@OlegYch OlegYch commented Dec 10, 2014

what about fields with different path (but same name) and different mapping in one type?

@clintongormley
Copy link
Contributor Author

@clintongormley clintongormley commented Dec 10, 2014

@OlegYch Since we will no longer support short names, only the full path is used to identify a field, so it is only the path that matters.

@OlegYch
Copy link

@OlegYch OlegYch commented Dec 10, 2014

we are using 'mapping types' precisely to store different kinds of documents in the same index (so that we can use parent/child queries)
so far prefixing field path with _type worked fine for us
this change would mean that we would lose ability to use parent/child queries if there happens to be a conflict, as they would have to be in different indexes
the conflicts will probably be rare, and we could probably change documents schema if they arise, but a better way to prevent or diagnose the conflicts than an exception on put would be nice
i'm also wondering if there is no way to somehow append field type to its name internally (like one would do to resolve conflicts manually)?

@clintongormley
Copy link
Contributor Author

@clintongormley clintongormley commented Dec 10, 2014

the conflicts will probably be rare, and we could probably change documents schema if they arise, but a better way to prevent or diagnose the conflicts than an exception on put would be nice
i'm also wondering if there is no way to somehow append field type to its name internally (like one would do to resolve conflicts manually)?

As you said, conflicts are rare (for most people). Normally a field with the same name refers to the same type of data. The alternative (eg prefixing the field name with the type) would create much more sparsity in the index, and would impact every cross-type query (as multiple fields would need to be queried). Right now, we have opted for making the common case correct and efficient. However, we have left the APIs as they are in case, sometime in the future, we manage to figure out a cleverer way of handling conflicting field definitions.

@rore
Copy link

@rore rore commented Dec 29, 2014

For the record, I want to raise here again our objection to the way this change is planned.

We have several use cases in which we have an index with custom types that have custom fields. Types are not pre-defined and fields are not pre-defined. There's high potential of fields with the same name under different types, including fields with the same name that have different field type (and it happens). We have a lot of this kind of data.

This modeling was aligned with the way Types where presented by Elasticsearch (and still are - in our last meetup Types where referred to by Boaz as equivalent to "tables").

So with this change we will be enforced to either hard-code the type as a prefix for all fields ourselves, or encapsulate all documents with a root "type" node. It can be done but is ugly, requires patchy handling and also reindexing all data.

A better option that will allow such use cases is having a setting on the index level to configure field type isolation. So if we know that we need field type isolation (and we don't have cross-type searches and are willing to pay the overhead), we can set it to "type" level, and internally fields will be prefixed by the type.

@clintongormley
Copy link
Contributor Author

@clintongormley clintongormley commented Dec 29, 2014

Hi @rore

There's high potential of fields with the same name under different types, including fields with the same name that have different field type (and it happens). We have a lot of this kind of data.

If you have this, then your data is essentially broken today, and you can end up with incorrect results, exceptions, or even corrupt indices.

The first thing that we're trying to do is to make everything safe and predictable. We are leaving the mapping APIs as they are so that, in the future, we may be able to revisit this decision and provide more alternatives.

@jhansen-tt
Copy link

@jhansen-tt jhansen-tt commented Jun 25, 2015

+1. This has been a major pain ever since I started using ES, and still happens in 1.6. Deleting the type alone doesn't fix the problem -- I have to completely delete the entire index and re-index everything, and then sometimes the problem goes away. I do set up a strict mapping before I index anything. It really feels to be a timing thing between the shards.

@jordansissel
Copy link
Contributor

@jordansissel jordansissel commented Jun 25, 2015

it seems that default data model for logstash is broken then.

Maybe? This only impacts users who have the same field name mapped to different data types under different document types in the same index. I don't know how many users this affects. Given Logstash has had this behavior (by default a daily index) for many years and given the anecdote that I can't recall much reports of this problem from Logstash users in those years, I'm not sure how big a problem this will be for Logstash users.

I will confess that prior to this ticket, my assumptions were that type mappings were fully independent even if they shared field names (w/ different mappings) - I know accept this as incorrect, but I don't know great an impact this has had against Logstash users at this time.

If you enforce that all the same names need to have the same type you are effectively enforcing the same schema for all different logs from different sources.

These are two different things. One constraint "Fields in the same index but on different document types must have the same mapping" is not the same as saying "the same schema must exist for all documents in the same index regardless of type - the conflict is only when two fields occupy the same name but different mappings in one index.

More research is needed for the Logstash side of things, but it's possible we may want to change the default index to include the 'type' field from Logstash (it'll be a backwards-compatibility-breaking change, if we do this). Hopefully the script from #10214 will help users figure out how this will impact them before upgrading, and we can address further from there.

@jhansen-tt
Copy link

@jhansen-tt jhansen-tt commented Jun 25, 2015

After reading this:

https://www.elastic.co/guide/en/elasticsearch/reference/1.3/mapping.html

I think this should be taken out of the docs:
In practice though, this restriction is almost never an issue.

It looks like this is probably my issue, but I think the documentation should be updated to say that using different mapping characteristics on fields with the same name across multiple types is not supported, because some searches fall apart completely, such as sorting.

@monowai
Copy link

@monowai monowai commented Jul 2, 2015

Indexes can have the same field name with a different type. Doesn't this change move the query problem out of the index/type level and up in to the index? It seems to me that I had a query spanning indexes I'd still have the same fieldname+datatype conflict problem.

Is there any merit in resolving this as part of the query DSL? If you have conflicting field+datatypes then could the query allow the caller to specify which field+datatype they wanted ignoring docs that don't match the criteria.

@rjernst
Copy link
Member

@rjernst rjernst commented Jul 3, 2015

@monowai The important thing about #8871 was making field types consistent within an index. You are correct that across indexes, the problem can still exist. However, whether this is an error case depends on the query. If the query is not parseable in one of the indexes, an exception would be raised. This was already the case, but now it should be consistently raised, while before mixed field types within an index could have masked the problem (depending on the order the document types were loaded within mappings).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
You can’t perform that action at this time.