Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Typeless parent/child #20257

Closed
martijnvg opened this issue Aug 31, 2016 · 34 comments
Closed

Typeless parent/child #20257

martijnvg opened this issue Aug 31, 2016 · 34 comments
Assignees
Labels
>breaking :Search/Search Search-related issues that do not fall into other categories

Comments

@martijnvg
Copy link
Member

martijnvg commented Aug 31, 2016

Related to #15613

Types are going to be removed. The parent/child is tightly coupled to types, so that needs to be changed. Parent/child still needs to distinguish between a parent or a child document, so the idea is that a new meta field type name _join will be allow that. This new field would replace _parent meta field. Each _join meta field type would maintain an indexed field to distinguish between parent and child documents and a doc values field needed for the join operation.

Example of how to use typeless parent/child:

PUT /stackoverflow
{
  "mapping": {
    "_join": ["question-to-answers"]
    }
  }
}

Indexing question document (parent):

PUT stackoverflow/1?join=question-to-answers
{
  "question" : "..."
}

Indexing answer document (child):

PUT stackoverflow/2?join=question-to-answers&parent=1
{
  "answer" : "..."
}

Besides the parent the typeless parent/child will also need a join url parameter. In order to prevent adding more feature specific options to transport and rest layer, meta fields should be completely isolated, from rest to mapping layer. This will result in cleaner code and allows parent/child to be moved to a module.

Adding answer-to-comments join field:

PUT /index1
{
  "mapping": {
    "_join": ["question-to-answers", "answer-to-comments"]
    }
  }
}

Indexing question document (parent):

PUT stackoverflow/1?join=question-to-answers
{
  "question" : "..."
}

Indexing answer document (child):

PUT stackoverflow/2?join=question-to-answers,answer-to-comments&parent=1
{
  "answer" : "..."
}

Indexing comment document (grand child):

PUT stackoverflow/3?join=answer-to-comments&parent=2&routing=1
{
  "comment" : "..."
}
@jpountz
Copy link
Contributor

jpountz commented Sep 1, 2016

I'm not sure we need to retain the notion of types, maybe we could go with something like below? You say that parent_type is used for the name of the internal join field, bu we could directly use the name of the join field directly for that? (in that case question-to-answers)

PUT /stackoverflow
{
  "mappings": {
    "properties": {
      "question-to-answers": {
        "type": "join"
      }
    }
  }
}

PUT stackoverflow/1
{
  "question" : "...",
  "question-to-answers" : null # no parent id, so this doc is a parent
}

PUT stackoverflow/2?routing=1
{
  "answer" : "...",
  "question-to-answers" : "1" # id of the parent
}

Something else that made me wonder when reading your proposal is that users generally do not like modifying the structure of their documents for specifying metadata (about the join in that case), so maybe it should remain a meta field, something like below?

PUT /stackoverflow
{
  "mappings": {
    "_join": [
      "question-to-answers"
    ]
  }
}

PUT stackoverflow/1?join=question-to-answers
{
  "question" : "..."
}

PUT stackoverflow/2?join=question-to-answers&parent=1
{
  "answer" : "..."
}

@martijnvg
Copy link
Member Author

martijnvg commented Sep 1, 2016

I'm not sure we need to retain the notion of types, maybe we could go with something like below?

When thinking about this refactoring I thought we needed to distinguish between the different documents (in case of multiple join fields), but when answering this question I realize we don't :) ,
because all documents in the same index will have different ids. (currently that is not the case as ids are only unique within a single type)

Something else that made me wonder when reading your proposal is that users generally do not like modifying the structure of their documents for specifying metadata (about the join in that case)

The other reason I moved away from metadata is that besides the parent there is a need of an additional meta field. I didn't want to add more options to IndexRequest and other classes for a feature that isn't used as much as our core features. This gives us cleaner code (p/c's code being encapsulated in only a FieldMapper impl and QueryBuilder impls) and does give the opportunity to eventually isolate parent/child into a module. The price of this is that the relationship needs to be specified in the source of the document.

If we can make metadata completely pluggable (from rest layer to field mapper layer) then we specify the required parameter in the url and having the ability to isolate the p/c code. But I don't feel that this should be a requirement for this refactoring. Although if we really want this we do have the time to develop this.

@jpountz
Copy link
Contributor

jpountz commented Sep 1, 2016

I didn't want to add more options to IndexRequest and other classes for a feature that isn't used as much as our core features.

This is a very good point.

@rjernst
Copy link
Member

rjernst commented Sep 1, 2016

If we can make metadata completely pluggable (from rest layer to field mapper layer) then we specify the required parameter in the url

I think we should do this. It is already a deficiency in the pluggability of Metadata fields. And I like @jpountz suggested api.

@martijnvg
Copy link
Member Author

@rjernst @jpountz I've updated the description of the issue to match with Adrien's proposal.

@stephanebastian
Copy link

Hi All,

Interesting discussion, I'm jumping-in with a couple of questions/comments.

Right now, as a user I can independently define the properties of my parent and child documents.
If I understand correctly, one consequence of removing types, is that I would define ONE mapping defining all properties of both my parent and child document (basically a merge of my current child and parent mapping files). Is that right?

Then, when inserting a document, I would either set properties allowed by the parent OR properties allowed by the child. ES won't have any way to tell if the properties are 'parent properties' or 'child properties'. Am I correct ?
If that's right, it seems like a major step backward in terms of functionality. What do you guys think?

Another question: Would it be possible to support cross index parent-child relationships?

All the best

@jpountz
Copy link
Contributor

jpountz commented Mar 15, 2017

Your assumptions are correct. It is indeed hard to keep feature parity when moving forward, however we think that removing types remains a good trade-off by making Elasticsearch easier to use, easier to understand and possibly faster.

Cross index parent-child relationships would not be possible as-is since Elasticsearch relies on the fact that a parent and all its children are on the same shard.

@martijnvg
Copy link
Member Author

Then, when inserting a document, I would either set properties allowed by the parent OR properties allowed by the child. ES won't have any way to tell if the properties are 'parent properties' or 'child properties'.

I would like to add to this that today we use the type as marker of what is a child and what is a parent. Today you could use the same field in both parent and child, so properties alone isn't enough to identify what is what. In this case both types would have the same field defined in their mappings.

With the proposed change, the join parameter would indicate what is the parent and what is the child document. If both parent and child documents use the same field they will use the same field mapping instead of two field mappings in the two type mappings.

So I think there is no step backwards in terms of functionality, but the way how parent child relationships are defined is just different with types no longer being there.

@cmorss
Copy link

cmorss commented Mar 28, 2017

I totally understand moving mappings up to the index level (basically removal of types), however we use strict mapping and there will be no way with this proposal that we'll be able to enforce which fields are allowed in a child vs a parent. If my understanding is correct, I'll be able to enforce mapping on the union of my logical children fields with logical parent fields, but if a field is logically "allowed" in a child there is nothing preventing it from also being in a parent document.

I can certainly live with these constraints, I just want to be sure I understand and plan for the future.

Thanks - C

@jimczi
Copy link
Contributor

jimczi commented Apr 26, 2017

After some internal discussions with @jpountz and @martijnvg I'd like to update this issue with the latest status.
Firstly naming, instead of join I'd like to call it by parent join I think it is important since it would indicate what kind of join we're supporting.
Secondly and since this feature is targeted to 5.5 I think we should add it as a module with the following restrictions:

  • the value of the join (name + parent) must be encoded in the source
  • the routing for any document (parent, child or grand-child) is checked but not automatically added. This means that user should add the routing manually for each document. If the document is only a child we can ensure that routing == parent but if it's also a grand-children we can just ensure that routing was set with some values.
    This restrictions are useful to simplify the integration of this feature as a module without being intrusive in all indexation layers (rest and mappings).
    We can later work on:

If we can make metadata completely pluggable (from rest layer to field mapper layer) then we specify the required parameter in the url
but I don't think we can achieve this for 5.5. This is not ideal to require the join metadata to be inside the source but adding more metadata handling to all required layers would be worst I think.

So the proposal is as follow:

Defining a single parent-child relation:

"join": {
  "questions": "answers"
}

Defining a parent-child-grand-child relation:

"join": {
  "questions": {
     "answers": "comments"
   }
}

Defining multiple parent-child relation:

"join": {
  "questions": {
     "answers": "comments"
   },
   "products": "items"
}

With this format the relation between each entity is explicit and we can use the hierarchy to validate join values inside documents.
Also it is possible to update a mapping to add a child to an existing parent:

Adding a child to an existing parent:

"join": {
  "parent": "child1"
}

"join": {
  "parent": {
     "child1": {},
     "child2": {}
}

"join": {
  "parent": {
     "child1":  "grand_child",
     "child2": {}
}

Indexing a parent question:

PUT stackoverflow/1
{
  "join" : {
     "name": "questions"
  },
  "question": "..."
}

Indexing a child answer:

PUT stackoverflow/2?routing=1
{
  "join" : {
     "name": "answers",
    "parent": "1"
  },
  "answer": "..."
}

Indexing a grandchild comment:

PUT stackoverflow/3?routing=1
{
  "join" : {
     "name": "comments",
     "parent": "2"
  },
  "comment": "..."
}

So the plan here is to have a metadata ParentJoinFieldMapper separated in a module. Then we would migrate has_parent, has_child queries and the children aggregation to it. These queries and aggs would still be compatible with the legacy parent_child but only for 5.x and 6.x. Lastly we must find a way to also migrate inner_hits in order to be able to completely replace the functionality of the current parent_child.

@clintongormley WDYT ?

@jpountz
Copy link
Contributor

jpountz commented Apr 26, 2017

How about replacing

"join": {
  "questions": {
     "answers": "comments"
   }
}

with

"join": {
  "questions": "answers",
  "answers": "comments"
}

It feels more natural to me, but maybe I'm missing something.

@jimczi
Copy link
Contributor

jimczi commented Apr 26, 2017

IMO the first option is more visual, you see more clearly the relation tree and you don't have to repeat the join name but I am fine either way.

@clintongormley
Copy link

The first option is more visual but is somewhat confusing, eg why does questions take a map, but answers just takes comments? A stricter syntax would be:

"join": {
  "questions": {
     "answers": {
        "comments": {}
     }
   }
}

but that encourages users to use several layers of inheritance, which is unwise.

I think I prefer the simple version that @jpountz suggested. In the case that a parent has multiple child types, it could be:

"join": {
  "questions": ["answers","comments"]
}

Question: Does the parent document need to know about which join field to use? If a document is neither a parent nor a child, does the join field still contain a value, or is it null?

@jimczi
Copy link
Contributor

jimczi commented Apr 26, 2017

but that encourages users to use several layers of inheritance, which is unwise.

Right and since most of the use cases are for a single parent-child relation the syntax would be the same anyway: "questions": "answers".
I'll start with the simple version that @jpountz suggested.

Does the parent document need to know about which join field to use

Yes because we use a different docvalue field for each "parent=>child" relation.

If a document is neither a parent nor a child, does the join field still contain a value, or is it null?

It is not required to add a join field in this case so the field can be missing in the document.

This was referenced May 11, 2017
jimczi added a commit that referenced this issue May 12, 2017
* Add parent-join module

This change adds a new module named `parent-join`.
The goal of this module is to provide a replacement for the `_parent` field but as a first step this change only moves the `has_child`, `has_parent` queries and the `children` aggregation to this module.
These queries and aggregations are no longer in core but they are deployed by default as a module.

Relates #20257
jimczi added a commit that referenced this issue May 16, 2017
* Add parent-join module

This change adds a new module named `parent-join`.
The goal of this module is to provide a replacement for the `_parent` field but as a first step this change only moves the `has_child`, `has_parent` queries and the `children` aggregation to this module.
These queries and aggregations are no longer in core but they are deployed by default as a module.

Relates #20257
jimczi added a commit to jimczi/elasticsearch that referenced this issue May 16, 2017
This change removes the field data specialization needed for the parent field and replaces it with
a simple DocValuesIndexFieldData. The underlying global ordinals are retrieved via a new function called
IndexOrdinalsFieldData#getOrdinalMap.
The children aggregation is also modified to use a simple WithOrdinals value source rather than the deleted WithOrdinals.Parent.

Relates elastic#20257
@johnrfrank
Copy link

@jimczi will this new join field type rely on Global Ordinals? Or, more to the point, what will be the practical considerations relevant to using _join? How will important details like the byte-length of record identifiers impact the use of _join?

Thanks for any guidance on this -- or anticipated guidance :-)

Very excited to see this new feature coming together.

martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Jun 2, 2017
… work with the new join field type and

at the same time maintaining support for the `_parent` meta field type.

Relates to elastic#20257
jimczi added a commit to jimczi/elasticsearch that referenced this issue Jun 6, 2017
This change moves the parent_id query to the parent-join module and handles the case when only the parent-join field can
be declared on an index (index with single type on).
If single type is off it uses the legacy parent join field mapper and switch to the new one otherwise (default in 6).

Relates elastic#20257
jimczi added a commit that referenced this issue Jun 6, 2017
This change moves the parent_id query to the parent-join module and handles the case when only the parent-join field can be declared on an index (index with single type on).
If single type is off it uses the legacy parent join field mapper and switch to the new one otherwise (default in 6).

Relates #20257
martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Jun 7, 2017
at the same time maintaining support for the `_parent` meta field type/

Relates to elastic#20257
jimczi added a commit to jimczi/elasticsearch that referenced this issue Jun 7, 2017
…hild relation within documents of the same index (elastic#24978)

* Introduce ParentJoinFieldMapper, a field mapper that creates parent/child relation within documents of the same index

This change adds a new field mapper named ParentJoinFieldMapper. This mapper is a replacement for the ParentFieldMapper but instead of using the types in the mapping
it uses an internal field to materialize parent/child relation within a single index.
This change also adds a fetch sub phase that automatically retrieves the join name (parent or child name) and the parent id for child documents in the response hit fields.
The compatibility with `has_parent`, `has_child` queries and `children` agg will be added in a follow up.

Relates elastic#20257
jimczi pushed a commit to jimczi/elasticsearch that referenced this issue Jun 7, 2017
… work with the new join field type and

at the same time maintaining support for the `_parent` meta field type.

Relates to elastic#20257
jimczi added a commit to jimczi/elasticsearch that referenced this issue Jun 7, 2017
This change moves the parent_id query to the parent-join module and handles the case when only the parent-join field can be declared on an index (index with single type on).
If single type is off it uses the legacy parent join field mapper and switch to the new one otherwise (default in 6).

Relates elastic#20257
jimczi pushed a commit to jimczi/elasticsearch that referenced this issue Jun 7, 2017
at the same time maintaining support for the `_parent` meta field type/

Relates to elastic#20257
jimczi added a commit that referenced this issue Jun 7, 2017
This is a full backport of the typeless parent child feature (parent-join) introduced in master.
It includes:
* Introduce ParentJoinFieldMapper, a field mapper that creates parent/child relation within documents of the same index (#24978)
* Disallow multiple parent-join fields per mapping (#25002)
* Change `has_child`, `has_parent` queries and `childen` aggregation to work with the new join field type and at the same time maintaining support for the `_parent` meta field type.
* Move parent_id query to the parent-join module (#25072)
* Changed inner_hits to work with the new join field type and
at the same time maintaining support for the `_parent` meta field type

Relates #20257
jimczi added a commit to jimczi/elasticsearch that referenced this issue Jun 14, 2017
This commit adds the docs for the new parent-join field.
It explains how to define, index and query this new field.

Relates elastic#20257
@shawnjohnson
Copy link

Will this change also remove Nested objects? Or are there plans to improve the performance gap between a nested object query and a parent-child query?

Parent-child queries can be 5 to 10 times slower than the equivalent nested query!

@jimczi
Copy link
Contributor

jimczi commented Jun 16, 2017

Will this change also remove Nested objects?

No, nested objects are not impacted by this change.

Or are there plans to improve the performance gap between a nested object query and a parent-child query?

This issue is just about typeless parent join that will replace the parent/child with types. We did not change the internals and how the query is executed so it should be similar in terms of performance.
Though we decided to index the parent id this time so inner_hits that retrieves children should be faster and the indexing cost should be a little higher.

Or, more to the point, what will be the practical considerations relevant to using _join? How will important details like the byte-length of record identifiers impact the use of _join?

The same as before ;)
We use doc_values and global_ordinals so the length of the id can have an impact on disk consumption and RAM consumption. But again, this is not a new approach for parent/child, it is just a rewriting without types so if your use case worked with the previous model it should behave the same in the new one.

jimczi added a commit that referenced this issue Jun 16, 2017
* Add documentation for the new parent-join field

This commit adds the docs for the new parent-join field.
It explains how to define, index and query this new field.

Relates #20257
jimczi added a commit that referenced this issue Jun 16, 2017
* Add documentation for the new parent-join field

This commit adds the docs for the new parent-join field.
It explains how to define, index and query this new field.

Relates #20257
@jimczi
Copy link
Contributor

jimczi commented Jun 16, 2017

The typeless parent-join has landed in master and 5.x.
You can find the documentation here:
https://www.elastic.co/guide/en/elasticsearch/reference/master/parent-join.html

The first release for this will be 5.6 where users can start exploring this typeless parent/child by setting mapping.single_type to true:
https://www.elastic.co/guide/en/elasticsearch/reference/5.x/parent-join.html

New issues can be opened for enhancements or bugs but this long standing issue can be closed !

@jimczi jimczi closed this as completed Jun 16, 2017
s1monw added a commit that referenced this issue Jun 26, 2017
This removes the remaining usage of `mapping.single_type` from the parent join
module and moves it's bwc test to the mixed cluster tests

Relates to #24961
Relates to #20257
@clintongormley clintongormley added :Search/Search Search-related issues that do not fall into other categories and removed :Parent/Child labels Feb 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>breaking :Search/Search Search-related issues that do not fall into other categories
Projects
None yet
Development

No branches or pull requests