Parent / Child Support #553

Closed
kimchy opened this Issue Dec 7, 2010 · 6 comments

Comments

Projects
None yet
6 participants
Owner

kimchy commented Dec 7, 2010

The parent/child documents support allows to define a parent relationship from a child type to a parent type.

Mapping

The relationship is defined using a simple mapping definition at the child level mapping. For example, in case of a blog type and a blog_tag type child document, the mapping for blog_tag should be:

{
    "blog_tag" : {
        "_parent" : {
            "type" : "blog"
        }
    }
}

The above defines a parent mapping, and the type of the parent.

Indexing

When indexing a child document, it is important that it will be routed to the same shard as the parent. This uses the routing capability. When indexing a doc with a parent id, it is automatically set as the routing value (unless the routing value is explicitly defined). Indexing a document with a parent id is simple:

curl -XPUT localhost:9200/blogs/blog_tag/1122?parent=1111 -d '
{
    "tag" : "something"
}
'

There is an option to set _parent in each bulk index item as well.

Querying

There are several mechanisms to query child documents. The idea of child filter / query is that its inner query is run against the child documents, and the result of it are parent docs matching those child documents.

The way it is implemented is that the child queries are first run on their own, with the results "joining" the parent documents. Then, the main query runs with the results of the child query, which includes the parent docs.

has_child

The first is the has_child filter and has_child query (which is a simple constant_score query wrapping the has_child filter):

{
    "has_child" : {
        "type" : "blog_tag"
        "query" : {
            "term" : {
                "tag" : "something"
            }
        }
    }
}

The type is the child type to query against. The parent type to return is automatically detected based on the mappings.

The query (and filter), do no scoring, and the "join" process of matching which parent doc the child doc matches is done on each matching child doc.

top_children

The top_children query basically runs the child query with an estimated hits size, and out of this hit docs, aggregates it into parent docs. If there aren't enough parent docs matching the requested from/size search request, then it is run again with a wider (more hits) search.

The top_children also provide scoring capabilities, with the ability to specify max, sum or avg as the score type.

One downside of using the top_children is that if there are more child docs matching the required hits when executing the child query, then the total_hits result of the search response will be incorrect.

How many hits are asked for in the first child query run is controlled using the factor parameter (defaults to 5). For example, when asking for 10 docs with from 0, then the child query will execute with 50 hits expected. If not enough parents are found (in our example, 10), and there are still more child docs to query, then the search hits are expanded my multiplying by the incremental_factor (defaults to 2).

The required parameters are the query and type (the child type to execute the query on). Here is an example with all different parameters, including the default values:

{
    "top_children" : {
        "type": "blog_tag",
        "query" : {
            "term" : {
                "tag" : "something"
            }
        }
        "score" : "max",
        "factor" : 5,
        "incremental_factor" : 2
    }
}

Faceting

Faceting on the child query phase (on the results of the query executed) can be done by specifying a scope with a custom name in the query / filter. All facets now accept a scope to run on (similar to global set to true), and can now be executed on docs matching the child query.

Query Performance

In general, the top_children performance will be much better than the has_child performance. This is because joining the child to its parent is done in the top_children case against the expected number of hits returned, while in the has_child case, it is executed against all child docs matching the child query.

Memory Considerations

With the current implementation, all _id values are loaded to memory (heap) in order to support fast lookups, so make sure there is enough mem for it.

Owner

kimchy commented Dec 7, 2010

Parent / Child Support, closed by 54437c1.

Member

medcl commented Dec 14, 2010

this feature really rocks!

Contributor

apatrida commented Dec 15, 2010

A few months back, I found this interesting solution to parent/child documents that can be indexed at the same time... (notes that follow were from that time, it has progressed I am sure since)

http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene

And implemented a base version here:

https://issues.apache.org/jira/browse/LUCENE-2454

It does look to have the problems mentioned in the JIRA issue:

  • Doesn't work well when parent/child docs cross index segment
  • Consumes more bits in things that track a bit per doc in the index (i.e. filters/facets)
  • Should consider rolling child docs into the returned parent
  • Probably messes with IDF and scoring

medcl pushed a commit to medcl/elasticsearch that referenced this issue Jul 1, 2011

Contributor

gpstathis commented Feb 4, 2012

Is there a way to disable the expected number of hits returned by the child query in top_children? Basically having it executed against all child docs matching the child query like has_child with the understanding that performance would indeed suffer? Or does one just have to estimate a really large factor?

I assume this is a bug on ES documentation, since the parent on the bulk request is set with "parent", rather than "_parent": https://www.elastic.co/guide/en/elasticsearch/guide/current/indexing-parent-child.html (sorry to post it here, I don't know where else to put it).

Owner

clintongormley commented Jun 29, 2015

@doublebyte actually both forms are accepted

This issue was closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment