Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add field collapsing for search request #22337

Merged
merged 2 commits into from Jan 23, 2017

Conversation

Projects
None yet
8 participants
@jimczi
Copy link
Member

commented Dec 23, 2016

This is a POC for implementing field collapsing for search requests.
The field collapsing is implemented for query_then_fetch and dfs_query_then_fetch using the query and fetch phase. No additional phase is required even when inner_hits for each collapsed result are computed. For instance the following query:

GET _search
{
   "collapse": {
      "field": "category",
      "inner_hits": {
         "name": "collapsed_hits",
         "size": 3
      }
   }
}

... would return the top hits "collapsed" on the category field and for each collapsed result the inner_hits named collapsed_hits contains the 3 best hits within this category.
The inner_hits retrieval is done in the fetch phase and does not require any index routing to be accurate.

The main advantages of using this versus a combo terms + top_hits aggregation (as described here https://www.elastic.co/guide/en/elasticsearch/guide/current/top-hits.html) are:

  • The collapse + inner_hits is done in two phases (instead of 1 in the combo agg) so the top hits of each collapsed key are always accurate.
  • The field collapsing is done at the top hits level only. We don't need to compute the values for each collapsed key in the result set, just the top hits. This saves a lot of memory compared to the terms aggregation behavior.
  • Since it's applied on the top hits only the collapsing can be much faster than using the combo agg.
  • The collapsing top docs collector does not use global ordinals for strings which saves extra memory compared to the aggs solution.
  • Paging is possible, though with the same limitation than for regular search.
  • search_after and scroll are not implemented yet but could be achieved as well (though without the optimization when sorting on _doc only).

I've opened this PR to show approximatively the amount of changes required. The implementation should be cleaned and at the moment there is no test or docs so the only way to test this is to launch a node with this branch. It's implemented directly in the existing search actions since I did not wanted to create a new request/response/action for this.
I though it would be less intrusive which is why I need feedback at this point to decide if this POC should be continued.

Fixes #21833

@jimczi

This comment has been minimized.

Copy link
Member Author

commented Dec 23, 2016

@markharwood, @jpountz this is an early WIP but I think it's ready for a preliminary review. We've discussed some time ago regarding the feasibility of this so I wanted to give a try. I'd like to know if the direction I am taking seems justified and whether this feature should be included in the regular search or not.

@clintongormley

This comment has been minimized.

Copy link
Member

commented Dec 23, 2016

@jimczi could you show an example of how the results would be returned?

@jimczi

This comment has been minimized.

Copy link
Member Author

commented Dec 23, 2016

Here is an example of how to use this feature:

PUT t/t/1
{
    "price":10,
    "category": 1
}

PUT t/t/2
{
    "price":100,
    "category": 1
}

PUT t/t/3
{
    "price":100,
    "category": 2
}

GET _search
{
   "collapse": {
      "field": "category",
      "inner_hits": {
         "name": "collapsed_hits",
         "size": 3,
         "sort": [
            {
               "price": {
                  "order": "desc"
               }
            }
         ]
      }
   }
}

This query return one collapsed result per category (the best one sorted by _score) and the best inner hits for each category sorted by price:

{
   "took": 2,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 3,
      "max_score": -1,
      "hits": [
         {
            "_index": "t",
            "_type": "t",
            "_id": "2",
            "_score": 1,
            "_source": {
               "price": 100,
               "category": 1
            },
            "sort": [
               1
            ],
            "inner_hits": {
               "collapsed_hits": {
                  "hits": {
                     "total": 2,
                     "max_score": null,
                     "hits": [
                        {
                           "_type": "t",
                           "_id": "2",
                           "_score": null,
                           "_source": {
                              "price": 100,
                              "category": 1
                           },
                           "sort": [
                              100
                           ]
                        },
                        {
                           "_type": "t",
                           "_id": "1",
                           "_score": null,
                           "_source": {
                              "price": 10,
                              "category": 1
                           },
                           "sort": [
                              10
                           ]
                        }
                     ]
                  }
               }
            }
         },
         {
            "_index": "t",
            "_type": "t",
            "_id": "3",
            "_score": 1,
            "_source": {
               "price": 100,
               "category": 2
            },
            "sort": [
               1
            ],
            "inner_hits": {
               "collapsed_hits": {
                  "hits": {
                     "total": 1,
                     "max_score": null,
                     "hits": [
                        {
                           "_type": "t",
                           "_id": "3",
                           "_score": null,
                           "_source": {
                              "price": 100,
                              "category": 2
                           },
                           "sort": [
                              100
                           ]
                        }
                     ]
                  }
               }
            }
         }
      ]
   }
}

For both categories the first inner_hit is the same as the parent hit. This is due to the fact that inner_hits in a collapsing context works with the same document than their parent hit. Since inner_hits can define its own sort it's also possible to get different document but when the same sort is used then the first inner_hits is the same as the parent hit. This can be fixed by adding from: 1 in the inner_hits definition.

@jimczi

This comment has been minimized.

Copy link
Member Author

commented Dec 23, 2016

I compared the combo aggs solution and this PR using this two queries:

GET _search
{
   "collapse": {
      "field": "category",
      "inner_hits": {
         "name": "collapsed_hits",
         "size": 3,
         "sort": [
            {
               "price": {
                  "order": "desc"
               }
            }
         ]
      }
   }
}
GET _search
{
   "size": 0,
   "aggs": {
      "groups": {
         "terms": {
            "field": "category",
            "order": {
               "top_score": "desc"
            },
            "size": 10
         },
         "aggs": {
            "top_score": {
               "max": {
                  "script": "_score"
               }
            },
            "top_hit": {
               "top_hits": {
                  "size": 3,
                  "sort": [
                     {
                        "price": {
                           "order": "desc"
                        }
                     }
                  ]
               }
            }
         }
      }
   }
}

I've created 1M documents with random categories from 0 to 10,000.
The collapse query takes 12ms in average and the aggs one takes 80ms. I did not compare the memory footprint but it should also show a nice save on the collapse side ;)

@markharwood

This comment has been minimized.

Copy link
Contributor

commented Dec 23, 2016

Looks promising! I gave this a quick spin and had a problem with this:

when the same sort is used then the first inner_hits is the same as the parent hit. This can be fixed by adding from: 1 in the inner_hits definition.

from:1 looked to work in one of my tests but on an index with a single shard it was returning a dup - see
https://gist.github.com/markharwood/fec539e954dfffe162af8817b962ffa1

This was happening if I was using natural scoring order (as in my example) or when sorting both root and inner by identical sort criteria.

@jimczi

This comment has been minimized.

Copy link
Member Author

commented Dec 26, 2016

Thanks for looking into it Mark.
I pushed a fix for the issue with a 1-shard index. In this case we still need to prune the doc list if from > 0 because all shards returned their inner top hits starting at 0.

@jpountz

This comment has been minimized.

Copy link
Contributor

commented Dec 26, 2016

I wish it was less intrusive too. Maybe we should focus for now on what gives the change most of its added value: the ability to get one doc per group in the top hits. While the ability to fetch the top docs per group is convenient, it is something that can be done from client-side with similar efficiency? This way things like the fetch phase could remain unmodified?

@jimczi

This comment has been minimized.

Copy link
Member Author

commented Dec 26, 2016

While the ability to fetch the top docs per group is convenient, it is something that can be done from client-side with similar efficiency? This way things like the fetch phase could remain unmodified?

I agree. The client-side solution is easy and we can maybe think of a better solution when simple collpase is correctly integrated.
I'll remove the top docs per group option and see how it simplifies the PR. Thanks for looking @jpountz

@clintongormley

This comment has been minimized.

Copy link
Member

commented Dec 28, 2016

I wish it was less intrusive too. Maybe we should focus for now on what gives the change most of its added value: the ability to get one doc per group in the top hits. While the ability to fetch the top docs per group is convenient, it is something that can be done from client-side with similar efficiency? This way things like the fetch phase could remain unmodified?

I understand your motivation here, but it feels like removing the inner hits removes a big reason to use this API, certainly for me.

@jpountz

This comment has been minimized.

Copy link
Contributor

commented Dec 28, 2016

Fair enough. Then I think we need to either have clearer evidence this would be a very popular feature. To me the red flag for this feature right now is that it requires to modify the way the search types work (query_then_fetch, etc.). Unless I'm missing something, I don't think we can make it go away without either removing the retrieval of the top hits per group or requiring that the grouping field was used as a routing key at index time.

@jimczi

This comment has been minimized.

Copy link
Member Author

commented Dec 28, 2016

To me the red flag for this feature right now is that it requires to modify the way the search types work (query_then_fetch, etc.). Unless I'm missing something, I don't think we can make it go away without either removing the retrieval of the top hits per group or requiring that the grouping field was used as a routing key at index time

I've tried to modify the existing search type as little as possible but some changes are required to make the inner_hits accurate. The most problematic one is that we need to send the fetch request to all shards instead of only the ones that have a top hit to show. This would not be required if the grouping field was used as a routing key at index time as Adrien mentioned but one of the goal of this PR was to break this limitation.
I still think that collapse without inner_hits can be useful. For instance if we don't provide inner_hits we could allow a script instead of a field for the collapse key.
Now if inner_hits is the main motivation maybe we can re-think about adding a phase to the existing search type or provide a new search type for the collapsing/grouping ?

@jimczi

This comment has been minimized.

Copy link
Member Author

commented Dec 28, 2016

I tried another approach for the inner_hits retrieval.
Instead of changing the fetch phase to retrieve the inner_hits locally, I am sending a query to all shards to get the top inner_hits per document. This is not invasive because it can be implemented as a simple fetch sub phase but it requires to send queries over the wire in the fetch phase. So basically each top hit send a query to get its top inner hits. This is done behind the scene in a sub fetch phase and does not require to change the search type.
I pushed another iteration that does what's describe above. The gist of it can be found in CollapseFetchSubPhase which retrieves the inner_hits by sending a query to all shards involved in the main query. This change may be controversial so I did not implement the inner hits option fully.
In fact at this stage the inner hits definition cannot be changed at all and a collapse query always return 3 inner hits per document.

@jpountz

This comment has been minimized.

Copy link
Contributor

commented Dec 29, 2016

Interesting... I see you had to add a hack to get a client that does not complain about the frozen state, but if my understanding is correct this would not be an issue since it should be fine to call the client in the fetch phase since it is not subject to caching? This requires to transfor docs twice over the network but I like it better in terms of sustainability. While we are looking at simplifications, I wonder whether we could reuse a Lucene collector rather than implementing a new one. This looks rather similar to the grouping collectors?

@jimczi

This comment has been minimized.

Copy link
Member Author

commented Dec 29, 2016

I see you had to add a hack to get a client that does not complain about the frozen state, but if my understanding is correct this would not be an issue since it should be fine to call the client in the fetch phase since it is not subject to caching?

I think so and we can also add a safeguard in the setting to disallow inner queries by default.

This requires to transfor docs twice over the network but I like it better in terms of sustainability.

Note also that this is more optimized than the first version because the inner query is done in two phases. In fact aside from the network trip doing a distributed query at this point does not add any computation. We would do exactly the same if this was implemented as a new search type.

While we are looking at simplifications, I wonder whether we could reuse a Lucene collector rather than implementing a new one. This looks rather similar to the grouping collectors?

Yes it's a modified version of the AbstractFirstPassGroupingCollector in Lucene. I'll just extend this collector and adapt it for our need if the new design is accepted. I'll also try to minimize the changes to the TopDocs serialization.

core/src/main/java/org/apache/lucene/search/collapse/CollapseTopDocs.java Outdated
import java.util.List;
import java.util.Set;

public class CollapseTopDocs extends TopFieldDocs {

This comment has been minimized.

Copy link
@martijnvg

martijnvg Jan 3, 2017

Member

I wonder if we can use Lucene's TopGroups? It already has the merge logic.

@jimczi jimczi added review and removed WIP labels Jan 5, 2017

@jimczi

This comment has been minimized.

Copy link
Member Author

commented Jan 5, 2017

I pushed another iteration that implements the design we discussed earlier.
I think it's ready for another review. I've added tests and reused the Lucene's grouping class. @martijnvg I had to keep the custom merge of the CollapseTopDocs because the merge of the grouping in Lucene is done in two phases, the first phase merges the top groups but does not record the top doc per group and the second phase merges the inner hits per group. We cannot use the first pass merge because we need the top doc (we just want to collapse the result set) and the second pass merge is done in the CollapseFetchSubPhase by sending a query per group in the cluster.

@martijnvg
Copy link
Member

left a comment

@jimczi I did a review and it looks good. My main review point is about the client hack and we need docs for this.

core/src/main/java/org/elasticsearch/search/query/QueryPhase.java Outdated
} else if (topDocsCollector instanceof CollapsingTopDocsCollector) {
topDocs = ((CollapsingTopDocsCollector) topDocsCollector).getTopDocs();
} else {
throw new RuntimeException("Should never happen");

This comment has been minimized.

Copy link
@martijnvg

martijnvg Jan 10, 2017

Member

maybe throw an IllegalStateException here instead?

core/src/main/java/org/elasticsearch/search/SearchService.java Outdated
}
final CollapseContext collapseContext;
try {
collapseContext = source.collapse().build(context.getQueryShardContext());

This comment has been minimized.

Copy link
@martijnvg

martijnvg Jan 10, 2017

Member

maybe do this instead?

 CollapseContext collapseContext = source.collapse().build(context.getQueryShardContext());
 context.collapse(collapseContext);
@@ -221,6 +225,9 @@ public SearchSourceBuilder(StreamInput in) throws IOException {
profile = in.readBoolean();
searchAfterBuilder = in.readOptionalWriteable(SearchAfterBuilder::new);
sliceBuilder = in.readOptionalWriteable(SliceBuilder::new);
if (in.getVersion().onOrAfter(Version.V_6_0_0_alpha1_UNRELEASED)) {

This comment has been minimized.

Copy link
@martijnvg

martijnvg Jan 10, 2017

Member

I think we should consider after merging to backport this to 5.x branch?

core/src/main/java/org/apache/lucene/search/grouping/CollapseTopDocs.java Outdated
* Returns a new CollapseTopDocs, containing topN collapsed results across
* the provided CollapseTopDocs, sorting by score. Each {@link CollapseTopDocs} instance must be sorted.
**/
public static CollapseTopDocs merge(Sort sort, int start, int size,

This comment has been minimized.

Copy link
@martijnvg

martijnvg Jan 10, 2017

Member

Can we have a unit test which tests just the merging?

core/src/main/java/org/elasticsearch/search/fetch/subphase/CollapseFetchSubPhase.java Outdated
} else {
hitField.values().add(null);
}
hit.fields().put("_collapse", hitField);

This comment has been minimized.

Copy link
@martijnvg

martijnvg Jan 10, 2017

Member

We should make sure that _collapse isn't an actual stored or docvalues field. We shoud either forbid at index time fields in a document with the name _collapse or just fail the search request here if someone does collapsing and requests to fetch a stored / doc values field with the name _collapse. I think we should do the latter as I expect this scenario to be rare.

core/src/main/java/org/elasticsearch/search/fetch/subphase/CollapseFetchSubPhase.java Outdated
if (options == null) {
continue;
}
Client client = context.getQueryShardContext().getClientNoCache();

This comment has been minimized.

Copy link
@martijnvg

martijnvg Jan 10, 2017

Member

I don't think we should make external calls here. This is the wrong place to do this.

We should do this at a different level. I think the best place is a hook point that @polyfractal worked in #19587. On the transport layer it is okay to make additional external calls. Unfortunately Zach's PR never got merged.

What we can do is move this code that executes a search and enriches the search response to the same place Zach tried to add this extension point.

I think we should add something like this to SearchTransportAction.java:

At line 164

Index: core/src/main/java/org/elasticsearch/action/search/TransportSearchAction.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
--- core/src/main/java/org/elasticsearch/action/search/TransportSearchAction.java	(date 1483612278000)
+++ core/src/main/java/org/elasticsearch/action/search/TransportSearchAction.java	(revision )
@@ -33,15 +33,26 @@
 import org.elasticsearch.common.settings.Setting.Property;
 import org.elasticsearch.common.settings.Settings;
 import org.elasticsearch.index.Index;
+import org.elasticsearch.index.query.BoolQueryBuilder;
+import org.elasticsearch.index.query.InnerHitBuilder;
+import org.elasticsearch.index.query.QueryBuilder;
+import org.elasticsearch.index.query.QueryBuilders;
+import org.elasticsearch.index.query.TermQueryBuilder;
+import org.elasticsearch.search.SearchHit;
+import org.elasticsearch.search.SearchHits;
 import org.elasticsearch.search.SearchService;
 import org.elasticsearch.search.builder.SearchSourceBuilder;
+import org.elasticsearch.search.collapse.CollapseBuilder;
 import org.elasticsearch.search.internal.AliasFilter;
+import org.elasticsearch.search.internal.InternalSearchHit;
+import org.elasticsearch.search.sort.SortBuilder;
 import org.elasticsearch.tasks.Task;
 import org.elasticsearch.threadpool.ThreadPool;
 import org.elasticsearch.transport.TransportService;
 
 import java.util.Collections;
 import java.util.HashMap;
+import java.util.Iterator;
 import java.util.Map;
 import java.util.Set;
 import java.util.concurrent.Executor;
@@ -150,8 +161,23 @@
             }
         }
 
+        // Only enrich the search response iff collapsing has been specified:
+        final ActionListener<SearchResponse> wrapper;
+        if (searchRequest.source().collapse() != null && searchRequest.source().collapse().getInnerHit() != null) {
+            wrapper = ActionListener.wrap(searchResponse -> {
+                if (searchResponse.getHits().getHits().length == 0) {
+                    listener.onResponse(searchResponse);
+                } else {
+                    Iterator<SearchHit> hitsIterator = searchResponse.getHits().iterator();
+                    expandCollapsedHits(searchRequest, hitsIterator, searchResponse, listener);
+                }
+            }, listener::onFailure);
+        } else {
+            wrapper = listener;
+        }
+
         searchAsyncAction((SearchTask)task, searchRequest, shardIterators, startTimeInMillis, clusterState,
-            Collections.unmodifiableMap(aliasFilter), concreteIndexBoosts, listener).start();
+            Collections.unmodifiableMap(aliasFilter), concreteIndexBoosts, wrapper).start();
     }
 
     @Override
@@ -204,4 +230,82 @@
                 + "] to a greater value if you really want to query that many shards at the same time.");
         }
     }
+
+    void expandCollapsedHits(SearchRequest originalSearchRequest, Iterator<SearchHit> hits, SearchResponse searchResponse, ActionListener<SearchResponse> finalListener) {
+        if (hits.hasNext() == false) {
+            finalListener.onResponse(searchResponse);
+            return;
+        }
+
+        CollapseBuilder collapseBuilder = originalSearchRequest.source().collapse();
+        InternalSearchHit searchHit = (InternalSearchHit) hits.next();
+        BoolQueryBuilder boolQuery = new BoolQueryBuilder();
+        Object collapseValue = searchHit.field("_collapse").getValue();
+        if (collapseValue != null) {
+            boolQuery.filter(new TermQueryBuilder(collapseBuilder.getField(), collapseValue));
+        } else {
+            boolQuery.mustNot(QueryBuilders.existsQuery(collapseBuilder.getField()));
+        }
+        QueryBuilder query = originalSearchRequest.source().query();
+        if (query != null) {
+            boolQuery.must(query);
+        }
+
+        SearchRequest searchRequest = new SearchRequest(originalSearchRequest.indices());
+        searchRequest.types(originalSearchRequest.types());
+        SearchSourceBuilder sourceBuilder = new SearchSourceBuilder();
+        sourceBuilder.query(boolQuery);
+        addInnerHitOptions(sourceBuilder, collapseBuilder.getInnerHit());
+        searchRequest.source(sourceBuilder);
+        execute(searchRequest, ActionListener.wrap(innerHitsSearchResponse -> {
+            SearchHits innerHits = innerHitsSearchResponse.getHits();
+            if (searchHit.getInnerHits() == null) {
+                searchHit.setInnerHits(new HashMap<>(1));
+            }
+            searchHit.getInnerHits().put(collapseBuilder.getInnerHit().getName(), innerHits);
+            expandCollapsedHits(originalSearchRequest, hits, searchResponse, finalListener);
+        }, finalListener::onFailure));
+    }
+
+    private void addInnerHitOptions(SearchSourceBuilder source, InnerHitBuilder options) {
+        source.from(options.getFrom());
+        source.size(options.getSize());
+        if (options.getSorts() != null) {
+            for (SortBuilder<?> sort : options.getSorts()) {
+                source.sort(sort);
+            }
+        } else if (source.sorts() != null) {
+            for (SortBuilder<?> sort : source.sorts()){
+                source.sort(sort);
+            }
+        }
+        if (options.getFetchSourceContext() != null) {
+            if (options.getFetchSourceContext().includes() == null && options.getFetchSourceContext().excludes() == null) {
+                source.fetchSource(options.getFetchSourceContext().fetchSource());
+            } else {
+                source.fetchSource(options.getFetchSourceContext().includes(),
+                    options.getFetchSourceContext().excludes());
+            }
+        }
+        if (options.getDocValueFields() != null) {
+            for (String field : options.getDocValueFields()) {
+                source.docValueField(field);
+            }
+        }
+        if (options.getStoredFieldsContext() != null && options.getStoredFieldsContext().fieldNames() != null) {
+            for (String field : options.getStoredFieldsContext().fieldNames()) {
+                source.storedField(field);
+            }
+        }
+        if (options.getScriptFields() != null) {
+            for (SearchSourceBuilder.ScriptField field : options.getScriptFields()) {
+                source.scriptField(field.fieldName(), field.script());
+            }
+        }
+        if (options.getHighlightBuilder() != null) {
+            source.highlighter(options.getHighlightBuilder());
+        }
+        source.explain(options.isExplain());
+        source.trackScores(options.isTrackScores());
+    }
 }

Then when the work on #19587 continues, this expand collapsed hit logic should be migrated too.

core/src/test/java/org/elasticsearch/search/collapse/CollapseBuilderTests.java Outdated
import static org.mockito.Mockito.mock;
import static org.mockito.Mockito.when;

public class CollapseBuilderTests extends ESTestCase {

This comment has been minimized.

Copy link
@martijnvg

martijnvg Jan 10, 2017

Member

maybe extend from AbstractWireSerializingTestCase? This should increase test coverage.

@jimczi

This comment has been minimized.

Copy link
Member Author

commented Jan 12, 2017

@martijnvg thanks !
I pushed another commit to address your comments. I've restricted the collapsing to single valued keyword and number fields with doc_values. We can maybe add the multi value support later but as is it would break the inner_hits option. I've removed the metadata field _collapse and now the collapse key is retrieved through the docvalue_fields (automatically added to the request if not already present).
The feature is now documented and I've also added some test regarding the top doc collapsing/merging.
Could you please take another look ?

@martijnvg
Copy link
Member

left a comment

Looks great! I left some small comments.

I've removed the metadata field _collapse and now the collapse key is retrieved through the docvalue_fields (automatically added to the request if not already present).

+1 I really like that approach.

core/src/main/java/org/elasticsearch/index/query/InnerHitBuilder.java Outdated
@@ -183,6 +183,7 @@ private InnerHitBuilder(InnerHitBuilder other) {
}
}


This comment has been minimized.

Copy link
@martijnvg

martijnvg Jan 18, 2017

Member

newlines in this commit aren't needed?

core/src/main/java/org/elasticsearch/action/search/TransportSearchAction.java Outdated
Map<String, Float> concreteIndexBoosts,
ActionListener<SearchResponse> listener) {
final Function<String, DiscoveryNode> nodesLookup = state.nodes()::get;
final long clusterStateVersion = state.version();
Executor executor = threadPool.executor(ThreadPool.Names.SEARCH);
AbstractSearchAsyncAction searchAsyncAction;
switch (searchRequest.searchType()) {
switch(searchRequest.searchType()) {

This comment has been minimized.

Copy link
@martijnvg

martijnvg Jan 18, 2017

Member

I'm ok with making these style changes, so I think this commit isn't needed.

@@ -513,6 +523,16 @@ public SliceBuilder slice() {
return sliceBuilder;
}


public CollapseBuilder collapse() {

This comment has been minimized.

Copy link
@martijnvg

martijnvg Jan 18, 2017

Member

I noticed that the collapse field wasn't taken into account in hascode, equals and swallowCopy methods.
Can you maybe alter SearchSourceBuilderTests.java to also take into account an CollapseBuilder instance?

This comment has been minimized.

Copy link
@martijnvg

martijnvg Jan 20, 2017

Member

Looking at the latest change, and I don't see that the collapse field is taken into account here and the no changes have been made to SearchSourceBuilderTests.java. Can you make these changes?

This comment has been minimized.

Copy link
@jimczi

jimczi Jan 20, 2017

Author Member

These changes are done already ;)
Check RandomSearchRequestGenerator.java where I added a random collapse builder

This comment has been minimized.

Copy link
@martijnvg

martijnvg Jan 20, 2017

Member

Ah sorry, I see, but can take collapse into account in the hashcode and equals method of SearchSourceBuilder?

This comment has been minimized.

Copy link
@jimczi

jimczi Jan 20, 2017

Author Member

Oups, it should yes. I pushed a commit to add the collapse builder in equals and hashcode.

@@ -792,6 +793,17 @@ private void parseSource(DefaultSearchContext context, SearchSourceBuilder sourc
}
context.storedFieldsContext(source.storedFields());
}

if (source.collapse() != null) {

This comment has been minimized.

Copy link
@martijnvg

martijnvg Jan 18, 2017

Member

What if we add this logic to SearchBuilder#collapse(...) setter? Downside is that Java client users could unset doc values fields, but we can add a validation check for that in SearchBuilder#rewrite(...). I think that this approach would be cleaner than this changes in this file.

This comment has been minimized.

Copy link
@martijnvg

martijnvg Jan 20, 2017

Member

What do you think about this ^ comment?

This comment has been minimized.

Copy link
@jimczi

jimczi Jan 20, 2017

Author Member

I've changed it so now we add the field directly in DocValueFetchSubPhase if the collapse field is defined.
It's not what you proposed but I think it's cleaner, WDYT ?

This comment has been minimized.

Copy link
@martijnvg

martijnvg Jan 20, 2017

Member

Cool, that is better than where is initially was.

I think the downside of modifying the search context compared to the search source builder is that the query gets logged on the ES side (slow query log) then it isn't obvious that this request also fetches doc values. Someone would need to know that this is what collapse does. So to me adding this logic here is less transparent compared than in SearchSourceBuilder. What do you think?

This comment has been minimized.

Copy link
@jimczi

jimczi Jan 20, 2017

Author Member

I think it's ok if we document it ? To me it's the opposite, I like the fact that we don't change the original search request and that the logic is implemented directly in the fetch sub phase but if you feel strongly against it I can replace it with your solution.

This comment has been minimized.

Copy link
@martijnvg

martijnvg Jan 20, 2017

Member

I have no strong feelings too about it. It was just a thought stuck in my mind. I'm ok keeping it this way.

@jimczi

This comment has been minimized.

Copy link
Member Author

commented Jan 19, 2017

@elasticmachine test this please

@jimczi

This comment has been minimized.

Copy link
Member Author

commented Jan 19, 2017

Ok I've rebased the branch because it had nothing to do with the initial version.
@martijnvg this now use the search response listener to send the additional inner group queries.
I've also changed the logic to retrieve the collapsing key in the hits, it is now done directly in the doc values fetch sub phase. I think it's ready for another review !

@martijnvg
Copy link
Member

left a comment

LGTM - Push and ship it! :)

@jimczi

This comment has been minimized.

Copy link
Member Author

commented Jan 20, 2017

retest this please

Add top hits collapsing to search request
The field collapsing is done with a custom top docs collector that "collapse" search hits with same field value.
The distributed aspect is resolve using the two passes that the regular search uses. The first pass "collapse" the top hits, then the coordinating node merge/collapse the top hits from each shard.

```
GET _search
{
   "collapse": {
      "field": "category",
   }
}
```

This change also adds an ExpandCollapseSearchResponseListener that intercepts the search response and expands collapsed hits using the CollapseBuilder#innerHit} options.
The retrieval of each inner_hits is done by sending a query to all shards filtered by the collapse key.

```
GET _search
{
   "collapse": {
      "field": "category",
      "inner_hits": {
	"size": 2
      }
   }
}
```
@jpountz
Copy link
Contributor

left a comment

I left some comments.

if (value == null) {
return null;
}
return value.longValue();

This comment has been minimized.

Copy link
@jpountz

jpountz Jan 23, 2017

Contributor

this method could just return value given that longs are immutable?

This comment has been minimized.

Copy link
@jimczi

jimczi Jan 23, 2017

Author Member

Yes, I've started with MutableLong and switched to Long afterward. I'll return the value directly, thanks.


abstract T copy(T value, T reuse);

abstract void setNextReader(LeafReader reader) throws IOException;

This comment has been minimized.

Copy link
@jpountz

jpountz Jan 23, 2017

Contributor

I wish the API was more in-line with things like collectors and comparators, ie. LeafCollapsingDocValuesSource CollapsingDocValuesSource.getLeafSource(LeafReaderContext context)

This comment has been minimized.

Copy link
@jimczi

jimczi Jan 23, 2017

Author Member

It's in-line with the FirstPassGroupingCollector and the only user is the CollapsingTopDocsCollector. The goal is not to reuse this elsewhere ?

This comment has been minimized.

Copy link
@jpountz

jpountz Jan 23, 2017

Contributor

fair enough

}
};
}
docsWithField = DocValues.getDocsWithField(reader, field);

This comment has been minimized.

Copy link
@jpountz

jpountz Jan 23, 2017

Contributor

this assignment looks redundant?

This comment has been minimized.

Copy link
@jimczi

jimczi Jan 23, 2017

Author Member

It is, thanks

throw new IllegalStateException("failed to collapse " + docID +
", the collapse field must be single valued");
}
return sorted.valueAt(0);

This comment has been minimized.

Copy link
@jpountz

jpountz Jan 23, 2017

Contributor

Can we return 0 when the value count is 0 to be consistent with the singe-value case, or throw a proper exception? I am concerned this code could raise a weird error message otherwise.

This comment has been minimized.

Copy link
@jimczi

jimczi Jan 23, 2017

Author Member

The value count must be greater than 0 if we reach this statement, there is an asset above.
We always check if the document has a value for the field before calling this method.


default:
throw new IllegalStateException("unexpected doc values type "
+ type + "` for field `" + field + "`");

This comment has been minimized.

Copy link
@jpountz

jpountz Jan 23, 2017

Contributor

do we need to handle NONE? I'm wondering this could happen if there is a segment that does not have any values for that field

This comment has been minimized.

Copy link
@jimczi

jimczi Jan 23, 2017

Author Member

Right, I'll add a test for that

@jimczi

This comment has been minimized.

Copy link
Member Author

commented Jan 23, 2017

Thanks @jpountz
I pushed another commit to address your comments

@jpountz
Copy link
Contributor

left a comment

LGTM

@jimczi jimczi merged commit e48bc2e into elastic:master Jan 23, 2017

2 checks passed

CLA Commit author is a member of Elasticsearch
Details
elasticsearch-ci Build finished.
Details

@jimczi jimczi deleted the jimczi:collapsing branch Jan 23, 2017

@jimczi

This comment has been minimized.

Copy link
Member Author

commented Jan 23, 2017

Thanks so much @martijnvg and @jpountz for all the reviews ;)
Now merging to 5.x

@jimczi jimczi removed the review label Jan 23, 2017

jimczi added a commit that referenced this pull request Jan 23, 2017

Add field collapsing for search request (#22337)
* Add top hits collapsing to search request

The field collapsing is done with a custom top docs collector that "collapse" search hits with same field value.
The distributed aspect is resolve using the two passes that the regular search uses. The first pass "collapse" the top hits, then the coordinating node merge/collapse the top hits from each shard.

```
GET _search
{
   "collapse": {
      "field": "category",
   }
}
```

This change also adds an ExpandCollapseSearchResponseListener that intercepts the search response and expands collapsed hits using the CollapseBuilder#innerHit} options.
The retrieval of each inner_hits is done by sending a query to all shards filtered by the collapse key.

```
GET _search
{
   "collapse": {
      "field": "category",
      "inner_hits": {
	"size": 2
      }
   }
}
```

@Mpdreamz Mpdreamz referenced this pull request Feb 15, 2017

Closed

Field collapse support #2596

@byronvoorbach

This comment has been minimized.

Copy link
Contributor

commented Feb 15, 2017

Nice feature! Been waiting a while for this 👍 😄
Do you have any future plans for nested field collapsing?? (Not the type, but multi level collapsing)

@Mpdreamz

This comment has been minimized.

Copy link
Member

commented Feb 16, 2017

Small additional feature request (happy to open a new issue if folks agree).

Return the collapse value as part of the innerhits, e.g instead of:

"inner_hits" : {
          "stateofbeing" : {
                  "hits" : {

Something along the lines of:

"inner_hits" : {
          "stateofbeing" : {
                  "collapse_value" : "happy",
                  "hits" : {
@jimczi

This comment has been minimized.

Copy link
Member Author

commented Feb 16, 2017

@Mpdreamz we always return the collapse value in each top hit:

hits": [
      {
        "_index": "reddit-05-15",
        "_type": "comments",
        "_id": "cr56nez",
        "fields": {
          "stateofbeing": [
            "happy"
          ]
        }
        "inner_hits": {
           ....

@Mpdreamz is it what you're looking for ?

@byronvoorbach please note that this feature existed in Elasticsearch before this PR. This implementation is just an optimization for simple cases like single level field collapsing, simple sort and pagination. For an advanced feature like multi level field collapsing you can use the aggregations for the moment.

@Mpdreamz

This comment has been minimized.

Copy link
Member

commented Feb 16, 2017

@jimczi doh 🙈 that totally works.

anti-social added a commit to anti-social/elasticmagic that referenced this pull request Mar 2, 2017

Removed GroupedPageFilter
Soon there will be another ability to group hits:

- elastic/elasticsearch#21833
- elastic/elasticsearch#22337
@monotone

This comment has been minimized.

Copy link

commented Apr 11, 2017

@jimczi
Hi, I want to know, how to get the total number of hits after collapse?
The total field of response is just the total number of query hits.

@byronvoorbach

This comment has been minimized.

Copy link
Contributor

commented Apr 11, 2017

Is source filtering inside the inner_hits supported?? I get a:

"caused_by": {
    "type": "illegal_argument_exception",
    "reason": "[inner_hits] _source doesn't support values of type: START_ARRAY"
  }

When trying:

 "collapse": {
    "field": "collapse_field",
    "inner_hits": {
      "name": "name",
      "size": 5,
      "_source": [
        "field_1",
        "field_2"
      ],
      "sort": [
        {
          "_score": "asc"
        }
      ]
    }
  }

On Elasticsearch 5.3

Edit: Switching out the array for a single string value also throws an exception:

"reason": "[inner_hits] _source doesn't support values of type: VALUE_STRING"

anti-social added a commit to anti-social/elasticmagic that referenced this pull request Apr 11, 2017

Dev/query filters (#30)
* Query filters refactoring

* Removed GroupedPageFilter

Soon there will be another ability to group hits:

- elastic/elasticsearch#21833
- elastic/elasticsearch#22337

* Updated tox, travis and vagga configs. Put documentation building dependencies into requirements_doc.txt

* Fix travis & codecov

* Pep8

* Tests for nested query filters

* Query filters codec tests

* Apply 80th column for queryfilters

* Removed assert_expr utility function
@jimczi

This comment has been minimized.

Copy link
Member Author

commented Apr 11, 2017

@monotone @byronvoorbach this pr is closed so it should not be used to list up questions/issues you have with the feature.
Please use the forum (discuss.elastic.co) or create a new issue describing the problem you found with field collapsing.

For the record, the number of groups after collapsing is not computed since it would be too costly. You can run a cardinality aggregation to get this information.

For source filtering, I think it's just a parsing issue. It accepts an object with includes and excludes but not a simple array of strings. The following works:

"collapse": {
    "field": "collapse_field",
    "inner_hits": {
      "name": "name",
      "size": 5,
      "_source": {
        "includes": ["field_1", "field_2"]
      },
      "sort": [
        {
          "_score": "asc"
        }
      ]
    }
  }

Can you open an issue with the description of the problem?

@monotone

This comment has been minimized.

Copy link

commented Apr 12, 2017

ok, I got it. thanks

@byronvoorbach

This comment has been minimized.

Copy link
Contributor

commented Apr 12, 2017

Hi @jimczi

Sorry for polluting this pr with my question. I figured that this might be a bug since source filtering inside the query doesn't have a need for a specific 'include/exclude' and that it would be relevant for closing off this story. You're example fixed my issue, but not sure if you'd like me to still post this difference in structure on discuss???

@jimczi

This comment has been minimized.

Copy link
Member Author

commented Apr 12, 2017

@byronvoorbach that's a bug ;) Would you mind open an issue for it in github ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.