Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi_match should not enable coordination in bool query with BM25 #18944

Closed
clintongormley opened this issue Jun 17, 2016 · 5 comments
Closed
Assignees
Labels
>bug :Search/Search Search-related issues that do not fall into other categories

Comments

@clintongormley
Copy link

In 5.0 we use BM25, which means that query coordination should always be disabled. This works correctly with the bool query but the multi_match query enables coordination incorrectly:

PUT t/t/1
{
  "foo": "one",
  "bar": "two"
}

GET t/_search
{
  "query": {
    "multi_match": {
      "query": "one two",
      "fields": ["foo", "bar"]
    }
  },
  "explain": true
}

Returns:

            {
              "value": 0.5,
              "description": "coord(1/2)",
              "details": []
            }
@clintongormley clintongormley added >bug :Search/Search Search-related issues that do not fall into other categories v5.0.0-alpha4 labels Jun 17, 2016
@jimczi
Copy link
Contributor

jimczi commented Jun 17, 2016

In 5.0 we use BM25, which means that query coordination should always be disabled.

The default similarity is still TFIDF which is referred as classic. I opened #18948 to change the default similarity to BM25.

This works correctly with the bool query but the multi_match query enables coordination incorrectly

This is how the match_query works. It's the same on 2.x, I didn't test 1.7 but it should do the same.
Coords are disabled only when multiple terms are at the same position in the query otherwise the coords are always enabled and we rely on this functionality for the relevancy (documents matching a lot of terms are scored first). Regarding BM25, things will change since the coords are not taken into account in this similarity but this should not be considered as a bug ? To be honest I don't know what's the impact on the relevancy for queries produced by a match_query or a multi_match_query. @jpountz @rmuir WDYT ?

@rmuir
Copy link
Contributor

rmuir commented Jun 17, 2016

This is how the match_query works.

This is why a SynonymQuery was added when defaulting to BM25 that handles this case in a more generic way for any scoring system (including classic TF/IDF):

One issue was the generation of synonym queries (posinc=0) by QueryBuilder (used by parsers). This is kind of a corner case (query-time synonyms), but we should make it nicer. The current code in trunk disables coord, which makes no sense for anything but the vector space impl. Instead, this patch adds a SynonymQuery which treats occurrences of any term as a single pseudoterm. With english wordnet as a query-time synonym dict, this query gives 12% improvement in MAP for title queries on BM25, and 2% with Classic (not significant). So its a better generic approach for synonyms that works with all scoring models.

I wanted to use BlendedTermQuery, but it seems to have problems at a glance, it tries to "take on the world", it has problems like not working with distributed scoring (doesn't consult indexsearcher for stats). Anyway this one is a different, simpler approach, which only works for a single field, and which calls tf(sum) a single time.

https://issues.apache.org/jira/browse/LUCENE-6789

Please use it :)

@rpedela
Copy link

rpedela commented Jun 20, 2016

I am currently using 2.3.3 and planning to start experimenting with BM25 since Lucene 6.0 makes that the default. I assumed it was ready to be used in ES as well. Is that not the case? Should I wait until 5.0 especially since I use multi_match heavily?

@jimczi
Copy link
Contributor

jimczi commented Jun 20, 2016

Thanks for the clarification @rmuir.
@rpedela please use it as well ;) Concerning the multi_match you may want to experiment different boosts as the scoring for BM25 is different and the range of possible values differ.

@jimczi
Copy link
Contributor

jimczi commented Jun 21, 2016

@clintongormley I think we can close this issue (please reopen if you disagree). The coords are a TF/IDF thing that was added as a countermeasure for terms with very high term frequency where the score constantly increases and never reaches a saturation point like in BM25.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Search/Search Search-related issues that do not fall into other categories
Projects
None yet
Development

No branches or pull requests

4 participants