New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synonym Query Style (configurable single term syn queries) #35422

Open
wants to merge 5 commits into
base: master
from

Conversation

Projects
None yet
9 participants
@softwaredoug
Copy link
Contributor

softwaredoug commented Nov 9, 2018

In Lucene, SynonymQuery became the default behavior for single term synonym queries, which ES 6.0 inherited. While this is a good default, it interferes with other legitimate uses of synonyms. You don't always want to blend document frequencies of the terms.

It's common to want to expand a term to a broadening term, but maintaining the specificity of the original term. (ie jeans => jeans, trousers). I and others have written about these techniques for implementing hierarchical vocabularies. In these cases blending ends up doing more harm, so this PR makes this behavior configurable.

This PR introduces an option to the match query, synonym_query_style which can have the values blended (default), most_terms, or best_terms.

{
   "match": {
      "text": {
        "query": "blue jeans",
        "synonym_query_style": "most_terms"
       }
   }
}

with a synonym file jeans, trousers would turn into a search: text:jeans text:trousers. This is basically the pre 6.0 behavior.

Using best_terms changes the synonym query to a dismax text:jeans | text:trousers which tends to ignore the broader term in the narrower term is present.

One note on this PR,

  • tests seem to be failing locally for me for unrelated functionality (transport client stuff...)
  • any feedback on where to place a test is greatly appreciated. I looked for how other match query options were tested and did not find anything.

softwaredoug added some commits Nov 9, 2018

@softwaredoug

This comment has been minimized.

Copy link
Contributor

softwaredoug commented Nov 10, 2018

Hey @romseygeek I know you do a lot with synonyms in ES, I wonder if you could have a gander at this?

*/
public MatchQueryBuilder synonymQueryStyle(MatchQuery.SynonymQueryStyle synQueryStyle) {
if (synQueryStyle == null) {
throw new IllegalArgumentException("[" + NAME + "] requires synQueryStyle to be non-null");

This comment has been minimized.

@martin-g

martin-g Nov 10, 2018

wouldn't it be better if the exception uses synonym_query_style instead of synQueryStyle. It would be more clear, I think.

@@ -534,10 +539,20 @@ public MultiMatchQueryBuilder zeroTermsQuery(MatchQuery.ZeroTermsQuery zeroTerms
return this;
}

public MultiMatchQueryBuilder synonymQueryStyle(MatchQuery.SynonymQueryStyle synQueryStyle) {
if (synonymQueryStyle == null) {
throw new IllegalArgumentException("[" + NAME + "] requires synonym query style to be non-null");

This comment has been minimized.

@martin-g

martin-g Nov 10, 2018

Here it is more clean, but again I think using synonym_query_style would be better

@@ -175,6 +231,8 @@ public void writeTo(StreamOutput out) throws IOException {

protected boolean autoGenerateSynonymsPhraseQuery = true;

protected SynonymQueryStyle synonymQueryStyle = SynonymQueryStyle.BLENDED_TERMS;

This comment has been minimized.

@martin-g

martin-g Nov 10, 2018

Why not use DEFAULT_SYNONYM_QUERY_STYLE as a value ? This way if the default is ever changed then one has to change it at just one place.

case BLENDED_TERMS:
return blendTermsQuery(terms, mapper);
default:
throw new IllegalStateException("unrecognized synonymQueryStyle passed when creating newSynonymQuery");

This comment has been minimized.

@martin-g

martin-g Nov 10, 2018

For easier debugging it would be good to print the (unknown) value of this.synonymQueryStyle

@softwaredoug

This comment has been minimized.

Copy link
Contributor

softwaredoug commented Nov 10, 2018

@martin-g - thanks for the review! I believe I addressed your comments

@elasticmachine

This comment has been minimized.

Copy link

elasticmachine commented Nov 12, 2018

Pinging @elastic/es-search-aggs

@colings86 colings86 added the v7.0.0 label Nov 12, 2018

@softwaredoug

This comment has been minimized.

Copy link
Contributor

softwaredoug commented Nov 12, 2018

Any chance this could be backported into 6.5? (or whatever appropriate 6.x release makes sense?)

@byronvoorbach

This comment has been minimized.

Copy link
Contributor

byronvoorbach commented Nov 14, 2018

Nice work @softwaredoug, really looking forward to this feature!

@romseygeek

This comment has been minimized.

Copy link
Contributor

romseygeek commented Nov 14, 2018

Hi @softwaredoug, this looks very interesting. Can you look at adding tests to MatchQueryBuilderTests in server, and also to 50_queries_with_synonyms in the analysis-common module?

@jimczi
Copy link
Member

jimczi left a comment

Thanks @softwaredoug . I have some concerns with this approach that I'll try to summarize here.

It's common to want to expand a term to a broadening term, but maintaining the specificity of the original term. (ie jeans => jeans, trousers)

The broadening term depends on the frequencies of the terms in the index so it depends on the context. trousers may seem broader than jeans but there is no guarantee that it will have smaller score. It might even be the opposite if the corpus does not use this term often.

This PR introduces an option to the match query, synonym_query_style which can have the values blended (default), most_terms, or best_terms.

It works if all your synonyms should have the same scoring behavior but how do you handle the case where you need different behavior per synonym rule ?
I don't like the boolean scoring approach because it is impossible to determine the score difference when you define your synonym rule (you need to know the internal frequencies). One alternative solution would be to add the ability to set a boost per synonym in the rule. We'd still use the SynonymQuery all the time to make sure that scores are comparable but the boost could be applied per match. So if jeans has a boost of 2 we'd multiply the total term frequency for that term by 2 and do the same for document term frequency (a document that matches jeans would have the document frequency for that term multiplied by 2). I don't know how much work this would require to change the synonym filter and the query builder to handle this kind of boosting but this approach would bring more flexibility and consistent scoring. WDYT ?

@alessandrobenedetti

This comment has been minimized.

Copy link

alessandrobenedetti commented Nov 15, 2018

Hi @jimczi ,
You may find interesting this work I have done few months ago Apache Solr side:
https://issues.apache.org/jira/browse/SOLR-12238
in line:

The broadening term depends on the frequencies of the terms in the index so it depends on the context. trousers may seem broader than jeans but there is no guarantee that it will have smaller score. It might even be the opposite if the corpus does not use this term often.

I agree and I believe that if you select to use the <best_terms> approach, you need to be aware that you are preferring the rarest from the index perspective, and not from the human perspective.
Furthermore using the <best_terms> out of the box doesn't guarantee that documents matching the exact entity win over documents containing a rarer synonym.
Basically it summarise to a question :
Documents with hypernym < Documents with synonym < Documents with hyponym OR
Documents with hypernym < Documents with hyponym < Documents with synonym ( I prefer this)

And then, how to implement this ? modelling through the current synonyms rules could be tricky.
Maybe some sort of lightweight taxonomy format?
And then applying different boosts to each hypernym/synonym/hyponym ?
Blending hierarchically the document frequency? so the hyponymgets the original document frequency of the term searched + its own?
But how you make Documents with hypernym < Documents with hyponym ?
A simple boost would not win over a strongly unbalanced document frequency...

@alessandrobenedetti
Copy link

alessandrobenedetti left a comment

I believe it is a nice addition to ES.
I would add some more testing but this is already specified in the other reviews.
It is a +1 for me , it is a very good starting point to manage advanced synonyms.
Then it can be improved and expanded in the future

@softwaredoug

This comment has been minimized.

Copy link
Contributor

softwaredoug commented Nov 15, 2018

Hey @jimczi

Yes I agree that won't always work, synonyms are pretty messy in practice :), and the more tools we can give an ES practitioner the better.

It depends on if you're truly labeling true alternate labels usa,united states or terms with less clear synonomy khakis,trousers where specificity matters

In the latter case, they are putting themselves in the mind of the searcher. They have the same notions of specificity that a searcher would. So if they expect that "trousers" is common and "khakis" is rare, they would expect that pattern to be reflected in scoring. Taking the boolean or dismax of synonyms using each terms specificity (measured by df) reflect these semantics; SynonymQuery tends to reflect the semantics of the former case (alternate labels with 99% overlap in meaning). When SynonymQuery is applied to scenarios where the overlap isn't close to 100%, it creates unexpected scoring (why are generic trousers showing up higher when I search for khakis!); when boolean/dismax is applied to the latter it also tends to create confusion (why is 'usa' always showing up higher than 'united states').

To your question, what I see in practice is people create synonym filters that map to specific semantics and use them in that way. So they know they have a synonym filter that captures loose synonomy/hypernomy/hyponomy. And another filter thats capturing alternate labels (true synonyms, acrononyms, decompounds, common misspellings, etc). But enforcing everything to synonymquery now makes it more difficult to capture the khakis,trousers usage of synonyms that many expect.

I agree it would be useful to have more metadata per synonym for a variety of things (see Allesandro's great work here ).

There's also a set of practices around using the document frequency at index time to map to the specificity users expect - see this blog article on index-time taxonomies:
https://opensourceconnections.com/blog/2016/12/23/elasticsearch-synonyms-patterns-taxonomies/

@jimczi
Copy link
Member

jimczi left a comment

synonyms are pretty messy in practice :), and the more tools we can give an ES practitioner the better.

I disagree, the SynonymQuery fixed the scoring of single term synonyms nicely. However it only handles the case where all synonyms are true synonyms but it does it well. The score for the merged term is very close (we have to approximate the document frequency because we don't know where terms co-occur in the corpus) to the "true" score if synonyms were applied at index time. This is a nice property and it is very easy to explain how it works. The boolean alternative is what caused a lot of issues in the past so I don't see this pr as an enhancement but rather as a way to restore the buggy behavior ;). As I said in my previous comment I think we can handle the importance of synonyms directly in the SynonymQuery, if we can add a boost per term then all sorts of synonyms can be added and we would not break the BM25 scoring logic. The tricky part is to propagate these boosts from the synonym filter to the query builder but we could use a special attribute for this purpose.

@softwaredoug

This comment has been minimized.

Copy link
Contributor

softwaredoug commented Nov 15, 2018

I disagree - I think SynonymQuery has created as many issues as it has fixed :). In fact, it's probably broken the more common case (users specifying inexact synonyms). You're proposing that the only solution now is hard-coded prioritization per synonym by the user? Keep in mind, people have millions of synonyms, and have long expected the search engine is good at figuring out what terms are more exact than others

We already having document frequency that is a good approximation of the importance of a term. I think it's a reasonable approach to let them use the term's document frequency to prioritize synonyms.

@softwaredoug

This comment has been minimized.

Copy link
Contributor

softwaredoug commented Nov 15, 2018

BTW I would love adding per-synonym prioritization, I just don't think it should be the default way of doing this when there's good term stats around, and would require a fair amount of work to get in place

@renekrie

This comment has been minimized.

Copy link

renekrie commented Nov 15, 2018

@jimczi I wonder why we shouldn't have both: per synonym boost factors and a parameterisable strategy for dealing with df in synonyms - in my opinion that's two different concerns.

@romseygeek

This comment has been minimized.

Copy link
Contributor

romseygeek commented Nov 16, 2018

It depends on if you're truly labeling true alternate labels usa,united states or terms with less clear synonomy khakis,trousers where specificity matters

The first instance here will always use a blended term frequency, because it resolves to a SpanQuery due to the difference in path lengths between the synonyms, but that's a separate issue...

I agree with @jimczi here; the synonym filter is really built to deal with true synonyms, and for things like hypernyms then I think we need a different solution. Document frequency gives information about the relative importance about terms within a corpus, it doesn't tell us anything about semantic relationships, and we need another way to deal with that.

I've had a quick look at SynonymMap, and it's not going to be simple to add weights in there. One possibility would be to allow synonym filters to customize the TypeAttribute on a token, so you could have one filter for pure synonyms, and then one that sets the type on each token to 'hypernym' instead; then the query builder could detect that and use it to adjust term frequencies. This would allow the fairly common use case where people want the original term to score more highly than a synonym.

@softwaredoug

This comment has been minimized.

Copy link
Contributor

softwaredoug commented Nov 16, 2018

@romseygeek does @alessandrobenedetti's work in this Jira provide any help? http://lucene.472066.n3.nabble.com/jira-Created-SOLR-12238-Synonym-Query-Style-Boost-By-Payload-td4386107.html

You could natively support a different format that simply expressed a hierarchy, and inject a payload attribute on the token in the token stream that forced the sort of ordering you would like. If you had a hierarchy that was

\trousers\khakis

The taxonomy filter could have configuration that let you control the weight of each level (or the amount to reduce the weight as you descend levels, as

"analysis": {
  "filter": {
      "clothing_taxonomy": {
           "type": "taxonomy",
           "taxonomy": [
                "\trousers\khakis",
                "\trousers\jeans"
            ],
            "weight_discount": 0.1
      }
  }
}

The output of running "trousers" through this filter would be:

token: trousers
payload: 1.0

Comparatively running "jeans" through does:

token:trousers
payload:1.0

token:jeans
payload:0.1   <-- notice its weight_discount * parent's boost

This payload then acts as a boost within SynonymQuery.

One challenge with this representation format is when there's both synonyms and a taxonomy. For example if one decides that blue_jeans is a synonym of jeans. I would advocate for using a subsequent synonym filter that preserves the payload to deal with this.

@softwaredoug

This comment has been minimized.

Copy link
Contributor

softwaredoug commented Nov 17, 2018

I also want to add that SynonymQuery was brought into Lucene as part of an unrelated patch, without really any community discussion. So I'm not sure we should treat it as sacrosanct.

@softwaredoug

This comment has been minimized.

Copy link
Contributor

softwaredoug commented Nov 17, 2018

This paper on query expansion summarizes the state of the art on synonym scoring. Specifically 3.2 where scoring of synonyms from conceptnet/wordnet is discussed. I'm still going through it, but I think it could have a lot of insight on how something like SynonymQuery should behave

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment