Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a minimum_should_match parameter when Common query has only high frequent terms #3188

Closed
hc opened this Issue Jun 14, 2013 · 19 comments

Comments

Projects
None yet
3 participants
@hc
Copy link
Contributor

hc commented Jun 14, 2013

When a Common query has only high frequent terms, they are combined in a boolean MUST query.

I'd like to add the possibility to specify a minimum_should_match for this specific case. Because when I use a very low cutoff_frequency, the boolean fallback query is too restrictive in this case.

It requires to override the buildQuery method in the Lucene CommonTermsQuery class. I called the parameter "high_freq_only_minimum_should_match", but that might be a bit far-fetched... what do you think?

@ghost ghost assigned s1monw Jun 14, 2013

@s1monw

This comment has been minimized.

Copy link
Contributor

s1monw commented Jun 14, 2013

hey man, great to see you are using this feature. I agree this could be tricky if you have a low cutoff. Yet, I am not sure if we should add yet another parameter or just use a disjunction operator if minimum_should_match is set. This would make it a bit easier? I'd like to push all the changes we make here upstream as well I think this is good input. Does this query work well for you so far?

@hc

This comment has been minimized.

Copy link
Contributor Author

hc commented Jun 14, 2013

I used to manually export the top terms of my index and do exactly what the CommonTerm query does but in my frontend application. So this new query is a great addition for me.

Do you mean using the same minimum_should_match parameter for both the "main" query and the fallback one? Because I would probably want a higher minimum_should_match for the query with only high terms.

For example if I have a query with only high frequent terms: "How to live your life with her?" I want a more restrictive minimum_should_match than for this query: "A chromosome is an organized structure of DNA and protein"

Something like:

{ "query" : "...",
"cutoff_frequency" : 0.0002,
"minimum_should_match" : "2/50%",
"high_freq_only_minimum_should_match" : "4/75%" }

But it might be too specific to my case.

@hc

This comment has been minimized.

Copy link
Contributor Author

hc commented Jun 15, 2013

Here an attempt at what I meant: hc@cc4414b

@s1monw

This comment has been minimized.

Copy link
Contributor

s1monw commented Jun 16, 2013

hey cedric, I see what you mean though. I think this makes sense and I think I am ok with the change on buildQuery() - yet I don't think we should introduce yet another parameter on the top level. What do you think about this:

{ 
  "query" : "...",
  "cutoff_frequency" : 0.0002,
  "minimum_should_match" : { "low_freq" :  "2/50%", "high_freq" :  "4/75%" }
}

and we can also accept this:

{ 
  "query" : "...",
  "cutoff_frequency" : 0.0002,
  "minimum_should_match" : "2/50%"
}

this is equivalent to:

{ 
  "query" : "...",
  "cutoff_frequency" : 0.0002,
  "minimum_should_match" : { "low_freq" :  "2/50%" }
}

which makes this elegant and backwards compatible, makes sense?

@hc

This comment has been minimized.

Copy link
Contributor Author

hc commented Jun 17, 2013

It makes more sense yes, makes it cleaner. I did the changes there :
hc@277f999

I don't know if I did it "the right way", let me know.

@s1monw

This comment has been minimized.

Copy link
Contributor

s1monw commented Jun 17, 2013

I commented on the commit... looks good :)

@hc

This comment has been minimized.

Copy link
Contributor Author

hc commented Jun 17, 2013

Commented too, I pushed the changes there: https://github.com/hc/elasticsearch/tree/common_terms_minimum_high

@s1monw

This comment has been minimized.

Copy link
Contributor

s1monw commented Jun 17, 2013

cool looks awesome. I think it's ok to break BW compat on such a new "experimental" query in the Java API. I just thought about this for a while and given those changes I think it would make sense to apply the highFreqMinimumShouldMatch also if we have low freq terms. Since the special case we try to solve here is really a common case at the end of the day this would make it consistent and easier to document, does this make sense?

@hc

This comment has been minimized.

Copy link
Contributor Author

hc commented Jun 17, 2013

In my mind, setting the highFreqMinimumShouldMatch is to avoid returning no result. But if the same parameter is also applied to the main query, it might reduce the number of results, no?

In your suggestion, if I am looking for "a lot like protein lipid", with "4/75%" as earlier, I am ending with a query where all terms are required? protein, lipid (low freq) + a, lot, like (high freq)

Or I didn't get what you mean :)

@s1monw

This comment has been minimized.

Copy link
Contributor

s1monw commented Jun 17, 2013

In my mind, setting the highFreqMinimumShouldMatch is to avoid returning no result. But if the same parameter is also applied to the main query, it might reduce the number of results, no?

not really. so lets take your example a lot like protein lipid you will end up with two queries just like this:

{ 
  "query" : "a lot like protein lipid ",
  "cutoff_frequency" : 0.0002,
  "minimum_should_match" : { "low_freq" :  "2/50%" , "high_freq" : "4/75%" }
}

which is somewhat equivalent to:

bool : {
  must : [ { query: "protein lipid",  "minimum_should_match" :   "2/50%"  } ],
 should : [ {query : "a lot like" ,  "minimum_should_match" :   "4/70%"  }]
}

so the high freq part is executed as a should while if you don't have any low freq terms it's executed as a must so in that case the high freq part will only contribute to the score iff the minimum_should_match criteria is met on the should part otherwise you only get the score from the must part, does this make sense now?

@hc

This comment has been minimized.

Copy link
Contributor Author

hc commented Jun 17, 2013

Ah yes you are right, I had my head so much in the special case with only high terms I forgot there were inner boolean queries.

Now that makes sense to me :)

By the way, I just noticed the "highFreqBoost" and "lowFreqBoost" variables, they could also be exposed? Well maybe in another ticket.

@s1monw

This comment has been minimized.

Copy link
Contributor

s1monw commented Jun 17, 2013

Now that makes sense to me :)

awesome, can you update the PR? if you updated it can you squash the commits into one so I can pull it in?

By the way, I just noticed the "highFreqBoost" and "lowFreqBoost" variables, they could also be exposed? Well maybe in another ticket.

I'd be happy to help you doing this and I agree lets do it in another ticket!

thanks!

@hc

This comment has been minimized.

Copy link
Contributor Author

hc commented Jun 17, 2013

I merged my commits : hc@d4cd1aa

@s1monw

This comment has been minimized.

Copy link
Contributor

s1monw commented Jun 17, 2013

this looks very very cool! thanks man for doing this. I will pull this in soon today or tomorrow.
Can I convince you that it would be awesome to have this folded back into Lucene as well? I'd be happy to help you working on a port to lucene. We try in general to push all our extension / fixes upstream so 1. everybody benefits from them, 2. we don't have to maintain too much code and 3. we don't diverge from the lucene functionallity.

If you need help how to get started here is a guide or just go ahead and ask me :)

@s1monw

This comment has been minimized.

Copy link
Contributor

s1monw commented Jun 17, 2013

oh one more thing, can you add a Closes #3188 as the last line in the commit message and open a pull-request for this so I can easily pull it in!

thanks!

@s1monw s1monw closed this in d41c37f Jun 17, 2013

s1monw added a commit that referenced this issue Jun 17, 2013

Add support for "high_freq" and "low_freq" parameters for Common Query
"minimum_should_match" parameter. High freq parameters is used when the
query has only high frequent terms.

Closes #3188
@s1monw

This comment has been minimized.

Copy link
Contributor

s1monw commented Jun 17, 2013

thanks man!

martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Jun 17, 2013

Add support for "high_freq" and "low_freq" parameters for Common Query
"minimum_should_match" parameter. High freq parameters is used when the
query has only high frequent terms.

Closes elastic#3188
@kimchy

This comment has been minimized.

Copy link
Member

kimchy commented Jun 23, 2013

can this be properly documented on how it can be used in the issue description, so we can update the docs once 0.90.2 is released?

@hc

This comment has been minimized.

Copy link
Contributor Author

hc commented Jun 27, 2013

@hc

This comment has been minimized.

Copy link
Contributor Author

hc commented Jun 28, 2013

It's backward compatible, no breaking I think

mute pushed a commit to mute/elasticsearch that referenced this issue Jul 29, 2015

Add support for "high_freq" and "low_freq" parameters for Common Query
"minimum_should_match" parameter. High freq parameters is used when the
query has only high frequent terms.

Closes elastic#3188
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.