-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LUCENE-9537 #2097
LUCENE-9537 #2097
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @cammiemw! This looks very nice (think Borat's voice).
It makes me nervous that we are having to "fork" some complex low-level Lucene search time classes (Scorable
, Scorer
, BooleanQuery/Scorer
, DisjunctionScorer
, etc.). I think it might also be confusing to users to have to pick entirely different Query
class when they just want to have this nice smoothing. But, maybe we can improve on that in followon improvements.
@@ -46,6 +46,11 @@ public float score() throws IOException { | |||
return in.score(); | |||
} | |||
|
|||
@Override | |||
public float smoothingScore(int docId) throws IOException { | |||
return 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm should this be return in.smoothingScore(docId)
instead?
@@ -59,6 +59,11 @@ | |||
@Override | |||
public final float score() { return score; } | |||
|
|||
@Override | |||
public float smoothingScore(int docId) throws IOException { | |||
return 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, should we also cache the smoothingScore
for this hit?
Or, if we will keep it at returning 0
, couldn't we remove this impl and inherit the default from Scorable
?
@@ -0,0 +1,22 @@ | |||
package org.apache.lucene.search; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add the standard Apache copyright header here and in all the new classes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done :-)
*/ | ||
public class IndriAndQuery extends IndriQuery { | ||
|
||
public IndriAndQuery(List<BooleanClause> clauses) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, BooleanClause
can hold any Lucene Query
implementation, but it looks like we are only supporting TermQuery
and other IndriAndQuery
clauses (just from reading the javadoc)? If so, should we check/enforce this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello, any comment on this one?
//If the query exists in the document, score the document | ||
//Otherwise, compute a smoothing score, which acts like an idf | ||
//for subqueries/terms | ||
if (docId == scorerDocId) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you factor out the 2nd two statements under each of the true
and false
clauses here? I.e., only the first line needs to be conditional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done :-)
|
||
/** | ||
* The Indri implemenation of a disjunction scorer which stores the subscorers | ||
* for for the child queries. The score and smoothingScore methods use the list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove extra for
?
*/ | ||
abstract public class IndriScorer extends Scorer { | ||
|
||
private float boost; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm is this (to store boost
) the only reason to have a separate IndriScorer
? If I remember right, Lucene used to apply boost
similarly (every scorer kept track of it) but at one point we moved all boosting to a dedicated BoostQuery
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did this because I apply the boost in the scorer rather than in the similarity (such as in LMDirichletSimilarity), and I divide by the sum of the boosts. I originally did this to exactly match the Indri scores; however, this is not a huge priority. I don't have much issue with dropping this if it doesn't fit in the Lucene workflow well.
/** | ||
* Returns the smoothing score of the current document matching the query. | ||
* This score is used when the query/term does not appear in the document. | ||
* This can return 0 or a smoothing score. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe link to Indri paper that describes/motivates this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added more detail and a paper the describes the motivation of the smoothing score. The description of how the smoothing score is used is at the bottom of page 11 in the paper. It is important to note that most of the explanation has to do with when the score is a product. Even though the IndriAndScorer does not use a product, the smoothing score is still helpful for acting like an idf. Additionally, there are many more Indri operators that I would like to add that do use a product.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks.
* engine (http://www.lemurproject.org/indri.php). Indri Dirichelet Smoothing! | ||
* tf_E + mu*P(t|D) P(t|E)= ------------------------ documentLength + documentMu | ||
* mu*P(t|C) + tf_D where P(t|D)= --------------------- doclen + mu | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, add a few words giving some intuition about what mu
does? It looks like it roughly lets you tune how important document length is in the scoring?
Also, the formatting of the above equations looks like it got garbled? You will need to use html/javadoc markup to make the formatting survive future developers viewing in browser...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried adding more formatting and a description of mu. Let me know if you would like to see anything different or more. Thanks!
/** A Query that matches documents matching combinations of | ||
* {@link TermQuery}s or other IndriAndQuerys. | ||
*/ | ||
public class IndriAndQuery extends IndriQuery { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally in Lucene AND
implies MUST
, i.e. required clauses.
But this query is actually disjunctive, right? Documents will match even if they are missing some of the terms. Should we name it IndriOrQuery
maybe? Or, IndriBooleanQuery
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree the naming is confusing. I have taken the naming schema as well as the logic from the original Indri search engine implementation. The issue with renaming it is that there is already IndriOrQuery, which I have created and hope to be able to add at a future time. I will continue to think about whether there is a better name for the IndriAndQuery though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK let's keep the (somewhat confusing) naming for now. Naming is the hardest part!!
…smoothing scores for TermQueries that have a frequency of 0
Thanks for your comments and quick response time @mikemccand! I think I have addressed and replied to each one. An alternative to touching so many low-level lucene classes would be to create a separate interface with only the smoothingScore method. Then the only Lucene class that I would need to change would be TermScorer to implement that interface and add the smoothing score method. This might be less risky although perhaps not as clean. As I mentioned in some of my comments, I am hoping that can be a base for suggesting to add more Indri content. I have several more query types and an Indri parser implemented already, but I thought it might make sense to start with a small piece of functionality so I could hopefully learn the code base and process. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great to me!
I left small comments -- I think if you add a default implementation you can remove all the return 0
methods in Scorable
implementations.
I hope another Lucene developer, with more experience on the search side of Lucene, will also chime in ...
/** A Query that matches documents matching combinations of | ||
* {@link TermQuery}s or other IndriAndQuerys. | ||
*/ | ||
public class IndriAndQuery extends IndriQuery { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK let's keep the (somewhat confusing) naming for now. Naming is the hardest part!!
*/ | ||
public class IndriAndQuery extends IndriQuery { | ||
|
||
public IndriAndQuery(List<BooleanClause> clauses) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello, any comment on this one?
* toString, equals, getClauses, and iterator. | ||
* | ||
*/ | ||
public abstract class IndriQuery extends Query |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Random question: will IndriQuery
take advantage of Block MAX Weak And optimization? The added smoothingScore
must alter the optimization logic to find the min block score to skip to?
I think it's OK to defer BMW to followon improvements, as long as it is not kicking in incorrectly here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, IndriQuery does not take advantage of Block MAX Weak and optimization. We iterate through all documents that that have a posting for at least one of the search terms. I would be interested in expanding the smoothing score functionality to more parts of lucene in the future.
/** | ||
* Returns the smoothing score of the current document matching the query. | ||
* This score is used when the query/term does not appear in the document. | ||
* This can return 0 or a smoothing score. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks.
@@ -32,4 +33,9 @@ public int docID() { | |||
public float score() { | |||
return score; | |||
} | |||
|
|||
@Override | |||
public float smoothingScore(int docId) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, why not add a default method in the Scorable
interface to return 0
? Then we don't have to add this default method in all these subclasses?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this makes more sense :-) I have added the default implementation of smoothingScore in Scorable and reverted my changes to add the smoothingScore method to all the unnecessary extending classes.
subs); | ||
} else { | ||
Scorer scorer = scorer(context); | ||
int advanced = scorer.iterator().advance(doc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NULLPTR_DEREFERENCE: accessing memory that is the null pointer on line 117 indirectly during the call to IndriAndWeight.scorer(...)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, can scorer
ever be null? Other Lucene queries can/do return null
scorers to indicate that there are no matches for this query in this segment, and callers need to check for that. But maybe in this context it never happens?
Thank you @cammiemw! Looks like there are some small style problems with your PR -- see the above |
Hi @mikemccand! I think the issues are fixed in my latest commit. Let me know if there is anything else that you think I need. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @cammiemw, this looks great now!
Could you add a CHANGES.txt
entry too, briefly summarizing the new smoothing / Indri search-time capability?
Thanks @mikemccand! I have added an entry in CHANGES.txt under API changes. I am hoping to be able to add more of the Indri search engine functionality that build upon these smoothingScore changes in the near future! Once this PR is merged, I would love to start working on adding a few more Indri operators such as the Indri OR, Indri Weighted Sum, and Indri Weighted And, and add either an Indri Query Parser or add to an existing lucene query parser so that it is easier to use these operators. Do you think it makes sense to open a new jira ticket for these changes once this PR is merged? About how long does the process or merging a PR take? Thank you! |
Thank you! I ran top-level
|
Hello @cammiemw, I ran top-level
|
Hi @mikemccand, apologies for the failed test! Unfortunately, I cannot replicate the error in my environments and gradle test as well as that specific test come back successful. I am going to keep digging into what is different in my local. For right now, I looked into the code and could see what was causing the error, so I pushed the change that I think fixes it. If that fails for you, I will need to get my local environment working so that I get the same results that you do. Thanks so much for your time! |
Thanks @cammiemw, that test now passes again! I ran top-level tests multiple times and each time, a solr test failed for what looked like (not certain) unrelated to this great change. On the 4th try, all solr tests finally passed. And I will push soon! Thank you for persisting @cammiemw and sorry for the slow iterations here. I am excited that we are cross-fertilizing these two impactful open-source search engines! |
Thanks @cammiemw! One thing I noticed after I pushed was the new |
Hi @mikemccand! The smoothingScore method requires a docId because the most important implementation of smoothingScore is in TermScorer, which uses the postingsEnum to get the docId. However, in the case of the smoothing score, there will be no posting for the term in the document. Let me know if you think a different way of getting the docId would work better. Thanks so much for all your help and time throughout this process! I am very excited about this change and hope that I can continue to add more. |
// the FilterFeatureScorer may simply inherit Scorer's default implementation | ||
if (scorerClassMethod.getName().equals("smoothingScore")) continue; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This surprised me. Created https://issues.apache.org/jira/browse/SOLR-15958 and (draft) apache/solr#567 to consider further.
Description
This pull request implements logic from our academic search engine Indri: http://www.lemurproject.org/indri.php. The functionality that is implemented is a smoothing score for search terms or subqueries that are not present in the document being scored. The smoothing score acts like an idf so that documents that do not have terms or subqueries that are more frequent in the index are not penalized as much as documents that do not have less frequent terms or subqueries. Additionally, Indri's dirichelet smoothing similarity has been added.
Solution
The smoothingScore method has been added to the Scorable interface and implemented in the abstract class Scorer. The classes IndriAndQuery, IndriAndWeight, and IndriAndScorer have been added to call the smoothingScore method on documents where the search term or subquery are not present. The class IndriDirichletSimilarity has been added for implementing Indri's equation for the Language Model with Dirichlet smoothing.
Tests
TestIndriAndQuery and TestIndriDirichletSmoothing have been added. I am happy to expand upon these tests and implement more tests.
Checklist
Please review the following and check all that apply:
master
branch../gradlew check
.