# Boosting

## The process of giving higher relevance to a set of documents over others is called __boosting__

A good example is google in itself where you are shown some search results boosted to the top (or a place which catches your attention) as they would be from a sponsored source. In essence google has boosted the sponsored search result to the top to bring it to prominence. A well designed search interface would provide the ability to adapt to user input and modify the search results accordingly, say in drilling down or choose from among a first of equals. This is where boosting can play a big part.
There are two stages where documents can be boosted: At index time and at query time.

![alt text](0.png)

## 1. Index time boost

Index time boosting is basically programatically setting the score of a field(s) at the time of indexing. However, you are not actually setting the score here, score is dependent on a lot of factors (for example the tokens in the query in itself which adds to the score), so what is being set is a number against a field which plays a part in the calculation of the score based on the query. This is where __NORM__ comes into play. Norm is basically that one number against the field which affects the document’s score and thus position in the search result pecking order. Norm basically is short for normalized value. The Norm values are added to the index and this can potentially (again, potentially) help increase the query time.
When should I use index time boosting?

_This pretty much depends on the business scenario at hand. For those scenarios where you know which subset of documents need to be boosted before hand, index time boosting would come in useful. Let us take a real world example here, say you have a shopping site selling cars with visitors from around the world. It is required that the search results for cars be boosted to the country of the user currently logged in. Say boost all products which are based in India to those users who have current address country as India?_

To increase the scores for certain documents that match a query, regardless of what that query may be, one can use index-time boosts.
~~~
document.setDocumentBoost(x)
~~~
The default boost for a field is 1, so setting a value between 0 and 1 would down boost the document.

Index-time boosts can be specified per-field also, so only queries matching on that specific field will get the extra boost.

It is also possible to add different boosts to different fields of a document. The only requirement here is that the boosted fields. 
~~~
document.addField(“title”, “Foo Bar”, x);
~~~
It’s important to know that the boost (either for a document or for a field) will be considered when calculating the final score for a document given a search. It is not the final score of the document. Boosting documents is not the same as sorting documents.

## 2. Query time boost
Boosting at query time is much more dynamic as it doesn’t require re-indexing and can be specified with every new request. Also, what gets boosted is not a document or a field, but a __subquery__ on the search. The simplest way to achieve query time boosting is by using the __^__ character plus the boost number on the query, for example:
~~~
foo^5 bar
~~~
Much more complex expressions can also be used for query time boosting, like:
~~~
title:(foo bar)^5 OR content:(foo bar)^2 OR foo OR bar
title:(foo bar)^5 OR title:”foo bar”^20 OR ...
~~~
The syntax can be very simple for simple cases, but it will get more and more complex with more complex use cases.

However, this syntax requires having an expert user who knows how to use it, or some application logic to inject it in the background after the user enters the query and before sending request. Dismax provides other alternatives for query time boosting, as dynamic as the previous one, but with a much easier syntax.

# Boosting a terms
Lucene/Solr provides the relevance level of matching documents based on the terms found. To boost a term use the caret symbol __^__ with a boost factor (a number) at the end of the term you are searching. __The higher the boost factor, the more relevant the term will be.__

Boosting allows you to control the relevance of a document by boosting its term. For example, if you are searching for
`"jakarta apache"` and you want the term `"jakarta"` to be more relevant, you can boost it by adding the `^` symbol along with the boost factor immediately after the term. 

For example, you could type: `jakarta^4 apache`

This will make documents with the term `jakarta` appear more relevant. You can also boost __Phrase Terms__ as in the example:
`"jakarta apache"^4 "Apache Lucene"`

By default, the boost factor is __1__. Although the boost factor must be positive, it can be less than 1 (for example, it could be 0.2).	

# Boosting Fields
The Dismax Query Parser (QP) will create a query that will be executed on many different fields, even if the user hasn’t specified any. This is one of the most important improvements of the Dismax QP over the Lucene QP. But sometimes, not all the fields have the same importance. 

_Sometimes, a hit on the title field is more important than a hit on the content field, or a hit on the content can be more important than a hit on the comments field. _

The Dismax Query Parser provides the ability to consider some fields more important than others with the “qf” (named after “query fields”) parameter, the same that is used for specifying the different fields on which to execute the user query. A common value for this parameter could be:
`qf=title^5 content^2 comments^0.5`
This will translate a user query like “boo bar” into something similar to:
`title:(foo bar)^5 OR content:(foo bar)^2 OR comments:(foo bar)^0.5`

# Boosting Phrases
The same as with query fields, Dismax Query Parser will execute the user query as a phrase query on the specified “phrase” fields. In this parameter, and in a similar way as in the qf parameter, a different boost for each of the phrase fields can be specified:
`pf=title^20 content^10`

This will translate a user query like foo bar into:
`title:”foo bar”^20 OR content:”foo bar”^10`
The last query will only be used for boosting the documents resulting from the original query.

# Boost Queries
Sometimes it is necessary to boost some documents regardless of the user query. A typical example of boost queries is __boosting sponsored documents__. The user searches for “car rental”, but the application has some sponsored document that should be boosted. A good way of doing this is by using boost queries. 

__A boost query is a query that will be executed on background after a user query, and that will boost the documents that matched it.__

For this example, the boost query (specified by the “bq” parameter) would be something like:
bq=sponsored:true
The boost query won’t determine which documents are considered a hit an which are not, but it will just influence the score of the result.

# Boost Functions
Boost Functions are very similar to boost queries; in fact, they can achieve the same goals. The difference between boost functions and boost queries is that the boost function is an arbitrary function instead of a query. 
__A typical example of boost functions is boosting those documents that are more recent than others.__ 
Imagine a forum search application, where the user is searching for forum entries with the text “foo bar”. The application should display all the forum entries that talk about “foo bar” but usually the most recent entries are more important (most users will want to see updated entries, and not historical). 

The boost function will be executed on background after each user query, and will boost some documents in some way.
For this example, a boost function (specified by the “bf” parameter) could be something like:
`bf=recip(ms(NOW,publicationDate),3.16e-11,1,1)`

The same as with the boost queries, this function will not determine which documents are a hit and which are not, it will just add additional score to them.

A note on boost functions: boost functions can also be used with the Lucene QP by using the “_val_” special key inside the query.

# Example

In [1]:
import sys
import lucene
 
from java.nio.file import Paths
from java.util import LinkedHashMap
from org.apache.lucene.analysis.standard import StandardAnalyzer,ClassicAnalyzer
from org.apache.lucene.search import IndexSearcher,BoostQuery,BooleanQuery,TermQuery
from org.apache.lucene.search import BooleanClause
from org.apache.lucene.index import DirectoryReader, Term
from org.apache.lucene.queryparser.classic import QueryParser,MultiFieldQueryParser 
from org.apache.lucene.store import SimpleFSDirectory
from org.apache.lucene.queryparser.simple import SimpleQueryParser
from org.apache.lucene.queryparser.classic import QueryParserBase
from org.apache.lucene.util import Version
import pandas as pd

In [2]:
lucene.initVM()
directory = SimpleFSDirectory(Paths.get('index'))
reader = DirectoryReader.open(directory)
searcher = IndexSearcher(reader)   
analyzer = StandardAnalyzer()
fields = ['title', 'genres', 'directors', 'top_3_cast', 'storyline','synopsis']
clauses = [ BooleanClause.Occur.SHOULD, BooleanClause.Occur.SHOULD, BooleanClause.Occur.SHOULD, \
           BooleanClause.Occur.SHOULD, BooleanClause.Occur.SHOULD, BooleanClause.Occur.SHOULD]
parser = MultiFieldQueryParser(fields, analyzer)
parser.setDefaultOperator(QueryParserBase.OR_OPERATOR)

# Query Time Boosting Terms
# query = 'top_3_cast:"Tom"'

In [3]:
query = 'top_3_cast:"Tom"'
query = MultiFieldQueryParser.parse(parser, query)
hits = searcher.search(query, 100)
print ("Found %d document(s) that matched query '%s':" % (hits.totalHits, query))

Found 302 document(s) that matched query 'top_3_cast:tom':


In [4]:
df = pd.DataFrame()
for hit in hits.scoreDocs:
    #print (hit.score, hit.doc, hit.toString())
    #print("hit.score: ", hit.score)
    #print("hit.doc: ", hit.doc)
    #print("hit: ", hit)
    
    doc = searcher.doc(hit.doc)
    df = df.append([[doc.get('id'), doc.get('title'), doc.get('genres'), doc.get('directors'), doc.get('top_3_cast'), doc.get('storyline'), doc.get('synopsis'), hit.score]], ignore_index = True)
    #print("explain: ", searcher.explain(query,hit.doc).toString())
    #print("________________________")

~~~
explain:  5.541642 = weight(top_3_cast:tom in 5555) [BM25Similarity], 
result of:
  5.541642 = score(doc=5555,freq=2.0 = termFreq=2.0), 
  product of:
    3.9971538 = idf, 
    computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) 
    from:
      302.0 = docFreq
      16468.0 = docCount
    1.386397 = tfNorm, 
    computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) 
    from:
      2.0 = termFreq=2.0
      1.2 = parameter k1
      0.75 = parameter b
      6.1806536 = avgFieldLength
      6.0 = fieldLength
~~~

In [5]:
df.columns = ['id','title', 'genres', 'director', 'top_3_cast', 'storyline', 'synopsis', 'score']
df.head(13)

Unnamed: 0,id,title,genres,director,top_3_cast,storyline,synopsis,score
0,120815,Saving Private Ryan,"['Drama', 'War']",['Steven Spielberg'],"['Tom Hanks', ' Tom Sizemore', ' Edward Burns']",Opening with the Allied invasion of Norman...,,5.541642
1,1645187,Walk a Mile in My Pradas,['Comedy'],['Joey Sylvester'],"['Nathaniel Marston', ' Tom Archdeacon', ' Tom...",A magic Christmas ornament turns two men's...,,5.541642
2,106218,Ad Fundum,['Drama'],['Erik Van Looy'],"['Tom Van Landuyt', ' Mathias Sercu', ' Tom Va...","Sammy Raes, a nice, naive law-freshman fro...",,5.075861
3,2461216,William and the Windmill,"['Documentary', 'Biography', 'Drama']",['Ben Nabors'],"['William Kamkwamba', ' Tom Rielly']","William Kamkwamba, a young Malawian, build...","Saint James Films and TomCat Films, LLC in ass...",4.671401
4,2124818,Janapar,"['Documentary', 'Adventure', 'Romance']","['James W. Newton', 'Tom Allen']","['Tom Allen', ' Andrew Welch']",23-year-old Englishman Tom Allen is all se...,,4.671401
5,478209,Metal: A Headbanger's Journey,"['Documentary', 'Music']","['Sam Dunn', 'Scot McFadyen', '1 more credit',...","['Tom Araya', ' Gavin Baddeley', ' Blasphemer']",Sam Dunn is a 30-year old anthropologist w...,,4.335995
6,56868,Billy Liar,"['Comedy', 'Drama', 'Romance']",['John Schlesinger'],"['Tom Courtenay', ' Wilfred Pickles', ' Mona W...",A young British clerk in a gloomy North Co...,,4.045527
7,89945,Rustlers' Rhapsody,"['Comedy', 'Western']",['Hugh Wilson'],"['Tom Berenger', ' G.W. Bailey', ' Marilu Henn...",While the audience watches a black and whi...,,4.045527
8,120755,Mission: Impossible II,"['Action', 'Adventure', 'Thriller']",['John Woo'],"['Tom Cruise', ' Dougray Scott', ' Thandie New...",Chimera is a deadly virus that will bear a...,,4.045527
9,240515,Freddy Got Fingered,['Comedy'],['Tom Green'],"['Tom Green', ' Rip Torn', ' Marisa Coughlan']","Gordon, 28, an aspiring animator, leaves h...",,4.045527


# Tom Hanks + Tom Sizemore 0 (score 5.541642) 
# Tom Hardy 12 (score 4.045527)






# query = 'top_3_cast:"Tom" OR top_3_cast:"Tom Hardy"^2'
# 2.0 = boost

In [6]:
query = 'top_3_cast:"Tom" OR top_3_cast:"Tom Hardy"^2'
query = MultiFieldQueryParser.parse(parser, query)
hits = searcher.search(query, 100)
print ("Found %d document(s) that matched query '%s':" % (hits.totalHits, query))

Found 302 document(s) that matched query 'top_3_cast:tom (top_3_cast:"tom hardy")^2.0':


In [7]:
df = pd.DataFrame()
for hit in hits.scoreDocs:
    #print (hit.score, hit.doc, hit.toString())
    #print("hit.score: ", hit.score)
    #print("hit.doc: ", hit.doc)
    #print("hit: ", hit)
    
    doc = searcher.doc(hit.doc)
    df = df.append([[doc.get('id'), doc.get('title'), doc.get('genres'), doc.get('directors'), doc.get('top_3_cast'), doc.get('storyline'), doc.get('synopsis'), hit.score]], ignore_index = True)
    #print("explain: ", searcher.explain(query,hit.doc).toString())
    #print("________________________")

~~~
explain:  23.3875 = sum of:
  4.045527 = weight(top_3_cast:tom in 681) [BM25Similarity], 
  result of:
    4.045527 = score(doc=681,freq=1.0 = termFreq=1.0
), product of:
      3.9971538 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) 
      from:
        302.0 = docFreq
        16468.0 = docCount
      1.0121019 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:
        1.0 = termFreq=1.0
        1.2 = parameter k1
        0.75 = parameter b
        6.1806536 = avgFieldLength
        6.0 = fieldLength
  19.341974 = weight(top_3_cast:"tom hardy" in 681) [BM25Similarity], 
  result of:
    19.341974 = score(doc=681,freq=1.0 = phraseFreq=1.0
), product of:
      2.0 = boost
      9.555349 = idf(), sum of:
        3.9971538 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:
          302.0 = docFreq
          16468.0 = docCount
        5.558195 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:
          63.0 = docFreq
          16468.0 = docCount
      1.0121019 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:
        1.0 = phraseFreq=1.0
        1.2 = parameter k1
        0.75 = parameter b
        6.1806536 = avgFieldLength
        6.0 = fieldLength

~~~

In [8]:
df.columns = ['id','title', 'genres', 'director', 'top_3_cast', 'storyline', 'synopsis', 'score']
df.head(10)

Unnamed: 0,id,title,genres,director,top_3_cast,storyline,synopsis,score
0,1172570,Bronson,"['Action', 'Biography', 'Crime']",['Nicolas Winding Refn'],"['Tom Hardy', ' Kelly Adams', ' Luing Andrews']","In 1974, a hot-headed 19 year old named Mi...",,23.387501
1,1291584,Warrior,"['Drama', 'Sport']","[""Gavin O'Connor""]","['Joel Edgerton', ' Tom Hardy', ' Nick Nolte']",Two brothers face the fight of a lifetime ...,,23.387501
2,1345836,The Dark Knight Rises,"['Action', 'Thriller']",['Christopher Nolan'],"['Christian Bale', ' Gary Oldman', ' Tom Hardy']",Despite his tarnished reputation after the...,,23.387501
3,375912,Layer Cake,"['Action', 'Crime', 'Drama']",['Matthew Vaughn'],"['Daniel Craig', ' Tom Hardy', ' Jamie Foreman']","A successful cocaine dealer, who has earne...",,23.387501
4,2692904,Locke,['Drama'],['Steven Knight'],"['Tom Hardy', ' Olivia Colman', ' Ruth Wilson']",Leaving the construction site on the eve o...,,23.387501
5,1891806,From the Ashes,"['Documentary', 'Sport']",['James Erskine'],"['Tom Hardy', ' Ian Botham', ' Mike Brearley']",From The Ashes is an uplifting and remarka...,,23.387501
6,1212450,Lawless,"['Crime', 'Drama']",['John Hillcoat'],"['Shia LaBeouf', ' Tom Hardy', ' Jason Clarke']","In 1931, in Franklin County, Virginia, For...",,23.387501
7,1596350,This Means War,"['Action', 'Comedy', 'Romance']",['McG'],"['Reese Witherspoon', ' Chris Pine', ' Tom Har...","Two CIA agents, Tuck and Frank who are als...",,23.387501
8,120815,Saving Private Ryan,"['Drama', 'War']",['Steven Spielberg'],"['Tom Hanks', ' Tom Sizemore', ' Edward Burns']",Opening with the Allied invasion of Norman...,,5.541642
9,1645187,Walk a Mile in My Pradas,['Comedy'],['Joey Sylvester'],"['Nathaniel Marston', ' Tom Archdeacon', ' Tom...",A magic Christmas ornament turns two men's...,,5.541642


# Tom Hardy 0-7 (score 23)
# Tom Hanks 8 (score 5)