# BM25 from the ground up

BM25 definition. Let ${\bf X}$ be a corpus of documents. Let ${\bf x} \in {\bf X}$ be a document. Let ${\bf q}$ be a query.

$$
bm25({\bf q}, {\bf x}; {\bf X}) = 
\sum_{i=1}^{\vert {\bf q} \vert} IDF(q_i; {\bf X}) \frac{f(q_i; {\bf x}) \cdot (k_1 +1)}{f(q_i; {\bf x}) + k_1\cdot \left( 1  - b + b\frac{ \vert{\bf x} \vert}{ {\bf \hat{X}} } \right) }
$$

In this formula we have

- $q_i$ is the i´th query term.
   
- $IDF(q_i)$ is the inverse document frequency of the query term $q_i$.
   - The IDF component measures how often a term occurs in all of the documents and “penalizes” terms that are common. The actual formula Lucene/BM25 uses for this part is:
    $$
    log\left(1 + \frac{  \vert{\bf X} \vert- |X_{q_i}| + 0.5}{|X_{q_i}|  + 0.5} \right)
    $$
    
      - $docCount$: is the total number of documents that have a value for the field in the shard (across shards, if you’re using search_type=dfs_query_then_fetch) 
      - $|X_{q_i}|$ is the number of documents which contain the ith query term. 
      
      
- $FL / avgFL$: how long a document is relative to the average document length.

- $b$ regularization coeficient of the ratio of the field length. If $b$ is bigger, the effects of the length of the document compared to the average length are more amplified. If $b$ to 0, the effect of the length ratio would be completely nullified and the length of the document would have no bearing on the score. By default, $b$ has a value of 0.75 in Elasticsearch.


- $f(q_i,D)$ is “how many times does the ith query term occur in document D?” 


In [45]:
l = {}
l['a'] = 0
l['b'] = 1
w = 'f'
l[w] = l.get(w, len(l))
l

{'a': 0, 'b': 1, 'f': 2}

In [2]:
# we'll generate some fake texts to experiment with
corpus = [
    'Human machine interface for lab abc computer applications',
    'A survcey of user opinion of computer system response time',
    'The EPS user interface management system',
    'System and human system engineering testing of EPS',
    'Relation of user perceived response time to error measurement',
    'The generation of random binary unordered trees',
    'The intersection graph of paths in trees',
    'Graph minors IV Widths of trees and well quasi ordering',
    'Graph minors A survey'
]

# remove stop words and tokenize them (we probably want to do some more
# preprocessing with our text in a real world setting, but we'll keep
# it simple here)
stopwords = set(['for', 'a', 'of', 'the', 'and', 'to', 'in'])
texts = [
    [word for word in document.lower().split() if word not in stopwords]
    for document in corpus
]

# build a word count dictionary so we can remove words that appear only once
word_count_dict = {}
for text in texts:
    for token in text:
        word_count = word_count_dict.get(token, 0) + 1
        word_count_dict[token] = word_count

texts = [[token for token in text if word_count_dict[token] > 1] for text in texts]
texts

[['human', 'interface', 'computer'],
 ['user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors']]

In [86]:
import math

class BM25:
    """
    Best Match 25.

    Parameters
    ----------
    k1 : float, default 1.5

    b : float, default 0.75

    Attributes
    ----------
    tf_ : list[dict[str, int]]
        Term Frequency per document. So [{'hi': 1}] means
        the first document contains the term 'hi' 1 time.

    df_ : dict[str, int]
        Document Frequency per term. i.e. Number of documents in the
        corpus that contains the term.

    idf_ : dict[str, float]
        Inverse Document Frequency per term.

    doc_len_ : list[int]
        Number of terms per document. So [3] means the first
        document contains 3 terms.

    corpus_ : list[list[str]]
        The input corpus.

    corpus_size_ : int
        Number of documents in the corpus.

    avg_doc_len_ : float
        Average number of terms for documents in the corpus.
    """

    def __init__(self, k1=1.5, b=0.75):
        self.b = b
        self.k1 = k1

    def fit(self, corpus):
        """
        Fit the various statistics that are required to calculate BM25 ranking
        score using the corpus given.

        Parameters
        ----------
        corpus : list[list[str]]
            Each element in the list represents a document, and each document
            is a list of the terms.

        Returns
        -------
        self
        """
        tf = []
        df = {}
        idf = {}
        doc_len = []
        corpus_size = 0
        vocabulary = {}
        
        for document in corpus:
            corpus_size += 1
            doc_len.append(len(document))
            
            # compute tf (term frequency) per document
            frequencies = {}
            for term in document:
                vocabulary[term] = vocabulary.get(term, len(vocabulary))
                vocab_id = vocabulary[term]
                term_count = frequencies.get(vocab_id, 0) + 1
                frequencies[vocab_id] = term_count
            tf.append(frequencies)

            # compute df (document frequency) per term
            for term, _ in frequencies.items():
                df_count = df.get(term, 0) + 1
                df[term] = df_count

        for term, freq in df.items():
            idf[term] = math.log(1 + (corpus_size - freq + 0.5) / (freq + 0.5))

        self.tf_ = tf
        self.df_ = df
        self.idf_ = idf
        self.doc_len_ = doc_len
        self.corpus_ = corpus
        self.corpus_size_ = corpus_size
        self.avg_doc_len_ = sum(doc_len) / corpus_size
        self.vocabulary = vocabulary
        return self

    def search(self, query):
        scores = [self._score(query, index) for index in range(self.corpus_size_)]
        return scores

    def _score(self, query, index):
        score = 0.0

        doc_len = self.doc_len_[index]
        frequencies = self.tf_[index]
        for term in query:
            
            if term not in frequencies:
                continue

            freq = frequencies[term]
            numerator = self.idf_[term] * freq * (self.k1 + 1)
            denominator = freq + self.k1 * (1 - self.b + self.b * doc_len / self.avg_doc_len_)
            score += (numerator / denominator)

        return score


In [88]:
query = 'The intersection of graph survey and trees'
query = [word for word in query.lower().split() if word not in stopwords]

bm25 = BM25()
bm25.fit(texts)
scores = bm25.search(query)

for score, doc in zip(scores, corpus):
    score = round(score, 3)
    print(str(score) + '\t' + doc)

0.0	Human machine interface for lab abc computer applications
0.0	A survcey of user opinion of computer system response time
0.0	The EPS user interface management system
0.0	System and human system engineering testing of EPS
0.0	Relation of user perceived response time to error measurement
0.0	The generation of random binary unordered trees
0.0	The intersection graph of paths in trees
0.0	Graph minors IV Widths of trees and well quasi ordering
0.0	Graph minors A survey


In [65]:
bm25.tf_

[{0: 1, 1: 1, 2: 1},
 {3: 1, 2: 1, 4: 1, 5: 1, 6: 1},
 {7: 1, 3: 1, 1: 1, 4: 1},
 {4: 2, 0: 1, 7: 1},
 {3: 1, 5: 1, 6: 1},
 {8: 1},
 {9: 1, 8: 1},
 {9: 1, 10: 1, 8: 1},
 {9: 1, 10: 1}]

In [66]:
bm25.__dict__

{'b': 0.75,
 'k1': 1.5,
 'tf_': [{0: 1, 1: 1, 2: 1},
  {3: 1, 2: 1, 4: 1, 5: 1, 6: 1},
  {7: 1, 3: 1, 1: 1, 4: 1},
  {4: 2, 0: 1, 7: 1},
  {3: 1, 5: 1, 6: 1},
  {8: 1},
  {9: 1, 8: 1},
  {9: 1, 10: 1, 8: 1},
  {9: 1, 10: 1}],
 'df_': {0: 2, 1: 2, 2: 2, 3: 3, 4: 3, 5: 2, 6: 2, 7: 2, 8: 3, 9: 3, 10: 2},
 'idf_': {0: 1.3862943611198906,
  1: 1.3862943611198906,
  2: 1.3862943611198906,
  3: 1.0498221244986776,
  4: 1.0498221244986776,
  5: 1.3862943611198906,
  6: 1.3862943611198906,
  7: 1.3862943611198906,
  8: 1.0498221244986776,
  9: 1.0498221244986776,
  10: 1.3862943611198906},
 'doc_len_': [3, 5, 4, 4, 3, 1, 2, 3, 2],
 'corpus_': [['human', 'interface', 'computer'],
  ['user', 'computer', 'system', 'response', 'time'],
  ['eps', 'user', 'interface', 'system'],
  ['system', 'human', 'system', 'eps'],
  ['user', 'response', 'time'],
  ['trees'],
  ['graph', 'trees'],
  ['graph', 'minors', 'trees'],
  ['graph', 'minors']],
 'corpus_size_': 9,
 'avg_doc_len_': 3.0,
 'vocabulary': {'hum

In [67]:
corpus[3]

'System and human system engineering testing of EPS'

In [68]:
query = 'The intersection of graph survey and trees'
query = [word for word in query.lower().split() if word not in stopwords]

scores = bm25._score([query], 8)
scores

TypeError: unhashable type: 'list'

In [69]:
bm25.vocab_

AttributeError: 'BM25' object has no attribute 'vocab_'