# Implementing a custom retrieval method based on term statistics from Elasticsearch

  1. Implement the TF-IDF weighting formula by completing the `score()` method.
      
  2. Implement Language Modeling (query (log)likelihood) scoring as a new `score_lm()` method.
  
  3. Implement BM25 scoring as a new `score_bm25()` method.


We use documents by taking the dot product of query and document term weights: $score(q,d)=\sum_{t \in q}w_{t,q} w_{t,d}$.
Different retrieval models can be instantiated by setting these weights as follows:

| Retrieval Model | $w_{t,q}$ | $w_{t,d}$ |
| -- | -- | -- |
| TF-IDF | $f_{t,q}$ | $\frac{f_{t,d}}{|d|} IDF_t$ |
| LM | $f_{t,q}$ | $\log \Big( (1-\lambda) \frac{f_{t,d}}{|d|} + \lambda \frac{f_{t,C}}{cl} \Big)$ |
| BM25 | $f_{t,q}$ | $\frac{f_{t,d} (1+k_1)}{f_{t,d} + k_1(1-b+b\frac{|d|}{avgdl})} \times IDF_t$ |

 - $f_{t,q}$ is the frequency of term t in query q
 - $f_{t,d}$ is the frequency of term t in document d
 - $f_{t,C}$ is the frequency of term t in the entire collection
 - $IDF_{t}=\log \frac{N}{df_t}$
 - $N$ is the total number of documents in the collection
 - $df_t$ is the number of documents that contain term t
 - $|d|$ is the length of document d
 - $cl$ collection length (sum of all document lengths $\sum_{d'}|d|$)
 - $\lambda$ is a smoothing parameter
  

In [37]:
from elasticsearch import Elasticsearch
import pprint

In [33]:
INDEX_NAME = "aquaint"
DOC_TYPE = "doc"
FIELD = "content"

## Scoring method

This is our "custom" scoring method. 
  - `es` is an Elasticsearch object; this is needed for getting term statistics from Elasticsearch.
  - `qterms` holds a sequence of query terms. It is important that these terms must be analyzed the same way documents were analyzed during indexing.
  - `doc_id` is the document's ID.
  
The scoring method computes the dot product between the query term weights and document term weights: $score(q,d)=\sum_{t \in q}w_{t,q} w_{t,d}$.

In [61]:
def score(es, qterms, doc_id):
    # Total number of documents in the index
    n = es.count(index=INDEX_NAME, doc_type=DOC_TYPE)["count"]

    # Getting term frequency statistics for the given document field from Elasticsearch
    tv = es.termvectors(index=INDEX_NAME, doc_type=DOC_TYPE, id=doc_id, fields=[FIELD],
                              term_statistics=True).get("term_vectors", {}).get(FIELD, {})
    
    # uncomment to see what information ES returns
    #pprint.pprint(tv)
        
    dl = sum([s["term_freq"] for t, s in tv["terms"].items()])  # length of the document
    cl = tv["field_statistics"]["sum_ttf"]  # collection length (total number of terms in a given field in all documents)
    avg_dl = cl / n
    
    print(avg_dl)
    
    s = 0  # holds the retrieval score
    for t in qterms:
        df_t = tv["terms"][t]["doc_freq"]  # number of docs in the collection that contain that term
        f_td = tv["terms"][t]["term_freq"]  # raw frequenct of t in d (number of times term t appears in doc d)
        t_tC = tv["terms"][t]["ttf"]  # frequency of t in the entire collection
        
        # TODO: setting query and document term weights
        w_tq = 1
        w_td = 1
        s += w_tq * w_td
    return s

## Query analyzer

See [indices.analyze](https://elasticsearch-py.readthedocs.io/en/master/api.html#elasticsearch.client.IndicesClient.analyze).

In [24]:
def analyze_query(es, query):
    tokens = es.indices.analyze(index=INDEX_NAME, body={"text": query})["tokens"]
    query_terms = []
    for t in sorted(tokens, key=lambda x: x["position"]):
        query_terms.append(t["token"])
    return query_terms

## Main

In [5]:
es = Elasticsearch()

In [6]:
query = "tropical storms"

### Retrieve the top-1 document using Elasticsearch

We search a single field (set in `FIELD`).

In [8]:
res = es.search(index=INDEX_NAME, q=query, df=FIELD, _source=False, size=1).get("hits", {})

Get the ID of the first hit.

In [47]:
doc_id = res["hits"][0]["_id"]
print(doc_id)

APW19990810.0014


### Re-score this document using our own retrieval method.

First, we need to transform the query string into a sequence of query terms, using the same analysis procedure that was used for building the index.

In [26]:
qterms = analyze_query(es, query)
print(qterms)

['tropical', 'storms']


Then, we compute the "custom" retrieval score for this document.

In [62]:
new_score = score(es, qterms, doc_id)

423.6087138266466


In [None]:
print(new_score)