# Reciprocal rank fusion

As you have seen in the previous notebooks, we now have several result lists:
* Semantic results from different embeddings models
* Lexical results from `tantivy`

Unfortunately, the scores of these lists are *incommensurable*. We cannot use the
score for sorting. However, we can use the *rank* as an indicator in each list
how well the documents are matching. This is universal. Here, we implement
an alogrithm which *fuses* these result lists using only the rank in the
individual lists, not the score.

The formula for calculating the fused score is $ {\rm score} = \sum \frac{1}{k + {\rm rank}} $ 
with $k=60$. The heart of the notebook is the function `rrf` which iterates over a
unique list of all document ids and calculates the scores with the formula above.

We try the algorithm with many different result lists (semantic with different models
and lexical).

## Load data (from previous notebook)

In [None]:
import json
with open("sentences.json") as f:
    sentences = json.load(f)

In [None]:
len(sentences)

## Retrieval

In [None]:
import numpy as np
with open("sentences-mqa.npy", "rb") as f:
    sembeddings = np.load(f)

In [None]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')

In [None]:
import pandas as pd
def search_semantic(query, text, corpus_embeddings, model, query_prompt_name=None, top=20):
    # code query to restrict search space
    question_embedding = model.encode(query, normalize_embeddings=True, prompt_name=query_prompt_name)
    
    # Determine similarity (vectors are normalized)
    sim = model.similarity(question_embedding, corpus_embeddings)[0].numpy() 
    # Alternative: sim = np.dot(corpus_embeddings, question_embedding)
    
    # Get most similar top_k by sorting
    hits = [ { "id": i, "text": text[i], "score": sim[i] } 
                     for i in sim.argsort()[::-1][0:top] ]
    
    # Return as dataframe
    return pd.DataFrame(hits)

remove a possible old index

In [None]:
import tantivy
schema_builder = tantivy.SchemaBuilder()
schema_builder.add_integer_field("id", stored=True)
schema_builder.add_text_field("text", stored=True)
schema = schema_builder.build()
index = tantivy.Index(schema, "tantivy-index")

In [None]:
def search_lexical(query, index, top=20):
    searcher = index.searcher()
    query = index.parse_query(query, ["text"])
    search_results = searcher.search(query, limit=20).hits
    res = []
    for (score, doc_id) in search_results:
        doc = searcher.doc(doc_id)
        res.append({ "id": doc["id"][0], "text": doc["text"][0], "score": score })

    return(pd.DataFrame(res))

In [None]:
question = "Is the climate crisis worse in poorer countries?"

In [None]:
pd.set_option('display.max_colwidth', 0)

In [None]:
# semantic search dataframe
sdf = search_semantic(question, sentences, sembeddings, model).set_index("id")
sdf

In [None]:
# lexical search dataframe
ldf = search_lexical(question, index).set_index("id")
ldf

### Reciprocal rank fusion

In [None]:
import numpy as np
def rrf(dataframes):
    docs = []  # list of matching docs
    ranks = [] # ranks for each document in the separate lists
    ids = []   # list of unique ids
    for df in dataframes:
        ids += list(df.index)
    # only use each id once
    for i in np.unique(ids):
        # the score
        s = 0
        # we also want to record the rank for debugging and visualization
        rank = []
        for df in dataframes:
            # if the current id is in the index
            if i in df.index:
                # calculate the scoree and add 
                s += 1 / (60.0 + list(df.index).index(i)+1)
                rank.append(list(df.index).index(i)+1)
            else:
                rank.append(None)

        # append to the list of docs, including the score and the rank list
        docs.append({ "id": i, "text": sentences[i], "score": s })
        ranks.append(rank)

    # convert to a dataframe
    df = pd.DataFrame(docs)
    df[[f"result_{i}" for i in range(len(dataframes))]] = ranks
    return df.set_index("id").sort_values("score", ascending=False)

In [None]:
# run the rank fusion for semantic and lexical search
rrf([sdf, ldf]).style.background_gradient(cmap='coolwarm')

In [None]:
model2 = SentenceTransformer('google/embeddinggemma-300m')
with open("sentences-gemma.npy", "rb") as f:
    sembeddings2 = np.load(f)

In [None]:
model3 = SentenceTransformer("Snowflake/snowflake-arctic-embed-l-v2.0", trust_remote_code=True)
with open("sentences-arctic.npy", "rb") as f:
    sembeddings3 = np.load(f)

In [None]:
sdf2 = search_semantic("task: search result | query:" + question, 
                       sentences, sembeddings2, model2).set_index("id")

In [None]:
sdf3 = search_semantic(question, 
                       sentences, sembeddings3, model3, query_prompt_name="query").set_index("id")

In [None]:
rrf([ldf, sdf, sdf2, sdf3]).head(20).style.background_gradient(cmap='coolwarm')

if you think the result is *spoiled* by the lexical matches, use only semantic matches

In [None]:
rrf([sdf, sdf2, sdf3]).head(20).style.background_gradient(cmap='coolwarm')