# Post-retrieval processing

In the "Post-retrieval" phase of RAG, the retrieved documents are processed to extract the relevant information. In order to optimize generation. 

The retrieval phase results in a list of documents. 

This notebook demonstrates three different techniques for post-retrieval processing:

- Reranking
- Compression
- Fusion

### Setup libraries and environment

In [None]:
%pip install llmlingua llama-index-postprocessor-rankgpt-rerank llama-index-postprocessor-cohere-rerank llama-index-postprocessor-longllmlingua

In [None]:
# NOTE: This is ONLY necessary in jupyter notebook.
# Details: Jupyter runs an event-loop behind the scenes.
#          This results in nested event-loops when we start an event-loop to make async queries.
#          This is normally not allowed, we use nest_asyncio to allow it for convenience.
import nest_asyncio
nest_asyncio.apply()

In [None]:
import os
from dotenv import load_dotenv

from IPython.display import display, Markdown

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.openai import OpenAI
from llama_index.core.schema import QueryBundle
from llama_index.postprocessor.rankgpt_rerank import RankGPTRerank
from llama_index.postprocessor.cohere_rerank import CohereRerank
from llama_index.postprocessor.longllmlingua import LongLLMLinguaPostprocessor

Add the following to a `.env` file in the root of the project if not already there.

```
OPENAI_API_KEY=<YOUR_KEY_HERE>
COHERE_API_KEY=<YOUR_KEY_HERE>
```

Sign up for Cohere and create one here: [Cohere Dashboard](https://dashboard.cohere.com/api-keys)

In [None]:
load_dotenv()
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
COHERE_API_KEY = os.getenv("COHERE_API_KEY")

In [None]:
from util.helpers import get_wiki_pages, create_and_save_wiki_md_files
pages = get_wiki_pages(["Vincent Van Gogh"])
create_and_save_wiki_md_files(pages=pages, path="./data/docs/wiki/")

In [None]:
documents = SimpleDirectoryReader("./data/docs/wiki/").load_data()

In [None]:
llm = OpenAI(api_key=OPENAI_API_KEY, model="gpt-4-turbo")

In [None]:
Settings.chunk_size = 124
Settings.chunk_overlap = 10
Settings.llm = llm

index = VectorStoreIndex.from_documents(
    documents=documents,
)

In [None]:
query = "Describe the later life of Vincent Van Gogh."

## Reranking

Since vectors are essentially compressions of the meeaning behind some text, there is a loss of information. So what do we do if relevant information is below top_k cutoff for ou retrieval? The simplest solution would be to increase the top_k value, but this would increase the computational cost. Another problem is that LLMs suffer from the "Lost in the Middle" phenomenon, where it usually focuses on the extremes of the input prompt. This means that its prudent to have the most relevant information at the top of the list.

A solution to this problem is **reranking**. Reranking fundamentally reorders the documents chunks to highlight the most pertinent results first, effectively reducing the overall document pool, severing a dual purpose in information retrieval, acting as both an enhancer and a filter, delivering refined inputs for more precise language model processing.

In this example we will see two approaches to reranking:
- LLM reranking 
    - having a language model rerank the documents
    - specifically, we will use RankGPT using ChatGPT from OpenAI
- Ranking using Cohere Rerank3 - A managed reranking model by Cohere

### LLMRerank

The benefits of using a language model to rerank documents are that it can understand the context of the query and the documents, and can provide a more nuanced ranking.

RankGPT uses the following prompt to rank the retrieved documents:
```
You are RankGPT, an intelligent assistant that can rank passages based on their relevancy to the query.

I will provide you with {num} passages, each indicated by number identifier []. 

Rank the passages based on their relevance to query: {query}.
```

In [None]:
reranker = RankGPTRerank(top_n=3, verbose=True)

In [None]:
query_engine = index.as_query_engine(
    similarity_top_k=10,
    node_postprocessors=[reranker],
)

In [None]:
response = query_engine.query(query)
display(Markdown(f'Nodes:\n\n{"\n\n---------------\n\n".join([node.text for node in response.source_nodes])}'))
display(Markdown(f'<b>{response}</b>'))

### Cohere

In this example we will use [**Rerank 3**](https://cohere.com/blog/rerank-3), which is a managed reranking model by **Cohere** that can be used to rerank documents. It is a transformer model that is trained on a large dataset of queries and documents to rerank documents based on their relevance to the query.

The model includes
- 4k context length to significantly improve search quality for longer documents 
- Ability to search over multi-aspect and semi-structured data like emails, invoices, JSON documents, code, and tables
- Multilingual coverage of 100+ languages 

Since it is closed source we can not go through the inner workings of the model, but the in many applications it has shown to be very effective at reducing latency and increasing accuracy of the generation step. An open source alternative is [BAAI/bge-reranker-large](https://huggingface.co/BAAI/bge-reranker-large) available at Hugging Face.

In [None]:
reranker = CohereRerank(api_key=COHERE_API_KEY, top_n=3, model="rerank-english-v3.0")

In [None]:
query_engine = index.as_query_engine(
    similarity_top_k=10,
    node_postprocessors=[reranker],
)

In [None]:
response = query_engine.query(query)
display(Markdown(f'Nodes:\n\n{"\n\n---------------\n\n".join([node.text for node in response.source_nodes])}'))
display(Markdown(f'<b>{response}</b>'))

## Prompt Compression

Prompt compression is the process of reducing the length of the prompt to focus on the most important information. This can be useful when the prompt is too long or contains irrelevant information. Its also an effective way to reduce the computational cost (reducing time and money spent) of the model as well as combat the "Lost in the middle" phenomenon.

In this example we will be using **LLMLingua** developed by Microsoft Research ([original paper](https://arxiv.org/pdf/2310.05736)) to reduce the size of prompts, while keeping the information that is relevant to the query.
The main idea behind LLMLingua is to use a smaller language model to calculate the mutual information between the prompt and the query and use this to perform prompt compression.

More specifically we will be using a process called **LongLLMLingua**. This process starts by reordering the documents to have the most relevant information at the top. Then it uses LLMLingua to compress the prompt and finally uses the compressed prompt to generate the final output.

Other methods of prompt compression include:
- [Selective Context](https://arxiv.org/pdf/2304.12102)


In [None]:
llm_lingua_compressor = LongLLMLinguaPostprocessor(
    instruction_str="Given the context, please answer the final question",
    target_token=300,
    
    rank_method="longllmlingua",
    additional_compress_kwargs={
        "condition_compare": True,
        "condition_in_question": "after",
        "context_budget": "+100",
        "reorder_context": "sort",  # enable document reorder
        "dynamic_context_compression_ratio": 0.4, # enable dynamic compression ratio
    },
)

If it the processor fails with "CUDA" error. You might need to make sure that pytorch is install with CUDA

Check by running `torch.cuda.is_available()` and if it returns `True` then you have CUDA installed. If it returns `False` then you need to install pytorch with CUDA support.
The following pip command will install the correct version of pytorch with CUDA support. (Versions may with computer and OS)

In [None]:

%pip install torch==2.3.0+cu118 --index-url https://download.pytorch.org/whl/cu118
%pip install --force-reinstall Pillow

In [None]:
import torch
torch.cuda.is_available()

### Compare the results before and after compression

In [None]:
retriever = index.as_retriever(similarity_top_k=10)

In [None]:
retrieved_nodes = retriever.retrieve(query)

In [None]:
new_retrieved_nodes = llm_lingua_compressor.postprocess_nodes(
    retrieved_nodes, query_bundle=QueryBundle(query_str=query)
)

In [None]:
original_contexts = "\n\n".join([n.get_content() for n in retrieved_nodes])
compressed_contexts = "\n\n".join([n.get_content() for n in new_retrieved_nodes])

original_tokens = llm_lingua_compressor._llm_lingua.get_token_length(original_contexts)
compressed_tokens = llm_lingua_compressor._llm_lingua.get_token_length(compressed_contexts)

print("Original Contexts:")
print("-------------------")
print(original_contexts)
print("-------------------")
print("Compressed Contexts:")
print("-------------------")
print(compressed_contexts)
print("-------------------")
print("Original Tokens:", original_tokens)
print("Compressed Tokens:", compressed_tokens)
print("Compressed Ratio:", f"{original_tokens/(compressed_tokens + 1e-5):.2f}x")

### Create Query Engine

In [None]:
query_engine = index.as_query_engine(
    similarity_top_k=10,
    node_postprocessors=[llm_lingua_compressor],
)