# Mastering Advanced RAG Techniques with LlamaIndex

In this lesson you will learn the following:
- Choere Re-ranker.
- Streaming Query engines with top n retrievals.
- `SubqueryTool` which can produce new sub question for effecient retrievals of each different part of question. 

# Intro
The Retrieval-Augmented Generation (RAG) pipeline heavily relies on retrieval performance guided by the adoption of various techniques and advanced strategies. Methods like query expansion, query transformations, and query construction each play a distinct role in refining the search process. 

# Querying in LlamaIndex

- Retrievers: These classes are designed to retrieve a set of nodes from an index based on a given query. Retrievers source the relevant data from the index.
- Query Engine: It is the central class that processes a query and returns a response object. Query Engine leverages the retrievers and the response synthesizer modules to curate the final output.
- Query Transform: It is a class that enhances a raw query string with various transformations to improve the retrieval efficiency. It can be used in conjunction with a Retriever and a Query Engine.

# Query Construction
The core idea is to answer user queries by leveraging the inherent structure of the data. For instance, a query like "movies about aliens in the year 1980" combines a semantic component like "aliens" (which will get better results if retrieved through vector storage) with a structured component like "year == 1980". The process involves translating a natural language query into the query language of a specific database, such as SQL for relational databases or Cypher for graph databases.

# Query Expansion

- Query expansion works by extending the original query with additional terms or phrases that are related or synonymous.

- if the original query is too narrow or uses specific terminology, query expansion can include broader or more commonly used terms relevant to the topic. Eg. "climate change effects." -> involve adding related terms or synonyms to this query, such as "global warming impact," "environmental consequences," or "temperature rise implications."

- One approach to do it is utilizing the `synonym_expand_policy` from the `KnowledgeGraphRAGRetriever` class. In the context of LlamaIndex, the effectiveness of query expansion is usually enhanced when combined with the Query Transform class.

# Query Transformation

- modifies query to retrieve relevant more information.
- It can include changes in query structure, the use of synonyms, or the inclusion of contextual information.
- create query more optimized for search engines and vector db. Eg "What were Microsoft's revenues in 2021?" -> “Microsoft revenues 2021”.

# Query Engine

- Interact with data using nlq.
- Multiple query engines can be combined for enhance functionlaity, catering to complex data interrogation needs.
- Use Chat Engines to proivde more dynamic and engaging interaction with data.



In [27]:
from dotenv import load_dotenv
load_dotenv()

True

In [4]:
! mkdir -p './data/'
! wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O './data/paul_graham_essay.txt'

--2024-03-09 12:30:16--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75042 (73K) [text/plain]
Saving to: ‘./data/paul_graham_essay.txt’


2024-03-09 12:30:17 (1.05 MB/s) - ‘./data/paul_graham_essay.txt’ saved [75042/75042]



In [9]:
from llama_index.core.readers import SimpleDirectoryReader

# load documents
documents = SimpleDirectoryReader(input_files=["./data/paul_graham_essay.txt"]).load_data()

print(len(documents))

1


In [13]:
from llama_index.core import ServiceContext

service_context = ServiceContext.from_defaults(chunk_size= 512, chunk_overlap= 64)
node_parser= service_context.node_parser
# default splitter is : SentenceSplitter
nodes = node_parser.get_nodes_from_documents(documents)

  service_context = ServiceContext.from_defaults(chunk_size= 512, chunk_overlap= 64)


In [14]:
# Instantiate vector db
from llama_index.vector_stores.deeplake import DeepLakeVectorStore

my_activeloop_org_id = "akshatsingh1718"
my_activeloop_dataset_name = "LlamaIndex_paulgraham_essays"
dataset_path = f"hub://{my_activeloop_org_id}/{my_activeloop_dataset_name}"

# Create an index over the documnts
vector_store = DeepLakeVectorStore(dataset_path=dataset_path, overwrite=False)



Your Deep Lake dataset has been successfully created!


 

In [16]:
from llama_index.core import StorageContext, VectorStoreIndex

storage_context = StorageContext.from_defaults(vector_store=vector_store)
storage_context.docstore.add_documents(nodes)
vector_index = VectorStoreIndex(nodes, storage_context=storage_context)

Uploading data to deeplake dataset.


100%|██████████| 40/40 [00:04<00:00,  9.34it/s]
|

Dataset(path='hub://akshatsingh1718/LlamaIndex_paulgraham_essays', tensors=['text', 'metadata', 'embedding', 'id'])

  tensor      htype      shape      dtype  compression
  -------    -------    -------    -------  ------- 
   text       text      (40, 1)      str     None   
 metadata     json      (40, 1)      str     None   
 embedding  embedding  (40, 1536)  float32   None   
    id        text      (40, 1)      str     None   


 

In [17]:
# create query engine 
query_engine = vector_index.as_query_engine(streaming=True, similarity_top_k=10)

In [18]:
streaming_response = query_engine.query(
    "What does Paul Graham do?",
)
streaming_response.print_response_stream()
# Streaming will provide a sense to the user that our chatbot is typing in real time and reduce idle time for end users.

Paul Graham is involved in various activities such as founding and running companies, writing essays, working on programming projects like creating a new Lisp dialect called Arc, and starting initiatives like Y Combinator to fund and support startups.

In [19]:
query_engine = vector_index.as_query_engine(similarity_top_k=10)    

In [20]:
from llama_index.core.tools import QueryEngineTool, ToolMetadata
from llama_index.core.query_engine import SubQuestionQueryEngine

query_engine_tools = [
    QueryEngineTool(
        query_engine=query_engine,
        metadata=ToolMetadata(
            name="pg_essay",
            description="Paul Graham essay on What I Worked On",
        ),
    ),
]

query_engine = SubQuestionQueryEngine.from_defaults(
    query_engine_tools=query_engine_tools,
    service_context=service_context,
    use_async=True,
)

In [25]:
response = query_engine.query(
    "How was Paul Grahams life different before, during, and after YC?"
)
print( ">>> The final response:\n", response )

Generated 3 sub questions.
[1;3;38;2;237;90;200m[pg_essay] Q: What did Paul Graham work on before Y Combinator?
[0m[1;3;38;2;90;149;237m[pg_essay] Q: What did Paul Graham work on during Y Combinator?
[0m[1;3;38;2;11;159;203m[pg_essay] Q: What did Paul Graham work on after Y Combinator?
[0m[1;3;38;2;237;90;200m[pg_essay] A: Paul Graham worked on a new version of Arc before Y Combinator.
[0m[1;3;38;2;90;149;237m[pg_essay] A: Paul Graham worked on writing essays, working on Y Combinator (YC), and developing a new version of Arc during his time at Y Combinator.
[0m[1;3;38;2;11;159;203m[pg_essay] A: After Y Combinator, Paul Graham worked on a new dialect of Lisp called Arc in a house he bought in Cambridge.
[0m>>> The final response:
 Paul Graham's life involved working on a new version of Arc before Y Combinator, during Y Combinator he focused on writing essays, working on Y Combinator (YC), and developing a new version of Arc, and after Y Combinator, he worked on a new dialect

In [26]:
response = query_engine.query(
    "What is python programming language ?"
)
print( ">>> The final response:\n", response )

Generated 1 sub questions.
[1;3;38;2;237;90;200m[pg_essay] Q: What is Python programming language?
[0m[1;3;38;2;237;90;200m[pg_essay] A: Python is a high-level programming language known for its simplicity and readability. It supports multiple programming paradigms, including procedural, object-oriented, and functional programming. Python is widely used for web development, data analysis, artificial intelligence, scientific computing, and more. It emphasizes code readability and has a large standard library that makes it suitable for various applications.
[0m>>> The final response:
 Python is a high-level programming language that is recognized for its simplicity, readability, and versatility. It supports various programming paradigms such as procedural, object-oriented, and functional programming. Python finds extensive use in web development, data analysis, artificial intelligence, scientific computing, and other fields due to its emphasis on code readability and its comprehensiv

# Custom Retriever Engine

- `QueryEngine` depends heavily on retriever and its parameters (# of docs returned).
- `LlamaIndex` supports custom retrievers which consist of combination of different retriever styles, creating more nuanced retrieval strategies that adapt to distinct individual queries. 
- `RetrieverQueryEngine` operates with a designated retriever and are of two types:
    - `VectorIndexRetriever`: The retreiver used till now is this one. It fetches top-k nodes that are most similar to the query and ensure the result closley align with the query. It is ideal where precision and relevance to the specific query are dominant.
    - `SummaryIndexRetriever`: This approach is less concerned with aligining closely to the specific context of the question and more about providing a broad overview. Useful where a comprehensive sweep of information is needed.

# Reranking

- Reranking is re-evaluating and re-ordering search results to present the most relevant options.
- One can removes the lower scores chunks and boost's LLM performance.
- It sorts the search results according to their relevance to the query.
- Retrieval can fetch multiple docs which may be irrevalant to the users query.
- `Cohere` reranker is uded for complex and domain-specific queries. 
- It is not a replacement to the search engine but a supplementary tool for sorting search results.

In [29]:
! pip3 install cohere

Collecting cohere
  Downloading cohere-4.53-py3-none-any.whl.metadata (6.2 kB)
Collecting fastavro<2.0,>=1.8 (from cohere)
  Downloading fastavro-1.9.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.5 kB)
Downloading cohere-4.53-py3-none-any.whl (52 kB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m52.8/52.8 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fastavro-1.9.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m[31m6.4 MB/s[0m eta [36m0:00:01[0m
[?25hInstalling collected packages: fastavro, cohere
Successfully installed cohere-4.53 fastavro-1.9.4


In [30]:
import cohere
import os

# os.environ['COHERE_API_KEY'] = "<YOUR_COHERE_API_KEY>"

# Get your cohere API key on: www.cohere.com
co = cohere.Client(os.environ['COHERE_API_KEY'])

# Example query and passages
query = "What is the capital of the United States?"
documents = [
   "Carson City is the capital city of the American state of Nevada. At the  2010 United States Census, Carson City had a population of 55,274.",

   "The Commonwealth of the Northern Mariana Islands is a group of islands in the Pacific Ocean that are a political division controlled by the United States. Its capital is Saipan.",
   
   "Charlotte Amalie is the capital and largest city of the United States Virgin Islands. It has about 20,000 people. The city is on the island of Saint Thomas.",
   
   "Washington, D.C. (also known as simply Washington or D.C., and officially as the District of Columbia) is the capital of the United States. It is a federal district. ",
   
   "Capital punishment (the death penalty) has existed in the United States since before the United States was a country. As of 2017, capital punishment is legal in 30 of the 50 states.",
   
   "North Dakota is a state in the United States. 672,591 people lived in North Dakota in the year 2010. The capital and seat of government is Bismarck."
   ]

In [31]:
results = co.rerank(query=query, documents=documents, top_n=3, model='rerank-english-v2.0') # Change top_n to change the number of results returned. If top_n is not passed, all results will be returned.
for idx, r in enumerate(results):
  print(f"Document Rank: {idx + 1}, Document Index: {r.index}")
  print(f"Document: {r.document['text']}")
  print(f"Relevance Score: {r.relevance_score:.2f}")
  print("\n")

# results can also filter out on the basis of the relevancy score by providing a threshold. 
# if r.relevance_score < threshold: exclude document 

Document Rank: 1, Document Index: 3
Document: Washington, D.C. (also known as simply Washington or D.C., and officially as the District of Columbia) is the capital of the United States. It is a federal district. 
Relevance Score: 0.98


Document Rank: 2, Document Index: 1
Document: The Commonwealth of the Northern Mariana Islands is a group of islands in the Pacific Ocean that are a political division controlled by the United States. Its capital is Saipan.
Relevance Score: 0.30


Document Rank: 3, Document Index: 4
Document: Capital punishment (the death penalty) has existed in the United States since before the United States was a country. As of 2017, capital punishment is legal in 30 of the 50 states.
Relevance Score: 0.28




In [33]:
! pip install llama-index-postprocessor-cohere-rerank

Collecting llama-index-postprocessor-cohere-rerank
  Downloading llama_index_postprocessor_cohere_rerank-0.1.2-py3-none-any.whl.metadata (720 bytes)
Downloading llama_index_postprocessor_cohere_rerank-0.1.2-py3-none-any.whl (2.7 kB)
Installing collected packages: llama-index-postprocessor-cohere-rerank
Successfully installed llama-index-postprocessor-cohere-rerank-0.1.2


In [47]:
import os
from llama_index.postprocessor.cohere_rerank import CohereRerank

# reranker will rank the retrievals based on the query and the select the top 2 ranked retrievals
cohere_rerank = CohereRerank(api_key=os.environ['COHERE_API_KEY'], top_n=2)
query_engine = vector_index.as_query_engine(
    similarity_top_k=10,
    node_postprocessors=[cohere_rerank],
)

response = query_engine.query(
    "What did Sam Altman do in this essay?",
)
print(response)

Sam Altman initially declined the offer to become the president of Y Combinator as he wanted to start a startup focused on making nuclear reactors. However, after persistent persuasion, he eventually agreed to take over as the president starting with the winter 2014 batch.


# Advanced Retrievals

- Alternative method for retrieving relevant documents involves using document summaries instead of extracting fragmented snippets or brief text chunks to respond to queries. This will ensure that the ans reflect the entire context or topic being examined.

- **Recursive retrieval**: This works well with hierarchical structure where there are relationships between nodes. This is found in the cases of PDF's which may contain sub-data such as tables and diagrams, alongside refrences to other documents.

- **Small-to-Big retrieval**: Starting with concise and focused sentences to get the most relevant section of content using differnet rag techniques like `SentenceWindowNodeParser` and `HierarchicalNodeParser`. This technique is particularly useful in situations where the initial query may not encompass the entirety of relevant information or where the data's relationships are intricate and multi-layered.