<a href="https://colab.research.google.com/github/anshupandey/AI_Agents/blob/main/AAP_C11_RAG_Advance_Retrieval_Techniques.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **RAG: Advance retrieval techniques using LangChain**

- Using LangChain - multi query retrieval, few other advance techniques from LangChain

- Vector DB: PineCone (sign in at https://app.pinecone.io/organizations?sessionType=login )

- Embedding model: vertex ai text embedding model

- LLM: Gemini-1.5-flash

- Domain: Education

Using Advanced Retrieval techniques like
- Multi-Query retriever
- Contextual compression
- Long-Context Reorder

This Lab focuses on implementing Retriever-Augmented Generation (RAG) system, designed for the botany domain, leverages advanced retrieval techniques facilitated by **LangChain**, employing **PineCone** as the vector database and **Gemini-1.5-Flash** for language modeling. This configuration enhances the precision and efficiency of botanical information retrieval and generation, ideal for academic research, educational content, and professional consultation.

**LangChain** serves as the orchestration framework that implements sophisticated retrieval techniques, crucial for dealing with complex botanical data. Key features include:

**Multi-Query Retriever**: This technique allows the system to handle multiple queries simultaneously or queries with multiple components, enhancing the system's ability to retrieve comprehensive information from a large dataset.
**Contextual Compression**: Here, relevant information is compressed into context-rich embeddings, preserving essential data while optimizing storage and retrieval processes.
**Long-Context Reorder**: This technique reorganizes the retrieved data to maintain coherence and relevance over extended contexts, vital for understanding complex botanical interactions and processes.
Weaviate is utilized as the vector database, storing and managing the vector embeddings generated using OpenAI's text-embedding-03 model. These embeddings encode detailed semantic relationships within botanical texts, enabling the system to retrieve highly relevant information rapidly and accurately.


Overall, this RAG setup offers a powerful solution for the botany domain, enhancing the retrieval and presentation of botanical knowledge through advanced AI-driven techniques. It significantly aids in research, teaching, and practical applications by delivering accurate, contextually enriched responses swiftly, thus supporting deeper insights and learning in botany.







Necessary libraries for implementing this "RAG: Advance retrieval techniques using LangChain"

## Environment Setup

In [1]:
!pip install -q langchain  langchain-community pypdf pinecone-client langchain-pinecone
!pip install -q -U langchain-core langgraph langchain-experimental --quiet
!pip install --upgrade --quiet google-cloud-aiplatform requests
!pip install -q -U langchain-google-vertexai --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m987.6/987.6 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m12.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.7/295.7 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m216.4/216.4 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m366.5/366.5 kB[0m [31m13.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m129.4/129.4 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m215.9/215.9 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.2/49.2 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━

In [2]:
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

In [1]:
import os
os.environ["PINECONE_API_KEY"] = "15101c0c-1e64-42c3-929a-c8c2d60cd14a"

In [2]:
import sys

if "google.colab" in sys.modules:
    from google.colab import auth

    auth.authenticate_user()

In [3]:
PROJECT_ID = "jrproject-402905"  # @param {type:"string"}
LOCATION = "us-central1"  # @param {type:"string"}

import vertexai

vertexai.init(project=PROJECT_ID, location=LOCATION)

Function for showing retrieved document results in intuitive way

In [4]:
!wget -q https://anshupandey.blob.core.windows.net/generativeaidocs/BotanyDocs.zip
!unzip -q BotanyDocs.zip

In [5]:
def pretty_print_docs(docs):
    print(
        f"\n{'-' * 100}\n".join(
            [f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]
        )
    )

The RecursiveCharacterTextSplitter is used in Retriever-Augmented Generation (RAG) models for handling long text inputs.

Here are the main points for its use:

- Length Management
- Context Preservation
- Efficient Processing

## Data Preparation

In [9]:
# Import the RecursiveCharacterTextSplitter class.
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Initialize a RecursiveCharacterTextSplitter with specified separators and configurations.
# The text will be split into chunks around 1000 characters long, with a 20-character overlap between chunks.

character_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", ". ", " ", ""],  # The hierarchy of separators to use for splitting.
    chunk_size=2000,  # Target size of each chunk in characters.
    chunk_overlap=400)


Here we are using four pdfs related to Stock market investments and strategies. These are our external data sources on which we will implement RAG solution

PyPDFLoader class from the langchain_community.document_loaders module to load and process multiple PDF documents

Loop Through PDF Files: The code iterates over the list of PDF filenames using a for-loop.

Loading PDF Content: For each PDF file in the list, an instance of PyPDFLoader is created with the filename as an argument.

Extract and Split Text: The load_and_split method of the PyPDFLoader instance is called to load the text content from the PDF file and split it into manageable segments. The method uses a text_splitter for this purpose, which is RecursiveCharacterTextSplitter defined above cell

In [10]:
from langchain_community.document_loaders import PyPDFLoader
pdfs=["intro_botany.pdf","Basics_of_Plants.pdf"]
docs=[]
for i in pdfs:
    loader = PyPDFLoader(i)
    docs.extend(loader.load_and_split(text_splitter=character_splitter))

In [11]:
len(docs)

214

In [12]:
docs[0]

Document(metadata={'source': 'intro_botany.pdf', 'page': 0}, page_content='Chapter 1\nIntroduction to the Introduction\n1.1 Plants, Botany, and Kingdoms\nBotany is the scientiﬁc study of plants and plant-like organisms. It helps us under-\nstand why plants are so vitally important to the world. Plants start the majority of\nfood and energy chains, they provide us with oxygen, food and medicine.\nPlants can be divided into two groups: plants 1andplants 2. Plants 1contain all pho-\ntosynthetic organisms which use light, H 2O, and CO 2to make organic compounds\nand O 2. Plants 1are deﬁned ecologically (based on their role in nature).\nSome plants 1can be bacteria or even animals! One example of this a green slug,\nElysia chlorotica (see Fig. 1.1). Green slugs collect chloroplasts from algae and use\nthem for their entire life as food producers. Therefore, green slugs are both animals\nand plants 1.\nPlants 2areall organisms from Vegetabilia kingdom . Normally, plants 2are green\norganisms

In [13]:
len(docs)

214

We are using Openai embeddings "text-embedding-ada-002"

## Vector Embeddings

In [14]:
from langchain_google_vertexai.embeddings import VertexAIEmbeddings
embeddings = VertexAIEmbeddings(model_name="textembedding-gecko@001")

## Vector DB Setup and Retrieval

We are using Pinecone vectorDB for storing the Vector embeddings

In [15]:
from pinecone import Pinecone, ServerlessSpec
import time

use_serverless=False

# configure client
pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY"))



# check for and delete index if already exists
index_name = 'langchain-rag'
if index_name in pc.list_indexes().names():
    pc.delete_index(index_name)

# create a new index
pc.create_index(
    name=index_name,
    dimension=768,
    metric="cosine",
    spec=ServerlessSpec(
        cloud='aws',
        region='us-east-1'
    )
)

# wait for index to be initialized
while not pc.describe_index(index_name).status['ready']:
    time.sleep(1)

In [16]:
from langchain_pinecone import PineconeVectorStore

index_name = "langchain-rag"

docsearch = PineconeVectorStore.from_documents(docs, embeddings, index_name=index_name)



In [17]:
from langchain_core.messages import HumanMessage

In [18]:
from langchain_google_vertexai import ChatVertexAI
model = ChatVertexAI(model="gemini-1.5-flash-001")

This function will retrieve top n results from vectorDB which are semantically similar to the query

In [19]:
def get_similiar_docs(query, k=6, score=False):
  if score:
    similar_docs = docsearch.similarity_search_with_score(query, k=k)
  else:
    similar_docs = docsearch.similarity_search(query, k=k)
  return similar_docs

query=input("what is you query? ")
pretty_print_docs(get_similiar_docs(query))
# What is needed to understand life of plants

what is you query? What is needed to understand life of plants
Document 1:

goal of the analysis is the creation of a phylogeny tree (cladogram ) which becomes
the basis of classiﬁcation. Below is a short instruction which explains the basics of
the cladistic analysis on the artiﬁcial example of several “families” of plants.
1. Start with determining the “players”—all subtaxa from bigger group. In our
case, it will be these three “families”:
Alphaceae
Betaceae
Gammaceae
2. Describe these three groups:
Alphaceae : Flowers red, petioles short, leaves whole, spines absent
Betaceae : Flowers red, petioles long, leaves whole, spines absent
Gammaceae : Flowers green, petioles short, leaves dissected, spines present
3. Determine individual characters (we will need at least 2 N+ 1 characters where
Nis number of studied taxa):
(1) Flower color
(2) Petiole size
(3) Dissection of leaves
(4) Presence of spines
4.Polarize the characters : every character should have at least two character
states wh

## Implementing RAG Chain

This below is the basic chaining process of langchain for RAG, we have provided two ways for Chainings


The first code snippet sets up a retrieval chain for answering questions based on context using Langchain. It involves creating a retrieval chain with a specific retriever and language model, structuring the question-answer process with a ChatPromptTemplate, and invoking the retrieval chain with a user input question.

In [20]:
from langchain_core.documents import Document
from langchain_core.prompts import ChatPromptTemplate
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.runnables import RunnableLambda, RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

In [21]:
template = """Answer the question based only on the following context:
{context}

Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)
chain = (
    {"context":docsearch.as_retriever(), "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)

chain.invoke("All organic molecule is made of what?")

'All organic molecules are made of some organic skeleton. \n'

**-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------**

The second code snippet constructs a chain for answering questions based on a provided context. It involves defining a template for structuring the question-answer process, creating a ChatPromptTemplate from the template, initializing a language model, and setting up a chain with components like the context (retriever), a placeholder for the question, a prompt template, the language model, and an output parser.

In [22]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate


# Create a retrieval chain
retriever =docsearch.as_retriever()  # Define the retriever based on your specific requirements


prompt = ChatPromptTemplate.from_template(
    """Answer the following question based only on the provided context:
    <context>
    {context}
    </context>
    Question: {input}"""
)

document_chain = create_stuff_documents_chain(model, prompt)
retrieval_chain = create_retrieval_chain(retriever, document_chain)

# Invoke the retrieval chain with the question
ans=retrieval_chain.invoke({"input": input()}) # What is needed to understand life of plants
ans

What is needed to understand life of plants


{'input': 'What is needed to understand life of plants',
 'context': [Document(metadata={'page': 159.0, 'source': 'intro_botany.pdf'}, page_content='goal of the analysis is the creation of a phylogeny tree (cladogram ) which becomes\nthe basis of classiﬁcation. Below is a short instruction which explains the basics of\nthe cladistic analysis on the artiﬁcial example of several “families” of plants.\n1. Start with determining the “players”—all subtaxa from bigger group. In our\ncase, it will be these three “families”:\nAlphaceae\nBetaceae\nGammaceae\n2. Describe these three groups:\nAlphaceae : Flowers red, petioles short, leaves whole, spines absent\nBetaceae : Flowers red, petioles long, leaves whole, spines absent\nGammaceae : Flowers green, petioles short, leaves dissected, spines present\n3. Determine individual characters (we will need at least 2 N+ 1 characters where\nNis number of studied taxa):\n(1) Flower color\n(2) Petiole size\n(3) Dissection of leaves\n(4) Presence of spine

In [23]:
ans['answer']

'The provided text focuses on cladistic analysis and plant classification, not on the overall understanding of plant life.  Therefore, the context does not provide an answer to the question of what is needed to understand the life of plants. \n'

**------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------**

These functions are created for showing the final cleaned results from advanced retrieval techniques

In [24]:
from langchain.chains.combine_documents import create_stuff_documents_chain

import re
def clean_and_format_answer(text):

  cleaned_text = text.strip()

  # Split into paragraphs and remove extra newlines
  paragraphs = re.split(r"\n\n|\n", text.strip())
  formatted_paragraphs = [paragraph.strip() for paragraph in paragraphs]

  # Join paragraphs back with a single newline
  answer = "".join(formatted_paragraphs)

  return answer


def answerRetriever(llm,docs,question):
    prompt = ChatPromptTemplate.from_template(
        """Answer the following question based only on the provided context:
        <context>
        {context}
        </context>
        Question: {input}"""
    )
    document_chain = create_stuff_documents_chain(llm, prompt)

    ans=document_chain.invoke({
        "context": unique_docs,
        "input":question
    })
    return ans

Now we are going to leverage some advanced retrieval techniques

# Multi-Query Retriever

The Multi-Query Retriever automates prompt tuning by using a language model to generate multiple queries from different perspectives for a given user input query. It retrieves relevant documents for each query and combines the unique results across all queries. By generating diverse perspectives on the same question, it aims to overcome limitations of distance-based retrieval and provide a richer set of results. The retriever simplifies the process of generating varied queries and retrieving a broader range of potentially relevant documents based on user input

In [25]:
from langchain_google_vertexai import ChatVertexAI
llm = ChatVertexAI(model="gemini-1.5-flash-001")

In [31]:
from langchain.retrievers.multi_query import MultiQueryRetriever
question = "Explain the whole process of photosynthesis"
retriever_from_llm = MultiQueryRetriever.from_llm(retriever=retriever, llm=llm)

In [27]:
# Set logging for the queries
import logging

logging.basicConfig()
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

In [74]:
from typing import List
from langchain.output_parsers import PydanticOutputParser
from langchain_core.prompts import PromptTemplate
from pydantic import BaseModel, Field


# Output parser will split the LLM result into a list of queries
class LineList(BaseModel):
    # "lines" is the key (attribute name) of the parsed output
    lines: List[str] = Field(description="Lines of text")


class LineListOutputParser(PydanticOutputParser):
    def __init__(self) -> None:
        super().__init__(pydantic_object=LineList)

    def parse(self, text: str) -> LineList:
        lines = text.content.strip()#.split("\n")
        return lines #LineList(lines=lines)


output_parser = LineListOutputParser()

QUERY_PROMPT = PromptTemplate(
    input_variables=["question"],
    template="""You are an AI language model assistant. Your task is to generate five
    different versions of the given user question to retrieve relevant documents from a vector
    database. By generating multiple perspectives on the user question, your goal is to help
    the user overcome some of the limitations of the distance-based similarity search.
    Provide these alternative questions separated by newlines.
    Original question: {question}""",
)

llm = ChatVertexAI(model="gemini-1.5-flash-001")

# Chain
multi_query_chain = QUERY_PROMPT | llm | output_parser.parse
# Other inputs
question = "Explain the whole process of photosynthesis"

In [75]:
result = multi_query_chain.invoke({"question": question})
print(result)

Here are five different versions of the user question "Explain the whole process of photosynthesis" to retrieve relevant documents from a vector database:

1. **Focus on stages:** What are the distinct stages involved in photosynthesis, and how do they work together?
2. **Highlight inputs and outputs:** What are the key inputs and outputs of photosynthesis, and how are they transformed during the process?
3. **Emphasize light dependence:** How does light energy play a role in photosynthesis, and what are the specific reactions that depend on it?
4. **Focus on the role of chlorophyll:** What is the function of chlorophyll in photosynthesis


In [76]:
# Run
retriever = multi_query_chain | docsearch.as_retriever()
# Results
unique_docs = retriever.invoke(input=question)
len(unique_docs)

4

In [77]:
pretty_print_docs(unique_docs)

Document 1:

Sun rays
Photosystems↓↓
Segregation of ions↓↓
Difference of potentials↓↓
Proton pump↓↓
ATP↓↓
CO 2assimilation↓↓
Figure 2.4. The logical chain of light stage reactions (hydrogen carrier not shown).
2.3 Enzymatic Stage
The enzymatic stage has many participants. These include carbon dioxide, hydro-
gen carrier with hydrogen (NADPH), ATP , ribulose biphosphate (RuBP , or C 5), and
Rubisco along with some other enzymes. Everything occurs in the matrix (stroma)
of the chloroplast.
The main event of the enzymatic stage is CO 2assimilation with C 5into short-living
C6molecules. Assimilation requires Rubisco as an enzyme. Next, this temporary C 6
breaks into two C 3molecules (PGA). Then, PGA will participate in the complex set of
reactions which spend NADPH and ATP as sources of hydrogen and energy, respec-
tively; and yields (through the intermediate stage of PGAL) one molecule of glucose
(C6H12O6) for every six assimilated molecules of CO 2. NADP+, ADP and P iwill go
back to the 

In [78]:
answerRetriever(llm,unique_docs,question)

'Photosynthesis is a process that plants use to convert light energy into chemical energy in the form of glucose. It occurs in two stages: the light-dependent reactions and the light-independent reactions (Calvin cycle).\n\n**Light-dependent reactions:**\n\n1. **Sun rays:** Sunlight is absorbed by chlorophyll in photosystems.\n2. **Photosystems:** This energy excites electrons, which are then passed along an electron transport chain.\n3. **Segregation of ions:** The movement of electrons creates a difference in potential across the thylakoid membrane.\n4. **Difference of potentials:** This potential drives the movement of protons ('

# Contextual compression

Contextual compression in Langchain involves compressing retrieved documents using the context of a given query to filter out irrelevant information. It aims to return only the most relevant information by shortening document contents or removing documents altogether. This process helps optimize expensive language model calls and improve response quality by focusing on the essential content related to the query. Contextual compression involves passing queries to a base retriever, then compressing the retrieved documents using a Document Compressor to enhance the relevance of the information returned to the user

In [None]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=docsearch.as_retriever()
)

print(question)

In [None]:
compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)

In [None]:
clean_and_format_answer(answerRetriever(llm,compressed_docs,question))

'Photosynthesis is the process by which plants convert sunlight, water, and carbon dioxide into glucose and oxygen. It begins with the light stage, where sunlight is absorbed by photosystems in the chloroplasts of plant cells. This leads to the segregation of ions, a difference in potentials, and the creation of a proton pump that generates ATP. ATP is essential for providing energy for the conversion of carbon dioxide into glucose.During the light stage, water is split into protons, electrons, and oxygen by photosystem II, which also produces ATP and forwards electrons to photosystem I. Photosystem I then uses these electrons to create NADPH, a hydrogen carrier. The light stage results in the accumulation of energy (ATP) and hydrogen (NADPH), with oxygen released as a byproduct.The enzymatic stage of photosynthesis takes place in the matrix (stroma) of the chloroplast and involves various participants such as carbon dioxide, NADPH, ATP, ribulose biphosphate (RuBP), and the enzyme Rubi


# Long-Context Reorder

Long-Context Reorder in Langchain is a technique used to address performance degradation when models access relevant information in the middle of long contexts. By reordering documents after retrieval, this method ensures that more relevant elements are placed at the beginning and end of the document list, while less relevant ones are in the middle. This reordering helps models avoid ignoring provided documents and enhances performance when dealing with long contexts and multiple retrieved documents. Long-Context Reorder is a valuable tool to optimize information retrieval and processing in complex language models

In [81]:
from langchain_community.document_transformers import (
    LongContextReorder,
)

In [86]:
query = "Explain the whole process of photosynthesis"

retriever = docsearch.as_retriever()
# Get relevant documents ordered by relevance score
docs = retriever.get_relevant_documents(question)
docs

[Document(metadata={'page': 14.0, 'source': 'intro_botany.pdf'}, page_content='starch\nDNAgranum thylakoid\nmatrix (stroma)Figure 2.5. Chloroplast.\n* * *\nTo summarize, the logic of photosynthesis (Fig. 2.8) is based on a simple idea: make\nsugar from carbon dioxide . Imagine if we have letters “s”, “g”, “u”, and “a” and need\nto build the word “sugar”. Obviously, we will need two things: the letter “r” and the\nenergy to put these letters in the right order. The same story occurs in photosynthe-\nsis: it will need hydrogen (H) which is the “absent letter” from CO 2because sugars\nmust contain H, O and C. NADP+/NADPH is used as hydrogen supplier, and energy\nis ATP which is created via proton pump, and the proton pump starts because light\nhelps to concentrate protons in the reservoir.\n33 Version June 7, 2021'),
 Document(metadata={'page': 14.0, 'source': 'intro_botany.pdf'}, page_content='starch\nDNAgranum thylakoid\nmatrix (stroma)Figure 2.5. Chloroplast.\n* * *\nTo summarize, the 

In [87]:
reordering = LongContextReorder()
reordered_docs = reordering.transform_documents(docs)

# Confirm that the 4 relevant documents are at beginning and end.
reordered_docs

[Document(metadata={'page': 14.0, 'source': 'intro_botany.pdf'}, page_content='starch\nDNAgranum thylakoid\nmatrix (stroma)Figure 2.5. Chloroplast.\n* * *\nTo summarize, the logic of photosynthesis (Fig. 2.8) is based on a simple idea: make\nsugar from carbon dioxide . Imagine if we have letters “s”, “g”, “u”, and “a” and need\nto build the word “sugar”. Obviously, we will need two things: the letter “r” and the\nenergy to put these letters in the right order. The same story occurs in photosynthe-\nsis: it will need hydrogen (H) which is the “absent letter” from CO 2because sugars\nmust contain H, O and C. NADP+/NADPH is used as hydrogen supplier, and energy\nis ATP which is created via proton pump, and the proton pump starts because light\nhelps to concentrate protons in the reservoir.\n33 Version June 7, 2021'),
 Document(metadata={'page': 13.0, 'source': 'intro_botany.pdf'}, page_content='Sun rays\nPhotosystems↓↓\nSegregation of ions↓↓\nDifference of potentials↓↓\nProton pump↓↓\nATP

In [90]:
# We prepare and run a custom Stuff chain with reordered docs as context.
ans = answerRetriever(llm,reordered_docs,query)
ans



''

In [None]:
clean_and_format_answer(ans)

## Thank You