# AI-Driven Research Assistant Powered by Agentic RAG

__RAG__ (Retrieval-Augmented Generation) is a technique for enhancing teh accuracy and reliability of generative AI models with facts fetched from external sources.
- __Retrieve__ most relevant data
- __Augment__ query with context
- __Generate__ response

### Starting with basic RAG implementation

__Import required dependencies for LlamaIndex and OpenAI__

`OpenAI GPT-3.5-turbo` is used by default.  
OpenAI API key is stored in `.env` file in the project directory.

In [2]:
# from openai import OpenAI

In [82]:
import os
from dotenv import load_dotenv
load_dotenv()

True

In [83]:
# client = OpenAI()

In [84]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

__Load research paper files from local directory__

In [86]:
documents = SimpleDirectoryReader("data\data1").load_data()

__Initialize an index and create a query engine__

In [87]:
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()

The LlamaIndex default query engine combines Retriever, PostProcessing and Synthesizer.

__Ask LLM questions from the local research document__

In [88]:
response = query_engine.query("Explain how transformers can outperform RNNs in machine translation tasks.")
print(response)

Transformers can outperform RNNs in machine translation tasks due to their unique architecture based entirely on attention mechanisms. By replacing the recurrent layers commonly used in encoder-decoder architectures with multi-headed self-attention, transformers can capture long-range dependencies more effectively. This allows them to process input sequences in parallel, making them more efficient and faster to train compared to RNNs. Additionally, transformers are better at handling context information across the entire input sequence, leading to improved translation quality and performance.


In [89]:
response = query_engine.query("Explain the transformer model architecture.")
print(response)

The Transformer model architecture consists of an encoder and a decoder, both composed of a stack of identical layers. Each layer in the encoder has two sub-layers: a multi-head self-attention mechanism and a position-wise fully connected feed-forward network. Residual connections and layer normalization are applied around each sub-layer. The decoder, in addition to the two sub-layers in each layer, inserts a third sub-layer that performs multi-head attention over the encoder stack's output. The model uses residual connections and layer normalization in the decoder as well. The self-attention mechanism in the decoder is modified to prevent positions from attending to subsequent positions. This architecture allows for efficient sequence transduction based entirely on attention, without the need for recurrent layers commonly used in encoder-decoder models.


### Creating our custom RAG pipeline

__Import required dependencies from LlamaIndex__

In [280]:
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import Settings
from llama_index.core import PromptTemplate
from llama_index.core import get_response_synthesizer
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine

__LLM and embedding model__

In [91]:
Settings.llm = OpenAI(model="gpt-3.5-turbo")
Settings.embed_model = OpenAIEmbedding()

__Custom retriever__

In [94]:
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=10,
)

__Custom synthesizer__

In [95]:
response_synthesizer = get_response_synthesizer()

__Custom query engine__

In [96]:
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)

__Query with custom query engine__

In [98]:
response = query_engine.query("Explain the transformer model architecture.")
print(response)

The Transformer model architecture consists of an encoder and a decoder, each composed of a stack of identical layers. The encoder stack includes self-attention layers where each position can attend to all positions in the previous layer. The decoder stack contains self-attention layers as well, allowing each position to attend to all positions up to that point. Additionally, the decoder uses encoder-decoder attention layers to enable every position in the decoder to attend over all positions in the input sequence. The model also incorporates positional encodings to provide information about the relative or absolute position of tokens in the sequence. Furthermore, the architecture includes feed-forward networks applied to each position separately and identically, along with embeddings and softmax functions for token conversion and prediction.


__Custom prompt__

In [92]:
qa_prompt_str = """\
Context information is below.
--------------------------
{context_str}
--------------------------
Given the context information, answer the query.
Query: {query_str}
Answer: \
"""

In [93]:
prompt_tmpl = PromptTemplate(qa_prompt_str)

In [97]:
fmt_prompt = prompt_tmpl.format(
    context_str="",
    query_str="Explain the transformer model architecture."
)
print(fmt_prompt)

Context information is below.
--------------------------

--------------------------
Given the context information, answer the query.
Query: Explain the transformer model architecture.
Answer: 


__Query with custom prompt__

In [99]:
response = query_engine.query(fmt_prompt)
print(response)

The Transformer model architecture consists of an encoder and a decoder, each composed of a stack of identical layers. The encoder stack includes two sub-layers: a multi-head self-attention mechanism and a position-wise fully connected feed-forward network. The self-attention mechanism allows each position in the encoder to attend to all positions in the previous layer. The decoder stack also contains two sub-layers from the encoder, along with an additional sub-layer that performs multi-head attention over the encoder stack's output. The model utilizes residual connections and layer normalization around each sub-layer to facilitate learning. Additionally, the model incorporates positional encodings to provide information about the order of tokens in the sequence. The Transformer model is based entirely on attention mechanisms, eliminating the need for recurrence and convolutions, allowing for more parallelization and achieving state-of-the-art results in machine translation tasks.


__Combined custom prompt method__

In [100]:
def custom_prompt_template(qa_prompt_str, context_str="", query_str=""):
    prompt_tmpl = PromptTemplate(qa_prompt_str)
    fmt_prompt = fmt_prompt = prompt_tmpl.format(
        context_str=context_str,
        query_str=query_str
    )
    return fmt_prompt

__Query with combined custom prompt method__

In [101]:
query_str = "Summerize the 'Attention is all you need' paper."

response = query_engine.query(custom_prompt_template(qa_prompt_str, query_str=query_str))
print(response)

The paper "Attention Is All You Need" introduces the Transformer model, which is a sequence transduction model based solely on attention mechanisms, eliminating the need for recurrent or convolutional neural networks in the encoder-decoder architecture. The Transformer model uses stacked self-attention and point-wise fully connected layers for both the encoder and decoder. It achieves state-of-the-art results in machine translation tasks, such as English-to-German and English-to-French, outperforming previous models in terms of quality and training efficiency. The model allows for more parallelization, faster training times, and improved translation quality compared to traditional models based on recurrent layers.


### Advanced queries with agents

__Import required dependencies from LlamaIndex__

In [102]:
from llama_index.core.query_engine import RouterQueryEngine
from llama_index.agent.openai import OpenAIAgent
from llama_index.core.tools import QueryEngineTool
from llama_index.core.tools import FunctionTool

In [151]:
from llama_index.core.selectors import PydanticSingleSelector

__Documents, incides and query engines for research paper 1__

In [119]:
document1 = SimpleDirectoryReader("data\data1").load_data()
index1 = VectorStoreIndex.from_documents(document1)
queryEngine1 = index1.as_query_engine()

__Documents, incides and query engines for research paper 2__

In [120]:
document2 = SimpleDirectoryReader("data\data2").load_data()
index2 = VectorStoreIndex.from_documents(document2)
queryEngine2 = index2.as_query_engine()

__Query with research paper 1__

In [121]:
response1 = queryEngine1.query("What is this paper about? And what is the title of the paper?")
print(response1)

The paper discusses the Transformer model, which is a sequence transduction model based entirely on attention. It replaces the recurrent layers commonly used in encoder-decoder architectures with multi-headed self-attention. The paper highlights that the Transformer can be trained faster than architectures based on recurrent or convolutional layers, achieving state-of-the-art results on translation tasks. The title of the paper is "attention is all you need."


__Query with research paper 2__

In [123]:
response2 = queryEngine2.query("What is this paper about? And what is the title of the paper?")
print(response2)

The paper discusses the challenges of detecting machine-generated text and proposes a method using probability curvature for zero-shot detection. The title of the paper is "Zero-Shot Machine-Generated Text Detection using Probability Curvature."


__Router Query Engine__

Router Query Engines can take multiple query engines and use them as tools. Each query engine has a description with the help of which the LLM will decide which tool to use to answer specific types of questions.

In [154]:
routed_query_engine = RouterQueryEngine(
    selector=PydanticSingleSelector.from_defaults(),
    query_engine_tools=[
        QueryEngineTool.from_defaults(
            query_engine=queryEngine1,
            description=(
                "Useful for retrieving specific context for transformer architecture which relies solely"
                " on attention mechanisms and eliminates the need for recurrence in sequence transduction models"
            ),
        ),
        
        QueryEngineTool.from_defaults(
            query_engine=queryEngine2,
            description=(
                "Useful for retrieving specific context for the challenges of detecting machine-generated text"
                " and proposes a method using probability curvature for zero-shot detection"
            ),
        ),
    ]
)

__Query with router query engine__

In [155]:
res4 = routed_query_engine.query("What machine learning method is specifically built and is best for machine translation tasks?")
print(res4)

The Transformer model is specifically built for machine translation tasks and has been shown to outperform previous state-of-the-art models on tasks like English-to-German and English-to-French translation.


In [167]:
res4.metadata

{'5da5998b-58da-4654-9841-419c4d33eb19': {'file_path': 'D:\\LLMProjects\\ResearchAssistant_RAG\\data\\data1\\NIPS-2017-attention-is-all-you-need-Paper.txt',
  'file_name': 'NIPS-2017-attention-is-all-you-need-Paper.txt',
  'file_type': 'text/plain',
  'file_size': 33049,
  'creation_date': '2024-07-10',
  'last_modified_date': '2024-07-10'},
 '72be0f23-b858-4beb-a65a-b983c34ef80c': {'file_path': 'D:\\LLMProjects\\ResearchAssistant_RAG\\data\\data1\\NIPS-2017-attention-is-all-you-need-Paper.txt',
  'file_name': 'NIPS-2017-attention-is-all-you-need-Paper.txt',
  'file_type': 'text/plain',
  'file_size': 33049,
  'creation_date': '2024-07-10',
  'last_modified_date': '2024-07-10'},
 'selector_result': MultiSelection(selections=[SingleSelection(index=0, reason='The transformer architecture, which relies solely on attention mechanisms and eliminates the need for recurrence, is specifically built for machine translation tasks.')])}

In [165]:
res4.metadata['selector_result']

MultiSelection(selections=[SingleSelection(index=0, reason='The transformer architecture, which relies solely on attention mechanisms and eliminates the need for recurrence, is specifically built for machine translation tasks.')])

__The `queryEngine1` is used to answer this query which has knowledge of the research file `NIPS-2017-attention-is-all-you-need-Paper.txt`.__

In [168]:
res5 = routed_query_engine.query("How can machine generated text be identified?")
print(res5)

Machine-generated text can be identified by comparing the log probability of a candidate passage under a particular source model with the average log probability of several perturbations of the passage under the same source model. If the perturbed passages tend to have lower average log probability than the original passage, then the candidate passage is likely to have come from the source model.


In [169]:
res5.metadata

{'9a5ffd6a-f3c1-433c-b8fb-d572f0fe2c67': {'file_path': 'D:\\LLMProjects\\ResearchAssistant_RAG\\data\\data2\\2301.11305v2.txt',
  'file_name': '2301.11305v2.txt',
  'file_type': 'text/plain',
  'file_size': 62736,
  'creation_date': '2024-07-10',
  'last_modified_date': '2024-07-10'},
 '2efdb51e-393d-4ed5-8085-110d5557d5c5': {'file_path': 'D:\\LLMProjects\\ResearchAssistant_RAG\\data\\data2\\2301.11305v2.txt',
  'file_name': '2301.11305v2.txt',
  'file_type': 'text/plain',
  'file_size': 62736,
  'creation_date': '2024-07-10',
  'last_modified_date': '2024-07-10'},
 'selector_result': MultiSelection(selections=[SingleSelection(index=1, reason="The choice (2) is more relevant to the question 'How can machine generated text be identified?' as it specifically addresses the challenges of detecting machine-generated text and proposes a method using probability curvature for zero-shot detection.")])}

In [170]:
res5.metadata['selector_result']

MultiSelection(selections=[SingleSelection(index=1, reason="The choice (2) is more relevant to the question 'How can machine generated text be identified?' as it specifically addresses the challenges of detecting machine-generated text and proposes a method using probability curvature for zero-shot detection.")])

__The `queryEngine2` is used to answer this query which has knowledge of the research file `2301.11305v2.txt`.__

### Agents

__Simple functions that an agent can use__

In [180]:
def multiply(a, b):
    """Multiple two numbers and returns the result"""
    return a * b

multiply_tool = FunctionTool.from_defaults(fn=multiply)

In [181]:
def add(a, b):
    """Add two numbers and returns the result"""
    return a + b

add_tool = FunctionTool.from_defaults(fn=add)

__Convert previous router query engine into a tool__

In [196]:
query_engine_tool = QueryEngineTool.from_defaults(
    query_engine=routed_query_engine,
    description="A tool that can answer questions about different research papers"
)

__OpenAI Agent that can utilize these tools__

In [197]:
agent = OpenAIAgent.from_tools(
    tools=[query_engine_tool, multiply_tool, add_tool],
    verbose=True
)

__Queries that utilizes specific tools__

Queries using the router query engine tool. Each query is then further routed to use the relevent documents and query engine for with knowledge of specific research paper.  

We can see the functions being called by the agent.

In [198]:
response = agent.chat("What is the secret that makes transformers the best for machine translation tasks? Explain.")
print(response)

Added user message to memory: What is the secret that makes transformers the best for machine translation tasks? Explain.
=== Calling Function ===
Calling function: query_engine_tool with args: {"input":"The secret that makes transformers the best for machine translation tasks"}
Got output: The secret that makes transformers the best for machine translation tasks is their architecture based solely on attention mechanisms, which allows for drawing global dependencies between input and output without the need for recurrence or convolutions. This design enables more parallelization, faster training times, and superior translation quality compared to models that rely on recurrent or convolutional layers.

The secret that makes transformers the best for machine translation tasks is their architecture based solely on attention mechanisms. This design allows transformers to draw global dependencies between input and output without the need for recurrence or convolutions. As a result, transfor

In [201]:
response = agent.chat("I think my students are submitting assignments with the help of ChatGPT. How can I detect if the work is done by using AI?")
print(response)

Added user message to memory: I think my students are submitting assignments with the help of ChatGPT. How can I detect if the work is done by using AI?
=== Calling Function ===
Calling function: query_engine_tool with args: {"input":"Detecting assignments done using AI"}
Got output: DetectGPT is a method used for detecting machine-generated text, specifically from models like GPT-3 and Jurassic-2 Jumbo. It compares favorably with supervised models trained for machine-generated text detection, showing strong performance in detecting text generated by these AI models.

To detect if the work submitted by your students is done using AI, you can use a method called DetectGPT. This method is specifically designed to detect machine-generated text, particularly from models like GPT-3 and Jurassic-2 Jumbo. DetectGPT compares favorably with supervised models trained for machine-generated text detection and shows strong performance in identifying text generated by these AI models.


In [203]:
response = agent.chat("How can machine generated text be identified? Use the tools.")
print(response)

Added user message to memory: How can machine generated text be identified? Use the tools.
=== Calling Function ===
Calling function: query_engine_tool with args: {"input": "Identifying machine-generated text"}
Got output: DetectGPT is a zero-shot method for identifying machine-generated text by comparing the log probability of a candidate passage under a specific source model with the average log probability of several perturbations of the passage under the same model. If the perturbed passages consistently have lower average log probability than the original passage, it is likely that the candidate passage was generated by the source model. This method leverages the observation that machine-generated text tends to lie in regions with negative curvature in the log probability function, while human-written text does not exhibit this tendency.

=== Calling Function ===
Calling function: query_engine_tool with args: {"input": "Using AI tools for text identification"}
Got output: AI tools

__Queries that utilizes `multiply_tool` and `add_tool` to perform somewhat complex mathematical calculations for which LLMs normally produces wrong answers.__

In [199]:
response = agent.chat("What is (121 * 3) + 42?")
print(response)

Added user message to memory: What is (121 * 3) + 42?
=== Calling Function ===
Calling function: multiply with args: {"a":121,"b":3}
Got output: 363

=== Calling Function ===
Calling function: add with args: {"a":363,"b":42}
Got output: 405

The result of (121 * 3) + 42 is 405.


In [210]:
response = agent.chat("What is (44.5 + 7.3) * (9.6 + 5)?")
print(response)

Added user message to memory: What is (44.5 + 7.3) * (9.6 + 5)?
=== Calling Function ===
Calling function: add with args: {"a":44.5,"b":7.3}
Got output: 51.8

=== Calling Function ===
Calling function: add with args: {"a":9.6,"b":5}
Got output: 14.6

=== Calling Function ===
Calling function: multiply with args: {"a":51.8,"b":14.6}
Got output: 756.28

The result of (44.5 + 7.3) * (9.6 + 5) is 756.28.


__Complex query for which the agent has to use multiple tools__

Agent runs multiple tools by making multiple function calls.

In [221]:
tmp = agent.chat("Explain transformers. Once you are done, explain how are transformers involved in DetectGPT. "
    " Then, add 2 and 3 then multiply the sum by 4.")
print(tmp)

Added user message to memory: Explain transformers. Once you are done, explain how are transformers involved in DetectGPT.  Then, add 2 and 3 then multiply the sum by 4.
=== Calling Function ===
Calling function: query_engine_tool with args: {"input": "Explanation of transformers"}
Got output: Transformers are a network architecture that relies solely on attention mechanisms, eliminating the need for recurrence and convolutions. They are designed to capture global dependencies between input and output sequences without considering their distance. This architecture allows for enhanced parallelization during training, leading to faster training times and improved translation quality.

=== Calling Function ===
Calling function: query_engine_tool with args: {"input": "Role of transformers in DetectGPT"}
Got output: Transformers play a crucial role in DetectGPT by serving as the foundational architecture for sequence transduction models based entirely on attention. They replace the traditio

__Stream Chat: Streams response instantly__

Response is give instantly by sending the answer piece-by-piece as the LLM thinks.

In [219]:
response = agent.stream_chat(
    "Explain transformers. Once you are done, explain how are transformers involved in DetectGPT. "
    " Then, add 2 and 3 then multiply the sum by 4."
)

response_gen = response.response_gen

for token in response.response_gen:
    print(token, end="")

Added user message to memory: Explain transformers. Once you are done, explain how are transformers involved in DetectGPT.  Then, add 2 and 3 then multiply the sum by 4.
=== Calling Function ===
Calling function: query_engine_tool with args: {"input": "Explanation of transformers"}
Got output: Transformers are a network architecture that relies solely on attention mechanisms, eliminating the need for recurrent or convolutional layers typically used in sequence transduction models. By leveraging attention to capture global dependencies between input and output sequences, transformers enable more efficient parallelization during training, leading to faster training times and improved performance. This architecture has shown superior quality in tasks like machine translation, achieving state-of-the-art results with significantly reduced training costs compared to traditional models.

=== Calling Function ===
Calling function: query_engine_tool with args: {"input": "Role of transformers in

### Storage Context

No need to parse data or documents everytime

In [223]:
from llama_index.core import StorageContext, load_index_from_storage

__Persist data in a vector database__

In [230]:
index.storage_context.persist(persist_dir="storage")

In [231]:
storage_context = StorageContext.from_defaults(persist_dir="storage")

__Load stored data and create query engine__

In [233]:
saved_index = load_index_from_storage(storage_context)

In [234]:
query_engine_from_saved = saved_index.as_query_engine()

__Query with stored data__

In [237]:
res0 = query_engine_from_saved.query("How is the accuracy for DetectGPT measured? Explain.")
print(res0)

The accuracy for DetectGPT is measured using the BLEU score, which is a metric commonly used to evaluate the quality of machine translation.


### Chat with data

__Context aware and can ask follow up questions using saved message history.__

In [272]:
from llama_index.core.chat_engine.context import ContextChatEngine
from llama_index.core.base.llms.types import ChatMessage

__Sample message history from the last query__

In [273]:
messageHistory = [
    ChatMessage(
        role = "user",
        content = "How is the accuracy for DetectGPT measured? Explain."
    ),
    ChatMessage(
        role = "assistant",
        content = "The accuracy for DetectGPT is measured using the BLEU score, which is a metric commonly used to evaluate the quality of machine translation."
    )
]

__Context aware chat engine__

In [274]:
retriever2 = saved_index.as_retriever()
retriever2.similarityTopK = 3

chatEngine = ContextChatEngine.from_defaults(retriever=retriever2, chat_history=messageHistory)

__Query with context aware chat engine about previous response__

In [277]:
res9 = chatEngine.chat("What was the last thing you mentioned?")
print(res9)

I mentioned that the BLEU score is used to assess the accuracy of the English-to-German translation produced by the Transformer model in the context of the paper.


Can easily recall previous responses allowing for follow-up questions keeping context in mind.